## Data Checkpoint

### Research Question
What combination of factors best characterizes high-value housing in Los Angeles in the 1990s?

### Variables
We intend to analyze income, age, number of bedrooms, proximity to the ocean, and proximity to major cities as our independent variables. Our dependent variable will be house value. Additionally, we define “high-value housing” as houses with median prices that fall within the top 50% of all Los Angeles homes from our dataset.
See “Data Overview” for further details about variables.

### Background and Prior Work 

Housing prices in Los Angeles are influenced by multiple economic, social, and environmental factors. Characteristics such as income levels, proximity to major cities, and the proximity to the ocean[1] are known to have an impact on the value of a home, as these factors can be seen as appealing to buyers. Having such features in a house can make a neighborhood seem more appealing and attractive to the residents, leading to a higher demand and an increase in housing value.

Previous studies based on the housing market in Los Angeles have emphasized how the proximity to the ocean contributes to higher home values due to the unique views and lifestyle it provides[2]. In addition, neighborhood income levels and home sizes are playing a big role in property values, as higher income neighborhoods with larger homes attract buyers who will pay at a higher price for such features. Our work will build on this research by focusing on which factors are the best features that create high-value housing in Los Angeles. Our analysis will provide a detailed understanding of the interactions between economic, demographic, and geographic factors in the Los Angeles housing market.

1) “Coastal Impact.” Los Angeles Regional Collaborative, www.laregionalcollaborative.com/coastal-impact.
2) Rose, Sarah . “Torrance’s South Bay Balance: How Coastal Proximity Influences Affordable Living and Los Angeles’ Rental Scene - Propertywealthstrategies.com.” Propertywealthstrategies.com, 13 Oct. 2023, propertywealthstrategies.com/torrances-south-bay-balance-how-coastal-proximity-influences-affordable-living-and-los-angeles-rental-scene


### Hypothesis

We predict that high-value housing in Los Angeles will be best characterized by a combination of higher income, a greater total number of bedrooms, and closer proximity to the ocean. Higher average income levels may correlate with higher home values as greater income enables residents to afford more expensive homes, while a greater total number of bedrooms could be indicative of larger more spacious houses. 
Proximity to the ocean is a highly desirable feature, often associated with premium real estate prices due to the scenic views. Specifics, like what kind of water body and exactly how close houses are to the water, may have varying levels of impact on home values, but have an overall positive affect. That said, flood zones on rivers and lakes lower values slightly but the impact of the water is overall positive[1].
Additionally, we believe that there may be a relationship between income and house price, not only because higher income may enable people to afford more expensive homes, but also because a 2020 study of absolute income inequality and rising house prices showed that “...increasing income inequality contributed to the rise in real house prices.”[2]
Lastly, the link between number of bedrooms and house price has historically been linked with higher home appraisal values. There are two reasons for this, the first being that more bedrooms often indicates higher square footage, however, even two equally sized homes, will be appraised differently if one has more bedrooms than the other[3]. In fact, the number of rooms tends to impact house price at a significantly higher rate than other factors[4].

Benson, Earl D., Hansen, Julia L., Schwartz Jr., Arthur L., Smersh, Greg T. “Pricing Residential Amenities: The Value of a View.” Journal of Real Estate Finance and Economics, 16:1, pp. 55–73.
Thomas Goda, Chris Stewart, Alejandro Torres García, Absolute income inequality and rising house prices, Socio-Economic Review, Volume 18, Issue 4, October 2020, Pages 941–976
Keller, L. (2022). Does the number of bedrooms affect your appraisal?. SFWL Appraisals. https://www.swflappraisals.com 
Zhang, Qingqi. (2021). Housing Price Prediction Based on Multiple Linear Regression. Scientific Programming. 2021. 1-9. 10.1155/2021/7678931.


### Data Overview

Dataset Name: California Housing with Name of Counties - Number of Observations (Total): 20,640 - Number of Variables (Total): 16

***Description***

The dataset that our team has chosen to use is the “California Housing with Name of Counties” dataset, which is available on Kaggle. This dataset provides comprehensive information on housing prices across various California counties. It includes a total of 20,640 observations and features a total of 10 variables, making it a robust dataset for analyzing housing trends and factors influencing housing prices. The data includes information from the 1990 California census, and includes variables such as median house value, median income, median housing age, total rooms, total bedrooms, population, households, latitude, longitude, and the name of the county.  This dataset is particularly useful for understanding the geographic and socioeconomic factors affecting housing prices in California.

***Variables***

Median_House_Value: The median of value of the houses in each block group (in USD) [$]

Median_Income: The median income of households in each block group (in tens of thousands of USD) [10K$]

Median_Age: median age of houses in a block (in years)

Tot_Rooms: Total number of bedrooms in a block

Tot_Bedrooms: Total number of bedrooms in a block

Population: Total number of residents in a block

Households: Total number of households in a block

Latitude: indicates how far north a house is (higher value means farther north) [°]

Longitude: Indicates how far ∑est a house is (higher value means farther west) [°]

Distance_to_coast: distance to the nearest costal point [m]

Distance_to_LA: Distance to the center of Los Angeles [m]

Distance_to_SanDiego: Distance to the center of San Diego [m]

Distance_to_SanJose: Distance to the center of San Jose [m]

Distance_to_SanFrancisco: Distance to the center of San Francisco [m]

ocean_proximity: Description of nearest water body [“<1H OCEAN”, “INLAND”, “ISLAND”, “NEAR BAY”, “NEAR OCEAN”]

City: The county in which the household is located

The variables in bold are the ones we plan to use in our analysis.

***Data Wrangling Plan***

Given our research question, “What combination of factors best characterizes high-value housing in Los Angeles?”, we will focus on the following steps to prepare our dataset:

1) Filtering for Los Angeles
We plan to extract data specific to Los Angeles county from the dataset to ensure that our analysis is geographically relevant. This will narrow the number of observations that we work with from 20,640 to 5,836.

2) Defining High-Value Housing
Since we are focusing on factors that characterize high-value homes, we will only look at the top 50% of homes in terms of median_house_value. This further narrows the dataset from 5,836 to 2,918. 

3) Feature Engineering
May create new features if necessary, or helpful. For example, average of people living in each household by dividing Population by Households. 

4) Normalize
Normalize numerical values to ensure they are on a similar scale.

5) Encode
Encode categorical variables into numerical format to facilitate analysis.

6) Missing Values
Removing or imputing any null or missing values.

7) Other
These wrangling methods are tentative, we may use fewer or additional methods if necessary.

By focusing on these steps, we can ensure that our dataset is clean and relevant, allowing us to analyze specific combinations of factors to see if/how they impact high-value housing in Los Angeles. This approach ensures that our analysis is both comprehensive and targeted, providing meaningful insights into our research question. 


### Ethics & Privacy

The data set we used is based on a Kaggle data set. The data set was substantial and did not contain any personal or private information from its listings, therefore breach of privacy was not a concern. The information used in the data set is based on the 1990 California Census.  All of the data shown on this is information available to the public therefore the concern of violating any privacy terms would not be a problem. The California Census pulls data from the The U.S. Census Bureau which is information updated every 10 years therefore there should be no bias in the data as it is accurate and public information. For our data set we are focusing on Los Angeles therefore it would not exclude certain populations as we are focusing on a specific area.

### Team Expectations

Our team will prioritize effective communication and a collaborative tone to keep everything on track. For daily communication, we’ll use iMessage, with responses expected within 2–3 hours to keep things moving smoothly. Weekly virtual meetings on Mondays will help us stay aligned and troubleshoot any issues. We’ll aim for a tone that’s "blunt but polite," encouraging open feedback with a focus on solutions. For example, we’ll phrase criticism constructively: "I think X is a problem because of Y. Does everyone else agree, or am I missing something?" Major project decisions will require a unanimous vote, while smaller, section-specific ones will follow a majority rule, allowing for flexibility if someone is unavailable. Our work style will be balanced; while we’ll all contribute equally at first, as we progress, team members can focus on areas where they excel. If anyone is struggling with their tasks, they should notify the group immediately via the group chat. In these cases, we’ll divide responsibilities to ensure we meet our shared deadlines, which will generally be set for noon on the due date. Additionally, team members are expected to track task progress using the Google spreadsheet
here: https://docs.google.com/spreadsheets/d/1CqqX8_CHUeCzUD8tMlEtGdI9yYgvWX34g9T-bLnS-pE/edit?usp=sharing.


### Project Timeline Proposal

10/21
8 PM
Think and brainstorm about COGS 108 project expectations; review past COGS 108 projects 
Determine which projects will be reviewed; discuss final project topic; begin working on project proposal 

10/23
7 PM
Finish project review; Do background research on topic
Finish discussing past COGS 108 project; Discuss ideal dataset(s) and ethics; draft project proposal

10/28
10 AM
Edit, and work on project proposal; begin search for datasets
Discuss any analytical approaches; Assign group members to lead each specific part; work on team expectations and project timeline 

10/30
12 PM
Edit and finalize any work on project proposal 
Finalize project proposal and work on anything that needs to be done and add any touch-ups; submit final version of project proposal

11/11
5 PM
Work on any progress towards Checkpoint #1, work on Importing & Wrangling Data  
Review/Edit wrangling/EDA; Discuss Analysis Plan; work on anything to get to Checkpoint Progress #1

11/13
12 PM
Finalize wrangling/EDA; Begin working on any Analysis; finalize any work for Checkpoint #1
Discuss/edit Analysis; Finalize any work that needs to be done for Checkpoint #1

11/25
6 PM
Complete analysis; Draft results/conclusion/discussion 
Work on getting to Checkpoint #2; Discuss/edit project

11/27
12pm
Complete and work on finishing EDA
Work on anything that needs to get done to get to Checkpoint #2

12/11
Before 11:59 PM
Finalize project
Work on anything to get project to final state; Turn in Final Project & Group Project Surveys

### Data Cleaning

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#Reading the Dataset
df = pd.read_csv('California_Housing_CitiesAdded.csv')

In [3]:
#Checking the structure of the Dataset
df.head(10)

Unnamed: 0,Median_House_Value,Median_Income,Median_Age,Tot_Rooms,Tot_Bedrooms,Population,Households,Latitude,Longitude,Distance_to_coast,Distance_to_LA,Distance_to_SanDiego,Distance_to_SanJose,Distance_to_SanFrancisco,ocean_proximity,City
0,500001.0,1.2434,52,249,78,396,85,37.8,-122.27,2524.614616,552234.0515,731023.5749,61415.35211,14466.70538,NEAR BAY,Alameda
1,500001.0,1.1696,52,609,236,1349,250,37.87,-122.25,7897.024567,556856.928,735788.3723,67242.51828,19172.81885,NEAR BAY,Alameda
2,500001.0,7.8521,52,1668,225,517,214,37.86,-122.24,9154.528309,555442.5086,734372.6023,65849.13943,19335.74118,NEAR BAY,Alameda
3,500001.0,9.3959,52,3726,474,1366,496,37.85,-122.24,8259.085109,554610.7171,733525.6829,64867.28983,18811.48745,NEAR BAY,Alameda
4,500001.0,7.8772,52,2990,379,947,361,37.83,-122.23,7284.913015,552365.4712,731263.5682,62493.11252,18750.94628,NEAR BAY,Alameda
5,500001.0,11.8603,39,2492,310,808,315,37.82,-122.22,7447.262038,550950.9605,729847.5883,61097.42739,19258.11494,NEAR BAY,Alameda
6,500001.0,13.499,42,2991,335,1018,335,37.82,-122.22,7447.262038,550950.9605,729847.5883,61097.42739,19258.11494,NEAR BAY,Alameda
7,500001.0,12.2138,52,3242,366,1001,352,37.82,-122.23,6670.966116,551535.6797,730418.0382,61517.93721,18412.54557,NEAR BAY,Alameda
8,500001.0,12.3804,52,3494,396,1192,383,37.82,-122.23,6670.966116,551535.6797,730418.0382,61517.93721,18412.54557,NEAR BAY,Alameda
9,500001.0,8.7477,52,1611,203,556,179,37.82,-122.23,6670.966116,551535.6797,730418.0382,61517.93721,18412.54557,NEAR BAY,Alameda


In [4]:
# Cleaning the dataset so it only conataints Los Angeles under city because we are interested in Los Angeles
df = df[df['City'] == 'Los Angeles']
df.reset_index(drop=True, inplace=True)

In [5]:
#Checing the size of the dataset and the structue of it
df.head(10)
df.shape

(5836, 16)

In [6]:
#Dropping columns that are not necessary
df = df.drop(['Latitude','Longitude','Distance_to_LA'],axis = 1)

In [7]:
df.head(10)

Unnamed: 0,Median_House_Value,Median_Income,Median_Age,Tot_Rooms,Tot_Bedrooms,Population,Households,Distance_to_coast,Distance_to_SanDiego,Distance_to_SanJose,Distance_to_SanFrancisco,ocean_proximity,City
0,500001.0,10.8805,10,7665,999,3517,998,27774.13728,215871.3204,454746.3628,522763.9011,<1H OCEAN,Los Angeles
1,500001.0,10.9052,23,4960,592,1929,586,25596.69339,214092.4277,456434.7298,524447.371,<1H OCEAN,Los Angeles
2,500001.0,9.6047,20,7384,845,2795,872,26263.8908,216646.7724,453800.2264,521807.6764,<1H OCEAN,Los Angeles
3,500001.0,8.565,31,1962,243,697,242,24171.28565,214323.6214,456090.5958,524095.0103,<1H OCEAN,Los Angeles
4,500001.0,8.1714,23,2980,362,1208,378,23072.17475,213444.1903,456939.6905,524941.0221,<1H OCEAN,Los Angeles
5,500001.0,8.8612,24,4338,558,1514,549,25264.85008,211549.2698,459073.1678,527090.4176,<1H OCEAN,Los Angeles
6,500001.0,4.1544,35,2245,393,783,402,20601.01743,193920.148,477109.7452,545136.9364,<1H OCEAN,Los Angeles
7,500001.0,11.2093,42,777,102,284,113,18364.0668,200022.4657,470614.6307,538631.1559,<1H OCEAN,Los Angeles
8,500001.0,9.6465,5,5429,665,2315,687,20019.8369,216057.4366,454226.0575,522179.7825,<1H OCEAN,Los Angeles
9,500001.0,3.1875,31,1950,383,870,357,15242.9884,207932.6733,462329.1622,530301.0635,<1H OCEAN,Los Angeles


In [8]:
#Checking the value of top 50%
df['Median_House_Value'].quantile(0.50)

203450.0

In [9]:
#Filtering the dataset to only include houses that are above the bottm 50%
df = df[df['Median_House_Value'] >= 203450.0]

In [10]:
#Checking to see that we have enough entries in the dataset
df.shape

(2918, 13)