# King County Housing Analysis

By Tosca Le and Jonny Hofmeister

## Overview

This project explores different components of location and details of homes in King County, which is located in Washington state of the U.S. King County encompasses the area of greater Seattle. The data has been provided by Flatiron School and the project requires us to use linear regression models to generate data driven insights. 

This analysis notebook (located in branch: main), is a combined analysis and summary of the work done individually by group partners Tosca Le and Jonny Hofmeister. Features and insights from the separate models are combined here to summarize our results and findings for the stakeholder presentation. Analyses for the separate location and details models can be found in the branches Jonny and Tosca, respectively. 

![seattle neighborhood](images/GI_157380908_SeattleNeighborhoods.jpg)

# Business Understanding

The stakeholder we have selected for the context of this project is a real estate agency in Seattle. The real estate agency should be able to assist homeowners who are looking to buy and sell homes in King County. By understanding the house sales data, the real estate agency can give useful advice to homeowners on how location and characteristics of the home might increase or decrease the estimated value of their homes.

The scope of this regression analysis is determined by the data and features we have access to. Finding out how home price is dependent on features given in the data, like square footage, floors, number of bed/baths, zipcode, etc, is within the scope of the data. The scope of this analysis does not incude prediting price or finding its dependency on feature values outside the range of the data we are using to train the model. For example, this analysis can find the dependency and generally predict price for the zipcodes given in the King County data, but not for zipcodes outside of it; and can find the trend for how the number of bedrooms effect price as long as this value is not way outside the range of the data. It is important the stakeholders/realtors understand this scope, so they can properly apply this analysis to future homes/data. 

# Data Understanding

In [1]:
# the usual imports
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

Begin by importing the data with pandas and examining the columns and data types.

In [4]:
df = pd.read_csv('data/kc_house_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  float64
 9   view           21534 non-null  float64
 10  condition      21597 non-null  int64  
 11  grade          21597 non-null  int64  
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

### *Target and Predictors*

The target variable in the data for this regression is price.

We have 19 other columns left at our disposal to use as predictors. Some like 'sqft_living' are probably very useful, while others like 'id' and 'date' are practically useless to us.

We have decided to split the columns into separate models in order to add context for the models we report to the realtors. We have decided to create two models, one generated from location based features of a home, such as zipcode and neighbor lot size, and a second model generated from home-specific features, like the number of bedrooms and condition of the home. 

Both these analyses will explore the dependency of price on features, but will provide two different contexts in which realtors can examine how location and home separately influence the price of homes. 

### *Addressing Model Assumptions*

If we are to use these variables as predictors, we must first validate some of the assumptions that go into linear regression. The two main assumptions are normally distributed predictors, and multicolinearity. 

**Home Model:**
For the home model, Tosca selected some features that are continuous, and some that are categorical. Home square footage and year built are continuous while the others like floors, bedrooms, condition, and grade may be numbers, but must be treated like categorical data as their values are limited to integers. 

It was found that the target column, price, along with the continuous columns, are not normally distributed. To address this, she decided to log transform and apply a standard scalar to each of these columns. 

In addressing multicolinearity of variables for the home model, it was found that home square footage has very high correllation to the number of bathrooms and the grade. . . . . 

**Need to complete for Toscas final model^^^^^**

**Location Model:**
For the location model, Jonny also selected columns that were both continuous and categorical. Zipcode and waterfront are categorical and can be dealt with by one-hot encoding the zipcodes into boolean columns. 

The continuous columns are lot square footage, neighbor lot square footage, price, and home square footage. Note that home square footage is also contained in the location model, this is because it was determined that home size accounts for so much of the price that a viable model could not be produced without it. 

In examining these features for multicolinearity, it was found that lot sqft, and neighboring lot sqft were very highly correlated. To deal with this, they have been transformed into a single column, the ratio of the size of the lot of interest to the average lot size of the nearest 15 neighbors. This new variable describes if a lot is bigger or smaller than the neighbors on average.

Looking at the distributions of these continuous variables, it was found that none represented normal distributions. Below the distributions for price and sqft_living are show as examples:

![pre-log feats](images/prelogfeats.png)

It was determined that log transforming price, sqft_lot, and the new lot_ratio column both increased the normality of the predictors and the confidence of the model results. Here are the same features after log transforming:
![post-log feats](images/postlogfeats.png)

We can visually confirm that logging our target and features greatly improves their normality and thus the performance of the model.

**Scaling:**

In order to compare the effects of each of these predictors, a standard scalar was applied after log scaling to each of the three continuous columns. It is unnecessary to scale the OHE categorical columns.

### To summarize predictors:

**Home Model Predictors:** 
- Log of home square footage
- Log of year built
- Number of bedrooms
- Number of floors
- Condition rating
- Grade

**Location Model Predictors:**
- Zipcode
- Waterfront
- Log of home square footage
- Log of lot size ratio

***Log Target***

It is important to note that each of the models precits for the log of price, as it increased the fit of prediction in both cases. In order to examine residuals and error in a dollar context, predictions must be un-log-transformed.

# Data Preparation

Data preparation here entails cleaning for missing and unwanted values, selecting columns from interest from the data, one-hot encoding, logging and scaling, and finally train/test splitting. 

**Basic Cleaning:**

We decided to clean our data in the same way, so that even though we are using different features of the home, each model contains the same houses.

What was cleaned:
- Removed all duplicates from the data (177 rows).
- Drop 33 bedroom outlier house.
- Remove rows with NaN values in the waterfront column (2353 rows).



### Column Prep

Next, the columns for each model were selected modified or OHE as needed, then log transformed and scaled. Specific information and code for how this was done is not given in this group analysis summary, but can be found in each of the indiviudal model notebooks. 

**Location Model Columns:**

Two successful location models were created. The first one-hot encoded the zipcodes given. The second grouped zipcodes into the different suburbs of Seattle, and then one-hot encoded these new categories. The goal of trying both these models is to give slightly different context to the realtors/stakeholders and home owners/buyers. Hearing a zipcode and instinctively knowing specifially what area that contains is not something most of us can do, zipcodes are weird, often unintuitive shapes, even if you know the area. Grouping homes by city/suburb instead gives realtors the ability to call the area of Seattle a buyer is interested in by name, giving a more general description of how that location influences price rather than zipcode.

Examples of the Seattle suburbs used are: Auburn, Ballard, Bellevue, Black Diamond, Bothell, Broadview, Burien . . . Redmond, Renton, Sammamish, Snoqualmie, Tukwila, Queen Anne, University, Vashon, West Seattle, Woodinville.

While a suburb name may be more recognizeable than a zipcode, there still are benefits to producing a zipcode model. The main benefit is that GIS information on zipcode boundaries is readily available. This means that creating a map to outline zipcode areas will be an easy way to provide stakeholders and customers with a way to visually interact with and see how these areas influence price.


**Home Model Columns:**



#### *Train Test Split*

The final important step of preparation before modeling is splitting the data into training and testing sets. Just as we opted to clean the data identically above, we opted to split the data identically as well. We decided on using 75% of the data for training, and 25% for testing. We also used the same random seed within the train test split function so that we would be training and testing on the same homes/rows. 

# Modeling

Now that the data has been prepped and properly split, we perform the linear regression. Regressions from both the packages Stats Models and SkLearn were used, so it is important to know results from each are identical. **Stats models** was used intially because it's summary feature conveniently shows all model results and coefficients. **Sklearn** was used after because it is a more interactive and well documented package that we can use to easily produce predictions for our testing set and evaluate error. 

### Baseline Model (FSM):

In creating linear models, it is important to have something to compare the performance of your model against. This is where creating a baseline model comes in, and can be done many ways. The most basic and naive baseline we can create is simply the average price; eg. for any values of home features, we always predict the average price across all homes. 

This baseline model is visualized below with a scatterplot of the log price and sqft_living.
![baseline avg](images/avg_baseline.png)

For every value of sqft_living (or any other column), the baseline model just predicts the average of the target (log price), shown here as the red line.

We can also use the same statistical tools to measure fit and error for this baseline in order to easily compare to our next models. For this naive average baseline, the R^2 value is -0.001, meaning this line doesnt seem to fit the data much at all. The root mean square error (RMSE) for this baseline is 0.522. This RMSE is not in the units of dollars, but log(dollars) and must be compared to the same error in future models.

### Location Modeling:

This summary anlysis will not show the entire iterative modeling process that the location model underwent. Different iterations included removing zipcodes or suburbs that contained no homes. Another iteration used lat/long to produce a distance from city center, but this model was unsuccessful and thrown out. 

**How were the fit of models validated and error addressed?**

The main metric we relied on to validate model fit is the Coefficient of Determination, commonly know as the R-squared value, which tell us the how much of the variance in our data the model could account for. The loss function used to evaluate predictive error in the model is Root Mean Square Error, which computed the average residual error across all predictions. 

The threshold of perfomance we are seeking for a well-fit model is an R-squared value above 0.8. This was chosen because it is the threshold commonly used in the industry; explaining over 80% of the variance in the data will be considered successful. 

**Zipcode Model Results:**

The zipcode model was able to explain the most variance in the data, it produced a model with an R-squared of 0.835 on the training set, and 0.833 on the testing set. This means that we can use the features and coefficients given in the model to explain 80% of the variance in price. Next, looking at the the RMSE, it was best to un-log the results to give error in a dollar unit. The RMSE for the zipcode model was 147,000 dollars. Even though this model accounts for most the variance in the data, the average error for a single prediction is 147k. So this model is able to desribe the dependancy of price on the features (R^2 = 0.833), but has implications if we were to use this model to predict the price of single homes (RMSE = 147k dollars).


**Seattle Suburb Model Results:**

The city/suburb model performed slightly worse than the zipcode model. But it was still able to account for over 80% of the variance in the data; the R-squared for the training set was 0.804, so above the threshold, and quite well above the R^2 for the baseline model. 

#### Validating Residuals:

The last step to validate these models is to examine the distribution of their residuals. Normal residuals show us that the error for out model is spread evenly and the error is centered around zero. Non-normal residuals imply that we predicted more values either above or below their true value, and the model results are skewed. Residuals for both the Seattle Suburb model and Zipcode model were produced and shown in the graphs below:

![res](images/resid_loc.png)

The residuals are indeed normal and were verified a second time using 'qq' plots, shown in the separate notebook.

Now that both these models have been validated, we can move on to evaluating/analyzing them.

### Home Modeling:



# Evaluation

### Location Model Evaluation:

**Seattle Suburbs and Home Price**

***Here I will summarize the suburbs that influence price the most and least***
These suburbs did best, these did worst..... copy paste stuff from other branch

Maybe make a bar chart of area and coefs


**Zipcodes and Home Price**

The best way to evaluate how different zipcodes of Seattle influence home price is to map them. Below is a graph in which each zipcode in the data has been outlined as a separate region, and then colored based on how high the regression coefficient is for that zipcode. Yellow zipcodes indicate a low dependency on price, eg. the fact a home is located here does not add much value (may even decrease it on average if coef < 0). Bluer areas indicate a relatively high regression coefficient, eg. a home being located there does add value to it.

*To view this interactive map, click this [map link](https://nbviewer.jupyter.org/github/jonnyhof/kings_county_housing_analysis/blob/jonny/data/zipcode_coef_map.html)*

Immediate observations we can make from the map is that being in a zipcode closer to downtown or north Seattle correllates with a higher dependency for price. **Add more observations that address the business problem**


**Shared features: waterfront, sqft_living, and lot_ratio**

Waterfront, sqft_living, and lot_ratio are columns that were included in both zipcode and suburb models, and performed about the same in each. 

Size of the home accounts for a much greater portion of the price than having a yard larger than your neighbor. Going up one standard deviation in price increases price by a factor of 0.3233 while going up one STD in lot ratio increases price by a factor of 0.0146. Having a larger yard than your neighbor seems to matter a bit, but not nearly as much as home size itself does.

Waterfront has a significant impact on price in each of the models, its coefficient is 0.7-0.8. To add some context to this, a coef of 0.7 means that just being a waterfront house anywhere adds over 50% of the value that would be added if the home was just located in the most expensive area, Medina (coef = 1.34). 

### Home Model Evaluation:



# Moving Forward