# King County Housing Analysis

By Tosca Le and Jonny Hofmeister

## Overview

This project explores different components of location and details of homes in King County, which is located in Washington state of the U.S. King County encompasses the area of greater Seattle. The data has been provided by Flatiron School and the project requires us to use linear regression models to generate data driven insights. 

This analysis notebook (located in branch: main), is a combined analysis and summary of the work done individually by group partners Tosca Le and Jonny Hofmeister. Features and insights from the separate models are combined here to summarize our results and findings for the stakeholder presentation. Analyses for the separate location and details models can be found in the branches Jonny and Tosca, respectively. 

![seattle neighborhood](images/GI_157380908_SeattleNeighborhoods.jpg)

# Business Understanding

The stakeholder we have selected for the context of this project is a real estate agency in Seattle. The real estate agency should be able to assist homeowners who are looking to buy and sell homes in King County. By understanding the house sales data, the real estate agency can give useful advice to homeowners on how location and characteristics of the home might increase or decrease the estimated value of their homes.

The scope of this regression analysis is determined by the data and features we have access to. Finding out how home price is dependent on features given in the data, like square footage, floors, number of bed/baths, zipcode, etc, is within the scope of the data. The scope of this analysis does not incude prediting price or finding its dependency on feature values outside the range of the data we are using to train the model. For example, this analysis can find the dependency and generally predict price for the zipcodes given in the King County data, but not for zipcodes outside of it; and can find the trend for how the number of bedrooms effect price as long as this value is not way outside the range of the data. It is important the stakeholders/realtors understand this scope, so they can properly apply this analysis to future homes/data. 

# Data Understanding

In [1]:
# the usual imports
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

Begin by importing the data with pandas and examining the columns and data types.

In [4]:
df = pd.read_csv('data/kc_house_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  float64
 9   view           21534 non-null  float64
 10  condition      21597 non-null  int64  
 11  grade          21597 non-null  int64  
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

### *Target and Predictors*

The target variable in the data for this regression is price.

We have 19 other columns left at our disposal to use as predictors. Some like 'sqft_living' are probably very useful, while others like 'id' and 'date' are practically useless to us.

We have decided to split the columns into separate models in order to add context for the models we report to the realtors. We have decided to create two models, one generated from location based features of a home, such as zipcode and neighbor lot size, and a second model generated from home-specific features, like the number of bedrooms and condition of the home. 

Both these analyses will explore the dependency of price on features, but will provide two different contexts in which realtors can examine how location and home separately influence the price of homes. 

### *Addressing Model Assumptions*

If we are to use these variables as predictors, we must first validate some of the assumptions that go into linear regression. The two main assumptions are normally distributed predictors, and multicolinearity. 

**Home Model:**
For the home model, Tosca selected some features that are continuous, and some that are categorical. Home square footage and year built are continuous while the others like floors, bedrooms, condition, and grade may be numbers, but must be treated like categorical data as their values are limited to integers. 

It was found that the target column, price, along with the continuous columns, are not normally distributed. To address this, she decided to log transform and apply a standard scalar to each of these columns. 

In addressing multicolinearity of variables for the home model, it was found that home square footage has very high correllation to the number of bathrooms and the grade. . . . . 

**Need to complete for Toscas final model^^^^^**

**Location Model:**
For the location model, Jonny also selected columns that were both continuous and categorical. Zipcode and waterfront are categorical and can be dealt with by one-hot encoding the zipcodes into boolean columns. 

The continuous columns are lot square footage, neighbor lot square footage, price, and home square footage. Note that home square footage is also contained in the location model, this is because it was determined that home size accounts for so much of the price that a viable model could not be produced without it. 

In examining these features for multicolinearity, it was found that lot sqft, and neighboring lot sqft were very highly correlated. To deal with this, they have been transformed into a single column, the ratio of the size of the lot of interest to the average lot size of the nearest 15 neighbors. This new variable describes if a lot is bigger or smaller than the neighbors on average.

Looking at the distributions of these continuous variables, it was found that none represented normal distributions. It was determined that log transforming price, sqft_lot, and the new lot_ratio column both increased the normality of the predictors and the confidence of the model results.

In order to compare the effects of each of these predictors, a standard scalar was applied after log scaling to each of the three continuous columns. It is unnecessary to scale the OHE columns.

### To summarize predictors:

**Home Model Predictors:** 
- Log of home square footage
- Log of year built
- Number of bedrooms
- Number of floors
- Condition rating
- Grade

**Location Model Predictors:**
- Zipcode
- Waterfront
- Log of home square footage
- Log of lot size ratio

***Log Target***

It is important to note that each of the models precits for the log of price, as it increased the fit of prediction in both cases. In order to examine residuals and error in a dollar context, predictions must be un-log-transformed.

# Data Preparation

Data preparation here entails cleaning for missing and unwanted values, selecting columns from interest from the data, one-hot encoding, logging and scaling, and finally train/test splitting. 

**Basic Cleaning:**
We decided to clean our data in the same way, so that even though we are using different features of the home, each model contains the same houses.

What was cleaned:
- Removed all duplicates from the data (177 rows).
- Drop 33 bedroom outlier house.
- Remove rows with NaN values in the waterfront column (2353 rows).



**Column Prep**

Next, the columns for each model were selected modified or OHE as needed, then log transformed and scaled. Specific information and code for how this was done is not given in this group analysis summary, but can be found in each of the indiviudal model notebooks. 

#### *Train Test Split*

The final important step of preparation before modeling is splitting the data into training and testing sets. Just as we opted to clean the data identically above, we opted to split the data identically as well. We decided on using 75% of the data for training, and 25% for testing. We also used the same random seed within the train test split function so that we would be training and testing on the same homes/rows. 

# Modeling

# Evaluation

# Moving Forward