![banner](./data/home-sales-shutterstock-295804091-1068x601.jpg)

# King County Home Sales
**Authors:** [Jerry Vasquez](https://www.linkedin.com/in/jerry-vasquez-832b71224/), [Paul Lindquist](https://www.linkedin.com/in/paul-lindquist/), [Vu Brown](https://www.linkedin.com/in/austin-brown-b5211384/)

## Overview
***
This is our overview

## Business Problem
***
This is our business problem

## Data
***
This is where the data is sourced from with focuses:

## Methods
***
Descriptive analysis, etc.

## Exploratory Data Analysis
***

In [None]:
# Import libraries
from collections import Counter
import folium
import itertools
from math import sqrt
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import time
import scipy.stats as stats
import seaborn as sns
sns.set_theme(palette='magma_r')
from sklearn.dummy import DummyRegressor
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_validate, ShuffleSplit
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, LabelEncoder, MinMaxScaler
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", UserWarning)

pd.set_option('display.max_rows', 500) # Allows Jupyter Notebook to expand how much data is shown.

In [None]:
# Load DataFrame
df = pd.read_csv('./data/kc_house_data.csv')

### Preliminary Exploratory Data Analysis
Understanding the aspects of the dataset...

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.nunique(axis=0)

In [None]:
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

In [None]:
df.corr()

In [None]:
# Features with highest correlation to price
price_corr = df.corr()['price'].map(abs).sort_values(ascending=False)
price_corr

In [None]:
# Plot the latitude and longitude coordinates to examine where home prices are
# the highest
df_minus_outliers = df[df.price < (df.price.mean() + 3*df.price.std())].copy()

fig = px.scatter_mapbox(df_minus_outliers, lat="lat", lon="long", color="price",
#                         color_discrete_sequence=["IceFire"], zoom=10.2, height=1000)
                        color_discrete_sequence=["IceFire"], zoom=10.5, height=1000)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## Data Cleaning
***

### Identify and Drop Duplicates

In [None]:
# Create a function to identify duplicates
def determine_dupes(series):
    series_vcs = pd.Series(series.value_counts())
    series_dupes = [series_vcs.index[index] for index in range(len(series_vcs)) if series_vcs.values[index] > 1]
    print("Amount of unique duplicates: " + str(len(series_dupes)))
    print("Total amount of duplicates: " + str(series_vcs.values[0:len(series_dupes)].sum()))
    
    return series_vcs

In [None]:
# Run duplicates function for 'id' series
determine_dupes(df.id)

In [None]:
# Drop duplicates found within 'id' series
df = df.drop_duplicates(subset=['id'], keep='last')
df.info()

In [None]:
# # Consider droping duplicates based upon latitude and longitude
# df[df.duplicated(subset=['lat','long'], keep=False)].sort_values('lat')
# df = df.drop_duplicates(subset=['lat', 'long'], keep='last')

### Making features useful for regression modeling by dealing with missing/bunk values and changing series from objects to integers

In [None]:
# Replace NaN/?/missing values with 0, None or No for respective series
# Also change object series to integer via astype function
df.yr_renovated = df.yr_renovated.fillna(0)
df.yr_renovated = df.yr_renovated.astype('int64')

df.view = df.view.fillna('NONE')

df.waterfront = df.waterfront.fillna('NO')

df.loc[df.sqft_basement == '?', 'sqft_basement'] = 0.0
df.sqft_basement = df.sqft_basement.astype('float64').astype('int64')

In [None]:
df.info()

#### Resolving issues with `grade`

In [None]:
df.grade.value_counts()

In [None]:
# Change 'grade' series objects to corresponding integers
df.grade = pd.to_numeric(df.grade.map(lambda x: x.split()[0]))
df.grade.value_counts()

#### Resolving issues with `condition`

In [None]:
df.condition.value_counts()

In [None]:
# Change 'condition' series objects to corresponding integers
# Integer values from https://info.kingcounty.gov/assessor/esales/Glossary.aspx
df['condition'].replace('Poor', 1, inplace=True)
df['condition'].replace('Fair', 2, inplace=True)
df['condition'].replace('Average', 3, inplace=True)
df['condition'].replace('Good', 4, inplace=True)
df['condition'].replace('Very Good', 5, inplace=True)
df.condition.value_counts()

#### Resolving issues with `waterfront`

In [None]:
df.waterfront.value_counts()

In [None]:
# Change 'waterfront' series objects to integers
lb_make = LabelEncoder()
df['waterfront'] = lb_make.fit_transform(df['waterfront'])
df.waterfront.value_counts()
# 0:NO, 1:YES

#### Resolving issues with `view`

In [None]:
df.view.value_counts()

In [None]:
# Change 'view' series objects to corresponding integers
# Integer values mirrored from 'condition' series
df['view'].replace('NONE', 0, inplace=True)
df['view'].replace('FAIR', 2, inplace=True)
df['view'].replace('AVERAGE', 3, inplace=True)
df['view'].replace('GOOD', 4, inplace=True)
df['view'].replace('EXCELLENT', 5, inplace=True)
df.view.value_counts()

#### Resolving issues with `date`

In [None]:
# Change 'date' series to datetime data type (may not be needed)
df['date'] = pd.to_datetime(df['date'])

In [None]:
display(df.info())
display(df.head())

### Identify & Drop Outliers for Inferential Model

In [None]:
# Define inferential dataframe
infer_df = df.copy()

In [None]:
# Define function to help with removal of outliers
def determine_outliers_cut_off(series, constant):
    price_outliers_low = series.mean() - constant*series.std()
    price_outliers_high = series.mean() + constant*series.std()
    return price_outliers_high, price_outliers_low

#### Identifying Outliers 

In [None]:
# Examine price
infer_df.price.hist(bins=100);

In [None]:
# Determine price outliers by calculating at least 3 std. dev.'s from the mean
price_outliers_high, price_outliers_low = determine_outliers_cut_off(infer_df.price, 3)
print(price_outliers_low)
print(price_outliers_high)

In [None]:
# Examine bedrooms
display(infer_df.bedrooms.value_counts())

bedrooms_outliers_high, bedrooms_outliers_low = determine_outliers_cut_off(infer_df.bedrooms, 3)
print(bedrooms_outliers_low)
print(bedrooms_outliers_high)

In [None]:
# Examine bathrooms
display(infer_df.bathrooms.value_counts())

bathrooms_outliers_high, bathrooms_outliers_low = determine_outliers_cut_off(infer_df.bathrooms, 3)
print(bathrooms_outliers_low)
print(bathrooms_outliers_high)

In [None]:
# Examine floors
display(infer_df.floors.value_counts())

floors_outliers_high, floors_outliers_low = determine_outliers_cut_off(infer_df.floors, 3)
print(floors_outliers_low)
print(floors_outliers_high)

In [None]:
# Examine sqft_living
constant = 3.5
fig, axs = plt.subplots(figsize=(10,4))
axs.scatter(infer_df.sqft_living, infer_df.price)
axs.axvline(infer_df.sqft_living.mean() + constant*infer_df.sqft_living.std())
axs.set_title('sqft_living');

sqft_living_outliers_high, sqft_living_outliers_low = determine_outliers_cut_off(infer_df.sqft_living, constant)
print(sqft_living_outliers_low)
print(sqft_living_outliers_high)

In [None]:
# Examine sqft_lot
constant = 3
fig, axs = plt.subplots(figsize=(20,10))
axs.scatter(infer_df.sqft_lot, infer_df.price)
axs.axvline(infer_df.sqft_lot.mean() + constant*infer_df.sqft_lot.std())
axs.set_title('sqft_lot');

sqft_lot_outliers_high, sqft_lot_outliers_low = determine_outliers_cut_off(infer_df.sqft_lot, constant)
print(sqft_lot_outliers_low)
print(sqft_lot_outliers_high)

In [None]:
# Examine sqft_above
constant = 3
fig, axs = plt.subplots(figsize=(10,4))
axs.scatter(infer_df.sqft_above, infer_df.price)
axs.axvline(infer_df.sqft_above.mean() + constant*infer_df.sqft_above.std())
axs.set_title('sqft_above');

sqft_above_outliers_high, sqft_above_outliers_low = determine_outliers_cut_off(infer_df.sqft_above, constant)
print(sqft_above_outliers_low)
print(sqft_above_outliers_high)

In [None]:
# Examine sqft_basement
constant = 3
fig, axs = plt.subplots(figsize=(10,4))
axs.scatter(infer_df.sqft_basement, infer_df.price)
axs.axvline(infer_df.sqft_basement.mean() + constant*infer_df.sqft_basement.std())
axs.set_title('sqft_basement');

sqft_basement_outliers_high, sqft_basement_outliers_low = determine_outliers_cut_off(infer_df.sqft_basement, constant)
print(sqft_basement_outliers_low)
print(sqft_basement_outliers_high)

#### Remove Outliers

In [None]:
infer_df = infer_df[infer_df.price < price_outliers_high]
display(infer_df.info())
infer_df.price.hist(bins=100);

In [None]:
infer_df = infer_df[infer_df.bedrooms < bedrooms_outliers_high]
display(infer_df.info())
infer_df.bedrooms.value_counts()

In [None]:
infer_df = infer_df[infer_df.bathrooms < bathrooms_outliers_high]
display(infer_df.info())
infer_df.bathrooms.value_counts()

In [None]:
infer_df = infer_df[infer_df.floors < floors_outliers_high]
display(infer_df.info())
infer_df.floors.value_counts()

In [None]:
infer_df = infer_df[infer_df.sqft_living < sqft_living_outliers_high]
display(infer_df.info())

constant = 3.5
fig, axs = plt.subplots(figsize=(10,4))
axs.scatter(infer_df.sqft_living, infer_df.price)
axs.set_title('sqft_living');

In [None]:
infer_df = infer_df[infer_df.sqft_lot < sqft_lot_outliers_high]
display(infer_df.info())

constant = 3
fig, axs = plt.subplots(figsize=(20,10))
axs.scatter(infer_df.sqft_lot, infer_df.price)
axs.set_title('sqft_lot');

In [None]:
infer_df = infer_df[infer_df.sqft_above < sqft_above_outliers_high]
display(infer_df.info())

constant = 3
fig, axs = plt.subplots(figsize=(10,4))
axs.scatter(infer_df.sqft_above, infer_df.price)
axs.set_title('sqft_above');

In [None]:
infer_df = infer_df[infer_df.sqft_basement < sqft_basement_outliers_high]
display(infer_df.info())

constant = 3
fig, axs = plt.subplots(figsize=(10,4))
axs.scatter(infer_df.sqft_basement, infer_df.price)
axs.set_title('sqft_basement');

### Identify & Drop Outliers for Predictive Model

In [None]:
# df.loc[df.bedrooms == 33].sort_values('sqft_living', ascending=False).head(20)

In [None]:
# 33 bedrooms for a 1620 sqft house is a mistake. We'll drop those values.
# 9, 10 & 11 bedrooms for houses under 5000 sqft are also a mistake. We'll drop.
df.drop(df.loc[df['bedrooms']==33].index, inplace=True)
df.drop(df.loc[df['bedrooms']==11].index, inplace=True)
df.drop(df.loc[df['bedrooms']==10].index, inplace=True)
df.drop(df.loc[df['bedrooms']==9].index, inplace=True)

df.sort_values('bedrooms', ascending=False).head(10)

# Inferential Modeling
***

This section shows the iterative approach taken towards finding the best inferential model by first determining relevant features of interest, and then analyzing the coefficients of determination, coefficients of features, p-values, and other statistically relevent aspects of the models. In addition, the findings and results of each iteration/trial is included.

In [None]:
# Create model training and testing data

# Trial 1
# X = infer_df.drop(columns=['price', 'id', 'date', 'condition', 'sqft_above',
#                            'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
#                            'lat', 'long', 'sqft_living15', 'sqft_lot15'])

# Trial 2 (Difference from Trial 1--> Dropped grade)
# X = infer_df.drop(columns=['price', 'id', 'date', 'condition', 'grade', 'sqft_above',
#                            'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 
#                            'lat', 'long', 'sqft_living15', 'sqft_lot15'])

# Trial 3 (Difference from Trial 2--> Dropped bedrooms and bathrooms)
# X = infer_df.drop(columns=['price', 'id', 'date', 'bedrooms', 'bathrooms', 'condition',
#                            'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated',
#                            'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15'])

# Trial 4 (Difference from Trial 3--> Dropped waterfront)
# X = infer_df.drop(columns=['price', 'id', 'date', 'bedrooms', 'bathrooms', 'waterfront',
#                            'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built',
#                            'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15'])

# Trial 5 (Difference from Trial 1--> Dropped waterfront)
# X = infer_df.drop(columns=['price', 'id', 'date', 'waterfront', 'condition', 'sqft_above',
#                            'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
#                            'sqft_living15', 'sqft_lot15'])

# Trial 6 (Difference from Trial 1--> Dropped waterfront, view, grade; Added sqft_above, sqft_basement)
# X = infer_df.drop(columns=['price', 'id', 'date', 'waterfront', 'view', 'condition', 'grade',
#                            'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15'])

# Trial 7 (Difference from Trial 1--> Added condition, sqft_above, sqft_basement)
X = infer_df.drop(columns=['price', 'id', 'date', 'yr_built', 'yr_renovated',
                           'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15'])

y = infer_df.price

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# Show feature correlation of training data
train_data = pd.concat([X_train, y_train], axis=1)
corr = train_data.corr()

fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(data=corr, mask=np.triu(np.ones_like(corr, dtype=bool)),
            ax=ax,annot=True, cbar_kws={"label": "Correlation",
                                        "orientation": "horizontal",
                                        "pad": .2, "extend": "both"});

# Trial 1 - Multicollinearity Concerns:
# sqft_living & bathrooms
# sqft_living & grade

# Trial 2 - Multicollinearity Concerns:
# sqft_living & bathrooms

# Trials 3, 4 - Multicollinearity Concerns:
# sqft_living & price

# Trial 5 - Multicollinearity Concerns:
# sqft_living & bathrooms
# sqft_living & grade

# Trial 6 - Multicollinearity Concerns:
# sqft_living & bathrooms
# sqft_living & sqft_above

# Trial 7 - Multicollinearity Concerns:
# sqft_living & bathrooms
# sqft_living & grade
# sqft_living & sqft_above
# sqft_above & grade

In [None]:
# Show scatter plots of training data compared to target

# Trials 1, 2, 5, 6
# fig, axes = plt.subplots(ncols=3, nrows=3, figsize=(16, 10))

# Trials 3, 4
# fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(16, 7))

# Trial 7 
fig, axes = plt.subplots(ncols=3, nrows=4, figsize=(16, 10))

fig.set_tight_layout(True)

for index, col in enumerate(X_train.columns):
    ax = axes[index//3][index%3]
    ax.scatter(X_train[col], y_train) #, alpha=0.2)
    ax.set_xlabel(col)
    ax.set_ylabel('price')

    
# Trial 1
# fig.delaxes(axes[2][2])

# Trials 2, 5, 6
# fig.delaxes(axes[2][1])
# fig.delaxes(axes[2][2])

# Trial 3
# fig.delaxes(axes[1][2])

# Trial 4
# fig.delaxes(axes[1][1])
# fig.delaxes(axes[1][2])

# Trial 7
fig.delaxes(axes[3][2])

In [None]:
# Create baseline model with DummyRegressor
baseline = DummyRegressor()
baseline.fit(X_train, y_train)
baseline.score(X_test, y_test)

In [None]:
# Run first model with highested correlated feature ('sqft_living')

# Trials 1, 5, 7
most_correlated_feature = 'grade'

# Trial 2, 3, 4, 6
# most_correlated_feature = 'sqft_living'

first_model = LinearRegression()

splitter = ShuffleSplit(n_splits=3, test_size=0.25, random_state=0)

first_scores = cross_validate(estimator=first_model,
                                 X=X_train[[most_correlated_feature]],
                                 y=y_train, return_train_score=True,
                                 cv=splitter)

print('First Model')
print('Train score: ', first_scores['train_score'].mean())
print('Validation score: ', first_scores['test_score'].mean())

# Trials 1, 5, 7:
# First Model
# Train score:  0.414588498037074
# Validation score:  0.42492697175493815

# Trials 2, 3, 4, 6:
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

In [None]:
# Examine OLS summary table to examine coefficients of first model
sm.OLS(y_train, sm.add_constant(X_train[[most_correlated_feature]])).fit().summary()

In [None]:
# Run second model with additional, correlated features

# Trial 1
# select_features = X_train[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
#                            'floors', 'waterfront', 'view', 'grade']].copy()

# Trial 2
# select_features = X_train[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
#                            'floors', 'waterfront', 'view']].copy()

# Trial 3
# select_features = X_train[['sqft_living', 'sqft_lot',
#                            'floors', 'waterfront', 'view']].copy()

# Trial 4
# select_features = X_train[['sqft_living', 'sqft_lot',
#                            'floors', 'view']].copy()

# Trial 5
# select_features = X_train[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
#                            'floors', 'view', 'grade']].copy()


# Trial 6
# select_features = X_train[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
#                            'floors', 'sqft_above', 'sqft_basement']].copy()

# Trial 7
select_features = X_train[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
                           'floors', 'waterfront', 'view', 'condition', 'grade',
                           'sqft_above', 'sqft_basement']].copy()

second_model = LinearRegression()

second_model_scores = cross_validate(estimator=second_model,
                                     X=select_features, y=y_train,
                                     return_train_score=True, cv=splitter)

print('Second Model')
print('Train score: ', second_model_scores['train_score'].mean())
print('Validation score: ', second_model_scores['test_score'].mean())
print('First Model')
print('Train score: ', first_scores['train_score'].mean())
print('Validation score: ', first_scores['test_score'].mean())

# Trial 1:
# Second Model
# Train score:  0.5115381808835865
# Validation score:  0.5185176054579146
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

# Trial 2:
# Second Model
# Train score:  0.4472809084845386
# Validation score:  0.4499643386092405
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

# Trial 3:
# Second Model
# Train score:  0.4405937590300398
# Validation score:  0.44493075864886356
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

# Trial 4:
# Second Model
# Train score:  0.43840158886968944
# Validation score:  0.4432119138653097
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

# Trial 5:
# Second Model
# Train score:  0.5087900596439866
# Validation score:  0.5161810311821672
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

# Trial 6:
# Second Model
# Train score:  0.4151475204412618
# Validation score:  0.42144705522457837
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

# Trial 7:
# Second Model
# Train score:  0.5322909097073351
# Validation score:  0.5360463814225995
# First Model
# Train score:  0.414588498037074
# Validation score:  0.42492697175493815

In [None]:
# Examine OLS summary table to examine coefficients of second model
sm.OLS(y_train, sm.add_constant(select_features)).fit().summary()

In [None]:
# Run third model with features with high p-value removed

# Remove features due to high p-value and possible multicollinearity
# Trials 1, 3, 4, 5
# N/A

# Trial #2
# less_features = select_features.drop(columns=['bathrooms']).copy()

# Trial #6
# less_features = select_features.drop(columns=['bathrooms', 'sqft_above', 'sqft_basement']).copy()

# Trial #7a
# less_features = select_features.drop(columns=['floors', 'sqft_above', 'sqft_basement']).copy()

# Trial #7b
less_features = select_features.drop(columns=['floors', 'waterfront', 'sqft_above', 'sqft_basement']).copy()

third_model = LinearRegression()

third_model_scores = cross_validate(estimator=third_model,
                                     X=less_features, y=y_train,
                                     return_train_score=True, cv=splitter)

print('Third Model')
print('Train score: ', third_model_scores['train_score'].mean())
print('Validation score: ', third_model_scores['test_score'].mean())
print('Second Model')
print('Train score: ', second_model_scores['train_score'].mean())
print('Validation score: ', second_model_scores['test_score'].mean())
print('First Model')
print('Train score: ', first_scores['train_score'].mean())
print('Validation score: ', first_scores['test_score'].mean())

# Trial 2:
# Third Model
# Train score:  0.4472589810970505
# Validation score:  0.4499575049891514
# Second Model
# Train score:  0.4472809084845386
# Validation score:  0.4499643386092405
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

# Trial 6:
# Third Model
# Train score:  0.4120920473349818
# Validation score:  0.41827420141563426
# Second Model
# Train score:  0.4151475204412618
# Validation score:  0.42144705522457837
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

# Trial 7a:
# Third Model
# Train score:  0.5277460988907076
# Validation score:  0.5323939628237843
# Second Model
# Train score:  0.5322909097073351
# Validation score:  0.5360463814225995
# First Model
# Train score:  0.3976366000836915
# Validation score:  0.40635354544352903

# Trial 7b:
# Third Model
# Train score:  0.5249436553249848
# Validation score:  0.5300064733067893
# Second Model
# Train score:  0.5322909097073351
# Validation score:  0.5360463814225995
# First Model
# Train score:  0.414588498037074
# Validation score:  0.42492697175493815

In [None]:
# Examine OLS summary table to examine coefficients of third model
# Trials 1, 3, 4, 5
# N/A

# Trials 2, 6, 7
sm.OLS(y_train, sm.add_constant(less_features)).fit().summary()

In [None]:
# Build final model and score it
# Trial 1
# final_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
#                   'floors', 'waterfront', 'view', 'grade']

# Trial 2
# final_features = ['bedrooms', 'sqft_living', 'sqft_lot',
#                   'floors', 'waterfront', 'view']

# Trial 3
# final_features = ['sqft_living', 'sqft_lot',
#                   'floors', 'waterfront', 'view']

# Trial 4
# final_features = ['sqft_living', 'sqft_lot',
#                   'floors', 'view']

# Trial 5
# final_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
#                   'floors', 'view', 'grade']

# Trial 6
# final_features = ['bedrooms', 'sqft_living', 'sqft_lot', 'floors']

# Trial 7a
# final_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
#                   'waterfront', 'view', 'condition', 'grade']

# Trial 7b
final_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
                  'view', 'condition', 'grade']

X_train_final = X_train[final_features]
X_test_final = X_test[final_features]

final_model = LinearRegression()
final_model.fit(X_train_final, y_train)

final_model.score(X_test_final, y_test)

# Trial 1 Score: 0.5122565930691938
# Trial 2 Score: 0.45218493940429394
# Trial 3 Score: 0.4449265646174718
# Trial 4 Score: 0.4449057795826621
# Trial 5 Score: 0.5120769199763275
# Trial 6 Score: 0.4203671368154732
# Trial 7a Score: 0.527617569659234
# Trial 7b Score: 0.527653369069877

## Results
***
This section provides the RMSE, coefficients of features, intercept, and the four assumptions of linear regression for the final inferential model.

In [None]:
# Show feature correlation of training data

# Trial 1
# final_features_include_price = ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
#                                 'floors', 'waterfront', 'view', 'grade']

# Trial 2: 
# final_features_include_price = ['price', 'bedrooms', 'sqft_living', 'sqft_lot',
#                                 'floors', 'waterfront', 'view']

# Trial 3: 
# final_features_include_price = ['price', 'sqft_living', 'sqft_lot',
#                                 'floors', 'waterfront', 'view']

# Trial 4:
# final_features_include_price = ['price', 'sqft_living', 'sqft_lot', 'floors', 'view']

# Trial 5:
# final_features_include_price = ['price', 'bedrooms', 'bathrooms', 'sqft_living',
#                                 'sqft_lot', 'floors', 'view', 'grade']

# Trial 6:
# final_features_include_price = ['price', 'bedrooms', 'sqft_living', 'sqft_lot', 'floors']

# Trial 7a:
# final_features_include_price = ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
#                                 'waterfront', 'view', 'condition', 'grade']

# Trial 7b:
final_features_include_price = ['price', 'bedrooms', 'bathrooms', 'sqft_living',
                                'sqft_lot', 'view', 'condition', 'grade']


final_features_infer_df = infer_df[final_features_include_price]
corr = final_features_infer_df.corr()

fig, ax = plt.subplots(figsize=(16,10))
sns.heatmap(data=corr, mask=np.triu(np.ones_like(corr, dtype=bool)),
            ax=ax,annot=True, cbar_kws={"label": "Correlation",
                                        "orientation": "horizontal",
                                        "pad": .2, "extend": "both"})
ax.set_title('Correlation Heatmap of Inferential Model Features');
plt.savefig('./data/correlation_heatmap.jpg', dpi=300)

In [None]:
# Check RMSE
RMSE = mean_squared_error(y_test, final_model.predict(X_test_final), squared=False)

RMSE

# Trial 1 RMSE: 171891.21890108107
# Trial 2 RMSE: 182169.2084828494
# Trial 3 RMSE: 183372.0791291402
# Trial 4 RMSE: 183375.5123319613
# Trial 5 RMSE: 171922.8763082007
# Trial 6 RMSE: 187384.85530621544
# Trial 7a RMSE: 169162.79631882257
# Trial 7b RMSE: 169156.38621255787

In [None]:
# Coefficients and intercept of final model
print(pd.Series(final_model.coef_, index=X_train_final.columns, name="Coefficients"))
print("Intercept:", final_model.intercept_)

# Trial 1:
# bedrooms       -12403.223041
# bathrooms      -24266.956596
# sqft_living       134.121810
# sqft_lot           -1.088879
# floors         -10654.493719
# waterfront     229531.849627
# view            40836.318305
# grade           93085.076678
# Name: Coefficients, dtype: float64
# Intercept: -366070.7543283864

# Trial 2:
# bedrooms       -29063.830949
# sqft_living       213.256807
# sqft_lot           -1.030757
# floors          22355.715531
# waterfront     194527.027628
# view            45831.812383
# Name: Coefficients, dtype: float64
# Intercept: 136144.5077016906

# Trial 3:
# sqft_living       191.904577
# sqft_lot           -0.919129
# floors          25587.279592
# waterfront     202699.625657
# view            47821.575812
# Name: Coefficients, dtype: float64
# Intercept: 75236.04639607953

# Trial 4:
# sqft_living      190.853129
# sqft_lot          -0.870734
# floors         26097.979106
# view           51603.310306
# Name: Coefficients, dtype: float64
# Intercept: 75743.4946812251

# Trial 5:
# bedrooms      -12935.899346
# bathrooms     -24223.335353
# sqft_living      133.802466
# sqft_lot          -1.035701
# floors         -9878.714194
# view           45113.873883
# grade          92505.116943
# Name: Coefficients, dtype: float64
# Intercept: -361427.00863892934

# Trial 6:
# bedrooms      -35927.312809
# sqft_living      229.127926
# sqft_lot          -0.988000
# floors         15505.418021
# Name: Coefficients, dtype: float64
# Intercept: 150018.0302155278

# Trial 7a:
# bedrooms       -15310.733187
# bathrooms      -21054.230584
# sqft_living       130.654034
# sqft_lot           -1.114860
# waterfront     231748.926119
# view            39120.879396
# condition       49844.313685
# grade           97399.866558
# Name: Coefficients, dtype: float64
# Intercept: -573847.2957910688

# Trial 7b:
# bedrooms      -15882.999485
# bathrooms     -20761.202482
# sqft_living      130.298547
# sqft_lot          -1.064769
# view           43412.287164
# condition      49681.336403
# grade          96913.876757
# Name: Coefficients, dtype: float64
# Intercept: -568477.6204495409

In [None]:
# Trial 7b
coef_s = pd.Series(final_model.coef_, index=X_train_final.columns, name="Coefficients")
coef_s.sort_values(ascending=False, inplace=True)

sns.set_style(style='dark')
fig, ax = plt.subplots(figsize=(16, 10))
ax = sns.barplot(coef_s.index, coef_s.values, palette='magma_r')#, color=colors_ax)
ax.axhline(y=0, color='black')
ax.set_xlabel('Features', fontsize=14, labelpad=13)
ax.set_ylabel('Coefficients ($)', fontsize=14)
ax.set_title('Coefficients of Features', fontsize=15)
ax.ticklabel_format(axis='y', useOffset=False, style='plain')
y = np.array([-20000, 0, 20000, 40000, 60000, 80000, 100000])
y_ticks_labels = ["-20K","0", "20K", "40K", "60K", "80K", "100K"]
ax.set_yticks(y)
ax.set_yticklabels(y_ticks_labels)
xlocs, xlabs = plt.xticks()
for i, v in enumerate(coef_s):
    string = ''
    if v > 0:
        string = '$' + str(abs(round(v,2)))
        plt.text(xlocs[i] - 0.225, v + 1000, string, weight='bold')
    else:
        string = '-$' + str(abs(round(v,2)))
        plt.text(xlocs[i] - 0.225, v - 3000, string, weight='bold')
plt.savefig('./data/coefficients.jpg', dpi=300);

In [None]:
# Checkinng independence (aka no multicollinearity) assumption holds
vif = [variance_inflation_factor(X_train_final.values, i) for i in range(X_train_final.shape[1])]
pd.Series(vif, index=X_train_final.columns, name="Variance Inflation Factor")

# Trial 1:
# bedrooms       21.353913
# bathrooms      24.684748
# sqft_living    21.009740
# sqft_lot        1.756314
# floors         13.073733
# waterfront      1.098957
# view            1.239649
# grade          32.465846
# Name: Variance Inflation Factor, dtype: float64

# Trial 2: 
# bedrooms       14.387326
# sqft_living    14.998293
# sqft_lot        1.705328
# floors          7.919926
# waterfront      1.098914
# view            1.231901
# Name: Variance Inflation Factor, dtype: float64

# Trial 3: 
# sqft_living    8.113156
# sqft_lot       1.698304
# floors         6.783588
# waterfront     1.098832
# view           1.225531
# Name: Variance Inflation Factor, dtype: float64

# Trial 4:
# sqft_living    8.080519
# sqft_lot       1.692461
# floors         6.774367
# view           1.122524
# Name: Variance Inflation Factor, dtype: float64

# Trial 5:
# bedrooms       21.351724
# bathrooms      24.684676
# sqft_living    20.983958
# sqft_lot        1.750573
# floors         13.064506
# view            1.137428
# grade          32.464611
# Name: Variance Inflation Factor, dtype: float64

# Trial 6:
# bedrooms       14.299366
# sqft_living    14.393143
# sqft_lot        1.699181
# floors          7.882150
# Name: Variance Inflation Factor, dtype: float64

# Trial 7a:
# bedrooms       24.192977
# bathrooms      22.115402
# sqft_living    22.793305
# sqft_lot        1.728665
# waterfront      1.098486
# view            1.229774
# condition      20.103667
# grade          45.568216
# Name: Variance Inflation Factor, dtype: float64

# Trial 7b:
# bedrooms       24.185220
# bathrooms      22.113130
# sqft_living    22.772153
# sqft_lot        1.723913
# view            1.129638
# condition      20.098086
# grade          45.567261
# Name: Variance Inflation Factor, dtype: float64

In [None]:
# Checking linearity assumption holds
preds = final_model.predict(X_test_final)
fig, ax = plt.subplots()

perfect_line = np.arange(y_test.min(), y_test.max())
ax.plot(perfect_line, linestyle="--", color="red", label="Perfect Fit")
ax.scatter(y_test, preds, alpha=0.5)
ax.set_xlabel("Actual Price")
ax.set_ylabel("Predicted Price")
ax.legend();

In [None]:
# Checkinng normality assumption holds
residuals = (y_test - preds)
sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True);

In [None]:
# Checking homoscedasticity assumption holds
fig, ax = plt.subplots()

ax.scatter(preds, residuals, alpha=0.5)
ax.plot(preds, [0 for i in range(len(X_test))])
ax.set_xlabel("Predicted Value")
ax.set_ylabel("Actual - Predicted Value");

## Predictive Modeling
***
In this section, we take an iterative approach to create a predictive model using many correlated features. We'll normalize and scale all of the data.

In [None]:
# Create model training and testing data
X = df.drop(['price'], axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# Examine target ('price') distribution
sns.distplot(y_train, fit=stats.norm)
fig = plt.figure()
stats.probplot(y_train, plot=plt);

In [None]:
# Run log function to normalize target data
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)

In [None]:
# Re-examine target ('price') 
sns.distplot(y_train_log, fit=stats.norm)
fig = plt.figure()
stats.probplot(y_train_log, plot=plt);

In [None]:
# Show feature correlation of training data
train_data = pd.concat([X_train, y_train], axis=1)
corr = train_data.corr()

fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(data=corr, mask=np.triu(np.ones_like(corr, dtype=bool)),
            ax=ax,annot=True, cbar_kws={"label": "Correlation",
                                        "orientation": "horizontal",
                                        "pad": .2, "extend": "both"});

In [None]:
# Show linear correlation with 'price' & 'sqft_living'
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(df_minus_outliers.sqft_living, df_minus_outliers.price, alpha=0.5)
ax.set_xlabel('Living Area in Sq Ft')
ax.set_ylabel('Price')
ax.ticklabel_format(style='plain', axis='y')
ax.set_title('Living Area (Ft²) vs. Price', fontsize=13)
m, b = np.polyfit(df_minus_outliers.sqft_living, df_minus_outliers.price, 1)
plt.plot(df_minus_outliers.sqft_living, m*df_minus_outliers.sqft_living + b, color='black');

In [None]:
# Create baseline model with DummyRegressor
baseline = DummyRegressor()
baseline.fit(X_train, y_train_log)
baseline.score(X_test, y_test_log)

In [None]:
# Run baseline model with highested correlated feature ('sqft_living')
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate, ShuffleSplit

first_model = LinearRegression()

splitter = ShuffleSplit(n_splits=3, test_size=0.25, random_state=0)

first_scores = cross_validate(estimator=first_model,
                                 X=X_train[['sqft_living']],
                                 y=y_train_log, return_train_score=True,
                                 cv=splitter)

print('Train score: ', first_scores['train_score'].mean())
print('Validation score: ', first_scores['test_score'].mean())

In [None]:
# Add additional, correlated features to X_train data
select_features = X_train[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
                             'floors', 'waterfront', 'view', 'condition', 'grade',
                             'sqft_above', 'sqft_basement', 'sqft_living15',
                             'sqft_lot15']].copy()

In [None]:
# Run 2nd model with additional, correlated features
second_model_with_ylog = LinearRegression()

second_model_scores = cross_validate(estimator=second_model_with_ylog,
                                     X=select_features, y=y_train_log,
                                     return_train_score=True, cv=splitter)

print('Second Model')
print('Train score: ', second_model_scores['train_score'].mean())
print('Validation score: ', second_model_scores['test_score'].mean())
print()
print('First Model')
print('Train score: ', first_scores['train_score'].mean())
print('Validation score: ', first_scores['test_score'].mean())

In [None]:
# Examine OLS summary table to examine coefficients
sm.OLS(y_train_log, sm.add_constant(select_features)).fit().summary()

In [None]:
# Remove 'sqft_basement' due to high p-value and possible multicollinearity
less_features = select_features.drop(['sqft_basement'], axis=1).copy()

In [None]:
#Run 3rd model with 'sqft_basement' removed
third_model_with_ylog = LinearRegression()

third_model_scores = cross_validate(estimator=third_model_with_ylog,
                                     X=less_features, y=y_train_log,
                                     return_train_score=True, cv=splitter)

print('Third Model')
print('Train score: ', third_model_scores['train_score'].mean())
print('Validation score: ', third_model_scores['test_score'].mean())
print()
print('Second Model')
print('Train score: ', second_model_scores['train_score'].mean())
print('Validation score: ', second_model_scores['test_score'].mean())
print()
print('First Model')
print('Train score: ', first_scores['train_score'].mean())
print('Validation score: ', first_scores['test_score'].mean())

In [None]:
# Use recursive feature elimination and feature selection to examine significant features
X_train_for_RFECV = StandardScaler().fit_transform(less_features)

model_for_RFECV = LinearRegression()

selector = RFECV(model_for_RFECV, cv=splitter)
selector.fit(X_train_for_RFECV, y_train_log)

print("Was the column selected?")
for index, col in enumerate(less_features.columns):
    print(f"{col}: {selector.support_[index]}")

Creating a final model with the settled-on, selected features. This is also where we'll normalize (log) and scale the remaining data (independent variables).

In [None]:
final_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
                  'waterfront', 'view', 'condition', 'grade', 'sqft_above',
                  'sqft_living15', 'sqft_lot15']

In [None]:
# Build final model and score it

X_train_final = X_train[final_features]
X_test_final = X_test[final_features]

final_model = LinearRegression()
final_model.fit(X_train_final, y_train_log)

final_model.score(X_test_final, y_test_log)

In [None]:
# Check RMSE
mean_squared_error(y_test_log, final_model.predict(X_test_final), squared=False)

Now we need to log and scale independent variables (X_train, X_test) and scale target variable (y_train_log, y_test_log). Note, target already had log applied.

In [None]:
# Examine skew of final features
X_train[final_features].hist(figsize=(12,12));

In [None]:
# Apply log to continuous features and re-examine skew
X_train_continuous_log = pd.DataFrame([])
X_train_continuous_log['sqft_living_log'] = np.log(X_train['sqft_living'])
X_train_continuous_log['sqft_lot_log'] = np.log(X_train['sqft_lot'])
X_train_continuous_log['sqft_above_log'] = np.log(X_train['sqft_above'])
X_train_continuous_log['sqft_living15_log'] = np.log(X_train['sqft_living15'])
X_train_continuous_log['sqft_lot15_log'] = np.log(X_train['sqft_lot15'])
X_train_continuous_log.hist(figsize=(12,12));

In [None]:
# Create a DataFrame of all train features (independent & target) so
# everything can be scaled
X_train_discreet = X_train[['bedrooms', 'bathrooms', 'floors', 'waterfront',
                           'view', 'condition', 'grade']]

X_train_cont_disc = pd.concat([X_train_continuous_log, X_train_discreet, y_train_log],
                              axis=1)

train_columns = X_train_cont_disc.columns

In [None]:
# Scale all training features
scaler = StandardScaler()
X_train_log_scaled = scaler.fit_transform(X_train_cont_disc)

In [None]:
# Re-separate target and independent features
X_train_full = pd.DataFrame(X_train_log_scaled, columns=train_columns)

y_train_log_scaled = X_train_full['price']
X_train_log_scaled = X_train_full.drop(columns=['price'])
X_train_log_scaled

In [None]:
# Repeat the above process for the testing data
X_test_continuous_log = pd.DataFrame([])
X_test_continuous_log['sqft_living_log'] = np.log(X_test['sqft_living'])
X_test_continuous_log['sqft_lot_log'] = np.log(X_test['sqft_lot'])
X_test_continuous_log['sqft_above_log'] = np.log(X_test['sqft_above'])
X_test_continuous_log['sqft_living15_log'] = np.log(X_test['sqft_living15'])
X_test_continuous_log['sqft_lot15_log'] = np.log(X_test['sqft_lot15'])

X_test_discreet = X_test[['bedrooms', 'bathrooms', 'floors', 'waterfront',
                          'view', 'condition', 'grade']]
X_test_cont_disc = pd.concat([X_test_continuous_log, X_test_discreet, y_test_log],
                              axis=1)
test_columns = X_test_cont_disc.columns

scaler2 = StandardScaler()
X_test_log_scaled = scaler2.fit_transform(X_test_cont_disc)

X_test_full = pd.DataFrame(X_test_log_scaled, columns=test_columns)

y_test_log_scaled = X_test_full['price']
X_test_log_scaled = X_test_full.drop(columns=['price'])

In [None]:
# Create, run and score final model using log and scaled data
final_model_log_scaled = LinearRegression()
final_model_log_scaled.fit(X_train_log_scaled, y_train_log_scaled)

final_model_log_scaled.score(X_test_log_scaled, y_test_log_scaled)

In [None]:
# Find normalized-scaled RMSE
RMSE_log_scaled = mean_squared_error(y_test_log_scaled,
                   final_model_log_scaled.predict(X_test_log_scaled),
                   squared=False)
RMSE_log_scaled

In [None]:
# Convert normalized-scaled RMSE back to USD
target_log = pd.concat([y_train_log_scaled, y_test_log_scaled], axis=0)

y_hat_train = final_model_log_scaled.predict(X_train_log_scaled)
y_hat_test = final_model_log_scaled.predict(X_test_log_scaled)

def inv_normalize_price(feature_normalized):

    mu = target_log.mean()
    sd = target_log.std()
    return sd*feature_normalized + mu

inv1 = 10**(inv_normalize_price(y_train_log_scaled))
inv2 = 10**(inv_normalize_price(y_hat_train))
inv3 = 10**(inv_normalize_price(y_test_log_scaled))
inv4 = 10**(inv_normalize_price(y_hat_test))

# Transform back to regular $USD price (not log price)
train_mse_non_log = mean_squared_error(inv1, inv2)
test_mse_non_log = mean_squared_error(inv3, inv4)

# Take the square root of MSE to find RMSE * 100 for USD units
non_log_train = round(np.sqrt(train_mse_non_log)*100, 2)
non_log_test = round(np.sqrt(test_mse_non_log)*100, 2)

print(f'Train RMSE non-log: ${non_log_train}')
print(f'Test RMSE non-log: ${non_log_test}')

Checking Linear Assumptions (though not as important for predictive purposes)

Linearity

In [None]:
preds = final_model_log_scaled.predict(X_test_log_scaled)
fig, ax = plt.subplots(figsize=(12,8))
pred_xticks = ['0', '0', '.4M', '.8M', '1.2M', '1.6M']
pred_yticks = ['0', '-.75M', '-.5M', '-.25M', '0', '.25M', '.5M', '1M', '1.25M', '1.5M']
perfect_line = np.arange(y_test_log_scaled.min(), y_test_log_scaled.max(), step=.9)
ax.plot(perfect_line, linestyle="--", color="black", label="Perfect Fit")
ax.scatter(y_test_log_scaled+4.1, preds+.4, alpha=0.5)
ax.set_xlabel("Actual Price")
ax.set_ylabel("Predicted Price")
ax.set_xticklabels(pred_xticks)
ax.set_yticklabels(pred_yticks)
ax.legend();

Normality

In [None]:
residuals = (y_test_log_scaled - preds)
sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True);

Multicollinearity (Independence Assumption)

In [None]:
vif = [variance_inflation_factor(X_train_log_scaled.values, i) for i in range(X_train_log_scaled.shape[1])]
pd.Series(vif, index=X_train_log_scaled.columns, name="Variance Inflation Factor")

Homoscedasticity

In [None]:
fig, ax = plt.subplots()
ax.scatter(preds, residuals, alpha=0.5)
ax.plot(preds, [0 for i in range(len(X_test_log_scaled))])
ax.set_xlabel("Predicted Value")
ax.set_ylabel("Actual - Predicted Value");

## Results
***
Given the provided dataset and linear regression approach, both our inferential and predictive models did not perform as well as we had hoped.
- For the inferential model, homoscedasticity was poor, multicollinearity (VIF) was too high and the RMSE range was north of \$173k+. Linearity and normality performed decently.
- For the predictive model, multicollinearity (VIF) was too high for several features, the model score was an underwhelming .601629 and the RMSE range of \$193k+ was too great to provide any predictive value. Linearity and homoscedasticity were decent and normality was excellent, due to normalizing the data in pre-processing.

## Recommendations
***
We recommend not using our models for inferential or predictive purposes and perhaps looking into different modeling approaches other than linear regression.

## Overall Conclusions
***