## Final Project Submission

Please fill out:
* Student name: 
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


## Business Understanding

- Predictive sale pricing for relators working with sellers (price the home to sell)

### Problem/Stakeholder
We are a data science consulting company working with  

### Data Understanding

## Data Preparation

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
from random import gauss
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats as stats
from statsmodels.formula.api import ols
from sklearn.dummy import DummyRegressor

%matplotlib inline

Pulling in data and exploring data prior to cleaning.

In [None]:
data = pd.read_csv('./data/kc_house_data.csv')

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.info()

We see nulls in "waterfront", "view", and "yr_renovated" columns. 

In [None]:
data.corr()

In [None]:
sns.heatmap(data.corr());

In [None]:
#Yr_renovated and price correlation .1296; might flatten past a certain year.  

In [None]:
data.yr_renovated.value_counts()

We see we have both nulls and "0" values in this column.

In [None]:
data.yr_renovated.describe()

In [None]:
17755 - 17011

In [None]:
data.info()

In [None]:
data.drop(columns='yr_renovated', inplace=True)

Dropping the column 'yr_renovated' due to the low value counts.

In [None]:
data.info()

In [None]:
data.waterfront.value_counts()

In [None]:
data.view.value_counts()

Only 60 null values, could drop those rows utilizing dropna or replace with our Mode value, "None."

In [None]:
data['view'].fillna("NONE", inplace=True)

In [None]:
data['view'].value_counts()

In [None]:
data.info()

In [None]:
data['waterfront'].value_counts()

No is overwhelming mode; could replace nulls with mode or create 3rd column "Unknown". 

In [None]:
data['waterfront'].fillna("UNKNOWN", inplace=True)

In [None]:
data['waterfront'].value_counts()

In [None]:
data.info()

In [None]:
data['sqft_basement'].value_counts()

Here we see we have some values of "?"; we need to decide how to clean this data. We will replace "?" with 0 because a large proportion of our data is at 0.

In [None]:
data['sqft_basement'].replace(to_replace = '?', value = 0.0, inplace=True)

In [None]:
data['sqft_basement'].value_counts()

Adding price per sqft columnn to help with comparison.

In [None]:
data['price_per_sqft_living'] = (data['price']/data['sqft_living'])

In [None]:
data.head()

We want to further look into how distance from the epicenter of Seattle effects sale price. We will create a column calculating difference between home and Seattle epicenter. We will be utilizing the Seattle Art Museum coordinates for Seattle: (lat = 47.6077, long= -122.337).

In [None]:
from geopy.distance import geodesic
import geopy

In [None]:
print(geodesic((47.5112, -122.257), (47.6077, -122.337)).miles)

In [None]:
data.head()

In [None]:
coords = (47.6077, -122.337)

In [None]:
data['distance_from_Seattle'] = data.apply(lambda x: geopy.distance.distance((x.lat, x. long), coords).miles, axis=1)

In [None]:
data.head()

We have cleaned all of our null values. 

In [None]:
data.corr()

In [None]:
cor = data.corr()

plt.figure(figsize = (15, 8))
sns.heatmap(cor, annot=True);

We see that there is a large difference in the correlations between codition and price and grade and price, even though the columns seem to describe similar attributes. 

- 'sqft_living' highly correlated to 'bathrooms', 'grade', 'sqft_above', and 'sqft_living15'
- 'sqft_lot' is highly correlated to 'sqft_lot15'

Starting with our simple model, we will look at the model utilizing sqft_living as our independent variable based on it having the highest correlation with price.

In [None]:
simple_formula = 'price ~ sqft_living'
simple_mod = ols(formula=simple_formula, data=data).fit()

In [None]:
simple_mod_summ = simple_mod.summary()

In [None]:
simple_mod_summ

For our simple summary, we see an R-squared of .493. Our p-value is <.05, showing that our results are statistically repeatable. 

In [None]:
## Other regressions prior to fitting that checked variables with high correlation to price

In [None]:
formula = 'price ~ bedrooms + sqft_living + sqft_lot + floors + condition + yr_built + zipcode + lat + long'
mod = ols(formula=formula, data=data).fit()
mod_summ = mod.summary()

In [None]:
mod_summ

We see a very high p-value with sqft_lot. We also see a very high cond. no. suggesting strong multicollinearity or other problems. We need to scale our data. How to improve our regression:
- Scaling
- Improving multicollinearity issues/ lowering cond. no.
- Skew seems high
- May be working with too many variables

In [None]:
formula2 = 'price ~ sqft_living + floors + bedrooms '
mod2 = ols(formula=formula2, data=data).fit()
mod_summ2 = mod2.summary()

In [None]:
mod_summ2

Using fewer variables we see a decrease in both our R-squared and our Cond. No.

In [None]:
formula3 = 'price ~ grade + bathrooms + bedrooms '
mod3 = ols(formula=formula3, data=data).fit()
mod_summ3 = mod3.summary()

In [None]:
mod_summ3

- Our R-squared number decreased, but our Cond. No. also decreased substantially by not using sqft as a variable. Sqft seems to have most multicollinearity issues. Our skew here is high.

We need to decide how to deal with our categorical variables.

In [None]:
data['condition'].value_counts()

In [None]:
data['grade'].value_counts()

In [None]:
data['condition'] = pd.Categorical(data['condition'], ['Poor','Fair','Average','Good', 'Very Good'])

Here, we create visualizations to see if the data is relatively normal and if we want to utilize the data. If we decide to utilize these variables and hot code, we need to drop a column to prevent multicollinearity. Column dropped becomes baseline. View article link: https://github.com/hoffm386/coefficients-of-dropped-categorical-variables 


In [None]:
sns.displot(data=data, x='condition');

In [None]:
data['grade'] = pd.Categorical(data['grade'], ['3 Poor','4 Low','5 Fair','6 Low Average', '7 Average', '8 Good', '9 Better', '10 Very Good', '11 Excellent', '12 Luxury', '13 Mansion'])

In [None]:
sns.displot(data=data, x='grade', height=7, aspect=2);

- Normalish distribution, when modeling we will drop the "grade_ 7 Average" column to utilize Average as the baseline.

In [None]:
data_dummy_condition = pd.get_dummies(data, columns=['condition'])

In [None]:
data

In [None]:
data_dummy_grade= pd.get_dummies(data, columns=['grade'])

In [None]:
data_dummy_grade

Data needs to be scaled because units are not the same; scaling for One Hot depends on what kind of scaling we are utilizing. Standard scaling could potentially scale everything (including One Hot).

In [None]:
data.drop(columns=('id'), inplace=True)

In [None]:
data.drop(columns=('date'), inplace=True)

Dropping ID and date columns because they do not contain data important to our analysis.

In [None]:
data

In [None]:
data.corr()

In [None]:
data_dummy_grade

In [None]:
data_dummy_grade.drop(columns=('id'), inplace=True)

In [None]:
data_dummy_grade.drop(columns=('date'), inplace=True)

In [None]:
data_dummy_grade.drop(columns=('waterfront'), inplace=True)

In [None]:
data_dummy_grade.drop(columns=('view'), inplace=True)

In [None]:
data_dummy_grade.drop(columns=('condition'), inplace=True)

In [None]:
data_dummy_grade.info()

In [None]:
ss = StandardScaler()
ss.fit(data_dummy_grade)
data_dummy_grade_scaled = ss.transform(data_dummy_grade)

In [None]:
data_dummy_grade_scaled

In [None]:
df_scaled = pd.DataFrame(ss.fit_transform(data_dummy_grade),columns = data_dummy_grade.columns)

In [None]:
df_scaled

In [None]:
df_scaled.corr()

In [None]:
cor = df_scaled.corr()

plt.figure(figsize = (15, 8))
sns.heatmap(cor, annot=True);