## House Prices Model

To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the houseprices data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

### Load Data

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import statsmodels.api as sm
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

# Load data from PostgreSQL database and print out
# observations
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

house_df = pd.read_sql_query('select * from houseprices',con=engine)

# No need for an open connection, as we're only doing a single query
engine.dispose()

house_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### Convert Features

In [2]:
# Convert street and overallqual to numerical variables
house_df['street_is_paved'] = np.where(house_df['street'] == 'Pave', 1, 0)
house_df['overallqual_above_6'] = np.where(house_df['overallqual'] > 6, 1, 0)

house_df[['id', 'street', 'street_is_paved', 'overallqual', 'overallqual_above_6']].head(25)

Unnamed: 0,id,street,street_is_paved,overallqual,overallqual_above_6
0,1,Pave,1,7,1
1,2,Pave,1,6,0
2,3,Pave,1,7,1
3,4,Pave,1,7,1
4,5,Pave,1,8,1
5,6,Pave,1,5,0
6,7,Pave,1,8,1
7,8,Pave,1,7,1
8,9,Pave,1,7,1
9,10,Pave,1,5,0


### Build Model

In [3]:
# Y is the target variable
Y = house_df['saleprice']

# X is the feature set
X = house_df[['street_is_paved', 'overallqual_above_6', 'lotarea', 'totalbsmtsf', 'grlivarea', 'garagearea']]

# Manually add constant
X = sm.add_constant(X)

# Use fit method to build model
results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.719
Model:,OLS,Adj. R-squared:,0.717
Method:,Least Squares,F-statistic:,618.5
Date:,"Sun, 29 Dec 2019",Prob (F-statistic):,0.0
Time:,00:33:43,Log-Likelihood:,-17618.0
No. Observations:,1460,AIC:,35250.0
Df Residuals:,1453,BIC:,35290.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3.904e+04,1.82e+04,-2.151,0.032,-7.46e+04,-3437.927
street_is_paved,3.872e+04,1.78e+04,2.178,0.030,3847.798,7.36e+04
overallqual_above_6,4.742e+04,2830.163,16.755,0.000,4.19e+04,5.3e+04
lotarea,0.4752,0.120,3.967,0.000,0.240,0.710
totalbsmtsf,40.0037,3.092,12.937,0.000,33.938,46.069
grlivarea,53.5868,2.666,20.102,0.000,48.358,58.816
garagearea,74.1952,6.519,11.382,0.000,61.408,86.982

0,1,2,3
Omnibus:,546.237,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,61601.254
Skew:,-0.719,Prob(JB):,0.0
Kurtosis:,34.789,Cond. No.,333000.0


All of the features are statistically significant (p-value less than 0.05).

Did not exclude any features because they are all significant.

The bias term is -39040.

A home on a paved street sells for $38720 more, on average, than a home on an unpaved street.

A home with an overall quality above 6 sells for $47420 more, on average, than a home with an overall quality of 6 or below.

As lotarea increases by 1, the home price increases by $0.4752.

As totalbsmtsf increases by 1, the home price increases by $40.0037.

As grlivarea increases by 1, the home price increases by $53.5868.

As garagearea increases by 1, the home price increases by $74.1952.

street_is_paved and overallqual_above_6 have the most prominent effects on house prices.

Yes, these results sound reasonable to me. It makes sense that location and overall quality of a house would have the greatest impact on house price, with lot size and square footage also playing a significant role.