##Predicting house prices


In [5]:
from __future__ import division

In [6]:
import numpy as np
import pandas as pd

df = pd.read_csv('home_data.csv')
df.columns

Index([u'id', u'date', u'price', u'bedrooms', u'bathrooms', u'sqft_living',
       u'sqft_lot', u'floors', u'waterfront', u'view', u'condition', u'grade',
       u'sqft_above', u'sqft_basement', u'yr_built', u'yr_renovated',
       u'zipcode', u'lat', u'long', u'sqft_living15', u'sqft_lot15'],
      dtype='object')

**1. Selection and summary statistics: We found the zip code with the highest average house price. What is the average house price of that zip code?**

$75,000

$7,700,000

$540,088

$2,160,607

In [2]:
df_grp = df[['zipcode','price']].groupby('zipcode').mean()
df_grp.sort_values(by='price', ascending=False)[:5]

Unnamed: 0_level_0,price
zipcode,Unnamed: 1_level_1
98039,2160606
98004,1355927
98040,1194230
98112,1095499
98102,901258


**2. Filtering data: What fraction of the houses have living space between 2000 sq.ft. and 4000 sq.ft.?**

Between 0.2 and 0.29

Between 0.3 and 0.39

Between 0.4 and 0.49

Between 0.5 and 0.59

Between 0.6 and 0.69

In [3]:
print( "%.2f" % (df['sqft_living'][(df['sqft_living']>2000) & (df['sqft_living']<5000)].size / df['sqft_living'].size))

0.45


**3.Building a regression model with several more features: What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features?**

the RMSE of the model with advanced_features lower by less than $25,000

the RMSE of the model with advanced_features lower by between $25,001 and $35,000

the RMSE of the model with advanced_features lower by between $35,001 and $45,000

the RMSE of the model with advanced_features lower by between $45,001 and $55,000

the RMSE of the model with advanced_features lower by more than $55,000

In [7]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
advanced_features = [
    'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
    'condition', # condition of house
    'grade', # measure of quality of construction
    'waterfront', # waterfront property
    'view', # type of view
    'sqft_above', # square feet above ground
    'sqft_basement', # square feet in basement
    'yr_built', # the year built
    'yr_renovated', # the year renovated
    'lat', 'long', # the lat-long of the parcel
    'sqft_living15', # average sq.ft. of 15 nearest neighbors
    'sqft_lot15', # average lot size of 15 nearest neighbors 
    ]


In [25]:
# change 'zipcode' to dummies

zipcode = pd.get_dummies(df['zipcode'])
my_df = df[my_features].drop('zipcode',axis=1)
my_df = pd.concat([zipcode, my_df], axis=1)
ad_df = df[advanced_features].drop('zipcode',axis=1)
ad_df = pd.concat([zipcode, ad_df], axis=1)

In [26]:
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

def RMSE(X, y):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, train_size = 0.8, random_state=0)
    lr = linear_model.LinearRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    return np.sqrt(((y_test - y_pred) ** 2).mean())



my_RMSE = RMSE(my_df, df['price'])
ad_RMSE = RMSE(ad_df, df['price'])

my_RMSE, ad_RMSE, my_RMSE - ad_RMSE


(171861.75510031905, 149854.64964335991, 22007.105456959136)

In [31]:
my_lr = linear_model.LinearRegression()
y = df['price']
my_X = df[my_features]
my_lr.fit(my_X, y)
y_pred = my_lr.predict(my_X)
my_RMSE = np.sqrt( ((y - y_pred) ** 2).mean())


In [28]:
ad_lr = linear_model.LinearRegression()
ad_X = df[advanced_features]
ad_lr.fit(ad_X, y)
y_pred = ad_lr.predict(ad_X)
ad_RMSE = np.sqrt( ((y - y_pred) ** 2).mean())


In [29]:
my_RMSE, ad_RMSE, my_RMSE - ad_RMSE


(255637.33438786431, 201163.90238547101, 54473.432002393296)