## Fire up necessary modules

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline  

from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

## Load some house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [2]:
# In GraphLab Jupyter Notebook: Save SFrame to csv and create pandas DataFrame from csv
# import graphlab
# sf = graphlab.SFrame('/Users/lvg/Downloads/home_data.gl/') 
# sf.save('Downloads/home_data.csv', format='csv')
# df = pandas.read_csv('Downloads/home_data.csv')


In [3]:
df = pd.read_csv('home_data.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## Assignment

In this module, we focused on using regression to predict a continuous value (house prices) from features of the house (square feet of living space, number of bedrooms,...). We also built an iPython notebook for predicting house prices, using data from King County, USA, the region where the city of Seattle is located.
<br><br>
In this assignment, we are going to build a more accurate regression model for predicting house prices by including more features of the house. In the process, we will also become more familiar with how the Python language can be used for data exploration, data transformations and machine learning. These techniques will be key to building intelligent applications.

**Learning outcomes**

* Execute programs with the iPython notebook
* Load and transform real, tabular data
* Compute summaries and statistics of the data
* Build a regression model using features of the data

**What you will do**

Now you are ready! We are going do three tasks in this assignment. There are 3 results you need to gather along the way to enter into the quiz after this reading.

    1. Selection and summary statistics:  In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price. Now, take the sales data, select only the houses with this zip code, and compute the average price.

In [4]:
# 98039 is the most expensive zip code.
np.mean(df[df.zipcode==98039].price)

2160606.6

    2. Filtering data: One of the key features we used in our model was the number of square feet of living space (‘sqft_living’) in the house. For this part, we are going to use the idea of filtering (selecting) data. Select the houses that have ‘sqft_living’ higher than 2000 sqft but no larger than 4000 sqft. What fraction of the all houses have ‘sqft_living’ in this range? 

In [5]:
df_sub = df[df.sqft_living.between(2000,4000)]
float(df_sub.shape[0])/df.shape[0]

0.4266413732475825

    3. Building a regression model with several more features: In the sample notebook, we built two regression models to predict house prices, one using just ‘sqft_living’ and the other one using a few more features, we called this set
>my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

    Now, going back to the original dataset, you will build a model using the following features:

>advanced_features = 
[
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

    What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features? Save this result to answer the quiz at the end.

In [7]:
advanced_features = [ 
    'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode', 'condition', # condition of house
    'grade', # measure of quality of construction
    'waterfront', # waterfront property
    'view', # type of view
    'sqft_above', # square feet above ground
    'sqft_basement', # square feet in basement
    'yr_built', # the year built
    'yr_renovated', # the year renovated
    'lat', 'long', # the lat-long of the parcel
    'sqft_living15', # average sq.ft. of 15 nearest neighbors
    'sqft_lot15' # average lot size of 15 nearest neighbors 
]

In [8]:
# Split data into training and testing. We use random_state=0 so that everyone running this notebook gets the same results.  
train_data, test_data = train_test_split(df, test_size=0.2, random_state=0)

X_train3=train_data[advanced_features]
X_train3=np.array(X_train3)

y_train=train_data.price
y_train=y_train.reshape(-1, 1)

regr3 = linear_model.LinearRegression()
regr3.fit(X=X_train3, y=y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [11]:
# Max Error & RMSE
X_test=test_data[advanced_features]
y_test=test_data.price.reshape(-1, 1)
y_pred=regr3.predict(X_test)

print("Max Error : %.2f" % np.max(np.absolute(y_pred - y_test)))
print
print("We can calculate MSE 'by hand' or have sklearn do it for us")
print("MSE : %.2f" % np.mean((y_pred - y_test) ** 2))
print("MSE : %.2f" % mean_squared_error(y_test, y_pred))
print
print("RMSE is just the square root of MSE")
print("RMSE : %.2f" % np.sqrt(mean_squared_error(y_test, y_pred)))

Max Error : 3210865.09

We can calculate MSE 'by hand' or have sklearn do it for us
MSE : 36280106854.24
MSE : 36280106854.24

RMSE is just the square root of MSE
RMSE : 190473.38


RMSE of about \$190,473, less than the RMSE of our first model (~\$248,879) and second model (~\$244,005). The RMSE of the model with advanced_features is lower than the RMSE of the model with my_features by approximately \$53,500