# Predicting House Prices (One feature)

In this notebook we will use data on house sales in King County, where Seattle is located, to predict house prices using simple (one feature) linear regression. You will:

* Use SArray and SFrame functions to compute important summary statistics
* Write a function to compute the Simple Linear Regression weights using the closed form solution
* Write a function to make predictions of the output given the input feature
* Turn the regression around to predict the input/feature given the output
* Compare two different models for predicting house prices

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [12]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

In [14]:
main_data = pd.read_csv('kc_house_data.csv', dtype=dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv', dtype=dtype_dict)
train_data = pd.read_csv('kc_house_train_data.csv', dtype=dtype_dict)

In [15]:
main_data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3.0,1.0,1180.0,5650,1,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340.0,5650.0
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570.0,7242,2,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690.0,7639.0
2,5631500400,20150225T000000,180000.0,2.0,1.0,770.0,10000,1,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720.0,8062.0
3,2487200875,20141209T000000,604000.0,4.0,3.0,1960.0,5000,1,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360.0,5000.0
4,1954400510,20150218T000000,510000.0,3.0,2.0,1680.0,8080,1,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800.0,7503.0


In [18]:
print(main_data.shape)
print(test_data.shape)
print(train_data.shape)

(21613, 21)
(4229, 21)
(17384, 21)


### Write a generic function that accepts a column of data (e.g, an SArray) ‘input_feature’ and another column ‘output’ and returns the Simple Linear Regression parameters ‘intercept’ and ‘slope’. Use the closed form solution from lecture to calculate the slope and intercept. e.g. in python:

In [87]:
from sklearn import linear_model
regr = linear_model.LinearRegression()

def simple_linear_regression(input_feature, output):
    train_x = np.asanyarray(input_feature)
    train_y = np.asanyarray(output)
    regr.fit(train_x, train_y)
    intercept = regr.intercept_
    slope = regr.coef_
    
    return(intercept, slope)

Use your function to calculate the estimated slope and intercept on the training data to predict ‘price’ given ‘sqft_living’. e.g. in python with SFrames using:

In [88]:
input_feature = train_data[['sqft_living']]
output = train_data[['price']]

simple_linear_regression(input_feature, output)

(array([-47116.07907289]), array([[281.95883963]]))

In [89]:
squarefeet_intercept, squarfeet_slope = simple_linear_regression(input_feature, output)

### Write a function that accepts a column of data ‘input_feature’, the ‘slope’, and the ‘intercept’ you learned, and returns an a column of predictions ‘predicted_output’ for each entry in the input column. e.g. in python:

In [95]:
def get_regression_predictions(input_feature, intercept, slope):
    predicted_output = input_feature*slope + intercept
    return(predicted_output)

In [91]:
#What is the predcited price for a house with 2650 ft^2?
get_regression_predictions(2650,squarefeet_intercept,squarfeet_slope)

array([[700074.84594751]])

### Write a function that accepts column of data: ‘input_feature’, and ‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the Residual Sum of Squares (RSS). e.g. in python:

In [114]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    train_x=np.asanyarray(input_feature)
    train_y=np.asanyarray(output)
    regr.fit(train_x, train_y)
    y_hat = regr.predict(train_x)
    RSS = np.sum((y_hat - train_y)**2)
    return(RSS)

## Quiz Question: According to this function and the slope and intercept from (4) What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?

In [115]:
get_residual_sum_of_squares(
    train_data[['sqft_living']], train_data[['price']],
    squarefeet_intercept[0], squarfeet_slope[0][0])

1201918354177283.2

### Write a function that accept a column of data:‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the column of data: ‘estimated_input’. Do this by solving the linear function output = intercept + slope*input for the ‘input’ variable (i.e. ‘input’ should be on one side of the equals sign by itself). e.g. in python:

In [103]:
def inverse_regression_predictions(output, intercept, slope):
    estimated_input = (output-intercept)/slope
    return(estimated_input)

In [104]:
# estimated square-feet for a house costing $800,000?
inverse_regression_predictions(800000,squarefeet_intercept[0],squarfeet_slope[0][0])

3004.396245152277

## Instead of using ‘sqft_living’ to estimate prices we could use ‘bedrooms’ (a count of the number of bedrooms in the house) to estimate prices. Using your function from (3) calculate the Simple Linear Regression slope and intercept for estimating price based on bedrooms. Save this slope and intercept for later (you might want to call them e.g. bedroom_slope, bedroom_intercept).

In [105]:
input_feature = train_data[['bedrooms']]
output = train_data[['price']]

bedroom_intercept,bedroom_slope = simple_linear_regression(input_feature, output)

In [109]:
bedroom_intercept[0],bedroom_slope[0][0]

(109473.17762296257, 127588.95293398695)

## Now that we have 2 different models compute the RSS from BOTH models on TEST data.

In [116]:
#RSS from sqft test model
get_residual_sum_of_squares(
    test_data[['sqft_living']], test_data[['price']],
    squarefeet_intercept[0], squarfeet_slope[0][0])

275168573899671.78

In [117]:
#RSS from bedroom test model
get_residual_sum_of_squares(
    test_data[['bedrooms']], test_data[['price']],
    bedroom_intercept[0],bedroom_slope[0][0])

490597142829587.5