## UW Coursera "Machine Learning: Regression" Course
### Regression Week 1: Simple Linear Regression Assignment
#### Predicting House Prices in King County, Seattle

__By Mehmet Solmaz__
__01/13/2017__

Note: Please download the datasets from the Coursera website by enrolling in the course.

In [1]:
import pandas as pd
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt

In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

In [3]:
train_data=pd.read_csv("kc_house_train_data.csv", dtype=dtype_dict)

In [4]:
train_data.shape

(17384, 21)

In [5]:
train_data.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [6]:
train_data.dtypes

id                object
date              object
price            float64
bedrooms         float64
bathrooms        float64
sqft_living      float64
sqft_lot           int32
floors            object
waterfront         int32
view               int32
condition          int32
grade              int32
sqft_above         int32
sqft_basement      int32
yr_built           int32
yr_renovated       int32
zipcode           object
lat              float64
long             float64
sqft_living15    float64
sqft_lot15       float64
dtype: object

__Instruction #3__: Write a generic function that accepts a column of data (e.g, an SArray) ‘input_feature’ and another column ‘output’ and returns the Simple Linear Regression parameters ‘intercept’ and ‘slope’. Use __the closed form solution__ from lecture to calculate the slope and intercept. e.g. in python:

In [7]:
def simple_linear_regression(input_feature, output):
    # regr = linear_model.LinearRegression()
    # regr.fit(input_feature.values.reshape(-1, 1), output.values.reshape(-1, 1))
    # intercept=regr.intercept_
    # slope=regr.coef_
    sumXY=(input_feature*output).sum()
    sumXX=(input_feature*input_feature).sum()
    slope = (sumXY - (input_feature.sum()*output.sum())/input_feature.size)/(sumXX - (input_feature.sum()*input_feature.sum())/input_feature.size)
    intercept = output.sum()/output.size-slope*(input_feature.sum()/input_feature.size)
    return(intercept, slope)

__Instruction #4__: Use your function to calculate the estimated slope and intercept on the training data to predict ‘price’ given ‘sqft_living’

In [8]:
intercept, slope = simple_linear_regression(train_data['sqft_living'],train_data['price'])

In [9]:
print("Slope: {}".format(intercept))
print("Intercept: {}".format(slope))

Slope: -47116.07907289418
Intercept: 281.9588396303426


__Instruction #5__: Write a function that accepts a column of data ‘input_feature’, the ‘slope’, and the ‘intercept’ you learned, and returns an a column of predictions ‘predicted_output’ for each entry in the input column. e.g. in python:

In [10]:
def get_regression_predictions(input_feature, intercept1, slope1):
    predicted_output = input_feature * slope1 + intercept1
    return(predicted_output)

In [11]:
predictions = get_regression_predictions(train_data['sqft_living'],intercept,slope)

In [12]:
print(predictions[:6])

0    2.855954e+05
1    6.775181e+05
2    1.699922e+05
3    5.055232e+05
4    4.265748e+05
5    1.481101e+06
Name: sqft_living, dtype: float64


In [13]:
print(train_data['price'].head(6))

0     221900.0
1     538000.0
2     180000.0
3     604000.0
4     510000.0
5    1225000.0
Name: price, dtype: float64


__6. Quiz Question__: Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft?

In [14]:
predicted_price_for_a_house_with_2650_sqft = get_regression_predictions(2650,intercept,slope)
print(predicted_price_for_a_house_with_2650_sqft)

700074.845948


__Instruction #7__. Write a function that accepts column of data: ‘input_feature’, and ‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the Residual Sum of Squares (RSS). e.g. in python

In [15]:
def get_residual_sum_of_squares(input_feature, output, intercept,slope):
    RSS = ((output - (input_feature * slope + intercept))**2).sum()
    return(RSS)

__8. Quiz Question__: According to this function and the slope and intercept from (4) What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?

In [16]:
rss = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], intercept,slope)
print("RSS for Model: {0:1.4E} ".format(rss))

RSS for Model: 1.2019E+15 


__Instruction #9__: Write a function that accept a column of data:‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the column of data. 

In [17]:
def inverse_regression_predictions(output, intercept, slope):
    estimated_input = (output - intercept)/slope
    return(estimated_input)

__Instruction #10__. Quiz Question: According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000?

In [18]:
estimatedSqFeet=inverse_regression_predictions(800000,intercept, slope)

In [19]:
print(estimatedSqFeet)

3004.39624515


__Instruction #11__. Instead of using ‘sqft_living’ to estimate prices we could use ‘bedrooms’ (a count of the number of bedrooms in the house) to estimate prices.

In [20]:
bedroom_intercept, bedroom_slope = simple_linear_regression(train_data['bedrooms'],train_data['price'])

In [21]:
print("Slope: {}".format(bedroom_intercept))
print("Intercept: {}".format(bedroom_slope))

Slope: 109473.1776229596
Intercept: 127588.95293398784


__Instruction #12__. Now that we have 2 different models compute the RSS from BOTH models on TEST data.

__Quiz Question__: Which model (square feet or bedrooms) has lowest RSS on TEST data? Think about why this might be the case.

In [22]:
test_data=pd.read_csv("kc_house_test_data.csv", dtype=dtype_dict)

In [23]:
rss_sqft = get_residual_sum_of_squares(test_data['sqft_living'].values, 
                                                 test_data['price'].values, 
                                                 intercept, slope)

In [24]:
print('The RSS on TEST DATA with Square Feet as input is : ' + str(rss_sqft))

The RSS on TEST DATA with Square Feet as input is : 2.75402933618e+14


In [25]:
rss_bedrooms = get_residual_sum_of_squares(test_data['bedrooms'].values, 
                                                 test_data['price'].values, 
                                                 bedroom_intercept, bedroom_slope)

In [26]:
print('The RSS on TEST DATA with Bedrooms as input is : ' + str(rss_bedrooms))

The RSS on TEST DATA with Bedrooms as input is : 4.9336458596e+14


__Conclusion__: RSS on TEST DATA were evaluated for both SQFT and BEDROOMS as inputs. RSS for BEDROOMS is higher than RSS for SQFT.