# Simple Linear Regression

## Import Module

In [1]:
import pandas as pd
import numpy as np

## Import Data Type

In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 
              'sqft_living15':float, 'grade':int, 'yr_renovated':int, 
              'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 
              'sqft_lot15':float, 'sqft_living':float, 'floors':str, 
              'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 
              'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

## Load Data from CSV Files

In [3]:
sales = pd.read_csv("kc_house_data.csv", dtype = dtype_dict)
train_data = pd.read_csv("kc_house_train_data.csv", dtype = dtype_dict)
test_data = pd.read_csv("kc_house_test_data.csv", dtype = dtype_dict)

Exploring Some Data

In [4]:
sales.head(5)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3.0,1.0,1180.0,5650,1,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340.0,5650.0
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570.0,7242,2,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690.0,7639.0
2,5631500400,20150225T000000,180000.0,2.0,1.0,770.0,10000,1,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720.0,8062.0
3,2487200875,20141209T000000,604000.0,4.0,3.0,1960.0,5000,1,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360.0,5000.0
4,1954400510,20150218T000000,510000.0,3.0,2.0,1680.0,8080,1,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800.0,7503.0


## Build a generic simple linear regression function

In [5]:
def simple_linear_regression(input_feature, output):
    # compute the sum of input_feature and output
    sum_input_feature = input_feature.sum()
    sum_output = output.sum()
    
    # compute the product of the output and the input_feature and its sum
    mult_input_feature_output = input_feature * output
    sum_output_input = mult_input_feature_output.sum()
    
    # compute the squared value of the input_feature and its sum
    input_squared = input_feature * input_feature
    sum_input_squared = input_squared.sum()
    
    # use the formula for the slope
    num_data = output.size
    slope = (sum_output_input-(sum_output*sum_input_feature/num_data))/(sum_input_squared-input_feature.sum()*input_feature.sum()/num_data)
    # use the formula for the intercept
    intercept = output.mean() - slope * input_feature.mean()
    return (intercept, slope)

Builds a regression model for predicting price based on sqft_living.

In [6]:
# save data for predicting price from sqft_living
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], 
                                                      train_data['price'])

In [7]:
print("Intercept: " + str(sqft_intercept))
print("Slope    : " + str(sqft_slope))

Intercept: -47116.07907289418
Slope    : 281.9588396303426


## Predicting Values

Following function returns the predicted output given the input_feature, slope and intercept.

In [8]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    predicted_output = intercept + slope*input_feature    
    return predicted_output

Now that we can calculate a prediction given the slop and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above.

In [9]:
# Predicts the price of a house with 2650 sqft
predicted_output = get_regression_predictions(2650,sqft_intercept,sqft_slope)
print("Predicted Price of a house with 2650 sqft: $" + str(predicted_output))

Predicted Price of a house with 2650 sqft: $700074.8459475137


## Residual Sum of Squares

Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output.

Following function computes the RSS of a simple linear regression model given the input_feature, output, intercept and slope.

In [10]:
def get_residual_sum_of_squares(input_feature, output, intercept,slope):
    # First get the predictions
    predictions = intercept + slope*input_feature
    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals = output - predictions
    # square the residuals and add them up
    squared_residuals = residuals ** 2
    RSS = squared_residuals.sum()
    return RSS 

Let's calculate the RSS value from the model above.

In [11]:
# RSS for the simple linear regression using squarefeet to predict prices on TRAINING data
RSS_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], 
                                                 train_data['price'], sqft_intercept, sqft_slope)
print("The RSS of predicting prices based on Square Feet is: %.5g " % RSS_prices_on_sqft)


The RSS of predicting prices based on Square Feet is: 1.2019e+15 


## Predict the squarefeet given price

What if we want to predict the squarefoot given the price? Since we have an equation y = a + b*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

In [12]:
def inverse_regression_predictions(output, intercept, slope):
    estimated_input = (output-intercept)/slope
    return estimated_input

According to the function and regression slope and intercept, estimated squarefeet for a house costing $800,000 will be:

In [13]:
# estimated square-feet for a house costing $800,000
estimate_sqft = inverse_regression_predictions(800000,sqft_intercept,sqft_slope)
print("Estimated Sqft for a house costing $800,000: %.5g" % estimate_sqft)

Estimated Sqft for a house costing $800,000: 3004.4


## New Model: Estimate prices form bedrooms

We have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. Use your simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms. Use the training data!

In [14]:
# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'
bedrooms_intercept, bedrooms_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])
print("Intercept: " + str(bedrooms_intercept))
print("Slope: " + str(bedrooms_slope))

Intercept: 109473.1776229596
Slope: 127588.95293398784


## Testing Linear Regression Algorithm

Now we have two models for predicting the price of a house. How do we know which one is better?

In [15]:
# Compute RSS when using bedrooms on TEST data:
rss_prices_on_bedrooms = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], bedrooms_intercept, bedrooms_slope)
print('The RSS of predicting Prices based on Square Feet is : %.5g' % rss_prices_on_bedrooms)

The RSS of predicting Prices based on Square Feet is : 4.9336e+14


In [16]:
# Compute RSS when using squarefeet on TEST data:
rss_prices_on_sqft = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_intercept, sqft_slope)
print('The RSS of predicting Prices based on Square Feet is : %.5g' % rss_prices_on_sqft)

The RSS of predicting Prices based on Square Feet is : 2.754e+14
