# Simple Linear Regression

# Fire up graphlab create

In [1]:
import graphlab

# Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [2]:
sales = graphlab.SFrame('kc_house_data.gl/')

This non-commercial license of GraphLab Create for academic use is assigned to nanlee_89@yahoo.com and will expire on December 07, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Nan\AppData\Local\Temp\graphlab_server_1529636412.log.0


# Split data into training and testing

In [3]:
train_data,test_data = sales.random_split(.8,seed=0)

# Build a generic simple linear regression function 

### Function to compute the simple linear regression slope and intercept:

In [4]:
def simple_linear_regression(input_feature, output):
    # compute the sum of input_feature and output
    sum_input_feature = input_feature.sum()
    sum_output = output.sum()
    
    # compute the product of the output and the input_feature and its sum
    product_output_input_feature = input_feature*output
    sum_product = product_output_input_feature.sum()
    
    # compute the squared value of the input_feature and its sum
    squared_input_feature = input_feature*input_feature
    sum_squared_input = squared_input_feature.sum()
    
    # use the formula for the slope
    # numerator = (sum of X*Y) - (1/N)*((sum of X) * (sum of Y))
    # denominator = (sum of X^2) - (1/N)*((sum of X) * (sum of X))
    n = float(len(input_feature))
    numerator = sum_product - (1./n) * sum_input_feature * sum_output
    denominator = sum_squared_input - (1./n) * sum_input_feature * sum_input_feature
    slope = numerator / denominator
    
    
    # use the formula for the intercept
    # intercept = (mean of Y) - slope * (mean of X)
    intercept = output.mean() - slope * input_feature.mean()
    
    return (intercept, slope)

### Build a regression model for predicting price based on sqft_living

In [5]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print "Intercept: " + str(sqft_intercept)
print "Slope: " + str(sqft_slope)

Intercept: -47116.0765749
Slope: 281.958838568


# Predicting Values

### Function to compute the predicted output given the input_feature, slope and intercept:

In [6]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    predicted_values = slope*input_feature + intercept
    
    return predicted_values

** The predicted price for a house with 2650 sqft using learned Slope and Intercept **

In [7]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print "The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price)

The estimated price for a house with 2650 squarefeet is $700074.85


# Residual Sum of Squares

### Function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope

In [8]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # First get the predictions
    predicted_values = slope * input_feature + intercept

    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals = predicted_values - output
    
    # square the residuals and add them up
    squared_residuals = residuals*residuals
    RSS = squared_residuals.sum()

    return(RSS)

__Calculate the RSS on training data from the squarefeet model__

In [9]:
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)

The RSS of predicting Prices based on Square Feet is : 1.20191835632e+15


# Predict the squarefeet given price

### Function to compute the inverse regression estimate

In [10]:
def inverse_regression_predictions(output, intercept, slope):
    # solve output = intercept + slope*input_feature for input_feature. Use this equation to compute the inverse predictions:
    estimated_feature = (output-intercept)/slope
    
    return estimated_feature

**The estimated square-feet for a house costing $800,000 given the slope and intercept from squarefeet model**

In [11]:
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print "The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet)

The estimated squarefeet for a house worth $800000.00 is 3004


# New Model: estimate prices from bedrooms

In [12]:
# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'

bd_intercept, bd_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])

print "Intercept: " + str(bd_intercept)
print "Slope: " + str(bd_slope)

Intercept: 109473.180469
Slope: 127588.952175


# Test Linear Regression Algorithm

Now we have two models for predicting the price of a house.  Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet. Regression model has lowest RSS on TEST data would be better fitting. 

In [13]:
# Compute RSS when using bedrooms on TEST data:
print 'RSS on test data using bedroom model: {:.3e}'.format(get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], bd_intercept, bd_slope))


RSS on test data using bedroom model: 4.934e+14


In [15]:
# Compute RSS when using squarefeet on TEST data
print 'RSS on test data using squarefeet model: {:.3e}'.format(get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_intercept, sqft_slope))


RSS on test data using squarefeet model: 2.754e+14


**Clearly, squarefeet model has lower RSS on TEST data.**