# Multiple Linear Regression (Implementation)

## Import Module

In [1]:
import pandas as pd
import numpy as np

## Import Data type

In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 
              'sqft_above':int, 'sqft_living15':float, 
              'grade':int, 'yr_renovated':int, 'price':float, 
              'bedrooms':float, 'zipcode':str, 'long':float, 
              'sqft_lot15':float, 'sqft_living':float, 
              'floors':str, 'condition':int, 'lat':float, 
              'date':str, 'sqft_basement':int, 'yr_built':int, 
              'id':str, 'sqft_lot':int, 'view':int}

## Load Data from CSV Files

In [3]:
sales = pd.read_csv("kc_house_data.csv", dtype = dtype_dict)
train_data = pd.read_csv("kc_house_train_data.csv", dtype = dtype_dict)
test_data = pd.read_csv("kc_house_test_data.csv", dtype = dtype_dict)

# Prepend 'constant column
sales['constant'] = 1
train_data['constant'] = 1
test_data['constant'] = 1

## Predicting output given regression weights

In [4]:
def predict_output(feature_matrix, weights):
    # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy array
    # create the predictions vector by using np.dot()
    predictions = np.dot(feature_matrix, weights)
    return predictions

## Computing the derivative

We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output.
Since the derivative of a sum is the sum of the derivatives we can compute the derivative for a single data point and then sum over data points. We can write the squared difference between the observed output and predicted output for a single point as follows:

(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)^2

Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:

2*(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)* [feature_i]

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

2*error*[feature_i]

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. In the case of the constant then this is just twice the sum of the errors!
Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors.
With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).

In [5]:
def feature_derivative(errors, feature):
    # Assume that errors and feature are both numpy arrays of the same length (number of data points)
    # compute twice the dot product of these vectors as 'derivative' and return the value
    derivative = 2 * np.dot(feature, errors)
    return derivative

## Gradient Descent

Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of increase and therefore the negative gradient is the direction of decrease and we're trying to minimize a cost function.
The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.
With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria

In [6]:
from math import sqrt
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False
    weights = np.array(initial_weights)
    while not converged:
        # compute the predictions based on feature_matrix and weights:
        predictions = predict_output(feature_matrix, weights)
        # compute the errors as predictions - output:
        errors = predictions - output
        gradient_sum_squares = 0 # initialize the gradient
        # while not converged, update each weight individually:
        for i in range(len(weights)):
            # Recall that feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            derivative = feature_derivative(feature_matrix[:, i], errors)
            # add the squared derivative to the gradient magnitude
            gradient_sum_squares += (derivative**2)
            # update the weight based on step size and derivative:
            weights[i] -= (step_size * derivative)
            
        gradient_magnitude = sqrt(gradient_sum_squares)
        if gradient_magnitude < tolerance:
            converged = True
    return weights

A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (squarefeet) and the output is large (prices). So while you might expect "tolerance" to be small, small is only relative to the size of the features.
For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values.

## Running the Gradient Descent as Simple Regression

Now we will run the regression_gradient_descent function on some actual data. In particular we will use the gradient descent to estimate the model from Week 1 using just an intercept and slope. Use the following parameters:

* features: 'sqft_living'
* output: 'price'
* initial weights: -47000, 1 (intercept, sqft_living respectively)
* step_size: 7e-12
* tolerance: 2.5e7

In [7]:
simple_feature = train_data[['constant', 'sqft_living']].as_matrix()
output = train_data['price'].as_matrix()
initial_weights = np.array([-47000, 1.])
step_size = 7e-12
tolerance = 2.5e7

Use these parameters to estimate the slope and intercept for predicting prices based only on ‘sqft_living’.

In [8]:
simple_weights = regression_gradient_descent(simple_feature,output,initial_weights,step_size,tolerance)
print("Test Weights (Train Data): ", simple_weights)

Test Weights (Train Data):  [-46999.88716555    281.91211918]


Now build a corresponding ‘test_simple_feature_matrix’ and ‘test_output’ using test_data. Using ‘test_simple_feature_matrix’ and ‘simple_weights’ compute the predicted house prices on all the test data.

In [9]:
# Use your newly estimated weights and your predict_output() function to compute the predictions on all the TEST data (you will need to create a numpy array of the test feature_matrix and test output first:
test_simple_feature_matrix = test_data[['constant', 'sqft_living']].as_matrix()
test_predictions = predict_output(test_simple_feature_matrix, simple_weights)
print("Test Predictions: ", test_predictions)

Test Predictions:  [ 356134.443255    784640.86440132  435069.83662406 ...,  663418.65315598
  604217.10812919  240550.47439317]


Now compute RSS on all test data for this model. Record the value and store it for later

In [10]:
test_residuals = test_data['price']- test_predictions
test_RSS = (test_residuals * test_residuals).sum()
print("RSS of Test Data: %.4g" % test_RSS)

RSS of Test Data: 2.754e+14


Now we will use the gradient descent to fit a model with more than 1 predictor variable (and an intercept). Use the following parameters:

* model features: 'sqft_living', 'sqft_living_15'
* output: 'price'
* initial weights: [-100000, 1, 1] (intercept, sqft_living, and sqft_living_15 respectively)
* step size: 4e-12
* tolerance 1e9

In [11]:
# Now we will use more than one actual feature. Use the following code to produce the weights for a second model with the following parameters:
model_features = ['constant', 'sqft_living', 'sqft_living15']
my_output = 'price'
train_feature_matrix = train_data[model_features].as_matrix()
output = train_data[my_output].as_matrix()
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

Run gradient descent on a model with ‘sqft_living’ and ‘sqft_living_15’ as well as an intercept with the above parameters. Save the resulting regression weights.

In [12]:
# Weights for MLR Model
MLR_weights = regression_gradient_descent(train_feature_matrix,output,initial_weights,step_size,tolerance)
print("Weights: ", MLR_weights)

Weights:  [ -9.99999688e+04   2.45072603e+02   6.52795267e+01]


 Predict the outcome of all the house prices on the TEST data.

In [13]:
test_feature_matrix = test_data[model_features].as_matrix()
MLR_predictions = predict_output(test_feature_matrix, MLR_weights)
print("MLR Predictions: ", MLR_predictions)

MLR Predictions:  [ 366651.41162949  762662.39850726  386312.09557541 ...,  682087.39916306
  585579.27901327  216559.20391786]


Now compute RSS on all test data for the second model. Record the value and store it for later.

In [14]:
MLR_Residuals = test_data[my_output] - MLR_predictions
MLR_RSS = (MLR_Residuals * MLR_Residuals).sum()
print("RSS of MLR: %.4g" % MLR_RSS)

RSS of MLR: 2.703e+14
