In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import pylab as pl
import numpy as np
%matplotlib inline

In [2]:
print(pd.__version__)

0.24.2


In the first assignment we explored multiple regression using Pandas. Now we will use Pandas along with numpy to solve for the regression weights with gradient descent.

In this notebook we will cover estimating multiple regression weights via gradient descent. We will:

* Add a constant column of 1's to a dataframe to account for the intercept
* Convert an dataframe into a numpy array
* Write a predict_output() function using numpy
* Write a numpy function to compute the derivative of the regression weights with respect to a single feature
* Write gradient descent function to compute the regression weights given an initial weight vector, step size and tolerance.
* Use the gradient descent function to estimate regression weights for multiple features

In [3]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

In [4]:
train_data = pd.read_csv('kc_house_train_data.csv',dtype=dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv',dtype=dtype_dict)

In [5]:
train_data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3.0,1.0,1180.0,5650,1,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340.0,5650.0
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570.0,7242,2,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690.0,7639.0
2,5631500400,20150225T000000,180000.0,2.0,1.0,770.0,10000,1,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720.0,8062.0
3,2487200875,20141209T000000,604000.0,4.0,3.0,1960.0,5000,1,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360.0,5000.0
4,1954400510,20150218T000000,510000.0,3.0,2.0,1680.0,8080,1,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800.0,7503.0


In [28]:
test_data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,constant
0,114101516,20140528T000000,310000.0,3.0,1.0,1430.0,19901,1.5,0,0,...,1430,0,1927,0,98028,47.7558,-122.229,1780.0,12697.0,1
1,9297300055,20150124T000000,650000.0,4.0,3.0,2950.0,5000,2.0,0,3,...,1980,970,1979,0,98126,47.5714,-122.375,2140.0,4000.0,1
2,1202000200,20141103T000000,233000.0,3.0,2.0,1710.0,4697,1.5,0,0,...,1710,0,1941,0,98002,47.3048,-122.218,1030.0,4705.0,1
3,8562750320,20141110T000000,580500.0,3.0,2.5,2320.0,3980,2.0,0,0,...,2320,0,2003,0,98027,47.5391,-122.07,2580.0,3980.0,1
4,7589200193,20141110T000000,535000.0,3.0,1.0,1090.0,3000,1.5,0,0,...,1090,0,1929,0,98117,47.6889,-122.375,1570.0,5080.0,1


Next write a function that takes a data set, a list of features (e.g. [‘sqft_living’, ‘bedrooms’]), to be used as inputs, and a name of the output (e.g. ‘price’). This function should return a features_matrix (2D array) consisting of first a column of ones followed by columns containing the values of the input features in the data set in the same order as the input list. It should also return an output_array which is an array of the values of the output in the data set (e.g. ‘price’). 

In [6]:
def get_numpy_data(dataframe, features, output):
    dataframe['constant'] = 1 # add a constant column to an DFrame
    # prepend variable 'constant' to the features list
    features = ['constant'] + features
    # select the columns of dataFrame given by the ‘features’ list into the DFrame ‘features_df’
    features_df = dataframe[features]
    # this will convert the features_df into a numpy matrix:
    features_matrix = features_df.to_numpy()
    # assign the column of dataframe associated with the target to the variable ‘output_sarray’
    output_sarray = dataframe[output]
    # this will convert the SArray into a numpy array:
    output_array = output_sarray.to_numpy()
    return(features_matrix, output_array)

If the features matrix (including a column of 1s for the constant) is stored as a 2D array (or matrix) and the regression weights are stored as a 1D array then the predicted output is just the dot product between the features matrix and the weights (with the weights on the right). Write a function ‘predict_output’ which accepts a 2D array ‘feature_matrix’ and a 1D array ‘weights’ and returns a 1D array ‘predictions’. e.g. in python:

In [7]:
def predict_outcome(feature_matrix, weights):
    predictions = np.dot(feature_matrix, weights)
    return(predictions)

If we have a the values of a single input feature in an array ‘feature’ and the prediction ‘errors’ (predictions - output) then the derivative of the regression cost function with respect to the weight of ‘feature’ is just twice the dot product between ‘feature’ and ‘errors’. Write a function that accepts a ‘feature’ array and ‘error’ array and returns the ‘derivative’ (a single number). e.g. in python:

In [8]:
def feature_derivative(errors, feature):
    derivative = 2*np.dot(errors, feature)
    return(derivative)

Now we will use our predict_output and feature_derivative to write a gradient descent function. Although we can compute the derivative for all the features simultaneously (the gradient) we will explicitly loop over the features individually for simplicity. Write a gradient descent function that does the following:

* Accepts a numpy feature_matrix 2D array, a 1D output array, an array of initial weights, a step size and a convergence tolerance.
* While not converged updates each feature weight by subtracting the step size times the derivative for that feature given the current weights
* At each step computes the magnitude/length of the gradient (square root of the sum of squared components)
* When the magnitude of the gradient is smaller than the input tolerance returns the final weight vector.

In [9]:
import math
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False
    weights = np.array(initial_weights)
    while not converged:
        # compute the predictions based on feature_matrix and weights:
        predictions = predict_outcome(feature_matrix, weights)
        # compute the errors as predictions - output:
        errors = predictions - output
        
        gradient_sum_squares = 0 # initialize the gradient
        # while not converged, update each weight individually:
        for i in range(len(weights)):
            # Recall that feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            derivative = feature_derivative(errors, feature_matrix[:,i])
            
            # add the squared derivative to the gradient magnitude
            gradient_sum_squares += derivative**2
            # update the weight based on step size and derivative:
            weights[i] = weights[i] - step_size*derivative
        gradient_magnitude = math.sqrt(gradient_sum_squares)
        if gradient_magnitude < tolerance:
            converged = True
    return(weights)

In [14]:
#Use these parameters to estimate the slope and intercept for predicting prices based only on ‘sqft_living’.
simple_features = ['sqft_living']
my_output= 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

In [12]:
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights, step_size, tolerance)

In [13]:
simple_weights

array([-46999.88716555,    281.91211918])

## Now build a corresponding ‘test_simple_feature_matrix’ and ‘test_output’ using test_data. Using ‘test_simple_feature_matrix’ and ‘simple_weights’ compute the predicted house prices on all the test data.

In [15]:
simple_features = ['sqft_living']
my_output= 'price'
(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)

In [16]:
#predicted test data
test_predictions = predict_outcome(test_simple_feature_matrix, simple_weights)

In [17]:
test_predictions

array([356134.443255  , 784640.86440132, 435069.83662406, ...,
       663418.65315598, 604217.10812919, 240550.47439317])

In [18]:
# Now compute RSS on all test data for this model
RSS = np.sum((test_predictions - test_output)**2)


In [19]:
RSS

275400044902128.3

In [20]:
#Now we will use the gradient descent to fit a model with more than 1 predictor variable (and an intercept)
model_features = ['sqft_living', 'sqft_living15'] 
#Note that sqft_living_15 is the average square feet of the nearest 15 neighbouring houses.
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features,my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

## Use the regression weights from this second model (using sqft_living and sqft_living_15) and predict the outcome of all the house prices on the TEST data.

In [21]:
regression_weights = regression_gradient_descent(feature_matrix, output,initial_weights, step_size, tolerance)

In [22]:
(test_feature_matrix, test_output2) = get_numpy_data(test_data, model_features, my_output)

In [23]:
#predicted test data
test_predictions2 = predict_outcome(test_feature_matrix, regression_weights)

In [24]:
test_predictions2

array([366651.41162949, 762662.39850726, 386312.09557541, ...,
       682087.39916306, 585579.27901327, 216559.20391786])

In [25]:
#the actual price for the 1st house in the test data:
test_output[0]

310000.0

In [26]:
#Now compute RSS on all test data for the second model.
RSS2 = np.sum((test_predictions2 - test_output)**2)

In [27]:
RSS2

270263443629803.56