# Linear Regression Algorithm from Scratch

The purpose of this assignment is to manually create a linear regression algorithm. We will test our algorithm on a housing data set to predict the price of the house based on various features.

The theory and method of the linear regression algorithm can be found in the pdf file titled "Linear Regression"

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats
import sklearn
from math import log
%matplotlib inline

### The Data

In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
house_df=pd.read_csv("kc_house_data.csv",dtype=dtype_dict)
train_df=pd.read_csv("kc_house_train_data.csv",dtype=dtype_dict)
test_df=pd.read_csv("kc_house_test_data.csv",dtype=dtype_dict)

In [3]:
house_df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3.0,1.0,1180.0,5650,1,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340.0,5650.0
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570.0,7242,2,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690.0,7639.0
2,5631500400,20150225T000000,180000.0,2.0,1.0,770.0,10000,1,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720.0,8062.0
3,2487200875,20141209T000000,604000.0,4.0,3.0,1960.0,5000,1,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360.0,5000.0
4,1954400510,20150218T000000,510000.0,3.0,2.0,1680.0,8080,1,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800.0,7503.0


## Defining the Algorithm

In [4]:
#Define a function that accepts a dataframe slice as feature values and returns a numpy array. The first column of the 
#matrix should be all ones (this is for the constant term). The function should also take, as input, a list of feature
#names. The output H matrix will have its columns in the same order they appear in the list.

def H_generator(x_data,col_list):
    #add the ones column
    x_data["ones"]=1
    return np.array(x_data[["ones",]+col_list])
    

In [5]:
#Define a function that takes an H matrix, a y vector, and a w vector (initial values) and outputs the gradient matrix

def gradient_generator(H,w,y):
    return -2*np.dot(np.transpose(H),y-np.dot(H,w))
    

In [6]:
def regressor(H,w_not,y,eta,epsilon,max_iterations):
    wn=w_not
    j=0
    while True:
        j+=1
        gradient=gradient_generator(H,wn,y)
        gradient_length=np.sqrt(abs(np.dot(gradient,gradient)))
        wn=wn-(eta*gradient)
        if gradient_length<epsilon or j>max_iterations:
            break
    print("Coef: {}\ny predict: {}\niterations: {}\ngradient_length: {}".format(wn,np.dot(H,wn),j,gradient_length))
    return(wn)

In [7]:
#Define the model that makes predictions
def manual_predict(w,x_data,col_list):
    H=H_generator(x_data,col_list)
    return np.dot(H,w)
    

In [8]:
#Define a residual sum of squares function:
def rss(y_true,y_prediction):
    return ((y_true-y_prediction)**2).sum()

## Testing The Algorithm
Let's compare our algorithm to sklearn's LinearRegression

### Sklearn

In [9]:
from sklearn.linear_model import LinearRegression

In [21]:
X_train1=train_df[["sqft_living",]][:]
y_train1=train_df.price

X_test1=test_df[["sqft_living",]][:]
y_test1=test_df.price

In [11]:
model_compare1=LinearRegression().fit(X_train1,y_train1)

In [12]:
model_compare1.coef_

array([ 281.95883963])

In [13]:
model_compare1.intercept_

-47116.079072893714

#### Score

In [14]:
print("Train score: {}\nTest score: {}".format(model_compare1.score(X_train1,y_train1),model_compare1.score(X_test1,y_test1)))

Train score: 0.49409140136495394
Test score: 0.4872491184162604


### Our Algorithm

In [15]:
H_train=H_generator(X_train1,["sqft_living",])

In [16]:
w_train=regressor(H_train,np.array([-47000,1]),y_train1,7e-12,2.5e7,10000)

Coef: [-46999.88716555    281.91211918]
y predict: [ 285656.4134612   677514.25911474  170072.44459936 ...,  384325.65517252
  404059.50351479  240550.47439317]
iterations: 12
gradient_length: 18320017.26866943


In [17]:
w_train

array([-46999.88716555,    281.91211918])

In [18]:
y_predict=manual_predict(w_train,X_test1,["sqft_living",])

In [23]:
#Redefine our data sets since they have changed
X_train1=train_df[["sqft_living",]][:]
y_train1=train_df.price

X_test1=test_df[["sqft_living",]][:]
y_test1=test_df.price

In [25]:
predict_df=pd.DataFrame({"Actual y":y_test1,"Manual y":y_predict,"Sklearn y":model_compare1.predict(X_test1)})

In [26]:
predict_df.head()

Unnamed: 0,Actual y,Manual y,Sklearn y
0,310000.0,356134.443255,356085.061598
1,650000.0,784640.864401,784662.497837
2,233000.0,435069.836624,435033.536695
3,580500.0,607036.229321,607028.42887
4,535000.0,260284.322735,260219.056124


In [27]:
rss1=rss(y_test1,y_predict)
rss1

275400044902128.78

In [28]:
#We can compute the coefficient of determination (r squared value) with...
from sklearn.metrics import r2_score
r2_score(y_test1,y_predict)

0.48725449668690579

# Multiple Features

Now we will use the gradient descent to fit a model with more than 1 predictor variable (and an intercept). Use the following parameters:

model features = ‘sqft_living’, ‘sqft_living_15’
    
output = ‘price’

initial weights = [-100000, 1, 1] (intercept, sqft_living, and sqft_living_15 respectively)

step size = 4e-12

tolerance = 1e9

### Sklearn

In [237]:
X_train2=train_df[["sqft_living","sqft_living15"]][:]
y_train2=train_df.price

In [238]:
X_test2=test_df[["sqft_living","sqft_living15"]][:]
y_test2=test_df.price

In [229]:
model_compare2=LinearRegression().fit(X_train2,y_train2)

In [230]:
model_compare2.intercept_

-100262.17515853408

In [231]:
model_compare2.coef_

array([ 245.18871442,   65.27158522])

### Our Algorithm

In [232]:
H_train2=H_generator(X_train2,["sqft_living","sqft_living15"])

In [233]:
w_train2=regressor(H_train2,np.array([-100000,1,1]),y_train2,4e-12,1e9,10000)

Coef: [ -9.99999688e+04   2.45072603e+02   6.52795267e+01]
y predict: [ 276660.26901584  640159.02217622  266266.24843983 ...,  374838.79030123
  384160.32933988  216559.20391786]
iterations: 274
gradient_length: 997480716.6301363


In [234]:
y_predict2=manual_predict(w_train2,X_test2,["sqft_living","sqft_living15"])

## A Few Questions About the Prediction...

In [242]:
#Create a data frame of true prices and both predicted prices
predict2_df=pd.DataFrame({"Actual y":y_test2,"Manual y":y_predict2,"Sklearn y":model_compare2.predict(X_test2)})

In [243]:
#Let's view the results
predict2_df.head()

Unnamed: 0,Actual y,Manual y,Sklearn y
0,310000.0,366651.411629,366541.108167
1,650000.0,762662.398507,762725.724772
2,233000.0,386312.095575,386240.259288
3,580500.0,636989.650072,636976.332184
4,535000.0,269618.025845,269469.912366


Which estimate was closer to the true price for the 1st house on the TEST data set, model 1 or model 2?

In [246]:
model1err=predict_df["Actual y"][0]-predict_df["Manual y"][0]
model2err=predict2_df["Actual y"][0]-predict2_df["Manual y"][0]
print("Model 1 difference: {}\nModel 2 difference: {}".format(model1err,model2err))

Model 1 difference: -46134.44325500238
Model 2 difference: -56651.41162949387


In [261]:
rss2=rss(y_test2,y_predict2)
rss2

270263443629803.3

 Which model (1 or 2) has lowest RSS on all of the TEST data?

In [263]:
rss2<rss1

True

Apparently, model 2 has a lower rss value