Perform Ordinary least squares regression and track the result of the model. This will be performed without libraries to calculate the regression. Additionally, The model would be compared using modern libraries such as SKlearn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

In [2]:
credit = pd.read_csv('Credit_N400_p9.csv')
credit.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Balance
0,14.891,3606,283,2,34,11,Male,No,Yes,333
1,106.025,6645,483,3,82,15,Female,Yes,Yes,903
2,104.593,7075,514,4,71,11,Male,No,No,580
3,148.924,9504,681,3,36,11,Female,No,No,964
4,55.882,4897,357,2,68,16,Male,No,Yes,331


Feature  Engineering:

1. Convert Categorical values
2. Standardize and normalize the data.


In [3]:
# Change Catgorical values to 0 or 1
credit['Gender']=credit['Gender'].astype('category')
credit['Student']=credit['Student'].astype('category')
credit['Married']=credit["Married"].astype("category")
cat_columns=credit.select_dtypes(['category']).columns
credit[cat_columns]=credit[cat_columns].apply(lambda x:x.cat.codes)

In [4]:
# select independent and dependent variables
X=credit.drop(['Balance'],axis=1)
y=credit['Balance']

In [5]:
# creates a function to be recalled to standardize the data
def standardize(X):
  credit1=(X-X.mean())/X.std() # centralize and standardize the independent values
  return credit1
def centered_y(y):
  credit2=y-y.mean()# centralize the response values
  return credit2

In [6]:
# center and standardize the data
X=standardize(X) # assign the standardization to the independent values in the list
y=centered_y(y) # assign the centered function to the response values
X.head()# print standardize data

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married
0,-0.860505,-0.489386,-0.464957,-0.698255,-1.256101,-0.783948,1.034339,-0.332916,0.7944
1,1.725276,0.827225,0.827667,0.030993,1.526539,0.495967,-0.964384,2.996248,0.7944
2,1.684646,1.013518,1.028023,0.760241,0.888851,-0.783948,1.034339,-0.332916,-1.255665
3,2.942467,2.065853,2.107363,0.030993,-1.140158,-0.783948,-0.964384,-0.332916,-1.255665
4,0.302549,0.069925,0.013314,-0.698255,0.714936,0.815946,1.034339,-0.332916,0.7944


Create a ordinary least square function to calculate beta using the closed form formula.


In [7]:
def beta(X,y):
    Xst = np.transpose(X)
    Xst_X = np.dot(Xst,X)
    Xst_X_inv = np.linalg.inv(Xst_X)
    Xst_y = np.dot(Xst,y)
    return np.dot(Xst_X_inv,Xst_y)

In [8]:
beta_hat = beta(X,y)
beta_hat

array([-274.97585713,  445.6598264 ,  170.54760351,   24.57773459,
        -10.94805296,   -3.48469203,    5.20664991,  128.10098409,
         -3.42384309])

Predict and Measure the accuracy of the model

In [9]:
# Create function to predict for MSE later on
def predict(x,beta):
  prediction=np.dot(x,beta)
  return prediction

def mse(y_true,y_pred):
  error=(np.sum((y_true-y_pred)**2))/len(y_true)
  return error

In [10]:
pred = predict(X,beta_hat)

In [11]:
print("Mean squared error: %.2f" % mse(y, pred))

Mean squared error: 9502.04


Compare results with Sklearn linear model

In [12]:
model = linear_model.LinearRegression(fit_intercept=False)
model = model.fit(X,y)
model.coef_

array([-274.97585713,  445.6598264 ,  170.54760351,   24.57773459,
        -10.94805296,   -3.48469203,    5.20664991,  128.10098409,
         -3.42384309])

In [13]:
y_predicted = model.predict(X)
print("Mean squared error: %.2f" % mse(y, y_predicted))


Mean squared error: 9502.04


Both models contain identical values; thus, we can confirm that the mathematical approach to performing OLS Regression was correct.