### Week 11: Ordinary Least Squares

**OBJECTIVES**

- Demonstrate OLS in Python

 - from scratch
 - `statsmodels`
 - `sklearn`

In [1]:
%%latex
$$(X^TX)^{-1}X^Ty$$

<IPython.core.display.Latex object>

In [2]:
from regression import LinearRegression   

In [3]:
from sklearn.datasets import fetch_california_housing

In [4]:
cali = fetch_california_housing()

In [5]:
print(cali.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [6]:
X, y = cali.data, cali.target

In [7]:
lr = LinearRegression()

In [8]:
import numpy as np

In [9]:
lr.fit(X, y)

In [10]:
lr.coefs_

array([ 5.13515163e-01,  1.56511109e-02, -1.82528269e-01,  8.65099057e-01,
        7.79230657e-06, -4.69928985e-03, -6.39458199e-02, -1.63827177e-02])

In [11]:
lr.intercept_

In [12]:
preds = lr.predict(X)

In [13]:
preds

array([4.09839978, 3.88380622, 3.52943155, ..., 0.6193242 , 0.74355959,
       0.99054306])

In [15]:
y

array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

In [16]:
y - y.mean()

array([ 2.45744183,  1.51644183,  1.45244183, ..., -1.14555817,
       -1.22155817, -1.17455817])

In [18]:
class RegressionMetrics:
    import matplotlib.pyplot as plt
    def __init__(self, y_true, y_pred):
        self.y = y_true
        self.y_pred = y_pred
    
    def r_squared(self):
        ssr = np.sum((self.y - self.y_pred)**2)
#         ymean = np.mean(y_true)
        sst = np.sum((self.y - y.mean())**2)
        return 1 - ssr/sst
    
    def mse(self):
        self.mse_ = np.mean((self.y - self.y_pred)**2)
        return self.mse_
        
        
    def rmse(self):
        self.rmse_ = np.sqrt(self.mse_)
        return self.rmse_
        
        
    def plot(self):
        

In [19]:
metric = RegressionMetrics()

In [20]:
metric.r_squared(y, preds)

0.5462360656980105