<a href="https://colab.research.google.com/github/mounikarevanuru/mlfoundations/blob/main/algorithms/linear_regression/linear_regression_closed_form_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook has the code to create a linear regression model (closed form) from scratch. I will be suing diabetes dataset for solving the problem.

In [11]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

In [12]:
class LinearRegression:
    def __init__(self):
      self.coef_ = None
      self.intercept_ = None

    def _rmse_loss(self, X, y):
        """
        Compute the Root Mean Squared Error (OLS cost)
        """
        y_pred = self.predict(X)
        return np.sqrt(np.mean((y - y_pred) ** 2))

    def fit(self, X, y):
      """

      """
      self.X = X
      self.y = y

      #Add bias to X
      X_bias = np.c_[np.ones(X.shape[0]), X]
      #print(X_bias[:5, :])

      # Normal equation
      w = np.linalg.inv(X_bias.T.dot(X_bias)).dot(X_bias.T).dot(y)

      # Separate intercept and coefficients
      self.intercept_ = w[0]
      self.coef_ = w[1:]

      self.loss_ = self._rmse_loss(X, y)
      print("RMSE loss on training data:", self.loss_)

    def predict(self, X):
      """
      Return the predicted y values for X
      """
      return X.dot(self.coef_) + self.intercept_

    def score(self, X, y):
        """
        Return R^2 score of the prediction
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - ss_res / ss_tot

In [13]:
diabetes = load_diabetes()
diabetes

{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
          0.01990749, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
         -0.06833155, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
          0.00286131, -0.02593034],
        ...,
        [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
         -0.04688253,  0.01549073],
        [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
          0.04452873, -0.02593034],
        [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
         -0.00422151,  0.00306441]]),
 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
         69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
         68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
         87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
        259.,  53., 190., 142.,  75., 142., 155., 225.,  59

In [14]:
X = diabetes.data   # shape (442, 10)
y = diabetes.target # shape (442,)

In [15]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, random_state = 42)

In [16]:
# Create LinearRegression instance
model = LinearRegression()
# Fit the model on training data
model.fit(X_train, y_train)

RMSE loss on training data: 53.55884336723094


In [21]:
y_pred = model.predict(X_test)
print("Predictions on test set:", y_pred[:10])  # print first 10 predictions

Predictions on test set: [139.5475584  179.51720835 134.03875572 291.41702925 123.78965872
  92.1723465  258.23238899 181.33732057  90.22411311 108.63375858]


In [19]:
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

Intercept: 151.34560453986003
Coefficients: [  37.90402135 -241.96436231  542.42875852  347.70384391 -931.48884588
  518.06227698  163.41998299  275.31790158  736.1988589    48.67065743]


In [20]:
r2 = model.score(X_test, y_test)
print("R^2 score on test set:", r2)

R^2 score on test set: 0.4526027629719199
