<a href="https://colab.research.google.com/github/olcaykursun/ML/blob/main/linear_regression_formulation_covariance_matrix_relationship.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Below is a summarized version of the steps to transition from the residual sum of squares (RSS) to the matrix form for multiple linear regression with two features $ x_1 $ and $ x_2 $:

1. **RSS Formula**:
$$ \text{RSS} = \sum_{i=1}^{n} (y_i - (w_0 + w_1 x_{i1} + w_2 x_{i2}))^2 $$

2. **Expanding Squares**:
Expand the square term in RSS to isolate each coefficient.
$$ \text{RSS} = \sum_{i=1}^{n} (y_i^2 - 2y_i(w_0 + w_1 x_{i1} + w_2 x_{i2}) + (w_0 + w_1 x_{i1} + w_2 x_{i2})^2 ) $$

3. **Minimizing RSS**:
Take the partial derivative of RSS with respect to each coefficient $ w_0, w_1, w_2 $, set it to zero, and solve the resulting equations. For example, for $ w_1 $:
$$ \frac{\partial \text{RSS}}{\partial w_1} = -2 \sum_{i=1}^{n} x_{i1}(y_i - (w_0 + w_1 x_{i1} + w_2 x_{i2})) = 0 $$

4. **Matrix Form**:
The term $ \sum_{i=1}^{n} x_{i1}(y_i - (w_0 + w_1 x_{i1} + w_2 x_{i2})) $ can be represented in matrix form as $ \mathbf{x}_1^T (\mathbf{y} - \mathbf{Xw}) $, leading to the normal equations:
$$ \mathbf{X}^T \mathbf{X} \mathbf{w} = \mathbf{X}^T \mathbf{y} $$

5. **Solving for $ w $**:
$$ \mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $$

This sequence takes us from the RSS formula to the normal equations in matrix form and to the covariance matrix. It provides a compact way to represent and solve multiple linear regression.

Note that the term "normal equations" [1] refers to a system of equations that is derived by setting the gradient of the residual sum of squares (RSS) to zero. In the context of linear regression, this essentially means a closed-form solution for finding the weights that result in the smallest possible MSE (RSS) for a linear regression model.

Also note that there needs to be a colum of "1"s added for the intercept term. The reason to include it in the RSS formula is that we are interested in finding the value of the intercept that minimizes the RSS, along with the values of the other coefficients. By including a column of ones, we can solve for all these terms simultaneously when we find the vector w that minimizes the RSS (the same technique is used in neural networks, see for example [2]).

[1] https://mathworld.wolfram.com/NormalEquation.html

[2] https://www.cmpe.boun.edu.tr/~ethem/i2ml3e/3e_v1-0/i2ml3e-chap11.pdf (Slides 5 and 6)

In [5]:
!pwd
%cd /content/drive/MyDrive/Colab\ Notebooks/ML/
!pwd
import numpy as np
import pandas as pd

df = pd.read_csv('grades_dataset.csv')
df.head()

X = df.drop(columns='Final_Grade').values
y = df['Final_Grade'].values

# Add a column of ones to X for the intercept term
X = np.hstack([np.ones((X.shape[0], 1)), X])

# Calculate the weights (coefficients) using the normal equation
weights = np.linalg.inv(X.T @ X) @ X.T @ y

# The first element in 'weights' is the intercept, and the others are the coefficients
print(f"Intercept: {weights[0]}")
print(f"Coefficients: {weights[1:]}")


/content
/content/drive/MyDrive/Colab Notebooks/ML
/content/drive/MyDrive/Colab Notebooks/ML
Intercept: 11.7450528577571
Coefficients: [0.6889611  0.19356177]


In [17]:
X_centered = X - np.mean(X, axis=0)
N = X.shape[0]
C = np.cov(X, rowvar=False, ddof=0)
np.all(np.isclose(C,(X_centered.T @ X_centered)/N))

True