#*`02: Multiple Linear Regression`*

## What is Multiple Linear Regression?
Multiple Linear Regression a statistical technique that uses *multiple independent variables* to predict the outcome of a dependent variable. It is an extention of Simple Linear Regression.

* Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
* MLR is used extensively in econometrics and financial inference.

\
\
**Formula for Multiple Linear Regression**:

$$
y_i = \beta_0 + \beta_1X_1 + \beta_2X_2 + ........ + \beta_nX_n
$$

Above formula can also be written as:

$$
y_i = \beta_0 + \sum_{i=a}^{n} \beta_i X_i
$$

```
where,

 for i=n observations

 yi = dependent variable

 xi = explanatory variables

 β0 = y-intercept (constant term)

 βn = slope coefficients for each explanatory variable
```

\
### Understanding MLR
\
To get a better idea about the Multiple Linear Regression, let's take an example with fewer columns.

Consider the below data where the students have prioritized certain number of hours in a day for studies :


| |No. of Hours | IQ | Marks(out of 100) |
| --- | --- | --- | --- |
| stud1 | 4 | 100 | 93.4 |
| stud2 | 2 | 100 | 76.5 |
| stud3 | 6 | 100 | 85.2 |

\
The equation for this data would be as follows:

\
$$
Marks_i = \beta_0 + \beta_1 * no.of hours + \beta_2 * IQ ......(1)
$$

\
`student1` gets the highest marks by studying only for 4 hours whereas `student3` studied for 6 long hours and all the students have same IQ. We are able to compare only on the `no. hours factor` there is only one column to explain the results.

\
Now consider below data where we added the `no. of days` column that represents the number of days before they started preparing for the exam:

| | No. of days | No. of Hours | IQ | Marks(out of 100) |
| --- | --- | --- | --- | --- |
| stud1 | 9 | 4 | 100 | 93.4 |
| stud2 | 5 | 2 | 100 | 76.5 |
| stud3 | 1 | 6 | 100 | 85.2 |

\
The equation for this data would be as follows:

\
$$
Marks_i = \beta_0 + \beta_1 * no.ofdays + \beta2 * no.of hours + \beta_3 * IQ ......(2)
$$

\
>If we consider above data, though all the students have same IQ, `student1` got better marks. Also, `student3` studied for just one day for 6 hours and earned better marks than `student2` who started preparations 5 days before exam. The influence of each column is represented by the $\beta$ coefficients.

***Higher values of $\beta$ against a variable means that it contributed more towards predicting an outcome.***

\
This data explains that, different features contribute differently in order to predict the final value. A single attribute may not be able to explain whole output by itself. And so, the algorithm tries to find the best values of **`β`** coefficients and thats what the regression problem aims to solve.

Let's calculate for equation $(2)$:

\
$$
y_{hat} = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + \beta_3 * X_3
$$

\
Where, $X_1$ = no. of days, $X_2$ = no.of hours, $X_3$ = IQ

\
$y_{hat}$ will be calculated for each student, let's represent it in the matrix form as follows:

$$
y_{hat} = \beta  X
$$

$y_{hat}$ for each student will be represented as follows:

$$
y_{hat_1} = X_{11}\beta_1 + X_{12}\beta_2 + X_{13}\beta_3 + \beta_0
$$
$$
y_{hat_2} = X_{21}\beta_1 + X_{22}\beta_2 + X_{23}\beta_3 + \beta_0
$$
$$
y_{hat_3} = X_{31}\beta_1 + X_{32}\beta_2 + X_{33}\beta_3 + \beta_0
$$

\
And hence it can be represented as follows:

$$
y_{hat} = \begin{bmatrix}
y_{hat_1}  \\
y_{hat_2}  \\
y_{hat_3}
\end{bmatrix}_{(3*1)}
=
\begin{bmatrix}
1 & X_{11} & X_{12} & X_{13} \\
1 & X_{21} & X_{22} & X_{23} \\
1 & X_{31} & X_{32} & X_{33}
\end{bmatrix}_{(3*4)}
*
\begin{bmatrix}
\beta_0  \\
\beta_1  \\
\beta_2  \\
\beta_3
\end{bmatrix}_{(4 * 1)}
$$

The dot product of the matrices with following dimensions would be:
$(3*4) . (4*1) = (3*1)$

And since we are trying to establish a relationship as follows, therefore we add 1 in our $X$ matrix:

\
$y_{hat} = \beta_0 + [X.\beta]$

\
We got the above equations when no. of independent variables were 3. Similarly, we can extend this for `m` no of features and `n` no. of rows as follows:

$$
y_{hat} = \begin{bmatrix}
y_{hat_1}  \\
y_{hat_2}  \\
. \\
. \\
. \\
y_{hat_n}
\end{bmatrix}_{(n*1)}
=
\begin{bmatrix}
1 & X_{11} & X_{12} & ... & X_{1m} \\
1 & X_{21} & X_{22} & ... & X_{2m} \\
. & . & . & . & . \\
. & . & . & . & .  \\
. & . & . & . & .  \\
1 & X_{n1} & X_{n2} & ... & X_{nm}
\end{bmatrix}_{(n*m)}
*
\begin{bmatrix}
\beta_0  \\
\beta_1  \\
\beta_2  \\
. \\
. \\
. \\
\beta_m
\end{bmatrix}_{(m * 1)}
$$

Now that we have the final equation lets try to replicate this by building our own Multiple Regression class.

## Making Custom Multiple Linear Regression Class

In [1]:
# Let's build a class for Multiple Linear Regression
class MultipleLinearRegression:
  def __init__(self):
      self.coef_ = None
      self.intercept_ = None

  def fit(self, x, y):
    # Create matrix consisting of 1 and merge it with the input matrix x
      ones_column = np.ones((x.shape[0], 1))
      x = np.hstack((ones_column, x))

      # calculate the coeffs
      beta_coef = np.linalg.inv(x.T.dot(x)).dot(x.T).dot(y)
      self.intercept_ = beta_coef[0] # First value of beta_coef would be the bias term β0
      self.coef_ = beta_coef[1:]

  def predict(self,x):
    # Create a predict method to make future predictions
      y_pred = np.dot(x,self.coef_) + self.intercept_
      return y_pred

In [4]:
# Import necessary libraries
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

## Make regression data

In [15]:
# make a regression dataset using 10 features
X, y = make_regression(n_samples = 500,
                       n_features = 10,
                       n_informative = 7,
                       n_targets = 1,
                       noise = 1)

In [16]:
X.shape, y.shape

((500, 10), (500,))

## Scikit-Learn: LinearRegression class

In [17]:
# set random seed for reproducibility
np.random.seed(42)

# Instantiate an object of Linear Regression model
LR = LinearRegression()

# Fit the model
LR.fit(X, y)

# Make predictions on the model
y_preds = LR.predict(X)

In [18]:
# Calculate r2 score
from sklearn.metrics import r2_score
r2_score(y_true = y, y_pred = y_preds)

0.9999166546936376

## Custom Linear Regression class

In [19]:
# set random seed for reproducibility
np.random.seed(42)

# Instantiate an object of Linear Regression model
my_LR = MultipleLinearRegression()

# Fit the model
my_LR.fit(X, y)

# Make predictions on the model
y_preds_my_lr = my_LR.predict(X)

In [20]:
# Calculate r2 Score
r2_score(y, y_preds_my_lr)

0.9999166546936376