#*`02: Multiple Linear Regression`*

## What is Multiple Linear Regression?
Multiple Linear Regression a statistical technique that uses *multiple independent variables* to predict the outcome of a dependent variable. It is an extention of Simple Linear Regression.

* Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
* MLR is used extensively in econometrics and financial inference.

\
\
**Formula for Multiple Linear Regression**:

$$
y_i = \beta_0 + \beta_1X_1 + \beta_2X_2 + ........ + \beta_nX_n
$$

Above formula can also be written as:

$$
y_i = \beta_0 + \sum_{i=a}^{n} \beta_i X_i
$$

```
where,

 for i=n observations

 yi = dependent variable

 xi = explanatory variables

 β0 = y-intercept (constant term)

 βn = slope coefficients for each explanatory variable
```

\
### Understanding MLR
\
To get a better idea about the Multiple Linear Regression, let's take an example with fewer columns.

Consider the below data where the students have prioritized certain number of hours in a day for studies :


| |No. of Hours | IQ | Marks(out of 100) |
| --- | --- | --- | --- |
| stud1 | 4 | 100 | 93.4 |
| stud2 | 2 | 100 | 76.5 |
| stud3 | 6 | 100 | 85.2 |

\
The equation for this data would be as follows:

\
$$
Marks_i = \beta_0 + \beta_1 * no.of hours + \beta_2 * IQ ......(1)
$$

\
`student1` gets the highest marks by studying only for 4 hours whereas `student3` studied for 6 long hours and all the students have same IQ. We are able to compare only on the `no. hours factor` there is only one column to explain the results.

\
Now consider below data where we added the `no. of days` column that represents the number of days before they started preparing for the exam:

| | No. of days | No. of Hours | IQ | Marks(out of 100) |
| --- | --- | --- | --- | --- |
| stud1 | 9 | 4 | 100 | 93.4 |
| stud2 | 5 | 2 | 100 | 76.5 |
| stud3 | 1 | 6 | 100 | 85.2 |

\
The equation for this data would be as follows:

\
$$
Marks_i = \beta_0 + \beta_1 * no.ofdays + \beta2 * no.of hours + \beta_3 * IQ ......(2)
$$

\
>If we consider above data, though all the students have same IQ, `student1` got better marks. Also, `student3` studied for just one day for 6 hours and earned better marks than `student2` who started preparations 5 days before exam. The influence of each column is represented by the $\beta$ coefficients.

***Higher values of $\beta$ against a variable means that it contributed more towards predicting an outcome.***

\
This data explains that, different features contribute differently in order to predict the final value. A single attribute may not be able to explain whole output by itself. And so, the algorithm tries to find the best values of **`β`** coefficients and thats what the regression problem aims to solve.

Let's calculate for equation $(2)$:

\
$$
y_{hat} = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2 + \beta_3 * X_3
$$

\
Where, $X_1$ = no. of days, $X_2$ = no.of hours, $X_3$ = IQ

\
$y_{hat}$ will be calculated for each student, let's represent it in the matrix form as follows:

$$
y_{hat} = \beta  X
$$

$y_{hat}$ for each student will be represented as follows:

$$
y_{hat_1} = X_{11}\beta_1 + X_{12}\beta_2 + X_{13}\beta_3 + \beta_0
$$
$$
y_{hat_2} = X_{21}\beta_1 + X_{22}\beta_2 + X_{23}\beta_3 + \beta_0
$$
$$
y_{hat_3} = X_{31}\beta_1 + X_{32}\beta_2 + X_{33}\beta_3 + \beta_0
$$

\
And hence it can be represented as follows:

$$
y_{hat} = \begin{bmatrix}
y_{hat_1}  \\
y_{hat_2}  \\
y_{hat_3}
\end{bmatrix}_{(3*1)}
=
\begin{bmatrix}
1 & X_{11} & X_{12} & X_{13} \\
1 & X_{21} & X_{22} & X_{23} \\
1 & X_{31} & X_{32} & X_{33}
\end{bmatrix}_{(3*4)}
*
\begin{bmatrix}
\beta_0  \\
\beta_1  \\
\beta_2  \\
\beta_3
\end{bmatrix}_{(4 * 1)}
$$

The dot product of the matrices with following dimensions would be:
$(3*4) . (4*1) = (3*1)$

And since we are trying to establish a relationship as follows, therefore we add 1 in our $X$ matrix:

\
$y_{hat} = \beta_0 + [X.\beta]$

\
We got the above equations when no. of independent variables were 3. Similarly, we can extend this for `m` no of features and `n` no. of rows as follows:

$$
y_{hat} = \begin{bmatrix}
y_{hat_1}  \\
y_{hat_2}  \\
. \\
. \\
. \\
y_{hat_n}
\end{bmatrix}_{(n*1)}
=
\begin{bmatrix}
1 & X_{11} & X_{12} & ... & X_{1m} \\
1 & X_{21} & X_{22} & ... & X_{2m} \\
. & . & . & . & . \\
. & . & . & . & .  \\
. & . & . & . & .  \\
1 & X_{n1} & X_{n2} & ... & X_{nm}
\end{bmatrix}_{(n*m)}
*
\begin{bmatrix}
\beta_0  \\
\beta_1  \\
\beta_2  \\
. \\
. \\
. \\
\beta_m
\end{bmatrix}_{(m * 1)}
$$

Now that we have the final equation lets try to replicate this by building our own Multiple Regression class.

## Making Custom Multiple Linear Regression Class

In [11]:
# Let's build a class for Multiple Linear Regression
class MultipleLinearRegression:
  def __init__(self):
      self.coef_ = None
      self.intercept_ = None

  def fit(self, x, y):
    # Create an array of ones and merge it with the input matrix x
      ones_column = np.ones((x.shape[0], 1))
      x = np.hstack((ones_column, x))

      # calculate the coeffs
      beta_coef = np.linalg.inv(x.T.dot(x)).dot(x.T).dot(y)
      self.intercept_ = beta_coef[0] # First value of beta_coef would be the bias term β0
      self.coef_ = beta_coef[1:]

  def predict(self,x):
    # Create a predict method to make future predictions
      y_pred = np.dot(x,self.coef_) + self.intercept_
      return y_pred

In [1]:
# Import necessary libraries
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

## Get the data

In [3]:
# Load the diabetes dataset from sklearn.datasets
diabetes_data = datasets.load_diabetes()

# Turn the dataset into a pandas dataframe
data = pd.DataFrame(diabetes_data['data'], columns = diabetes_data['feature_names'])
data['target'] = diabetes_data['target']
data.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [39]:
data.shape

(442, 11)

In [38]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB


In [40]:
data.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,3.9184340000000004e-17,-5.777179e-18,-9.04254e-18,9.293722000000001e-17,1.130318e-17,152.133484
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,77.093005
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260971,-0.1377672,25.0
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665608,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324559,-0.03317903,87.0
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947171,-0.001077698,140.5
75%,0.03807591,0.05068012,0.03124802,0.03564379,0.02835801,0.02984439,0.0293115,0.03430886,0.03243232,0.02791705,211.5
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.198788,0.1811791,0.1852344,0.1335973,0.1356118,346.0


In [5]:
# Split the data into train and test sets
from sklearn.model_selection import train_test_split as split
X_train, X_test, y_train, y_test = split(data.iloc[:,:-1],
                                         data.iloc[:,-1],
                                         test_size = 0.2,
                                         random_state = 42)
len(X_train), len(X_test), len(y_train), len(y_test)

(353, 89, 353, 89)

## Scikit-Learn: LinearRegression class

In [6]:
# set random seed for reproducibility
np.random.seed(42)

# Instantiate an object of Linear Regression model
LR = LinearRegression()

# Fit the model
LR.fit(X_train, y_train)

# Make predictions on the model
y_preds = LR.predict(X_test)

In [8]:
# Calculate r2 score
from sklearn.metrics import r2_score
r2_score(y_true = y_test, y_pred = y_preds)

0.4526027629719195

In [31]:
# Check the coef and intercept values
beta_values = pd.DataFrame(np.array(LR.coef_).reshape(1,10), columns = X_train.columns)
beta_values['intercept'] = LR.intercept_
beta_values

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,intercept
0,37.904021,-241.964362,542.428759,347.703844,-931.488846,518.062277,163.419983,275.317902,736.198859,48.670657,151.345605


## Custom Linear Regression class

In [32]:
# set random seed for reproducibility
np.random.seed(42)

# Instantiate an object of Linear Regression model
my_LR = MultipleLinearRegression()

# Fit the model
my_LR.fit(X_train, y_train)

# Make predictions on the model
y_preds_my_lr = my_LR.predict(X_test)

In [33]:
# Calculate r2 Score
r2_score(y_test, y_preds_my_lr)

0.4526027629719199

In [34]:
# Check the coef and intercept values
beta_values_my_lr = pd.DataFrame(np.array(my_LR.coef_).reshape(1,10), columns = X_train.columns)
beta_values_my_lr['intercept'] = my_LR.intercept_
beta_values_my_lr

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,intercept
0,37.904021,-241.964362,542.428759,347.703844,-931.488846,518.062277,163.419983,275.317902,736.198859,48.670657,151.345605


In [37]:
## Compare the models
pd.concat([beta_values, beta_values_my_lr])

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,intercept
0,37.904021,-241.964362,542.428759,347.703844,-931.488846,518.062277,163.419983,275.317902,736.198859,48.670657,151.345605
0,37.904021,-241.964362,542.428759,347.703844,-931.488846,518.062277,163.419983,275.317902,736.198859,48.670657,151.345605
