<a href="https://colab.research.google.com/github/lakshit2808/Machine-Learning-Notes/blob/master/Machine-Learning-Notes/ML_Models/Regression/Multiple_Linear_Regression/multiple_linear_regression_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression

<img src='https://github.com/lakshit2808/Machine-Learning-Notes/blob/master/Resources/Images/MultiReg.jpg?raw=true' width='500'>

**Assumption of Linear Regression:**
- Linearity
- Homoscedasticity
- Multivariate Normality
- Independence of Errors
- Lack of Multicolinearity  

**5 Methods of Building Models:**
- **All In:**
  - Select the criterion of goodness of fit
  - Construct all possible regression models, 2^N - 1 total Combinations
  - Select the one with the best criterion(Model is ready)
- **Backward Elimination(Fastest Method):**
  - Select the significance level to stay in the model(e.g SL = 0.05)
  - Fit the model with all possible predictors
  - Consider the predictor with the highest P value,  If P > SL go to next step, otherwise your model is ready.
  - Remove the predictor.
  - Fit model without the variable
  - then again go the step 3 and repeat the process
- **Forward Selection:**
  - Select the significance level to stay in the model(e.g SL = 0.05)
  - Fit all simple regression model **Y - Xn** Select the one with the lowest P value
  - Keep this variable and fit all possible models with one extra predictor added to the one you already have
  - Consider the predictor with the lowest P value, If P < SL, go to Step 3, otherwise your model is ready
- **Bidirectional Elimination:**
  - Select the significance level to enter and to stay in the model(e.g SL = 0.05, SLENTER = 0.05)
  - Fit all simple regression model **Y - Xn** Select the one with the lowest P value(P < SLENTER to enter)
  - Perform all steps of backard Elimination(old varible must have P < SL to stay)
  - No new varible can enter no old varible can exit(at this step the model is ready)
- Score Comparision

## Importing the libraries

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [5]:
dataset = pd.read_csv('/content/50_Startups.csv')
dataset.head()
X = dataset.iloc[: , :-1].values
Y = dataset.iloc[: , -1].values

In [6]:
print(X)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

## Encoding categorical data

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [3])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))

In [8]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

## Splitting the dataset into the Training set and Test set

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

In [10]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Predicting the Test set results

In [18]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
data = np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test), 1)), 1) 
pd.DataFrame(data, columns=['Predicted Value', 'Real Value'])

Unnamed: 0,Predicted Value,Real Value
0,103015.201598,103282.38
1,132582.277608,144259.4
2,132447.738452,146121.95
3,71976.098513,77798.83
4,178537.482211,191050.39
5,116161.242302,105008.31
6,67851.692097,81229.06
7,98791.733747,97483.56
8,113969.43533,110352.25
9,167921.065696,166187.94


## Model Evaluation
**explained variance regression score:**  
If $\hat{y}$ is the estimated target output, y the corresponding (correct) target output, and Var is Variance, the square of the standard deviation, then the explained variance is estimated as follow:

$\texttt{explainedVariance}(y, \hat{y}) = 1 - \frac{Var{ y - \hat{y}}}{Var{y}}$  
The best possible score is 1.0, lower values are worse.

In [21]:
# Explained variance score: 1 is perfect prediction
print('Variance Score {}'.format(regressor.score(X_test,Y_test)))

Variance Score 0.9347068473282515
