# Multiple Linear Regression
y = b + b1x1 + b2x2 + b3x3 + ... + bnxn; trend line that best fits the data
y = DV (vertical axis), x1,x2,x3,...,xn = IVs (horivontal axis), b = y intercept(constant), 
b1 = slope of the line = (y2-y1)/(x2-x2)
## Assumptions of a linear regression
1. linearity
2. Homoscrdasticity
3. Multivariate Normality
4. Independence of Errors
5. Lack of Multicollinearity

## Ordinary least squares
draws a lot of trend lines and takes the line with the minimum sum of square(actual y - predicted y)
## R^2
sum of squares of residuals = minimum sum of square(actual y - predicted y)
total sum of squares  = sum of square(actual y - Average y)
R^2 = 1 - (minimum sum of squares of residuals/total sum of squares)
so therefore how good is the trend line compared to the average line
trying to minimise sum of squares of residuals 
thereby reducing (minimum sum of squares of residuals/total sum of squares)
so the closer R^2 is to one the better the further away from 1 the worst 

## Adjusted R^2
problem of R^2  = once your add a new variable the regression would minimise the sum of squares of residuals
so R^2 would never decrease but either stay the same or increase
hence Adjusted R^2 = 1-(1-R^2)n-1/n-p-1; n = sample size, p = number of regressors(IVs)
penalises the model for adding IVs that dont help your model

## Dummy Variable trap
if there are 2 dummy variables and both are used in the model, that would cause multicollinearity 
hence the model cannot distinguish between the effects of both dummy variables
therefore the model would not work properly
so always omit one dummy variable
## P value
P value is a statistical measure that helps scientists determine whether or not their hypotheses are correct

if the P value of a data set is below a certain pre-determined amount (like, for instance, 0.05), scientists will reject the "null hypothesis" of their experiment - in other words, they'll rule out the hypothesis that the variables of their experiment had no meaningful effect on the results.


### Importing the libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset

In [4]:
dataset = pd.read_csv('50_Startups.csv')
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [5]:
# Get x and y variables
X = dataset.iloc[:, :-1].values #[rows, cols]
y = dataset.iloc[:, -1].values

pd.DataFrame(X).head()

Unnamed: 0,0,1,2,3
0,165349,136898.0,471784,New York
1,162598,151378.0,443899,California
2,153442,101146.0,407935,Florida
3,144372,118672.0,383200,New York
4,142107,91391.8,366168,Florida


In [6]:
pd.DataFrame(y).head()

Unnamed: 0,0
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94


### Encoding Categorical data

This categorical data encoding method transforms the categorical variable into a set of binary variables (also known as dummy variables). In the case of one-hot encoding, for N categories in a variable, it uses N binary variables. 

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [11]:
# print(X[:5])
pd.DataFrame(X).head()

Unnamed: 0,0,1,2,3,4,5
0,0,0,1,165349,136898.0,471784
1,1,0,0,162598,151378.0,443899
2,0,1,0,153442,101146.0,407935
3,0,0,1,144372,118672.0,383200
4,0,1,0,142107,91391.8,366168


#### Note
Avoid the Dummy Variable Trap by removing one of the dummy variables; however the Scikit-Learn library automatically removes the extra dummy variable for you.
Also Backward Elimination is irrelevant in Python, because the Scikit-Learn library automatically takes care of selecting the statistically significant features when training the model to make accurate predictions.

### Splitting the dataset into the Training set and Test set

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Training the Multiple Linear Regression model on the Training set

In [13]:
#Fitting multiple Linear Regression to the Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Predicting the test set results

In [17]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
# pd.DataFrame(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]
