# Multiple Linear Regression

### Importing the Libraries

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset

In [8]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values  # This selects all the columns except the last one because that is the dependent variable
y = dataset.iloc[:, -1].values   # This selects only the columns except others because that is the dependent variable


### Encoding categorical data

To translate the categorical variables into the numerical format 

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [3])], remainder = 'passthrough')   # The index here is of the column on which we want to apply the oneHotEncoder
X = np.array(ct.fit_transform(X))

### Splitting the dataset into the training set and Test set

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

### Training the Mutiple Linear Regression Model on the Training set

In [11]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Model Created

LinearRegression()

We dont have to worry about dummy variable trap, selecting the best feature (feature with highest P-value), the Linear Regression will take care of that  

# Predicting the Test set Result

In [12]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2) # this will display any numerical value with only two decimals after comma
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)), 1) )

[[114664.42 105008.31]
 [ 90593.16  96479.51]
 [ 75692.84  78239.91]
 [ 70221.89  81229.06]
 [179790.26 191050.39]
 [171576.92 182901.99]
 [ 49753.59  35673.41]
 [102276.66 101004.64]
 [ 58649.38  49490.75]
 [ 98272.03  97483.56]]


Match the values, we can infer that they are pretty close to each other. The data set does not necessarily have some perfect linear correlations, however, we can be assured that with this linear regression class, we were able to select right features with the right parameters to make these predictions.

Backward Elimination is irrelevant is Python because Scikit-Learn Library automatically takes care of selecting the statiscally significant features when training the model to make accurate predictions.

## Question 1: 
How do I use my multiple Linear Regression model to make a single prediction, for example, the profit of a startup with R&D Spend = 160000, Adminstration Spend = 130000, Marketing Spend = 300000 and state = California?

### Answer 1:

In [13]:
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

[180892.25]


Note: The values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values intoa double pair of square brackets makes the input exactly a 2D array.

Note: Notice also that the "California" state was not input as a string in the last column but as a "1, 0, 0" in the first three column. That is because the predict method expects the one-hot-encoder value of the states.

# Question 2:
How do I get the final regression equation $y = b_0 + b_1 x_1 + b_2 x_2 + \ldots$ with the final values of the coefficents?

### Answer 2:

In [28]:
print(regressor.coef_)
print(regressor.intercept_)

[-2.85e+02  2.98e+02 -1.24e+01  7.74e-01 -9.44e-03  2.89e-02]
49834.88507321703


Therefore, the equation of our multiple linear regression model is:
$$ \text{Profit} = -285 * \text{Dummy State 1} + 298 * \text{ Dummy State 2} - 12.4 * \text{Dummy State 3} + 77.4 * \text{R&D Spend} - 0.00944 * \text{Administration} + 0.0289 * \text{Marketing Spend} + 49834.88 $$
