# Machine Learning Notebook 3: Multiple Linear Regression

### Compiled by Amit Purswani
LinkedIn: https://www.linkedin.com/in/amit-purswani-2a073777/

Repositories
1. Data Analysis:
https://github.com/kranemetal/Data-Analysis-Projects

2. Machine Learning:
https://github.com/kranemetal/MachineLearning

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Importing Dataset

In [2]:
df = pd.read_csv("C:\\Users\krane\Desktop\datasets\\50_Startups.csv")

In [3]:
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [4]:
df.shape

(50, 5)

The dataset has 50 records, 4 features and 1 target variable 'Profit'.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


3 Numeric features and 1 cateforical feature 'State'.

In [6]:
df.isnull().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
Profit             0
dtype: int64

The dataset does not have any Null values.

In [7]:
df.describe()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
count,50.0,50.0,50.0,50.0
mean,73721.6156,121344.6396,211025.0978,112012.6392
std,45902.256482,28017.802755,122290.310726,40306.180338
min,0.0,51283.14,0.0,14681.4
25%,39936.37,103730.875,129300.1325,90138.9025
50%,73051.08,122699.795,212716.24,107978.19
75%,101602.8,144842.18,299469.085,139765.9775
max,165349.2,182645.56,471784.1,192261.83


Some features have 0 minimum value too.

In [8]:
df['State'].value_counts()

California    17
New York      17
Florida       16
Name: State, dtype: int64

In [9]:
df['State'].nunique()

3

In [10]:
df['State'].unique()

array(['New York', 'California', 'Florida'], dtype=object)

Here we have 3 unique values for State 'New York', 'California', 'Florida'.

# Equations
Simple Linear Regression <br>
y = b0 + b1*x1<br><br>
Multiple Linear Regression<br>
y = b0 + b1*x1 + b2*x2 + …. + bnxn

## Assumptions of Linear Regression
1. Linearity: The relationship between X(independent) and Y(dependent) variables is linear.
We should also check for outliers, as linear regression is affected by them.
Linearity assumption can be checked by plotting scatter plots.

2. Homoscedasticity: The variance of residual is the same for any value of X.
Means residuals are same across the regression line.
This can be checked using a scatter plot.
When it is not the case then, variables are said to suffer from heteroscedasticity.

3. Non-Multicollinearity: The independent variables should not be correlated to each other.
Multicollinearity occurs when the independent variables are highly correlated with each other. This can be checked by correlation matrix, tolerance and Variance Inflation Factor(VIF).

4. Normality: All the variables should be normally distributed. 
This can be checked by plotting a histogram. If data is not normally distributed then a non-linear transformation (eg. log-transformation) might fix the issue.

5. Non-Autocorrelation: There should be no correlation between the residual (error) terms. Absence of this phenomenon is known as Autocorrelation.

Refer URL: https://dataaspirant.com/assumptions-of-linear-regression-algorithm/

### Separating independent(X) and dependent variables(Y)

In [11]:
x = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
print(x)
print(y)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

### Encoding categorical data

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
x=np.array(ct.fit_transform(x))

In [13]:
print(x)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

#### Note: We will not apply Feature scaling in Multiple Linear Regression as all variables will have a coefficient multiplier and it will take care of the weightage each variable should have. 

### Splitting the data into Train and Test sets

In [14]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

#### Note: Multiple Linear Regression will take care of the Dummy Variable trap we dont have to skip 1.

### Training the Multiple Linear Regression Model on Train set

In [15]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

LinearRegression()

### Predict Test set results

In [16]:
y_pred  = regressor.predict(x_test)
np.set_printoptions(precision=2) # prints numbers upto 2 decimal places
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[104282.76 103282.38]
 [132536.88 144259.4 ]
 [133910.85 146121.95]
 [ 72584.77  77798.83]
 [179920.93 191050.39]
 [114549.31 105008.31]
 [ 66444.43  81229.06]
 [ 98404.97  97483.56]
 [114499.83 110352.25]
 [169367.51 166187.94]
 [ 96522.63  96778.92]
 [ 88040.67  96479.51]
 [110949.99 105733.54]
 [ 90419.19  96712.8 ]
 [128020.46 124266.9 ]]


#### Making a single prediction
(for example the profit of a startup with R&D Spend = 150000, Administration Spend = 120000, Marketing Spend = 280000 and State = 'Florida')

In [17]:
print(regressor.predict([[0, 1, 0, 150000, 120000, 280000]]))

[173800.72]


Therefore, our model predicts that the profit of a Florida startup which spent 150000 in R&D, 120000 in Administration and 280000 in Marketing is $ 173800.72.

#### Important note 1: Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:

0,1,0,150000,120000,280000→scalars 

[0,1,0,150000,120000,280000]→1D array 

[[0,1,0,150000,120000,280000]]→2D array

#### Important note 2: Notice also that the "Florida" state was not input as a string in the last column but as "0, 1, 0" in the first three columns.
That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the third row of the matrix of features x, "Florida" was encoded as "0, 1, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

### Getting the final linear regression equation with the values of the coefficients

In [18]:
print(regressor.coef_)
print(regressor.intercept_)

[-2.56e+02  2.07e+02  4.89e+01  7.91e-01  3.02e-02  3.10e-02]
42659.813725518936


#### Therefore, the equation of our multiple linear regression model is:

Profit = (-256 × Dummy State 1) + (207 × Dummy State 2) + (48.9 × Dummy State 3) + (0.791 × R&D Spend) + (0.0302 × Administration) + (0.0310 × Marketing Spend) + 42659.81

#### Important Note:
To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.

### <center>The End