## Multiple Linear Regression


### Importing the libraries


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset


In [3]:
dataset = pd.read_csv('50_Startups.csv')
print("Complete Dataset Shape: ", dataset.shape)

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print("Shape of X: ", X.shape)

Complete Dataset Shape:  (50, 5)
Shape of X:  (50, 4)


In [4]:
# display first 5 rows in X
count = 0
for i in X:
    if count < 5:
        count += 1
        print(i)

[165349.2 136897.8 471784.1 'New York']
[162597.7 151377.59 443898.53 'California']
[153441.51 101145.55 407934.54 'Florida']
[144372.41 118671.85 383199.62 'New York']
[142107.34 91391.77 366168.42 'Florida']


### Encoding categorical data ~ (last column containing 3 distinct classes)


In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

print("New Shape of X: ", X.shape)

New Shape of X:  (50, 6)


In [6]:
# display first 5 rows in X
count = 0
for i in X:
    if count < 5:
        count += 1
        print(i)

[0.0 0.0 1.0 165349.2 136897.8 471784.1]
[1.0 0.0 0.0 162597.7 151377.59 443898.53]
[0.0 1.0 0.0 153441.51 101145.55 407934.54]
[0.0 0.0 1.0 144372.41 118671.85 383199.62]
[0.0 1.0 0.0 142107.34 91391.77 366168.42]


### Splitting the dataset into the Training set and Test set


In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Training the Multiple Linear Regression model on the Training set


In [8]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Predicting the Test set results


In [9]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)  # convert to 2 decimal points

# Create an array of predicted values vs the real test set labels
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


#### Making a single prediction 
-- for example the profit of a startup with R&D Spend = 160000, 
Administration Spend = 130000, Marketing Spend = 300000 and State = 'California')

In [10]:
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

[181566.92]


In [11]:
# Getting the final linear regression equation with the values of the coefficients

print(regressor.coef_)
print(regressor.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.529248549545


#### Therefore, the equation of our multiple linear regression model is:

Profit=86.6×Dummy State 1−873×Dummy State 2+786×Dummy State 3−0.773×R&D Spend+0.0329×Administration+0.0366×Marketing Spend+42467.53

### Let's find values for the metrics


In [12]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('')
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('')
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('')
print('R2 Score:', metrics.r2_score(y_test, y_pred))

Mean Absolute Error: 7514.293659641278

Mean Squared Error: 83502864.03260367

Root Mean Squared Error: 9137.990152796383

R2 Score: 0.934706847328222
