# Multiple Linear Regression to estimate the profit based on multiple parameters

A Venture Capital Fund has hired you as a data scientist analyze the patterns in profit based on various parameters taken from 50 Startups launched previously.

Use the acquired data to train and fit a data-driven model to estimate profit trends and assess feasibility for successfully funding future startups.

1.   Dependent Variable - Profit
2.   Independent Variables/Features - R&D , Administration Expenditures and Marketing Expenditures




## Importing the libraries (pandas, numpy, matplotlib)

In [25]:
import pandas as pd
print("Pandas imported as pd")

import numpy as np
print("Numpy imported as np")

import matplotlib.pyplot as plt
print("Pyplot module of Matplotlib imported as plt")

Pandas imported as pd
Numpy imported as np
Pyplot module of Matplotlib imported as plt


## Importing the dataset (50_Startups.csv) from source folder 

In [16]:
data= pd.read_csv('50_Startups.csv')
print("Data set imported")

Data set imported


## Assigning all rows and all columns except the last column (i. e. Profit) to x

In [30]:
x = data.iloc[:, :-1].values
print("All rows and R&D Spend, Administration, Marketing Spend & State columns assigned to x variable")

print(x)

All rows and R&D Spend, Administration, Marketing Spend & State columns assigned to x variable
[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California

## Assigning all rows and last column to y

In [43]:
y = data.iloc[:, -1].values
print("All rows and Profit column assigned to y variable")
print(y)

All rows and Profit column assigned to y variable
[192261.83 191792.06 191050.39 182901.99 166187.94 156991.12 156122.51
 155752.6  152211.77 149759.96 146121.95 144259.4  141585.52 134307.35
 132602.65 129917.04 126992.93 125370.37 124266.9  122776.86 118474.03
 111313.02 110352.25 108733.99 108552.04 107404.34 105733.54 105008.31
 103282.38 101004.64  99937.59  97483.56  97427.84  96778.92  96712.8
  96479.51  90708.19  89949.14  81229.06  81005.76  78239.91  77798.83
  71498.49  69758.98  65200.33  64926.08  49490.75  42559.73  35673.41
  14681.4 ]


## Encoding categorical data

In [47]:
from sklearn.compose import ColumnTransformer
#Applies transformers to columns to form a single feature space of an array or pandas DataFrame
print("ColumnTransformer imported from compose of sci-kit learn library")

from sklearn.preprocessing import OneHotEncoder
#One Hot Encoding preprocesses categorical features by creating a new binary feature for each possible category
print("OneHotEncoder imported from preprocessing module of sci-kit learn library")


#Here we are encoding State column
ct=ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
x= np.array(ct.fit_transform(x))
print("One Hot Encoding applied to State column")


ColumnTransformer imported from compose of sci-kit learn library
OneHotEncoder imported from preprocessing module of sci-kit learn library
One Hot Encoding applied to State column


In [53]:
print("Below we can see that the State column has been encoded into 0s and 1s in the first three columns whereas the remaining col-umns that are R&D Spend, Administration & Marketing Spend stay the same ")
print("We can whar corresponds to what if we take a look at our originial data set")
print ("001 corresponds to New York")
print ("100 corresponds to California")
print ("010 corresponds to Florida")
print(x) 

Below we can see that the State column has been encoded into 0s and 1s in the first three columns whereas the remaining col-umns that are R&D Spend, Administration & Marketing Spend stay the same 
We can whar corresponds to what if we take a look at our originial data set
001 corresponds to New York
100 corresponds to California
010 corresponds to Florida
[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.2

## Feature Scaling 

### No need to apply feature scaling in case of Multiple Linear Regression

In [54]:
print("There is no need to apply feature scaling in case of Multiple Linear Regression. This is because there are multiple features/independent variable, each having coefficients. Therefore, each of those coefficients will to put everthing on the same scale.")

There is no need to apply feature scaling in Multiple Linear Regression. There are multiple features/independent variable, each having coefficients. Therefore, each of those coefficients will to put everthing on the same scale.


## Splitting the dataset into the Training set and Test set

In [59]:
#Importing train_test_split function from linear_model module of sci-kit learn library
from sklearn.model_selection import train_test_split
print("train_test_split function imported")

x_train, x_test, y_train, y_test = train_test_split (x, y, test_size=0.2, random_state=0)
print ("Variable x split into x_train and x_test")
print("x_train" , x_train)
print("x_test" , x_test)

print ("Variable y split into y_train and y_test")
print("y_train" , y_train)
print("y_test" , y_test)


train_test_split function imported
Variable x split into x_train and x_test
x_train [[0.0 1.0 0.0 55493.95 103057.49 214634.81]
 [0.0 0.0 1.0 46014.02 85047.44 205517.64]
 [0.0 1.0 0.0 75328.87 144135.98 134050.07]
 [1.0 0.0 0.0 46426.07 157693.92 210797.67]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 1000.23 124153.04 1903.93]
 [0.0 0.0 1.0 542.05 51743.15 0.0]
 [0.0 0.0 1.0 65605.48 153032.06 107138.38]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [0.0 1.0 0.0 61994.48 115641.28 91131.24]
 [1.0 0.0 0.0 63408.86 129219.61 46085.25]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [1.0 0.0 0.0 23640.93 96189.63 148001.11]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 15505.73 127382.3 35534.17]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [1.0 0.0 0.0 64664.71 139553.16 137962.62]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [0.0 0.

## Training the Mutliple Linear Regression model on the Training set

Q. Do we have to deploy Backward Elimination to select the features that have the highest P-value and are most statistically significant?
Answer. No. Because the class we are about to call to build MLR will automatically best features that have highest P-values and are most statistically significant to estimate profit with highest accuracy.


In [63]:
#Importing linear regression class from linear_model module of sci-kit learn library
from sklearn.linear_model import LinearRegression
print("LinearRegression class imported")

LinearRegression class imported
Multiple Linear Regression model trained


In [65]:
#Training Multiple Liner Regression model
regressor=LinearRegression()
regressor.fit(x_train,y_train)

LinearRegression()

In [66]:
print("Multiple Linear Regression model trained")

Multiple Linear Regression model trained


## Predicting Test set results

In [71]:
y_pred=regressor.predict(x_test)
print(y_pred)

[103015.2  132582.28 132447.74  71976.1  178537.48 116161.24  67851.69
  98791.73 113969.44 167921.07]


## Comparing Actual result to the Predicted Model result


In [84]:
#np.set_printoptions(precision=2) #This will display results upto 2 decimal places
#print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test), 1)), 1))
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}) 
df

Unnamed: 0,Actual,Predicted
0,103282.38,103015.201598
1,144259.4,132582.277608
2,146121.95,132447.738452
3,77798.83,71976.098513
4,191050.39,178537.482211
5,105008.31,116161.242302
6,81229.06,67851.692097
7,97483.56,98791.733747
8,110352.25,113969.43533
9,166187.94,167921.065695


In [85]:
from sklearn import metrics  
#import metrics from sci-kit learn library
print('R-2:', metrics.r2_score(y_test, y_pred))

R-2: 0.9347068473281943


## R-2 score =0.94 which is considered a decent score