## Multiple Linear Regression 

In Multiple Linear Regression, we have more than 1 Independent feature using which we try to predict a continuous variable called as Dependent feature or Response variable.<br/><br/>
In this current implementation, we have a dataset of 50 startups and their spending on different departments. Based on these available features we are going to predict the Profit of the company.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv("50_startups.csv")

In [3]:
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [6]:
# Divide the dataset in Independent and dependent variables
# It will assign independent variable in X variable and dependent feature Profit in y variable

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

Now, since we have one categorical variable State so we'll have to convert it to numeric. For this, we will use one-hot encoding method since it has more than 2 categories (if we have only two categories, we can use label encoder as well).

In [9]:
# Convert the categorical feature State into numeric using getdummies() function

states = pd.get_dummies(X['State'],drop_first=True)

In [10]:
states.head()

Unnamed: 0,Florida,New York
0,0,1
1,0,0
2,1,0
3,0,1
4,1,0


Since we have converted State column into dummies so we can remove the original State column from our data now.

In [11]:
X = X.drop(labels='State',axis=1)

In [13]:
# State column is no more

X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend
0,165349.2,136897.8,471784.1
1,162597.7,151377.59,443898.53
2,153441.51,101145.55,407934.54
3,144372.41,118671.85,383199.62
4,142107.34,91391.77,366168.42


In [15]:
# Let's concatenate our dataframe X with dummy variables that we just created

X = pd.concat([X,states],axis=1)

In [16]:
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Florida,New York
0,165349.2,136897.8,471784.1,0,1
1,162597.7,151377.59,443898.53,0,0
2,153441.51,101145.55,407934.54,1,0
3,144372.41,118671.85,383199.62,0,1
4,142107.34,91391.77,366168.42,1,0


In [19]:
# Now that we have data ready for modeling, Lets split our data in train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=123)

In [21]:
# Fit Multiple Linear Regression

from sklearn.linear_model import LinearRegression

LR = LinearRegression()
LR.fit(X_train,y_train)

LinearRegression()

In [22]:
# Predict the test set 

y_pred = LR.predict(X_test)

In [24]:
# Check prediction score using R squared

from sklearn.metrics import r2_score

score = r2_score(y_test,y_pred)

In [25]:
score

0.9667998486975283

This means our model has r square of ~0.97 which is very good.