# Multiple Linear Regression

Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.

In the previous section we performed linear regression involving two variables. Almost all real world problems that you are going to encounter will have more than two variables. Linear regression involving multiple variables is called "multiple linear regression". 

In this section we will use multiple linear regression to predict the gas consumptions (in millions of gallons) in 48 US states based upon gas taxes (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population that has a drivers license.

In [None]:
# Importing Libraries
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline


__At the beginning of regression analysis, a dataset can be split into two groups:__

__A training dataset and a testing dataset.__

__The training dataset can be used to create a model to figure out the best approach to apply the line of best fit into the graph.__

__The test dataset (or subset) in order to test our model’s prediction__

![1.jpg](attachment:1.jpg)

# Sklearn 

__Sklearn (or Scikit-learn) is a Python library that offers various features for data processing that can be used for classification, clustering, and model selection.__

# train_test_split

__train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function,__


__Syntex: train_test_split(X, y, train_size=0.*,test_size=0.*, random_state=*)__

__1. X, y. The first parameter is the dataset you're selecting to use.__

__2. train_size. This parameter sets the size of the training dataset. There are three options: None, which is the default, Int, which requires the exact number of samples, and float, which ranges from 0.1 to 1.0.__

__3. test_size. This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.__


__4. random_state. Here you pass an integer, which will act as the seed for the random number generator during the split.If you don’t pass anything, the RandomState instance used by np.random will be used instead.
__

In [None]:
X =  list(range(15))
print (X)

In [None]:
y = [x * x for x in X]
print (y)

In [None]:
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.65,test_size=0.35, random_state=0)
print ("X_train: ", X_train)


In [None]:
print ("y_train: ", y_train)


In [None]:
print("X_test: ", X_test)


In [None]:
print ("y_test: ", y_test)

__Note: Sklearn train_test_split function ignores the original sequence of numbers. After a split, they can be presented in a different order.__

__The ideal split is said to be 80:20 for training and testing.__

# Example 01

In [None]:
# Importing the Dataset
dataset = pd.read_csv('petrol_consumption.csv')

dataset.head() 

In [None]:
# To see statistical details of the dataset, execute the following command:
dataset.describe()

In [None]:
dataset.isnull().sum()

# Visulaization

In [None]:
pd.plotting.scatter_matrix(dataset)

# Correlation coeficient

In [None]:
corrMatrix = dataset.corr()
print (corrMatrix)

# Fitting of regression model

 __import the LinearRegression model from sklearn_model__
 
 __fit() method along with our training data:__
 
regressor = LinearRegression() 

regressor.fit(X_train, y_train) 

__To retrieve the intercept:__

print(regressor.intercept_)

__For retrieving the slope (coefficient of x):__

regressor.coef_

__Create a pandas dataframe for all xi coefficents"__

coeff_dMaking Predictionsf = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient']) 

__Testing the model from testing data__

y_pred = regressor.predict(X_test)  


In [None]:
# Preparing the Data

# divide the data into attributes and labels
X = dataset.drop('Petrol_Consumption', axis=1)  #axis=0 scan row wise and axis=1 scan column wise

y = dataset['Petrol_Consumption']  

# dividing data into training and testing set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  

# Training and Making Predictions
from sklearn.linear_model import LinearRegression  

regressor = LinearRegression()  

regressor.fit(X_train, y_train) 

coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  

coeff_df



__This means that for a unit increase in "petroltax", there is a decrease of 24.19 million gallons in gas consumption.__

__Similarly, a unit increase in proportion of population with a drivers license results in an increase of 1.324 billion gallons of gas consumption.__

__We can see that "Averageincome" and "Paved_Highways" have a very little effect on the gas consumption.__

In [None]:
# Making Predictions
y_pred = regressor.predict(X_test)  

# compare the actual output values for X_test with the predicted values
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  

df  

# Error terms

In [None]:
# Evaluating the Algorithm
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  

__You can see that the value of root mean squared error is 68.31, which is slightly greater than 10% of the mean value of the gas consumption in all states. This means that our algorithm was not very accurate but can still make reasonably good predictions.__


# R square value: Coefficient of determination

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

# Lab Exercise

In [None]:
#Fit a regression model for Real estate.csv file

In [None]:
#Fit a regression model for following file:
dataset = pd.read_csv('movie_boxoffice.csv', encoding='ISO-8859–1')
dataset.head()

In [None]:
dataset.shape

In [None]:
dataset.isnull().sum()

In [None]:
dataset.dropna()

In [None]:
df=dataset.dropna()

In [None]:
df.shape