

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [32]:
df=pd.read_csv('petrol.csv')
df.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [33]:
df.describe()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
Petrol_tax                      48 non-null float64
Average_income                  48 non-null int64
Paved_Highways                  48 non-null int64
Population_Driver_licence(%)    48 non-null float64
Petrol_Consumption              48 non-null int64
dtypes: float64(2), int64(3)
memory usage: 1.9 KB


# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [35]:
Q1=df.quantile(0.25)
Q3=df.quantile(0.75)
IQR=Q3-Q1
IQR

Petrol_tax                         1.1250
Average_income                   839.7500
Paved_Highways                  4045.7500
Population_Driver_licence(%)       0.0655
Petrol_Consumption               123.2500
dtype: float64

In [36]:
Upper_lim=Q3+1.5*IQR
lower_lim=Q1-1.5*IQR

In [37]:
df1=df[~((df<lower_lim)|(df>Upper_lim)).any(axis=1)]

In [38]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43 entries, 0 to 47
Data columns (total 5 columns):
Petrol_tax                      43 non-null float64
Average_income                  43 non-null int64
Paved_Highways                  43 non-null int64
Population_Driver_licence(%)    43 non-null float64
Petrol_Consumption              43 non-null int64
dtypes: float64(2), int64(3)
memory usage: 2.0 KB


In [39]:
df1.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


# Question 3 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [40]:
X=df1.drop(columns=['Petrol_Consumption'])
Y=df1['Petrol_Consumption']

# Question 4 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [41]:
from sklearn.model_selection import train_test_split


In [42]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=0)

# Question 5 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [43]:
from sklearn.linear_model import LinearRegression

In [44]:
lm=LinearRegression()
lm.fit(X_train,Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [45]:
lm.coef_

array([-4.85092410e+01, -8.60668301e-02, -4.24564300e-03,  8.39706647e+02])

In [46]:
X.columns

Index([u'Petrol_tax', u'Average_income', u'Paved_Highways',
       u'Population_Driver_licence(%)'],
      dtype='object')

# R-Square 

# Question 6 - Evaluate the model 
Calculate the accuracy score for the above model.

In [47]:
print('R^2',lm.score(X_train,Y_train))

('R^2', 0.7471365231708735)


In [49]:
print('R^2',lm.score(X_test,Y_test))

('R^2', 0.17570305578644507)


# Question 7: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [50]:
X=df1.drop(columns=['Petrol_tax','Petrol_Consumption'])

In [51]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=0)

In [52]:
lm=LinearRegression()
lm.fit(X_train,Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [53]:
print('R^2',lm.score(X_train,Y_train))

('R^2', 0.6142399697238925)


In [54]:
print('R^2',lm.score(X_test,Y_test))

('R^2', 0.40856118403230757)


# Question 8: Print the coefficients of the multilinear regression model

In [55]:
lm.coef_

array([-7.13749234e-02,  1.34186609e-03,  1.13948422e+03])

# Question 9
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer