# External Lab 

Here each question is of 1 mark.

# Multiple Linear Regression

## Problem Statement

Use Multiple Linear Regression to **predict the consumption of petrol** given relevant variables are the petrol tax, the per capita, income, the number of miles of paved highway, and the proportion of the population with driver's licenses.

## Dataset

There are 48 rows of data.  The data include:

      I,  the index;
      A1, the petrol tax;
      A2, the per capita income;
      A3, the number of miles of paved highway;
      A4, the proportion of drivers;
      B,  the consumption of petrol.

### Reference 

    Helmut Spaeth,
    Mathematical Algorithms for Linear Regression,
    Academic Press, 1991,
    ISBN 0-12-656460-4.

    S Weisberg,
    Applied Linear Regression,
    New York, 1980, pages 32-33.

## Question 1 - Exploratory Data Analysis

*Read the dataset given in file named **'petrol.csv'**. Check the statistical details of the dataset.*

**Hint:** You can use **df.describe()**

In [1]:
import numpy as np
import pandas as pd

petroldf = pd.read_csv("petrol.csv")

In [2]:
petroldf.describe().round(2).transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tax,48.0,7.67,0.95,5.0,7.0,7.5,8.12,10.0
income,48.0,4241.83,573.62,3063.0,3739.0,4298.0,4578.75,5342.0
highway,48.0,5565.42,3491.51,431.0,3110.25,4735.5,7156.0,17782.0
dl,48.0,0.57,0.06,0.45,0.53,0.56,0.6,0.72
consumption,48.0,576.77,111.89,344.0,509.5,568.5,632.75,968.0


# Question 2 - Cap outliers 

Find the outliers and cap them. (Use (Q1 - 1.5 * IQR) as the minimum cap and (Q3 + 1.5 * IQR) as the max cap. The decision criteria is you should consider the datapoints which only falls within this range. The data points which fall outside this range are outliers and the entire row needs to be removed

In [3]:
# Calculating Q1 for each column

Q1 = petroldf.quantile(0.25)

# Calculating Q3 for each column

Q3 = petroldf.quantile(0.75)

# calculating interquartileregion (IQR) for each column

IQR = Q3-Q1

# Identifying & Printing outliers

petroldf_ol = petroldf[(petroldf < (Q1-1.5*IQR)) | (petroldf > (Q3+1.5*IQR))]

petroldf_ol.dropna(how='all') # Prints only the rows with outlier values (non NaN values in the below result are outliers)

Unnamed: 0,tax,income,highway,dl,consumption
5,10.0,,,,
11,,,14186.0,,
18,,,,0.724,865.0
36,5.0,,17782.0,,
39,,,,,968.0


From the above results it can be seen that tax column has 2 outliers, income column has 0 outliers, highway column has 2 outliers, dl column has 1 outlier and consumption column has 2 outliers. In total 6 rows have outliers which needs to be removed.

In [4]:
# Removing Outliers from the dataset

petroldf_wol = petroldf[~((petroldf < (Q1 - 1.5 * IQR)) |(petroldf > (Q3 + 1.5 * IQR))).any(axis=1)]
print(f'Shape of the dataset without outliers: {petroldf_wol.shape}') # 43 rows remain after deleting 5 rows : 5,11,18,36,39
print('\nDataset without outliers:')
petroldf_wol

Shape of the dataset without outliers: (43, 5)

Dataset without outliers:


Unnamed: 0,tax,income,highway,dl,consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410
6,8.0,5319,11868,0.451,344
7,8.0,5126,2138,0.553,467
8,8.0,4447,8577,0.529,464
9,7.0,4512,8507,0.552,498
10,8.0,4391,5939,0.53,580


# Question 3 - Independent variables and collinearity 
Which attributes seems to have stronger association with the dependent variable consumption?

In [59]:
## Correlation Matrix

petroldf_wol.corr()

Unnamed: 0,tax,income,highway,dl,consumption
tax,1.0,-0.109537,-0.390602,-0.314702,-0.446116
income,-0.109537,1.0,0.051169,0.150689,-0.347326
highway,-0.390602,0.051169,1.0,-0.016193,0.034309
dl,-0.314702,0.150689,-0.016193,1.0,0.611788
consumption,-0.446116,-0.347326,0.034309,0.611788,1.0


### Observing the above correlation values between all the variables, we can see that there is stronger association between the number of drivers and consumption. And comparatively tax has an association in a negative way. 
Insights :
As tax increases the consumption decreases.
As number of drivers is more consumption is more

# Question 4 - Transform the dataset 
Divide the data into feature(X) and target(Y) sets.

In [74]:
# Predictor variables (Feature Dataset)

X = petroldf_wol.drop([' consumption'],axis=1)

X

Unnamed: 0,tax,income,highway,dl
0,9.0,3571,1976,0.525
1,9.0,4092,1250,0.572
2,9.0,3865,1586,0.58
3,7.5,4870,2351,0.529
4,8.0,4399,431,0.544
6,8.0,5319,11868,0.451
7,8.0,5126,2138,0.553
8,8.0,4447,8577,0.529
9,7.0,4512,8507,0.552
10,8.0,4391,5939,0.53


In [75]:
# Dependent Variables (Target Dataset)

Y = petroldf_wol[[' consumption']]

Y

Unnamed: 0,consumption
0,541
1,524
2,561
3,414
4,410
6,344
7,467
8,464
9,498
10,580


# Question 5 - Split data into train, test sets 
Divide the data into training and test sets with 80-20 split using scikit-learn. Print the shapes of training and test feature sets.

In [77]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20 , random_state=1)

In [81]:
print(f'Shape of Training feature dataset, (X_train)  : {X_train.shape}')
print(f'Shape of Training target dataset,  (Y_train)  : {Y_train.shape}')
print(f'Shape of Testing feature dataset,  (X_test)   : {X_test.shape}')
print(f'Shape of Testing target dataset,   (Y_test)   : {Y_test.shape}')

Shape of Training feature dataset, (X_train)  : (34, 4)
Shape of Training target dataset,  (Y_train)  : (34, 1)
Shape of Testing feature dataset,  (X_test)   : (9, 4)
Shape of Testing target dataset,   (Y_test)   : (9, 1)


# Question 6 - Build Model 
Estimate the coefficients for each input feature. Construct and display a dataframe with coefficients and X.columns as columns

In [85]:
from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()
regression_model.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [94]:
coefficients_df = pd.DataFrame(regression_model.coef_,index=['coefficeints'],columns=X.columns)
coefficients_df

Unnamed: 0,tax,income,highway,dl
coefficeints,-39.411584,-0.062628,-0.003022,950.882744


# R-Square 

# Question 7 - Evaluate the model 
Calculate the accuracy score for the above model.

In [98]:
print(f'The accuracy score (R2) of the the above model is : {regression_model.score(X_test, Y_test).round(2)}')

The accuracy score (R2) of the the above model is : 0.69


# Question 8: Repeat the same Multi linear regression modelling by adding both Income and Highway features
Find R2 


In [110]:
# Model under Question 6 is built including the Income and Highway features. Hence building the model without those features here

X_train2 = X_train.drop([' income',' highway'],axis=1)
X_test2  = X_test.drop([' income',' highway'],axis=1)

regression_model2 = LinearRegression()
regression_model2.fit(X_train2, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [107]:
print(f'The accuracy score (R2) of the the above model is : {regression_model2.score(X_test2, Y_test).round(2)}')

The accuracy score (R2) of the the above model is : 0.29


# Question 9: Print the coefficients of the multilinear regression model

In [111]:
coefficients2_df = pd.DataFrame(regression_model2.coef_,index=['coefficeints'],columns=X_train2.columns)
coefficients2_df

Unnamed: 0,tax,dl
coefficeints,-30.709243,892.886209


# Question 10 
In one or two sentences give reasoning on R-Square on the basis of above findings
Answer

### *R squared value increase if we increase the number of independent variables to our analysis

R-squared value of the model with income and highway features included is 0.69 which is much higher than the value (0.29) of the model without these features. This explains the model with the all the features included can be considered as the best model among the two.

But as there is a chance that R-squared might have increased because of the increase in the number of independent variables, it may not be a reliable measure here (to add to it the value is not that high in absolute level). Hence, some other reliable values like adjusted R-squared can be used to judge the accuracy of the model with more confidence.