# Auto MPG

# NJIT

John Morrison  
Joseph Bennett

The goal of this project is to estimate the miles per gallon (mpg) of the different types of cars given a set of other attributes of the car, for example, the number of cylinders in the engine, the weight of the car, the model year, etc...

* The data set is maintained by Carniege Mellon and can be downloaded from [UCI Machine Learning](https://archive.ics.uci.edu/ml/datasets/auto+mpg).
* The methods for generating the models were found at [SciLearn](https://scikit-learn.org/stable/modules/linear_model.html).
* The rationale for using Ridge Regression was found from [NCSS](https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf).
* The definition of Lasso Regression was found on [Wikipedia](https://en.wikipedia.org/wiki/Lasso_(statistics)).

## Necessary Libraries

We want to predict a numerical value for the miles per gallon (mpg) therefore the models we use will be linear regression models:
1. Ordinary Least Squares model,
2. Ridge Regression model,
3. Lasso Regression model,
4. Support Vecrot Regression model.  

We will be using cross validation in order to minimize overfitting.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import learning_curve, GridSearchCV
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

## Read data

In [15]:
data = pd.read_csv('auto-mpg.data', sep='\s+')
data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


## Data Preprocessing

The data is not clean. Some of the values of horsepower are not known and marked with '?' to indicate this.

In [16]:
data.where(data['horsepower'] == '?').dropna()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
32,25.0,4.0,98.0,?,2046.0,19.0,71.0,1.0,ford pinto
126,21.0,6.0,200.0,?,2875.0,17.0,74.0,1.0,ford maverick
330,40.9,4.0,85.0,?,1835.0,17.3,80.0,2.0,renault lecar deluxe
336,23.6,4.0,140.0,?,2905.0,14.3,80.0,1.0,ford mustang cobra
354,34.5,4.0,100.0,?,2320.0,15.8,81.0,2.0,renault 18i
374,23.0,4.0,151.0,?,3035.0,20.5,82.0,1.0,amc concord dl


In [17]:
for column in data:
    print("{:14}: {}".format(column, data[column].dtype))

mpg           : float64
cylinders     : int64
displacement  : float64
horsepower    : object
weight        : float64
acceleration  : float64
model_year    : int64
origin        : int64
car_name      : object


We will drop this data because the horsepower variable is important in determining the mpg target variable. We will also need to convert the horsepower variable to a numeric type becuase, when it was loaded it was automatically converted to an 'object' type.

In [18]:
data = data[data['horsepower'] != '?']
data['horsepower'] = pd.to_numeric(data['horsepower'])

In [19]:
data.shape

(392, 9)

In [20]:
for column in data:
    print("{:14}: {}".format(column, data[column].dtype))

mpg           : float64
cylinders     : int64
displacement  : float64
horsepower    : float64
weight        : float64
acceleration  : float64
model_year    : int64
origin        : int64
car_name      : object


Drop the car name, this is not used in the regression analysis.

In [21]:
data = data.drop(['car_name'], axis=1)

## Setup the target (cross_y) and Independent (cross_X) variables

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['mpg'], axis=1), pd.DataFrame(data, columns=['mpg']), 
    test_size=0.33, random_state=42)
#cross_X = data.drop(['mpg'], axis=1)
#cross_y = pd.DataFrame(data, columns=['mpg'])

## Independent 

In [23]:
X_train.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
368,4,112.0,88.0,2640.0,18.6,82,1
182,4,107.0,86.0,2464.0,15.5,76,2
120,4,121.0,112.0,2868.0,15.5,73,2
309,4,98.0,76.0,2144.0,14.7,80,2
221,8,305.0,145.0,3880.0,12.5,77,1


## Target

In [24]:
y_train.head()

Unnamed: 0,mpg
368,27.0
182,28.0
120,19.0
309,41.5
221,17.5


## Building the Models

#### Selected Models
1. Ordinary Least Squares
2. Ridge Regression
3. Lasso
4. Support Vector Regression

### Ordinary Least Squares

According to [Skilearn](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares), Ordinary Least Squares will attempt to fit a linear model that will minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation

Accuracy scores for a k=5 Cross validation for an Ordinary Least Squares Linear Model.

In [36]:
model = LinearRegression()
parameters = {'fit_intercept':[True, False],
              'normalize':[True, False],
              'copy_X':[True,False]}
ols = GridSearchCV(model, parameters, cv=5, scoring="r2")
ols.fit(X_train, y_train)
print ("r2 / variance: {0:.4f}".format(ols.best_score_))
print ("Residual sum of squares: {0:.4f}".format(np.mean((ols.predict(X_test)-y_test)**2)[0]))

r2 / variance: 0.8061
Residual sum of squares: 10.5095


### Ridge Regression

Ridge Regression addresses issues with multicollinearity ([NCSS](https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf)). Multicollinearity is a feature of a dataset where one independent variable can be predicted by other independent variables within the dataset.

Accuracy scores for a k=5 Cross validation for a Ridge Regression Linear Model

In [37]:
model = Ridge()
parameters = {'alpha':[.1,.2,.3]}
ridge = GridSearchCV(model, parameters, cv=5, scoring="r2")
ridge.fit(X_train, y_train)
print ("r2 / variance: {0:.4f}".format(ridge.best_score_))
print ("Residual sum of squares: {0:.4f}".format(np.mean((ridge.predict(X_test)-y_test)**2)[0]))

r2 / variance: 0.8061
Residual sum of squares: 10.5067


The accuracy scores for ridge regression are not very different from the ordinary least squares scores. This means that multicollinearity is not a real issue with this dataset.

### Lasso Regression

Lasso Regression is a regression analysis method that performs both variable selection and regularization in order to  enhance the prediction accuracy and the interpretibility of the statistical model it produces [from Wikipedia](https://en.wikipedia.org/wiki/Lasso_(statistics)).

Accuracy scores for a k=5 Cross validation for a Lasso Regression Linear Model

In [40]:
model = Lasso()
parameters = {'alpha':[.1,.2,.3]}
lasso = GridSearchCV(model, parameters, cv=5, scoring="r2")
lasso.fit(X_train, y_train)
print ("r2 / variance: {0:.4f}".format(lasso.best_score_))
(lasso.predict(X_test)-y_test)**2
#print ("Residual sum of squares: {0:.4f}".format(np.mean((lasso.predict(X_test)-y_test)**2)[0]))
#lasso_scores = cross_val_score(lasso, cross_X, cross_y, scoring='neg_mean_squared_error', cv=5)
#np.mean(lasso_scores)

SyntaxError: invalid syntax (<ipython-input-40-34f9c20afbbe>, line 6)

### Support Vector Regression

Accoriding to [Scikit-learn](https://scikit-learn.org/stable/modules/svm.html#svm-regression), Support Vector Classification can be extended to regression as well.  

In [None]:
svr_rbf = SVR(gamma='scale')
svr_rbf_scores = cross_val_score(svr_rbf, cross_X, cross_y.values.ravel(), scoring='neg_mean_squared_error', cv=5)
np.mean(svr_rbf_scores)