# Auto MPG

# NJIT

John Morrison  
Joseph Bennett

The goal of this project is to estimate the miles per gallon (mpg) of the different types of cars given a set of other attributes of the car, for example, the number of cylinders in the engine, the weight of the car, the model year, etc...

* The data set is maintained by Carniege Mellon and can be downloaded from [UCI Machine Learning](https://archive.ics.uci.edu/ml/datasets/auto+mpg).
* The methods for generating the models were found at [SciLearn](https://scikit-learn.org/stable/modules/linear_model.html).
* The rationale for using Ridge Regression was found from [NCSS](https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf).
* The definition of Lasso Regression was found on [Wikipedia](https://en.wikipedia.org/wiki/Lasso_(statistics)).

## Necessary Libraries

We want to predict a numerical value for the miles per gallon (mpg) therefore the models we use will be linear regression models:
1. Ordinary Least Squares model,
2. Ridge Regression model,
3. Lasso Regression model,
4. Support Vecrot Regression model.  

We will be using cross validation in order to minimize overfitting.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn import svm
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

## Read data

In [4]:
data = pd.read_csv('auto-mpg.data', sep='\s+')
data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


## Data Preprocessing

The data is not clean. Some of the values of horsepower are not known and marked with '?' to indicate this.

In [5]:
data.where(data['horsepower'] == '?').dropna()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
32,25.0,4.0,98.0,?,2046.0,19.0,71.0,1.0,ford pinto
126,21.0,6.0,200.0,?,2875.0,17.0,74.0,1.0,ford maverick
330,40.9,4.0,85.0,?,1835.0,17.3,80.0,2.0,renault lecar deluxe
336,23.6,4.0,140.0,?,2905.0,14.3,80.0,1.0,ford mustang cobra
354,34.5,4.0,100.0,?,2320.0,15.8,81.0,2.0,renault 18i
374,23.0,4.0,151.0,?,3035.0,20.5,82.0,1.0,amc concord dl


In [6]:
for column in data:
    print("{:14}: {}".format(column, data[column].dtype))

mpg           : float64
cylinders     : int64
displacement  : float64
horsepower    : object
weight        : float64
acceleration  : float64
model_year    : int64
origin        : int64
car_name      : object


We will drop this data because the horsepower variable is important in determining the mpg target variable. We will also need to convert the horsepower variable to a numeric type becuase, when it was loaded it was automatically converted to an 'object' type.

In [7]:
data = data[data['horsepower'] != '?']
data['horsepower'] = pd.to_numeric(data['horsepower'])

In [8]:
data.shape

(392, 9)

In [9]:
for column in data:
    print("{:14}: {}".format(column, data[column].dtype))

mpg           : float64
cylinders     : int64
displacement  : float64
horsepower    : float64
weight        : float64
acceleration  : float64
model_year    : int64
origin        : int64
car_name      : object


Drop the car name, this is not used in the regression analysis.

In [10]:
data = data.drop(['car_name'], axis=1)

## Setup the target (cross_y) and Independent (cross_X) variables

In [11]:
cross_X = data.drop(['mpg'], axis=1)
cross_y = pd.DataFrame(data, columns=['mpg'])

## Independent 

In [12]:
cross_X.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,8,307.0,130.0,3504.0,12.0,70,1
1,8,350.0,165.0,3693.0,11.5,70,1
2,8,318.0,150.0,3436.0,11.0,70,1
3,8,304.0,150.0,3433.0,12.0,70,1
4,8,302.0,140.0,3449.0,10.5,70,1


## Target

In [13]:
cross_y.head()

Unnamed: 0,mpg
0,18.0
1,15.0
2,18.0
3,16.0
4,17.0


## Building the Models

#### Selected Models
1. Ordinary Least Squares
2. Ridge Regression
3. Lasso
4. Support Vector Regression

### Ordinary Least Squares

According to [Skilearn](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares), Ordinary Least Squares will attempt to fit a linear model that will minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation

Accuracy scores for a k=5 Cross validation for an Ordinary Least Squares Linear Model.

In [14]:
ols = LinearRegression()
ols_scores = cross_val_score(ols, cross_X, cross_y, cv=5)
ols_scores

array([0.55691895, 0.68950582, 0.82212138, 0.6795006 , 0.2250594 ])

### Ridge Regression

Ridge Regression addresses issues with multicollinearity ([NCSS](https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf)). Multicollinearity is a feature of a dataset where one independent variable can be predicted by other independent variables within the dataset.

Accuracy scores for a k=5 Cross validation for a Ridge Regression Linear Model

In [15]:
ridge = Ridge(alpha=.5)
ridge_scores = cross_val_score(ridge, cross_X, cross_y, cv=5)
ridge_scores

array([0.5569872 , 0.68950028, 0.8221379 , 0.67964279, 0.22455814])

The accuracy scores for ridge regression are not very different from the ordinary least squares scores. This means that multicollinearity is not a real issue with this dataset.

### Lasso Regression

Lasso Regression is a regression analysis method that performs both variable selection and regularization in order to  enhance the prediction accuracy and the interpretibility of the statistical model it produces [from Wikipedia](https://en.wikipedia.org/wiki/Lasso_(statistics)).

Accuracy scores for a k=5 Cross validation for a Lasso Regression Linear Model

In [None]:
lasso = Lasso(alpha=.1)
lasso_scores = cross_val_score(lasso, cross_X, cross_y, cv=5)
lasso_scores

### Support Vector Regression

#### NOTE: Try to normalize the data.  
The score output is wrong.

Accoriding to [Scikit-learn](https://scikit-learn.org/stable/modules/svm.html#svm-regression), Support Vector Classification can be extended to regression as well.  
There are three different implementations of Support Vector Regression:
1. SVR
2. NuSVR
3. and LinearSVR

In [None]:
svr_rbf = svm.SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1)
svr_rbf_scores = cross_val_score(svr_rbf, cross_X, cross_y.values.ravel(), cv=5)
svr_rbf_scores

In [None]:
svr_lin = svm.SVR(kernel='linear', C=100, gamma='auto')
svr_lin_scores = cross_val_score(svr_lin, cross_X, cross_y.values.ravel(), cv=5)
svr_lin_scores

In [None]:
svr_poly = svm.SVR(kernel='poly', C=100, gamma='auto', degree=3, epsilon=.1, coef0=1)
svr_poly_scores = cross_val_score(svr_poly, cross_X, cross_y.values.ravel(), cv=5)
svr_poly_scores