# Simple Linear Regression for Automobile mpg Data

In this demo, you will see how to:
* Load data from a text fileausing the `pandas` package
* Create a scatter plot of data
* Handle missing data
* Fit a simple linear model
* Plot the linear fit with the test data
* Use a nonlinear transformation for an improved fit

## Loading the Data

The python [`pandas`](http://pandas.pydata.org/) library is a package for data analysis.  In this course, we will use a small portion of its features -- just reading and writing data from files.  After reading the data, we will convert it to `numpy` for all numerical processing including running machine learning algorithms.

We begin by loading the packages.

In [None]:
import pandas as pd
import numpy as np

The data for this demo comes from a survey of cars to determine the relation of mpg to engine characteristics.  The data can be found in the UCI library: https://archive.ics.uci.edu/ml/datasets/auto+mpg. The specific files we need are in the "Data Folder" there: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg.

### Try 1:  Loading the Data Incorrectly

The pandas has very good methods for loading data from ASCII tables. In this case, we want to read the data in the file:
https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

Since the file is a CSV file (comma-separated-values), we can try to use the `read_csv` command:

This creates a pandas *dataframe*. We can see the first six lines of the dataframe with `head` command:

In [None]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data')

There were three errors:
* All the data appeared in one column.  That is, the columns were not "delimited" correctly
* The first line got mistook as a header
* The columns are missing their header names

In [None]:
df.head(6)

### Try 2: Fixing the Errors in the loading

The problems above are common.  Often it takes a few times to load the data correctly.  That is why it is 
good to look at the first few elements of the dataframe before proceeding.
After some googling you can find out that you need to specify some other options to the `read_csv` command.
First, you need to supply the names of the columns.  In this case, I have supplied them manually based on the
description in the UCI website:

In [None]:
names = ['mpg', 'cylinders','displacement', 'horsepower', 
         'weight', 'acceleration', 'model year', 'origin', 'car name']

Then, we can repeat the `read_csv` command with the correct options. 

In [None]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data',
                 header=None,delim_whitespace=True,names=names,na_values='?')

If you re-run `head` command now, you can see the loading was correct. You can see the column names, index, and values:

## Manipulating the Data
We can get the `shape` of the data, which indicates the number of samples and number of attributes

You can also see the three components of the `dataframe` object.  The dataframe is stored in a table (similar to a SQL table if you know databases).  In this case, there is one row for each car and the attributes of the car are stored in the columns.  The command `df.columns` returns the names of the columns.

In [None]:
df.head(6)

In [None]:
df.columns

The field `df.index` returns the indices of the rows.  In this case, they are just enumerated 0,1,...

In [None]:
df.index

Finally, `df.values` is a 2D `numpy` array with values of the attributes for each car.  Note that the data can be *heterogeneous*:  Some entries are integers, some are floating point values and some are strings.

In [None]:
df.values

The `df.columns` attribute is not a python list, but a `pandas`-specific data structure called an `Index`.  To convert to a list, use the `tolist()` method:

In [None]:
df.columns.tolist()

You can select subsets of the attributes with indexing.  For example, this selects one attribute, which returns what is called a pandas `Series`

In [None]:
df2 = df['cylinders']
df2.head(6)

You can also select a list of column names which returns another dataframe.  Note the use of the double brackets `[[ ... ]]`.

In [None]:
df2 = df[['cylinders','horsepower']]
df2.head(6)

## Plotting the Data
We load the `matplotlib` module to plot the data.  This module has excellent plotting routines that are very similar to those in MATLAB

In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

First, we need to convert the dataframes to numpy arrays:

In [None]:
x = df['displacement'].values
y = df['mpg'].values

Then, we can create a scatter plot

In [None]:
plt.plot(x,y,'o')
plt.xlabel('displacement')
plt.ylabel('mpg')
plt.grid(True)

## Manipulating Numpy arrays

Once the data is converted to a numpy array, we can perform many useful simple calculations.  For example, we can compute the sample mean:

In [None]:
mx = np.mean(x)
my = np.mean(y)
print('Mean displacement = {0:5.1f}, mean mpg= {1:5.1f}'.format(mx, my))

Fraction of cars with > 25 mpg:

In [None]:
np.mean(y > 40)

Sample mean displacement for the cars that have mpg > 25

In [None]:
I = (y>25)

In [None]:

np.sum(x*I)/np.sum(I)

You can also do the previous command with [boolean indexing](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.indexing.html).

In [None]:
np.mean(x[I])

## Missing Data and NaN Values

Now, try a different field, horsepower

In [None]:
x = np.array(df['horsepower'])
y = np.array(df['mpg'])
np.mean(x)

When you get the mean, it gives `nan` which means not a number.  The reason is that there was missing data in the orginal file and the `load_csv` function put `nan` values in the places where the data was missing.  This is very common.  To remove the rows with the missing data, we can use the `dropna` method:

In [None]:
df1 = df[['mpg','horsepower']]
df2 = df1.dropna()
df2.shape

We can see that some of the rows have been dropped.  Specifically, the number of samples went from 396 to 392.  We can now compute the mean using the reduced dataframe.

In [None]:
x = df2['horsepower'].values
y = df2['mpg'].values
np.mean(x)

And, we can plot the data.

In [None]:
plt.plot(x,y,'o')
plt.xlabel('horsepower')
plt.ylabel('mpg')
plt.grid(True)

## Computing and Plotting a Linear Fit
We can now try to fit a linear model, $\hat{y} = \beta_0 + \beta_1 x$.
From class, the formulae are:
$$\beta_1 = s_{yx}/s_{xx}, \quad \beta_0 = \bar{y} - \beta_1\bar{x},$$
where $\bar{x}$ and $\bar{y}$ are the sample means and $s_{yx}$ and $s_{xx}$ are the cross- and auto-covariances.

In [None]:
xm = np.mean(x)
ym = np.mean(y)
syx = np.mean((y-ym)*(x-xm))
sxx = np.mean((x-xm)**2)
syy = np.mean((x-xm)**2)
beta1 = syx/sxx
beta0 = ym - beta1*xm

print("mean of x ={0:7.2f}, mean of y ={1:7.2f}".format(xm,ym))
print("sxx ={0:6.2f},  sqrt(syy)={1:6.2f}".format(np.sqrt(sxx),np.sqrt(syy)))
print("beta0 ={0:6.2f}, beta1 ={1:6.2f}".format(beta0,beta1))

We can create a plot of the regression line on top of the scatter plot.

In [None]:
# Points on the regression line
ypred = beta1*x + beta0

plt.plot(x,y,'o')                    # Plot the data points
plt.plot(x,ypred,'-',linewidth=3)  # Plot the regression line (the predicted values)
plt.xlabel('horsepower')
plt.ylabel('mpg')
plt.grid(True)

The Squared Loss is RSS = $\sum_i (y_i-\hat{y}_i)^2$ where $\hat{y}_i = \beta_0 + \beta_1x_i$ or 

In [None]:
yhat=beta0+beta1*x
loss = np.sum((y-yhat)**2)
print("Loss = {0:7.2f}".format(loss))

## Nonlinear Transformation

We see that the linear regression captures the general trend of the relation between `y=mpg` and `x=horsepower`.  However, the trend does not really appear linear - instead it has an inverse type relation.   So, a natural idea is to use a *nonlinear transformation*:
* Transform the data `z=1/y` 
* Fit `z` vs. `x` with a linear model:  $\hat{z}=\beta_0 + \beta_1x$.
* Invert the nonlinear relation for a model for `y`:  $\hat{y} = \hat{z}=1/(\beta_0 + \beta_1x)$.

We begin then by computing `z` and plotting a scatter plot of `z` vs. `x`.  Note that `z` represented gallons per mile (1/mpg).

In [None]:
z = 1/y
plt.plot(x,z,'o')
plt.xlabel('horsepower')
plt.ylabel('1/mpg')
plt.grid(True)

We see a clear linear relation between `z` (1/mpg) and `x` (horsepower). We can fit a linear model,
$z = \beta_0 + \beta_1 x$.  

In [None]:
# Use linear regression to fit `z` vs. `x`
xm = np.mean(x)
zm = np.mean(z)
sxz = np.mean((z-zm)*(x-xm))
sxx = np.mean((x-xm)**2)
beta1_inv = sxz/sxx
beta0_inv = zm - beta1_inv*xm

We can create a plot of the regression line on top of the scatter plot.  

In [None]:
z = 1/y
xplt_inv = np.arange(20,250)
zplt_inv = beta1_inv*xplt_inv + beta0_inv
plt.plot(x,z,'o')
plt.plot(xplt_inv,zplt_inv,'-',linewidth=3)
plt.xlabel('horsepower')
plt.ylabel('1/mpg')
plt.grid(True)

Finally, we compute the estimate in the original domain:  $\hat{y}=1/\hat{z}$.  We plot the data, original linear fit and the linear fit with inversion.

In [None]:
yplt_inv = 1/zplt_inv 
plt.plot(x,y,'o')
plt.plot(xplt_inv,beta0 + beta1*xplt_inv,'-',linewidth=3)
plt.plot(xplt_inv,yplt_inv,'-',linewidth=3)
plt.xlabel('horsepower')
plt.ylabel('mpg')
plt.grid(True)
plt.legend(['data', 'linear', 'linear+inversion'])

We can conclude by comparing the squared loss using the linear fit and the linear fit+inversion.  We see that we get a slightly reduced error using the nonlinear transformation.

In [None]:
zhat_inv = beta0_inv + beta1_inv*x
yhat_inv = 1/zhat_inv
loss_inv = np.sum((yhat_inv-y)**2)
print("RSS = {0:7.2f} (linear)".format(loss))
print("RSS = {0:7.2f} (linear+inversion)".format(loss_inv))