# Linear regression
In this notebook we'll perform a multiple linear regression to look at correlations in a data regarding properties of automobiles.  The data is obtained from the [UC Irvine machine learning repository](http://archive.ics.uci.edu/ml/datasets.html).

First let's download the data.  Numpy's `genfromtxt` function makes this easy.

In [139]:
import numpy as np
data = np.genfromtxt("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", 
                     usecols=range(8))
data.shape

(398, 8)

So we have 398 samples, and 8 characteristics for each.  What are these properties?

At http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names we learn that they are

1. mpg:           continuous
2. cylinders:     multi-valued discrete
3. displacement:  continuous
4. horsepower:    continuous
5. weight:        continuous
6. acceleration:  continuous
7. model year:    multi-valued discrete
8. origin:        multi-valued discrete

In fact the original dataset has a 9th column with the make and model, but I've omitted that since genfromtxt doesn't know how to handle it easily.

Unfortunately, some values are missing:

In [140]:
print data[32,:]

[  2.50000000e+01   4.00000000e+00   9.80000000e+01              nan
   2.04600000e+03   1.90000000e+01   7.10000000e+01   1.00000000e+00]


Here `nan` means *not a number* -- the value in the data file is actually a question mark.  We'd like to automatically find and eliminate the rows with missing data.  We can detect them like this:

In [141]:
any(np.isnan(data[32,:]))

True

Now let's delete the rows with missing data:

In [142]:
bad_rows = []

for i, line in enumerate(data):
    if any(np.isnan(line)):
        bad_rows.append(i)
    
count = 0
for i in bad_rows:
    data = np.delete(data,i-count,0)
    count = count + 1

Check that it's clean:

In [143]:
np.isnan(data).any()

False

In [145]:
data.shape

(392, 8)

So we removed six rows, which still leaves us plenty of data to work with.

Let's create a model that predicts fuel economy (miles per gallon) based on the remaining characteristics.  Since there isn't a numerical meaning to the values for *origin*, we'll omit the last column.  Remember that Python indexes from zero!

In [150]:
A = data[:,1:7]
A.shape

(392, 6)

In [151]:
y = data[:,0] # miles per gallon

Now we have the *inputs* for our model in the matrix $A$ and the *outputs* in the vector $y$.  We'll solve

$$Ax = y$$

in the least squares sense, which will give us an idea of how fuel economy typically varies with each of the other factors.

We can solve the system using this function from `numpy`:

In [152]:
np.linalg.lstsq?

In [153]:
x, resid, rank, s = np.linalg.lstsq(A,y)

In [154]:
print x

[-0.5226089   0.01022108 -0.020873   -0.00639456 -0.05202195  0.61025869]


What do these results mean?  As one might expect, fuel economy tends to be worse for cars with more cylinders, horsepower, or weight.  Newer cars tend to be more fuel efficient, as do cars with greater displacement (i.e., greater engine volume).

Furthermore, we can make quantitative statements.  Cars seem to be getting more fuel efficient by a little more than 1/2 mpg per year.  Each thousand pounds of weight added to a car reduces its fuel efficiency by roughly 6 mpg.