## Lab for Linear Regression

### Linear Algebra in Python/Numpy

In this lab we will use:
- the `numpy` linear algebra package for computations
- the `bokeh` plotting package for graphics

The next cell loads these libraries.

In [1]:
import numpy as np
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
output_notebook()

### Distance to a moving object (simulated data)

We will begin by looking at Figures 1 and 2 in the [Linear Regression](../notes/main.pdf) notes.

In Figure 1 we began with simulated data that represents the distance from an observer to a a moving object. 
We will generate that data.  The observations occur at times $0, 1, 2, \ldots, 9$.  

In [4]:
# Create a 1-d numpy array containing 0,1,...,9 
x = np.array(range(10))

Let's suppose that the true velocity of the object is $15$ m/s and the initial distance to the object is $150$ m.  Then
the object's position is given by $d=150-15x$ where $x$ is time.  However, when we measure the distance, there is a random error
of between $-10$ and $10$ meters. The function `np.random.uniform` returns uniformly distributed random numbers within a given range.

In [10]:
y = 150-15*x+np.random.uniform(-10,10,size=x.shape[0])

We will plot the data points $(x,y)$.

In [11]:
f=figure(width=400,height=400,title='Measured Distance to Moving Object')
f.scatter(x=x,y=y)
f.xaxis.axis_label='time'
f.yaxis.axis_label='distance (m)'
show(f)

The points are roughly on a line, but not exactly, due to the error.  The
data matrix $X$ for this problem consists of two columns, one of which is $0,1,2,\ldots, 9$ and the other
is $1,1,1,1,\ldots$. (See section 1.3 in the notes).

We make this by:
- converting x to a column vector
- creating a column vector of 1's
- concatenating these

In [19]:
X=np.concatenate([x.reshape(-1,1),np.ones(shape=(x.shape[0],1))],axis=1)

In [20]:
X

array([[0., 1.],
       [1., 1.],
       [2., 1.],
       [3., 1.],
       [4., 1.],
       [5., 1.],
       [6., 1.],
       [7., 1.],
       [8., 1.],
       [9., 1.]])

The Y vector is our vector of measurements, except we need to make it a column vector.

In [21]:
Y = y.reshape(-1,1)

In [22]:
Y

array([[141.70146876],
       [126.74924305],
       [128.0650881 ],
       [111.02029657],
       [ 93.21515479],
       [ 80.66603855],
       [ 63.67188964],
       [ 47.35925691],
       [ 39.1045185 ],
       [  7.29868746]])

Now we will use the formulae in equations (7) and (8) from the notes t compute the least squares line.


In [23]:
D = np.dot(X.transpose(),X)

The matrix $M$ contains the slope and intercept of our least squares line.

In [30]:
M = np.dot(np.linalg.inv(D),np.dot(X.transpose(), Y))
print('Slope is {}, Intercept is {}'.format(M[0,0],M[1,0]))

Slope is -14.431888464120771, Intercept is 148.8286623223696


The predicted values of $Y$ are given by equation (8).

In [31]:
Yhat = np.dot(np.dot(np.dot(X,np.linalg.inv(D)),X.transpose()),Y)

Let's add these predicted values to our plot for comparison.  Notice that the green
dots lie along the regression line.

In [35]:
f.scatter(x=x,y=Yhat[:,0],color='green')
show(f)

We can connect the dots to see the line of best fit.

In [38]:
f.line(x=x,y=Yhat[:,0],color='green')
show(f)

### The MPG Data

Figure 2 in the notes shows data relating engine size to mileage for a group of cars.  Your task is to reproduce the graph
show in the figure.

#### Step 1.  Load the data

The mileage data is in a comma separated file (csv) called `../data/auto-mpg/auto-mpg.csv`  The command `np.genfromtxt`
can be used to read in an array like this.  You can examine the file by using the jupyter file browser and double clicking on the file name.  You will see that:
- the first row is a header
- the last column is the type of car, which is a string; numpy can't handle that so it will set them to nan meaning 'not a number'
- mpg is column zero
- displacement is column 2

In [44]:
data = np.genfromtxt('../data/auto-mpg/auto-mpg.csv',delimiter=',',skip_header=1)
data

array([[ 18.,   8., 307., ...,  70.,   1.,  nan],
       [ 15.,   8., 350., ...,  70.,   1.,  nan],
       [ 18.,   8., 318., ...,  70.,   1.,  nan],
       ...,
       [ 32.,   4., 135., ...,  82.,   1.,  nan],
       [ 28.,   4., 120., ...,  82.,   1.,  nan],
       [ 31.,   4., 119., ...,  82.,   1.,  nan]])

Now your job is to complete the code below, following the work we did above, so that you:
- Create  variable $x$ that is just column 2 (displacement) of the data matrix 
- create variable $Y$ that is just column 0 (mpg) of the data matrix 
- Plot $x$ and $Y$ as a scatter plot.
- Create a matrix whose first column is $x$ and whose second column is all $1$.
- Compute the matrix $D=X^{\intercal}X$
- Find $M=D^{-1}X^{\intercal}Y$
- Find $Yhat= XD^{-1}X^{\intercal}Y$
- Plot $Yhat$ and the predicted line.

In [63]:
#x=data[]
#Y=data[]
#f = figure()
#f.scatter()
#show(f)


In [68]:
x=data[:,2]
Y=data[:,0]
f=figure()
f.scatter(x=x,y=Y)
show(f)

In [69]:

#X = np.concatenate([],axis=1)
#D = np.dot(...)
#M = np.dot(...)
#Yhat = ...
#f.scatter(...,color='green')
#f.line(...,color='green')

In [70]:
X=np.concatenate([x.reshape(-1,1),np.ones((x.shape[0],1))],axis=1)
D = np.dot(X.transpose(),X)
M = np.dot(np.linalg.inv(D),np.dot(X.transpose(),Y))
Yhat = np.dot(X,M)
f.scatter(x=x,y=Yhat,color='green')
f.line(x=x,y=Yhat,color='green')
show(f)