Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list anyone you collaborated with on this workbook

---

## Lab 5: Regression
**This lab was distributed Monday 9/30/2019 and should be completed by Monday 10/7/2019 at 11:59PM.**


Welcome to your fifth lab of the semester!<br>

This lab aims to get you started with linear algebra operations and linear regression in Python.

### Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

### Section 1: Linear algebra with Numpy
The purpose of this section is to give you a chance to do a little more work with numpy before tackling the homework.

**Question 1.1** Create a 100x1 numpy array with normally distributed random entries, centered around 0.
Call the array $a$. If you can't remember the numpy syntax, [this might help](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html).

In [None]:
a = ...
print(a)

**Question 1.2** Print the first 5 entries and the last 5 entries of the array (you might want to look up [slicing and indexing in numpy](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html) as a refresher).

In [None]:
# YOUR CODE HERE

**Question 1.3** Plot the distribution of $a$ to give a visual verification that it's normal.  

In [None]:
# YOUR CODE HERE

**Question 1.4** Now make a matrix of random entries that you can multiply by the vector you just created.
We'll call the matrix $X$, and after this we're going to find the product $Xa$. Make sure to choose the dimensions of $X$ so that it's possible to perform the matrix multiplication $Xa$.

In [None]:
# YOUR CODE HERE

**Question 1.5** Now multiply $a$ by $X$, that is, compute $Xa$.  
If numpy gives you an error, diagnose and fix it, and write comments on what happened in the markdown cell below your solution.

In [None]:
# YOUR CODE HERE

(enter comments here on any problems you encountered and how you fixed them)

**Question 1.6** Preparing to invert a matrix: Here, you'll define a matrix that you'll later invert. Remember that the inverse of a matrix $B$ is written as $B^{-1}$, and it satisfies the formula:<br>
$BB^{-1} = I$<br>

$I$ is the identity matrix, which is a matrix where the diagonal is all 1s and the rest of the elements are 0. For example, if you try to perform the following matrix multiplication, you'll find that you get this result:<br>

$\begin{bmatrix}
 1&2\\
 3&4\\
 \end{bmatrix}
 \begin{bmatrix}
 -2&1\\
 1.5&-0.5\\
 \end{bmatrix} = 
 \begin{bmatrix}
 1&0\\
 0&1\\
 \end{bmatrix}$<br>
 
The two matrices above are inverses of each other. The identity matrix $I$ is important because if you multiply it by *any* matrix with the same dimensions, you'll get that same matrix (try it!).<br>
 
A matrix inverse only exists for square matrices that are invertible. To avoid non-invertible matrices, you need to be careful that you don't create a "singular matrix". This happens when you can write down one row as a linear transformation of one or more other rows.  For example

$\begin{bmatrix}
 x&y\\
 2x&2y\\
 \end{bmatrix}$
 
is singular because the second row is just two times the first.  But 

$\begin{bmatrix}
 x&y&z\\
 2x&2y&0\\
 0&0&z\\
 \end{bmatrix}$
 
is also singular because the first row equals half of the second plus the third.<br>

In the cell below, define a square matrix $B$ (that is different from the example matrix provided above). You can choose its size and the value of each element.

In [None]:
B = ...
print(B)

Here is a function to check if the matrix is singular. 

In [None]:
### Run this cell; don't change it
def is_invertible(a):
    return a.shape[0] == a.shape[1] and np.linalg.matrix_rank(a) == a.shape[0]

**Question 1.7** What two things is the function `is_invertible` checking?

*Your answer here*

**Question 1.8** Is your matrix invertible?
Use `is_invertible` from the cell above to verify invertibility.  If your matrix isn't invertible, make adjustments until it is.

In [None]:
# YOUR CODE HERE

**Question 1.9** Use numpy to invert your matrix

In [None]:
Binv = ...
print(Binv)

**Question 1.10** Verify that $Binv$ is the inverse of $B$, using the definition of an inverse matrix.

In [None]:
# YOUR CODE EHRE

**Question 1.11** Time taken to invert matrices: the next cell will report the time numpy takes to invert matrices of different sizes.  Run the cell and watch the output.  In the following markdown cell, discuss what you see happening to the time to compute the inverse as the matrix grows.<br>

From lecture 9, you'll recall that when we do multiple regression we need to calculate an inverse matrix to find the coefficients. What do these results tell you about computing coefficients for a multiple regression with a large number of predictors?

In [None]:
### Don't modify this cell, just run it.
A_small = np.random.randn(100).reshape([10,10])
print('Inverting small matrix')
%timeit np.linalg.inv(A_small)

A_med = np.random.randn(10000).reshape([100,100])
print('Inverting medium matrix')
%timeit np.linalg.inv(A_med)

A_realbig = np.random.randn(1000000).reshape([1000,1000])
print('Inverting real big matrix')
%timeit np.linalg.inv(A_realbig)

*Your answer here*

### Section 2: Solving least squares regression.
In lecture we went over formulas to solve for the coefficients $\beta_0$ and $\beta_1$ in a single-variable least squares regression problem:

$y_i = \beta_0 + \beta_1 x_i + e_i$.

Those formulas are:

$
\hat{\beta}_0  =\bar{y} - \hat{\beta}_1\bar{x}\\
\hat{\beta}_1 = \frac{ \sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2}
$

**Question 2.1** In this section, we'll be working with data from the [California Department of Transportation (CalTrans)](https://data.ca.gov/dataset/caltrans-annual-vehicle-delay). 

Load in the .csv file in the "data" folder and save it to a dataframe `df`.

In [None]:
df = ...
df.head()

**Question 2.2** This dataset reports freeway congestion in California, organized by county and route. For this exercise, we'll be looking specifically at the Annual Vehicle Miles Traveled (VMT) field, which represents the total number of miles that a vehicle traveled on that route in that county, and the Incidents/ Day field, which represents the average number of traffic incidents per day for that route and county in 2017.

To start off with, create a scatter plot of Annual VMT on the x-axis and Incidents/Day on the y-axis. What can you say about the general relationship between these two variables?

*Note*: instead of typing out a long column name everytime you need to use it, you can create a variable that contains that column name as a string. For instance, rather than typing out `df["Annual Vehicle Miles Traveled (VMT)"]`, you can define a variable `vmt`:
```python
vmt = "Annual Vehicle Miles Traveled (VMT)"
df[vmt]
```
You can also just re-name the column nanmes.

In [None]:
# YOUR CODE HERE

*Your observations here*

**Question 2.3** Write a function that returns $\beta_0$ and $\beta_1$ using the summation formulas from class (i.e., the formulas above in the lab notebook), taking the $x$ and $y$ observations as input. Note that you can return two values from a function using the syntax `return (value1,value2)`.

In [None]:
def get_betas(x,y):
    # YOUR CODE HERE
    return ...

**Question 2.4** Use your function to compute $\beta_0$ and $\beta_1$ for two variables of interest in the Caltrans data you loaded.  

In [None]:
# YOUR CODE HERE
print('Beta values are', b0, 'and', b1)

**Question 2.5** Output a plot that overlays your regression line on a scatterplot of VMT vs. incidents per day. To get started, you'll need to calculate your predicted value `y_hat` for each of your $x$ values.

In [None]:
y_hat = ...
# YOUR CODE HERE

**Question 2.5** In this section, we'll calculate our error term $e$ as well as our mean squared error (MSE). Below, calculate $e$ for each pair of predictions and observations, and then calculate the MSE. The result for $e$ should be a 1-dimensional array that has the same length as our number of observations; for the MSE it will be a single, non-negative value.

In [None]:
error = ...
MSE = ...
print(MSE)

**Question 2.6** In the previous few questions, we didn't divide our data into training and testing sets - we fit the regression line to the full dataset, and tested its performance on that same full dataset. In this section we'll build a loop that randomly selects the training data from the full dataset, finds the beta values using the function you've defined, and then tests its performance against the testing data. For each iteration, we'll save the beta values, the error array, and the MSE. The skeleton code below will get you started.

In [None]:
n_iter = 100 # number of iterations
n_test = int(np.round(len(df[vmt])*0.7)) # number of test data points is equal to 70% of the observations, rounded to the nearest integer
n_train = int(len(df[vmt]) - n_test) # number of training data points

betas = np.zeros((n_iter,2)) # initialize empty array that will hold beta values, where each row is the beta value for a different iteration
error = np.zeros((n_iter, n_train)) # each row has the error values of the training dataset for a different iteration
MSE = np.zeros(n_iter) # each value is mean squared error for a different iteration

for i in range(n_iter):
    test_indx = np.random.choice(len(df[vmt]), size = n_test, replace = False) # index values for test data
    train_indx = np.setdiff1d(np.arange(len(df[vmt])),test_indx) # index values for training data
 
    # YOUR CODE BELOW
    betas[i,:] = ...
    y_hat = ...
    error[i,:] = ...
    MSE[i] = ...

**Question 2.7** Plot a distribution of your mean squared error using `sns.kdeplot()`. What do you notice about the distribution? Can you explain its shape, based on what you've observed about the dataset?

In [None]:
# YOUR CODE HERE

*Your observations here*

**Question 2.8** Plot a scatter plot of all observations, overlayed with all 100 linear regression lines. We can plot the regression lines by using the array `betas` to calculate their value at two points, 0 and the maximum $x$ value (`df[vmt].max()`) - the skeleton code below gets you started by defining those two $x$ values. Play around with the linestyles, scatter plot marker sizes, and linewidths to get a legible plot.

In [None]:
x = np.array([0, df[vmt].max()]) # two x values, at which y_hat can be calculated

plt.figure(figsize = (10,5))

plt.scatter(...)

for i in range(n_iter):
    y_hat = ...
    plt.plot(...)
    
plt.title(...)
plt.xlabel(...)
plt.ylabel(...)
plt.show()

# Hooray, you're done! 

Please remember to submit your lab work, after clicking Kernel -> Restart & Run All, in .html and .ipynb format on bCourses.