In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list names of anyone you worked with on this homework.

# [ERG 131] Homework 5: K-nearest neighbors, regression
<br>

In this homework students will work with air quality data, run k-nearest neighbors, and do regression using scikit-learn.

K-nearest neighbors is covered in Introduction to Statistical Learning. KNN for classification is described in section 2.2.3 and regression in Section 3.5. In this homework we're going to use KNN for quantiative spatial forecasting, meaning we'll predict a numeric value for a location in space based on the average of the K-nearest points in space for which we have data.

We'll use the EPA air pollution measurements again (first used in HW2). For linear regression the objective is to build simple prediction models for PM2.5 concentration versus time. The data can be found [here](https://aqs.epa.gov/aqsweb/airdata/download_files.html).

**Important note**: You'll notice in the dependencies code block that there's a section that we want you to comment out in the final version, and a section that we want you to uncomment. Make sure to uncomment everything in `# uncomment this for final version` and comment everything in `# comment this out for final version` - it ensures that one of the plots you'll be outputting will show up properly in the .html file you submit. We'll remind you at the end of the homework, too!

---

### Topics Covered
- Continue getting comfortable working with new data, and continue to practice working with tools that help manage and summarize large data sets.
- Understand how KNN works and make some cool maps in the process.
- Learn how to use the simple single linear regression tool in scikit-learn.
- Analyze spatial distribution of annual changes in pollutant concentration.

### Table of Contents

1 - [K-Nearest Neighbors](#section1)<br>
2 - [Single Linear Regression with scikit-learn](#section2)<br>
3 - [Multiple regression](#section3) <br>
4 - [Model selection](#section4) <br>
5 - [Project](#section5)<br>

**Dependencies:**

In [None]:
# Run this cell to install these packages
! pip install sklearn
! pip install plotly
! pip install mapbox

In [None]:
# Run this cell to set up your notebook
import requests
from pathlib import Path
import zipfile
import os
import csv
import pandas as pd
import numpy as np
from numpy.linalg import inv

import utils
from utils import run_plotly

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

# uncomment this for final version
# import plotly.offline as py
# py.init_notebook_mode(connected=False)
# import plotly.graph_objs as go
# uncomment this for final version

# comment this out for final version
import plotly
import plotly.graph_objs as go
# comment this out for final version

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Section 1: K-Nearest Neighbors  <a id='section1'></a>

Let's run a KNN algorithm on the [hourly EPA PM2.5 data](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_hourly_data_files) that we previously used in Homework 2. This time, we've reduced the dataset to contain just the hourly data from California on Nov 9, 2018, which you may remember was one of the peak wildfire burning days of Camp Fire. There are a lot of steps to calculating the K-nearest neighbours, but you'll get to produce a really cool plot at the end!<br>

We will use KNN to plot a map of predicted PM2.5 concentrations in locations throughout California. We've gone ahead and created te PM2.5 dataset for you to use as `pm25_nov9.csv`.

In [None]:
# Run the following cell
nov9 = pd.read_csv('data/pm25_nov9.csv', low_memory=False)
nov9.head()

In addition, we've also gathered together a dataset containing the latitude and longitude coordinates of every major city and town in the state of California as `ca_cities_towns.csv`. We will use these as our locations on which we will run our algorithm to predict PM2.5 concentrations.

In [None]:
# Run the following cell
ca_locations = pd.read_csv('data/ca_cities_towns.csv', low_memory=False)
ca_locations.head()

For our purposes, nearest neighbor proximity will be based on spatial distance. For each location, we will find its K-nearest neighbors in the EPA dataset, and then we will use their average PM2.5 concentration as the forecast for that location. This simple but effective algorithm should allow us in the end to create a map of California where we can color locations based on their observed and predicted PM2.5 concentrations.<br>

Before we jump into writing the KNN algorithm, we'll do a quick review of what KNN is (the lecture 11 slides and section 2.2 of ISLR are also helpful resources here). KNN estimates the value at a point by taking an average of the K nearest values to that point (so if K = 2, then it takes the average value of the 2 nearest points). Mathematically, this looks like:

$$\hat{y}_j=\frac{1}{K}\sum_{i \epsilon N_j}y_i$$<br>

In the formula above, we're trying to predict the value of $y$ at position *j*. $N_j$ is the set of $K$ points closest to $y_j$. The formula sums all of the points within the set $N_j$, and then divides by $K$ to get an average.

### Writing the KNN Algorithm

Because we are working with an hourly dataset, we want to find out the PM2.5 concentration at each location by hour. This means that for each call to our algorithm that we will build in that section, we will need to go through our EPA dataset and select only the data that correspond to that hour.

In [None]:
# Run to see the recorded hours
np.unique(nov9['Time Local'])

A downside to KNN is that it can be particularly slow. If we are working with a large dataset, we will have to iterate many times over to find the K-nearest neighbors and thus our computational cost will be very high. `ca_locations` contains 1500+ cities and towns so we will need to decrease its size.

Additionally, since we'll plan to eventually combine these two datasets, it will be useful for us if we first categorize our data into types &mdash; meaning we need flag the data in `nov9` as observed data and `ca_locations` as predicted data.

<br>

<b>Question 1.1:</b> Write a `get_hour_data()` function that takes an hour parameter passed in as a string and returns a data frame containing only data from `nov9` that was recorded during that hour. Besides the columns in the original dataframe, this dataframe should also contain a column, 'Type', that contains a string value 'Observed'. This will be useful later on when we merge our observed and predicted values.

In addition, write a `get_sample()` function that when called returns a data frame with a random sample of 150 locations from `ca_locations`. This function should also take in a seed parameter passed in as an integer that allows us to replicate the random set of locations that we get everytime we run the function. Like `get_hour_data()`, `get_sample()` should return a dataframe with all of the original columns plus a new column, 'Type', that contains a string value 'Predicted'. Again, this will be useful when we merge dataframes.

*Hint 1*: `np.repeat('a', 3)` returns `['a', 'a', 'a']`.<br>
*Hint 2*: There are many ways to select random samples: you can use the numpy method that we used in lab and lecture, or [`pandas.sample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)

In [None]:
def get_hour_data(hour):   
    # YOUR CODE HERE
    return ...

In [None]:
def get_sample(seed):
    #YOUR CODE HERE
    return ...

In [None]:
# run this cell, do not change it
get_hour_data('10:00').head()

In [None]:
# run this cell, do not change it
get_hour_data('10:00').shape

In [None]:
# run this cell, do not change it
get_sample(1).head()

In [None]:
# run this cell, do not change it
get_sample(1).shape

Now that we are able to get our hour data and our sampled California cities and towns, it's time to run the KNN algorithm. The first step to running KNN is to find the distance between our locations fo interest (California cities and towns) and our locations where PM2.5 is measured. Both the hour data and the grid contain latitude and longitude coordinates. Let's take advantage of that by defining a function that finds the distance between any two points given each point's latitude and longitude values, which will help us when comparing nearest distances. We can use the Euclidean distance formula, which says that if we have one set of points $(a_1,b_1)$, and another set of points $(a_2,b_2)$, the distance between them is:<br>

$distance = \sqrt{(a_1-a_2)^2 + (b_1-b_2)^2}$

*Side note*: calculating distances between latitude-longitude pairs is often more complicated than the formula above because the distance between two points of longitude actually varies based on how far away you are from the equator. Since we're calculating distances over a relatively small area (the state of California), we can use the approximation above. If you wanted to accurately look at distances between latitude-longitude pairs over a larger area of the globe, you would have to use a slightly more involved trigonometry formula.

<b>Question 1.2:</b> Define the function `find_distance()`, which returns the distances between each pair in a series of coordinates $(x, y)$ and a single coordinate $(a, b)$ using the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). The value it returns should be a list of distances with the same length as $x$ and $y$.

In [None]:
def find_distance(x, y, a, b):
    # YOUR CODE HERE

In [None]:
# run this cell to check your function; do not change it
print(find_distance([5,3],[0,4],0,0)) # calculate the distance from (5,0) and (3,4) to (0,0)

**Question 1.3**: Using `find_distance()`, we're going to create a distance matrix. Each row will be one of the 150 sampled California locations where we're interested in predicting the PM2.5 level, and each column will represent one of the measured PM2.5 value locations. The elements of the array will represent the distance between each California town or city and each measured PM2.5 value location. For instance, if Oakland was the first out of 150 cities to appear in our sampled dataframe, then row 0 would contain distances between Oakland and every measurement location in the PM2.5 dataset.<br>

Define a function, `get_dist_array()`, that creates this array of distances. As input, it will take in `hour_data` - a dataframe of observations at a given hour - and `ca_sample` - a dataframe of sampled California data. It should return a numpy array with 150 rows (for each California location) and a number of columns equal to the number of rows in `hour_data` (i.e. the number of observed measurements in that hour).

In [None]:
def get_dist_array(ca_sample, hour_data):
    dist_array = np.zeros((ca_sample.shape[0], hour_data.shape[0])) # initialize an array of zeros
    for i in range(ca_sample.shape[0]): # loop through CA cities/towns
        # calculate distance between each city/town and each measurement location, 
        # and add to array row
        dist_array[i,:] = ... # YOUR CODE HERE
    
    return dist_array

In [None]:
# run this cell; do not change it
hour_data = get_hour_data('10:00')
ca_sample = get_sample(1)
print(get_dist_array(ca_sample, hour_data))
print(get_dist_array(ca_sample, hour_data).shape)

**Question 1.4:** Now, write a function that predicts PM2.5 measurements for each point in your set of sampled California towns and cities using K-nearest neigbhours. This function, `predict_measurements()`, should take in as parameters the hour data `hour_data`, the dataframe of sampled California towns `ca_sample`, and a value for $K$.

You can use `dist_array()` to find the spatial distance between each town and each measurement location, and then select the K-nearest neighbours in each row (remember, the rows represent the California towns), and from there find the average PM2.5 measurement.

`predict_measurements()` takes in as parameters the hour data, the CA sample data, and a value for K. It should return the `ca_sample` dataframe with KNN performed on it, containing the predicted measurements under a 'Sample Measurement' column.

*Hint*: you may want to use the [np.argsort()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html) function here.

In [None]:
def predict_measurements(hour_data, ca_sample, K):
    # get distances between CA cities/towns and measurement locations
    # for each CA city/town, get the average value of the K nearest measurement locations: this is your predicted measurement
    # add measurement predictions to ca_sample
    
    return ca_sample

In [None]:
# run this cell; do not change it
hour_data = get_hour_data('10:00')
ca_sample = get_sample(1)
predict_measurements(hour_data, ca_sample, 2).head()
# your ca_sample dataframe should have 5 columns: 
# Location, Latitude, Longitude, Type, and Sample Measurement

In the real world, data that we work with is often messy, incomplete, and/or missing important values. Case in point, the hourly dataset we pulled from the EPA website that we have been working with so far &mdash; although it contains precise latitude and longitude coordinates for each location &mdash; only contains the county name for each location and not the city or town. This is in contrast to `ca_locations` which contains city and town names.

When we plot all of our data, we would like to have the city and town names visible instead of county names for greater accuracy and clarity. We can use `ca_locations` to approximate the locations in the hour data based on their latitude and longitude coordinates.

<br>

<b>Question 1.5:</b> Write `approximate_locations()` which takes as input `hour_data` and `ca_locations`. For every point in the hour data, it should go through all the records in `hour_data` and find the nearest city or town to that point. The function should return the hour data with an appended 'Location' column that contains the name of the closest city or town. Here, you can make use of `dist_array()` again - remember that each column of the array that `dist_array()` returns corresponds to a measurement location, and each element within a given column tells you the distance from that measurement location to a city or town in `ca_locations` (note that we're using `ca_locations` and not `ca_sample` in this case, because we want to look at all the possible cities or towns.

In [None]:
def approximate_locations(hour_data, ca_locations):
    
    # get distances between CA cities/towns and measurement locations
    # for each measurement location, find the nearest CA city/town: this is the location name corresponding to that measurement location
    # add locations to hour_data
    
    return hour_data

For a quick check of your results, you can choose a couple lat-long coordinates in `hour_data`, input them to Google Maps, and make sure that the location that your code is outputting is at or near those coordinates.

In [None]:
# run this cell; do not change it
hour_data = get_hour_data('10:00')
ca_sample = get_sample(1)
approximate_locations(hour_data, ca_locations).head()

The last thing we need to do before we can plot our data is more formatting. Taking a glance at `nov9`, we see that our PM2.5 sample measurements range anywhere from 0 LC to more than 300 LC, with most data falling far below 300 LC.

In [None]:
# run this cell to see the distribution of measurements
plt.hist(nov9["Sample Measurement"], bins = 20);
plt.title("Distribution of PM2.5 observations on Nov 9 2018")
plt.xlabel("PM2.5 concentration")
plt.ylabel("Count")
plt.show()

To allow our plot of measurements to have greater colour contrast, we will need to take the log of these measurements.

In addition, we would like to add a 'Text' column to our data that will allow us to display information about each point when we plot the data. For each point we would like to display the city or town name, the data type (predicted or observed), and the PM2.5 sample measurement.

<br>

<b>Question 1.6:</b> Write a `convert_to_log()` function and an `add_text()` function that both take in a data frame. Assume that the data frame passed into these functions will be the hour data and grid concatenated into one data frame, with a column "Sample Measurement" that contains either the observed or predicted PM25 value, a column "Location" that contains the town or city name, and a column "Type" that contains the data type (observed or predicted).

`convert_to_log()` should return the data frame with an appended 'Log Sample Measurement' column.

`add_text()` should return the data frame with an appended 'Text' column where each entry is a string that contains the data point's location name, data type, and measurement. Be sure to round the measurement to 3 decimals.

In [None]:
def convert_to_log(data):
    # YOUR CODE HERE

In [None]:
def add_text(data):
    # YOUR CODE HERE

Now, we are able to create our KNN map. Let's use the functions we've defined above to write our KNN algorithm and graph the data.

<br>

<b>Question 1.7:</b> Write `knn_algorithm()`. For the parameters, it takes in a string for the hour to filter the data, an integer for the seed to choose the set of locations, and an integer $K$ to run the algorithm with.

Be sure that after you have predicted measurements for the grid and approximated locations for the hour data that you concatenate them into one data frame, and then once you have the total data, format it (by taking the log of measurements and adding a "Text" column) and plot it. We've provided for you a `run_plotly()` function that takes in the observed data, predicted data, total data, hour, and K, and plots the map using `plotly` and `mapbox`. The function takes in the observed and predicted data separately, so you will need to separate your total data after formatting it.

If you are stuck or unsure how to approach this problem, try looking back to see the order of the steps we took to get the data, run the algorithm, and format the data for plotting. If you later encounter any errors, try going back to your previous code to look for any potential mistakes.

In [None]:
def knn_algorithm(hour, seed, K):
    
    # get data for the specified hour, and using the seed get a random sample of CA cities/towns 
    # predict the measurement values for the sampled CA cities/towns
    # get the approximate locations for the hourly measured data
    # concatenate dataframes, convert the measurements to log values, and add a text column
    # subset your dataframe into observed and predicted data
    
    # return a plot of observed and predicted values
    return run_plotly(observed_data, predicted_data, total_data, hour, K)

### Analyzing the KNN Algorithm

Try out the KNN algorithm for `hour='12:00'`, `seed=100`, and `k=3`. When the map loads, try hovering over points, zooming in and out, right clicking and dragging, and toggling on/off options in the interactive legend to get a better grasp of what the data looks like in both a local and a regional sense. Once you've done that, try it out for different hours and for different values of K.

Try different hours to see how PM2.5 concentrations changed throughout the day. Although, the K value should be the main focus of your analysis.

Try different values of K to see the changes in predicted measurements. And keep in mind that larger values of K will take longer to load &mdash; most likely anything more than K = 10 might take too long to run.

Also, try out different seeds, but keep in mind that the seed is meant to preserve a randomized set of locations, so when comparing different hours and K values it is best to keep the same seed.

In [None]:
# Run to see the recorded hours for reference
np.unique(nov9['Time Local'])

In [None]:
# YOUR CODE HERE

<b>Question 1.8:</b> Comment on what you think is a "good" value of K, and explain why. Note that there is no single right answer here, but there are undoubtedly better and worse options &mdash; what would be a bad value of K?

*Your answer here*

<b>Question 1.9:</b> What are other factors that might be affecting spatial distributions? Explain why it would be good to create a model that predicts concentrations based on location, nearby measurements *and* the other factors that you've listed.

*Your answer here*

---

## Section 2: Single Linear Regression with `scikit-learn` <a id='section2'></a>

Now that we've learned how to generate maps using KNN, we will learn how to use the simple single linear regression tool in [`scikit-learn`](http://scikit-learn.org/stable/), a popular Python package for machine learning algorithms. Their documentation is quite good, so feel free to browse if you would like to learn the details behind how their functions work.

For this section, we will use `scikit-learn` on the yearly PM2.5 dataset.

### Downloading and Filtering the Data

First, let's download the data we will be using for this section. Run the following cell below to download the zip files from the EPA website. Each file contains a dataset of annual air pollutant concentrations by site, or "monitor", and related data.

In [None]:
# Download the zip files from the EPA website
# This cell only needs to be run once
# Once the files are downloaded, they'll stay on datahub.
for year in np.arange(1998, 2019):
    airquality_url = 'https://aqs.epa.gov/aqsweb/airdata/annual_conc_by_monitor_' + str(year) + '.zip'
    airquality_path = Path('annual_conc_by_monitor_' + str(year) +'.zip')
    if not airquality_path.exists():
        print('Downloading ' + str(airquality_path) + ' ...', end=' ')
        airquality_data = requests.get(airquality_url)
        with airquality_path.open('wb') as f:
            f.write(airquality_data.content)
        print('Done!')

Let's try to get a sense of what our data looks like. Run the next cell to see the 2018 dataset.

In [None]:
airquality_path = Path('annual_conc_by_monitor_2018.zip')
zf = zipfile.ZipFile(airquality_path, 'r')
f_name = 'annual_conc_by_monitor_2018.csv'

# Unzip the file
with zf.open(f_name) as fh:

    # Create data frame
    annual_2018 = pd.read_csv(fh, low_memory=False)

print(annual_2018.columns)

For this homework we will only be considering annual measures for PM2.5 in the state of California. Our goal right now is to create a single dataframe that compiles data from all of the annual files.

<br><b>Question 2.1:</b> The goal of the following cell is to create one dataframe, `df_ca`, that contains PM2.5 data ('Parameter Code' = 88101) with a sample duration of 24 hours and pollutant standard of 'PM25 Annual 2006', from California only.<br> 

To do this, you can look at each csv file within each zip file, read that .csv file into a dataframe, and create a filtered dataframe that contains only the data we care about based on the conditions above. Then, you can concatenate that dataframe to `df_ca` so that everytime you run through the loop, you've added data from a different year to your dataframe. <br>

If you're unsure of where to start, look at the code block above to see how we open zipfiles and then access .csv files from within those zip files.

In [None]:
pm25_ca = pd.DataFrame() # initialize empty dataframe

for year in np.arange(1998, 2019):
  # YOUR CODE HERE

In [None]:
# run this cell
pm25_ca.head()

Run the cell below to see if the final data frame has the correct dimensions.

In [None]:
assert pm25_ca.shape == (2354, 55)

### Using `scikit-learn`

Now that our data is loaded, we can use `scikit-learn`. The `scikit-learn` package has a `linear_model` object upon which you can call `LinearRegression()` to generate a linear regression object:

`lm = linear_model.LinearRegression()`

`lm` takes in its `.fit()` method arrays $X$ and $y$, where $X$ is a data frame of independent variables and $y$ is a data frame of the dependent variable, or our "target" data.

<b>Question 2.2:</b> Using `scikit-learn`, let's fit a linear regression model to predict PM2.5 concentrations by year for the city of Victorville, California. First, create a `pm25_victorville` data frame that contains only data from Victorville. Then, you can generate a linear regression object`lm_victorville`, and then fit that linear model and save the output to `fit_victorville`.<br>

*Note*: For both $X$ and $y$, `scikit-learn` will only accept an input if the number of columns in the input array are explicitly defined. That mean for one-dimensional arrays, they need to have the dimensions `(number of observations,1)`. You'll have to get the values from your panda dataframe for $X$ and $y$, and then use the `.reshape()` method to get the right dimensions (see the lecture 9 in-class notebook for reference). Alternatively, `scikit-learn` will also accept an input if it is a pandas dataframe rather than a pandas series; for example, defining $X$ as `df[['column_name']]` is acceptable in `scikit-learn` syntax but defining $X$ as `df['column name']` is not.

In [None]:
# YOUR CODE BELOW
pm25_victorville = ...
X = ...
y = ...
lm_victorville = ...
fit_victorville = ...

<b>Question 2.3:</b> Now that we've fitted the linear model `fit_victorville`, we can use it to predict the PM2.5 concentrations for each year. Our linear model has a `.predict()` method, which takes in X and returns a list of our estimated coefficients. We can then plot these points using `matplotlib` and compare the regression line with the observed data points. Generate `y_prediction` and plot the `pm25_victorville` data as well as the regression line. Again, make sure to give the plot a title, label the axes, and choose a range for the xticks that makes sense.

In [None]:
# YOUR CODE HERE

Now that we've learned how to use the linear regression tool in `scikit-learn` to generate plots, let's do further analysis on the outputs. Namely, let's look at two coefficients that our linear regression object stores &mdash; the intercept and slope.

<b>Question 2.4:</b> Browse through the [`scikit-learn`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) documentation to find out how to call the intercept and slope attributes of `fit_victorville`, and print them.

In [None]:
# YOUR CODE HERE

<b>Question 2.5:</b> In the context of the plot we generated, try to make sense of our intercept and slope. What do they mean? Write down an explanation.

*Your code here*

### Linear Regression on `pm25_ca`

Now that we've gotten practice with using the single linear regression function on a sample dataset, we are now able to observe the spatial distribution of annual changes in pollutant concentration for all locations in the state.

<b>Question 2.6:</b> Use what we've learned in this homework on the `pm25_ca` dataset to estimate and print out the slope (for PM2.5 concentration versus time) for all of California, and create the corresponding scatter plot and regression line. As always, be sure to give the plot proper formatting.

In [None]:
# YOUR CODE HERE

<b>Question 2.7:</b> What trends do you observe? How does the model for California compare to the model for Victorville?

*Your answer here*

**Question 2.8:** What does our model predict the average PM2.5 concentration in California will be in 2020? How about 2030?

In [None]:
# YOUR CODE HERE

---


## Section 3: Multiple regression using land-use regression data <a id='section3'></a>

The next two sections use the used by Novotny et al (2011). We'll use it to explore multiple linear regression and the important questions one has to ask when running and interpreting results.

We'll be using two different libraries: `scikit-learn`, and `StatsModels`. `scikit-learn` is preferred in the machine-learning community, and is easier to use for methods concerning prediction(e.g., cross validation). `StatsModels` is preferred in the statistics and econometrics communities, shares syntax closer to R, and generally provides more statistical information.

**Question 3.1** Let's start by reading in the .csv file "BechleLUR_2006_finalmodel.csv", found in the data folder, as a Pandas dataframe named `df`. Print its first few rows.<br>

This is the data used in the Novotny et al paper, and it contains the response and predictor variables, as well as the model results (the predicted variable).

In [None]:
# YOUR CODE HERE

**Question 3.2** If the purpose of using multiple regression is to predict NO2 levels, which column is our response variable? Which columns are our predictor variables? State in words what each represents, along with their units of measurement. The Novotny et al paper is a good reference here.

*Your answer here*

**Question 3.3** Now let's get ready to do a multiple linear regression! There are some columns in the dataframe `df` that we will not be using as predictors or response variables - specifically Monitor_ID, Latitude, Longitude, State and Predicted_NO2_ppb. Create a new dataframe, `df_clean`, that does not include these variables, and print the first few rows.

In [None]:
# YOUR CODE HERE

**Question 3.4** Now, let's use `scikit-learn` to fit our linear model! In the cell below, fit a linear model using your response variables and your predictor variable. The process will be very similar to the process for fitting a linear model (call it `sk_model`) using a single response variable in section 2. Save the output of `.fit()` to `sk_fit`.

In [None]:
# YOUR CODE BELOW
X = ...
y = ...
sk_model = ...
sk_fit = ...

Run the cell below to print out the model's intercepts and coefficients.

In [None]:
# run this cell
# Intercept
print("Intercept:", sk_fit.intercept_)
# Coefficients
print("Coefficients:", sk_fit.coef_)

Notice how scikit-learn is very simple to use, but is not always informative - in this case, we aren't told which columns each these coefficients corresponds to. In order to get this information, we are going to run linear regression using `statsmodels`, which is a library we haven't used before. Run the cell below to import `statsmodels`.

In [None]:
# run this cell
import statsmodels.api as sm

**Question 3.5** In the cell below, fit $X$ and $y$ to a linear model using `statsmodels`. The skeleton code below will get you started, but you should also check out the [documentation for linear modeling in statsmodels](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html).<br> 

A good check of whether or not you set up your `statsmodels` regression properly is if the coefficient and intercept values match up with those output by `scikit-learn`. If not, something went wrong in either this regression or the `scikit-learn` regression.

In [None]:
# unlike scikit learn, statsmodels expects a column of 1s in the x-input to the model
# one way to achieve this is to use the add_constant function on the X values that you defined previously
# try printing X2 to make sure it's working properly!
X2 = sm.add_constant(X) 

sm_model = sm.OLS(...)
results = sm_model.fit()
print(results.summary())

**Question 3.6** Choose one response variable. What is its coefficient? What are the bounds of its 95% confidence interval? What do these values mean?

*Your answer here*

---


## Section 4: Model selection <a id='section4'></a>

Now that we've tried producing a multiple regression model, we can think about model selection. Model selection can be thought of the process of choosing a subset of variables, but in order to do so, we first need a benchmark to compare different models. And given the benchmark, we also need a search strategy: how are we going to systematically include or exclude different variables in our model, and then calculate the benchmark values for those models? With a limited number a predictors, we are able to search all possible models (i.e. including all combinations of predictor variables). One way to assess a model is using the Aikake Information Criterion ($\text{AIC}$).

The $\text{AIC}$ assesses the ***quality*** of a model given a set of data. Depending on the data that we use in our model - in this case, the data associated with the features we add - AIC may be used to tell us how our model performs with the data given. Sometimes adding more data (features) improves the quality, sometimes it doesn't. Other times adding the right features may improve the quality.

We define $\text{AIC}$ as the following:

$\text{AIC} = 2 \times (\text{number of features}) - 2 \times \ln(\text{maximum value of likelihood function})$

A likelihood function tells us what the maximum likelihood is that the coefficients that we have chosen will predict the true $y$ value. We don't go into it in much depth, but we will provide the code to calculate it.

The smaller $\text{AIC}$ is, the greater the model performs (one way to think about it is: if AIC is small, that means the likelihood function is high - there's a high likelihood that the coefficients predict the observed $y$ value. And if we have one model that uses less features, and another that uses more features, but they have the same likelihood function, then the model that uses less features has a smaller AIC value - so AIC defines models that have a high probability of predicting the observed values, while using less features when possible,as high quality models).

$\text{AIC}$ is important because we can use it as a form of model selection. **Our goal is to find a model that has the highest *quality* given a list of models.** The higher the quality, the better our model performs and the more desirable it is. In this section, we'll load the file "allmodelbuildingdata.csv" that contains the features that were in `df_clean` as well as additional features.

In [None]:
# run this cell to load the csv
df_all = pd.read_csv("data/BechleLUR_2006_allmodelbuildingdata.csv")
df_all.head()

**Question 4.1** Fill in the code below to complete the AIC using the log likelihood. `statsmodels` returns log likelihood from the fitted model using the right syntax. In the function definition below, `fit_model` represents the output of a call of `statsmodels` `.fit()` method (eg. the `results` variable that we defined above to get the multiple regression). `k` represents the number of features in the model.

*Note*: `statsmodels` also returns AIC directly, but we'd like you to do at least *a little* work to compute AIC here! Check the [attributes section of the linear regression documentation](https://www.statsmodels.org/stable/regression.html) to figure out how to grab the likelihood value.

In [None]:
def computeAIC(fit_model,k):
    llf = ... # get likelihood
    AIC = ... # calculate AIC
    return AIC

**Question 4.2** Use the function that we defined in the previous question to compute the AIC of the final model from part 3 of the homework.

In [None]:
# YOUR CODE HERE

As stated earlier, the lower the AIC the better. Let's choose our own features and see if we can create a model that has a comparable AIC; we can start off choosing a few features and see what we get.


**Question 4.3** Choose the features `Population_800`, `Major_1200`, `Impervious_2500`, `Major_400`, and choose two more of your choice! Then, fit this model and calculate the AIC.

In [None]:
# YOUR CODE HERE

Let's try computing a model with fewer features.

**Question 4.4** From the previous model, keep only `Population_800`, `Major_400`, and `Major_1200` and calculate the AIC. 

In [None]:
# YOUR CODE HERE

**Question 4.5** In this question, you'll make a plot that shows the AIC value and the likelihood function on the y-axis and $k$ on the x-axis, ranging from k = 1 to the total number of features in `df_all`. You can approach this however you want, but you do have to explain your approach - specifically, how did you choose which features to add for each $k$ value? Do you notice any trends in the AIC and likelihood values? Can you explain that trend, based on what you know about how AIC is calculated?<br>

*Note 1*: we're not asking you to calculate AIC for every combination of independent variables, just for different numbers of independent variables (features).

*Note 2*: when plotting the AIC value and the likelihood function, you'll need to use two different y scales to be able to display it. The skeleton code below sets that up for you. You're welcome to change the formatting if you'd like; in any case, you do have to write the input values to the `.plot()` commands.

In [None]:
# YOUR CODE HERE

In [None]:
# plotting AIC and llf
fig, ax1 = plt.subplots()

color = 'darkred'
ax1.set_xlabel('k')
ax1.set_ylabel('AIC', color=color)
ax1.plot(..., color=color) # YOUR CODE HERE
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()

color = 'lightseagreen'
ax2.set_ylabel('likelihood function', color=color)
ax2.plot(..., color=color) # YOUR CODE HERE
ax2.tick_params(axis='y', labelcolor=color)

ax1.set_yticks(np.linspace(ax1.get_yticks()[0], ax1.get_yticks()[-1], len(ax2.get_yticks())))
ax1.grid(None)
ax1.set_ylim(bottom = ax1.get_yticks()[0])
ax2.set_ylim(bottom = ax2.get_yticks()[0])

fig.suptitle(...) # YOUR CODE HERE
plt.show()

*Your discussion here*

---


## Section 5: Project <a id='section5'></a>

The questions in this section should be answered separately for each group member (i.e. each member of your group will submit their own answers to these questions).<br>

Last week, you all put down your initial ideas for the project. In this final section of the homework, we'll do a check-in on how project development is going; most homeworks from now on will have a short section about the project.<br>

**Question 5.1** What was challenging about answering the project-related questions in HW 4 (defining a prediction problem and listing relevant datasets)? If you've been able to work through those challenges, how have you done so? <br> 

*Note*: try to be as descriptive as possible here! eg. instead of saying "finding data was hard", you can say "I wanted to find non-US data for drinking water quality and it has been challenging to locate a dataset". You can discuss conceptual challenges (eg. figuring out if a question is phrased as a prediction problem) or practical challenges (eg. finding time to meet with your group). This question is mainly here so that we can organize the lab time and the resources we provide in a helpful way.

*Your answer here*

**Question 5.2** In a few sentences, give some context for your prediction problem. Have you come across any work that answers questions that are similar or related to the ones that you are asking? What results have they found? What are you hoping to do differently from other researchers who have asked similar questions?<br>

*Note*: we're definitely not expecting you to review a lot of academic papers and projects for this question, but you should take a look around to see if there are any papers or reports that ask similar questions or use similar data - beyond giving the reader context for your project in your final report, looking at other people's work can give you ideas for how to approach your own project.

*Your answer here*

**Question 5.3** In a few sentences, explain the *motivation* behind your prediction problem. Who would be interested in seeing the results your prediction model? Why is it important to answer this question?

*Your answer here*

**Question 5.4** Open one of your potential data sources and grab some descriptive statistics about it using `pd.describe()`. Paste the output below (you can also load it and run `pd.describe()` below, but if it's a very large dataset you might hit the memory limit, in which case you should load and inspect it in a separate notebook and then paste the output below). What do you notice when you run `pd.describe()`? Is there anything surprising or expected about the output?

*Your answer here*

----
## Submission

Congrats, you've finished homework 5! In the dependencies code block, make sure to uncomment everything in `# uncomment this for final version` and comment everything in `# comment this out for final version`.

Before you submit, click **Kernel** --> **Restart & Clear Output**. Then, click **Cell** --> **Run All**. Then, go to the toolbar and click **File** -> **Download as** -> **.html** and submit the file through bCourses.


---

## Bibliography

- Adi Bronshtein - Referred to KNN concepts. https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7

- Anwar A. Ruff - Used normal equation example as model. https://github.com/aaruff/Course-MachineLearning-AndrewNg/blob/master/NormalEquation.ipynb

- Introduction to Statistical Learning - Referred to KNN concepts. https://www-bcf.usc.edu/~gareth/ISL/

- Manu Jeevan - Adapted scikit-learn techniques. http://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/

- Maps of World - Obtained latitude/longitude of CA cities and towns. https://www.mapsofworld.com/usa/states/california/lat-long.html

- scikit-learn.org - Referred to scikit-learn documentation. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

- Shawon Ashraf - Adapted normal equation implementation techniques. https://www.c-sharpcorner.com/article/normal-equation-implementation-from-scratch-in-python/

---
Notebook developed by: Joshua Asuncion

Data Science Modules: http://data.berkeley.edu/education/modules
