In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list names of anyone you worked with on this homework.

# [ER 131] Homework 6: Land Use Regression and Model Selection


In this homework, we will continue to build linear regression models. Instead of the single-variable regression models of Homework 5, we will build multi-variable regression models. Specifically, we will work with the Novotny et al. (2011) data to build and analyze the performance of land use regression (LUR) models that predict nitrogen dioxide concentration near the Earth's surface.

### Table of Contents
1. [Project](#section1)<br>
1. [Multiple linear regression using land-use regression data](#section2)<br>
1. [Model selection](#section3)<br>

---


## Section 1: Project <a id='section1'></a>

This week, your group should work collaboratively to refine your research question and identify candidate data sources. You may develop answers to Questions 1.1 and 1.2 as a group, but if you do so, please identify each member's unique contribution (for example: "Jessica summarized reports or datasets #1-2, Duncan summarized reports or datasets #3-4...")

**Question 1.1** Give some context for your prediction problem. Have you come across any work that answers questions that are similar or related to the ones that you are asking? What results have they found? What are you hoping to do differently from other researchers who have asked similar questions?<br>

We're definitely not expecting you to review a lot of academic papers and projects for this question, but you should take a look around to see if there are any papers or reports that ask similar questions or use similar data - beyond giving the reader context for your project in your final report, looking at other people's work can give you ideas for how to approach your own project. 

One way you could approach this question is for each team member to identify and summarize 1-2 relevant citations. Summaries should focus less on the specifics of other papers and more on how findings in the literature inform and motivate your research question. Ultimately, this information can inform the introduction and motivation sections of your final project.

*YOUR ANSWER HERE*

**Question 1.2** Identify, open, and summarize your group's candidate datasets. Try to find $1+N_{s}$ relevant data sources, where $N_{s}$ is the number of students in your group. For **each** dataset:
- Insert a link to and a brief description of the data. 
- Open the dataset and incorporate the dataset into a Pandas dataframe if feasible.
- Grab some descriptive statistics about the dataset using `pd.describe()`. Paste the output below (you can also load it and run `pd.describe()` below, but if it's a very large dataset you might hit the memory limit, in which case you should load and inspect it in a separate notebook and then paste the output below). What do you notice when you run `pd.describe()`? Is there anything surprising or expected about the output?

No need to do a full-scale EDA at this point (we'll do that next week!); this week's focus is on making sure you can open and summarize your candidate datasets.

*YOUR ANSWER HERE*

---


## Section 2: Multiple linear regression using land-use regression data <a id='section2'></a>

In the remainder of this homework, we will dig into the data used by [Novotny et al. (2011)](https://bcourses.berkeley.edu/files/78396490/download?download_frd=1). We'll use these data to explore multiple linear regression, land use regression, and the important questions one has to ask when conducting these types of analyses and interpreting results.

We'll be using two different libraries: `scikit-learn`, and `StatsModels`. `scikit-learn` is preferred in the machine-learning community, and is easier to use for methods concerning prediction (e.g., cross validation). `StatsModels` is preferred in the statistics and econometrics communities, shares syntax closer to R, and generally provides more statistical information.

**Dependencies**

In [None]:
# run this cell
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
pd.set_option('display.max_columns', 150)

**Question 2.1** Let's start by reading in the .csv file "BechleLUR_2006_finalmodel.csv", found in the data folder, as a Pandas dataframe named `df`. Print its first few rows.<br>

These are the data used in the Novotny et al. (2011) paper. The dataframe contains the response and predictor variables, as well as the model results (ie., the predicted variable).

In [None]:
# YOUR CODE HERE
df = ...

**Question 2.2** If your goal is to use multiple linear regression to predict NO2 levels, which column represents the response variable (i.e., the "truth" that you are trying to predict)? Which columns are the independent variables (i.e., features)? State in words what each response and independent variable represents, along with its unit of measurement. The Novotny et al paper is a good reference here.

*YOUR ANSWER HERE*

**Question 2.3** Let's filter our dataframe to make it easier to do multiple linear regression. We will not be using all columns in the dataframe `df` as independent or response variables - specifically, we will ignore Monitor_ID, Latitude, Longitude, State and Predicted_NO2_ppb. Create a new dataframe, `df_clean`, that does not include these variables, and print the first few rows.

In [None]:
# YOUR CODE HERE

**Question 2.4** There is one qualitative variable in our dataframe. Which one is it? What are its possible values?

In [None]:
# SCRATCH WORK HERE

*YOUR ANSWER HERE*

Let's transform the qualitative categories of the variable you identified in Question 2.4 into a set of binary variables using the Pandas one-hot encoder. 

**Question 2.5** What is the minimum number of binary variables you will need to create in order to represent that categorical data in the qualitative variable you identified in Question 2.4?

*YOUR ANSWER HERE*

**Question 2.6** Replace the ellipsis in the cell below to create a dataframe in which the columns are the binary variables (corresponding to the categories of our qualitative variable), the rows correspond to observations, and the elements are either 0 or 1.

In [None]:
binary_vars = pd.get_dummies(...) # YOUR CODE HERE
binary_vars.head()

In [None]:
assert binary_vars.shape == (369,4)

**Question 2.7** Replace the qualitative variable in your `df_clean` dataframe with the set of binary variables you produced in Question 2.6. In other words, `drop` the column containing the quantitative variable and `concat` the columns you produced in 2.6. Do not change the name of `df_clean`.

In [None]:
# YOUR CODE HERE

**Question 2.8** Now, let's use `scikit-learn` to fit our linear model. In the cell below, fit a linear model using the response and independent variables in `df_clean`. The process will be very similar to the process for fitting a linear model (call it `sk_model`) using a single response variable (see Lab and Homework 5). Save the output of `.fit()` to `sk_fit`.

In [None]:
# YOUR CODE HERE

Print your model's intercepts and coefficients.

In [None]:
# YOUR CODE HERE
# Intercept
print("Intercept:", ...)
# Coefficients
print("Coefficients:", ...)

Notice how scikit-learn is very simple to use, but is not always informative - in this case, we aren't told which column each these coefficients corresponds to. In order to get that information, we are going to run linear regression using the `statsmodels` library. 

**Question 2.9** In the cell below, fit $X$ and $y$ to a linear model using `statsmodels`. The skeleton code below will get you started, but you should also check out the [documentation for linear modeling in statsmodels](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html).<br> 

Don't forget to add a column of 1's to $X$ so that `statsmodels` can fit an intercept.

In [None]:
# YOUR CODE HERE
X2 = sm.add_constant(...)
sm_model = sm.OLS(...)
results = sm_model.fit()
results.summary()

**Question 2.10** A good check of whether or not you've set up your `statsmodels` regression properly is if the coefficient and intercept values match up with those output by `scikit-learn`. Compare your outputs from 2.8 and 2.9. Were your intercepts and all of the coefficients the same across the two methods? If not, why do you think this might be the case? How might you change your inputs for fitting either `scikit-learn` or the `statsmodels` to make the outputs match?

*Hint:* If you're stuck, go back and listen closely to video W6.2.2 of the asynchronous modules.

In [None]:
# SCRATCH WORK HERE

*YOUR ANSWER HERE*

**Question 2.11** Examine the bounds of the 95% confidence interval provided in the output from Question 2.9 (i.e., the "[0.025" and "0.975]" columns). For which of the independent variables would you be most skeptical that a relationship exists between that variable and the response variables? Why?

*YOUR ANSWER HERE*

---


## Section 3: Model selection <a id='section3'></a>

Now that we've produced a multiple regression model, we can think about model selection. Model selection is the process of choosing a subset of independent variables to include in a regression model. To do this, we need a search strategy: how are we going to systematically include or exclude different variables in our model? 

One way to assess a model is using the Aikake Information Criterion ($\text{AIC}$). The $\text{AIC}$ assesses the ***quality*** of a model. Depending on the data that we use in our model - in this case, the data associated with the independent variables we add - AIC tells us how our model performs. Sometimes adding more data (i.e., independent variables) improves quality, and sometimes it doesn't. 

We define $\text{AIC}$ as the following:

$\text{AIC} = 2 \times (\text{number of features}) - 2 \times \ln(\text{maximum value of likelihood function})$

A likelihood function tells us what the maximum likelihood is that the coefficients that we have chosen will predict the true $y$ value. We don't go into it in much depth, but we will provide the code to calculate it.

The smaller $\text{AIC}$ is, the better the model performance. One way to think about it: if $\text{AIC}$ is small, the likelihood function is high - i.e., there's a high likelihood that the coefficients predict the observed $y$ value. And if we have one model that uses less features, and another that uses more features, but they have the same likelihood function, then the model that uses less features has a smaller AIC value. AIC thus defines models that have a relatively high probability of predicting the observed values, while using relatively few features, as high quality models.

$\text{AIC}$ is important because we can use it as a benchmark for model selection. **Our goal is to find a model that has the highest *quality*--i.e., the lowest AIC.** 

**Question 3.1** Load the file "allmodelbuildingdata.csv" into a Pandas dataframe. This dataframe that contains all the features that were in `df` as well as additional features.

In [None]:
# YOUR CODE HERE
df_all = ...
df_all.head()

**Question 3.2** Fill in the code below to compute the AIC using the log likelihood function. `statsmodels` returns log likelihood from the fitted model using the right syntax. In the function definition below, `fit_model` represents the output of a call of `statsmodels` `.fit()` method (eg. the `results` variable that we defined above to get the multiple regression). `k` represents the number of features in the model.

*Note*: Yes, `statsmodels` also returns AIC directly, but we'd like you to do at least *a little* work to compute AIC here! Check the [attributes section of the linear regression documentation](https://www.statsmodels.org/stable/regression.html) to figure out how to grab the likelihood value.

In [None]:
def computeAIC(fit_model,k):
    llf = ... # get likelihood
    AIC = ... # calculate AIC
    return AIC

**Question 3.3** Use `computeAIC` to compute the AIC of the `results` model from Question 2.9 of the homework. Check that your result matches the AIC given in the `statsmodels` summary.

In [None]:
#YOUR CODE HERE

As stated earlier, the lower the AIC the better. Let's choose our own features and see if we can create a model that has a comparable AIC; we can start off choosing a few features and see what we get.


**Question 3.4** Choose the features `WRF+DOMINO`, `Distance_to_coast_km`, `Elevation_truncated_km`, `Impervious_100`, and two more features of your choice. Then, fit this model and calculate the AIC.

In [None]:
# YOUR CODE HERE

Let's try computing a model with fewer features.

**Question 3.5** From the previous model, keep only `WRF+DOMINO`, `Distance_to_coast_km`, and `Impervious_100` and calculate the AIC. Did the quality of your model improve?

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

**Question 3.6** Make a plot that shows model quality and the likelihood function as a function of the number of independent variables. Your plot will have two subplots: the y-axis of the first will be AIC, and the second will be the log likelhood function. The x-axis of both subplots is $k$, ranging from k = 1 to the total number of features in `df_all`. You can approach this however you want, but you do have to explain your approach - specifically, how did you choose which features to add for each $k$ value? Do you notice any trends in the AIC and likelihood values? Can you explain that trend, based on what you know about how AIC is calculated?<br>

*Note*: we're not asking you to calculate AIC for every combination of independent variables, just for different numbers of independent variables (features).

*YOUR APPROACH HERE*

In [None]:
# YOUR CODE HERE

In [None]:
# # YOUR PLOT HERE
# fig, (ax0,ax1) = plt.subplots(nrows=2, sharex=True, figsize = (12,8))

# ax0.plot(...,..., color = 'navy')
# ax0.set_xlabel(...)
# ax0.set_ylabel(...)
# ax0.set_title(...)

# ax1.plot(...,..., color = 'gold')
# ax1.set_xlabel(...)
# ax1.set_ylabel(...)
# ax1.set_title(...)

# plt.show()

**Question 3.7** Approximately how many features does the highest quality model have, based on AIC? How many features maximize the likelihood that the model predicts the true response variable?

*YOUR ANSWER HERE*

----
## Submission

Congrats, you've finished Homework 6! 

Before you submit, click **Kernel** --> **Restart & Clear Output**. Then, click **Cell** --> **Run All**. Then, go to the toolbar and click **File** -> **Download as** -> **.html** and submit the file through bCourses.


---
Notebook developed by: Joshua Asuncion, revised by Jessica Katz

Data Science Modules: http://data.berkeley.edu/education/modules
