# [ERG 190C] Homework 6

<u>Learning Objectives</u>
1. Understand the basic structure of this land-use regression dataset
2. Learn how to run multiple linear regression with scikit learn and interpret coefficients
3. Learn how to interpret p-values, and the pitfalls of p-hacking
4. Learn how to use a model selection criterion like AIC
5. Address some of the potential problems that arise in linear regression models

This homework uses a data set used for the study in Novotny et al ES&T (2011). We'll use it as a basis for exploring multiple linear regression and the important questions one has to ask when running and interpreting results.

We'll be using two different libraries: scikit-learn, and StatsModels. Scikit-learn is preferred in the machine-learning community, and is easier to use for methods concerning prediction(e.g., cross validation). StatsModels is preferred in the statistics and econometrics communities, shares syntax closer to R, and generally provides more statistical information.

### 1. Land Use Regression Dataset  <a id='section1'></a>

In this homework, we are going to use the Land Use Regression Dataset to do basic multiple linear regression using scikit-learn and StatsModels.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

To understand the basic structure of the dataset, read in the .csv file named "BechleLUR_2006_finalmodel.csv" as a Pandas dataframe. Print its first few rows.

In [None]:
# Replace the ... with your code
df_final = ...
df_final.head()

**Q1)** If our purpose of using multiple regression is to predict NO2 levels, which column is our response variable? Which columns are our predictor variables? State in words what each represents, along with their units of measurement. The publication is included in the lab06 folder on DataHub.

In [None]:
# your answer here

### 2. Basic Multiple Regression <a id='section2'></a>

There are several variables that we will not use in our regression, specifically Monitor_ID, Latitude, Longitude, State and Predicted_NO2_ppb.

**Q2.1)** Assign a dataframe without those columns to a new variable called df_final_clean.

In [None]:
# Replace the ... with your code
df_final_clean = ...
df_final_clean.head()

**Q2.2)** We will start off with scikit-learn. In order to use scikit-learn, we need to organize our data properly.

- Assign X to a dataframe that contains all relevant columns *except for* the response variable.
- Assign Y to only the response variable column.

In [38]:
# Replace the ... with your code
X = ...
Y = ...

**Q2.3)** Using [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), fit X and Y to a linear model.

In [None]:
# Replace the ... with your code
from sklearn import linear_model

sk_model = linear_model...
sk_model.fit(...)

Run the cell below to print out the model's intercepts and coefficients.

In [6]:
## RUN THIS CELL - do not edit

# Intercept
print("Intercept:", sk_model.intercept_)
# Coefficients
print("Coefficients:", sk_model.coef_)

Intercept: 2.557225014332216
Coefficients: [ 7.20408718e-01  9.42846187e-02  1.06405415e+01  3.21020598e-01
  2.48321581e+00 -1.19248325e-03]


Notice how scikit-learn is very simple to use, but is not always informative - in this case, we aren't told which columns each these coefficients corresponds to. In order to get this information, we are going to run linear regression using Stats Models.

**Q2.4)** Using [StatsModels](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html), fit X and Y to a linear model, and print out the model summary.

In [None]:
# Replace the ... with your code
import statsmodels.api as sm

# In order to have an intercept, we need to add a column of 1's to X
X2 = sm.add_constant(X)

sm_model = sm...
results = sm_model...
print(results...)

This output includes much more statistical information, including the p-values of the coefficients!

**Question 2.5:** Pick 3 of the coefficients. What does each coefficient represent? What are their units, and what do their 95% confidence intevals mean?

## 3. p-Values and p-Hacking

In the previous problem, we created a multiple regression model by using the package StatsModels. We now use StatsModels to find the p-values of our independent and our dependent variables from the previous problem:

In [8]:
#Run this cell
results.pvalues

const                     2.634006e-09
WRF+DOMINO                1.840009e-58
800m_Impervious_%         2.056259e-21
Elevation_truncated_km    1.950387e-09
800m_MajorRoads_km        1.865722e-04
100m_MinorRoads_km        1.998295e-02
Distance_to_coast_km      2.417248e-03
dtype: float64

In StatsModels, the null hypothesis is defined as there being no statistically significant relationship between the term ($x$) and our prediction ($\hat{y}$). Rejecting the null hypothesis is dependent on the $\alpha$ level, the minimum percentage that you're willing to accept the null (in class, this was 0.05. The other popular $\alpha$ level is 0.01, depending on how strict you would like to make your test).

**Q3.1)** Interpret the p-values for each of the seven variables in results. Determine whether there is a statistically significant relationship between each variable and your predicted variable. You are free to choose your own $\alpha$ value, whatever you feel is appropriate. Fill in the ellipses below.

$\alpha$ = ...
<br> const: ...
<br> WRF+DOMINO: ...
<br> 800m\_Impervious_%: ...
<br> Elevation_truncated_km: ...
<br> 800m_MajorRoads_km: ...
<br> 100m_MinorRoads_km: ...
<br> Distance_to_coast_km: ...

Depending on your $\alpha$ level, some variables may be statistically significant. The bias associated with choosing an $\alpha$ for a p-value to determine significance is an example *p-hacking*. In this case, choosing a higher or lower $\alpha$ level as a result of seeing p-values is subject to this bias (in other words, unless you have a standard go-to $\alpha$ level *before* analyzing the p-values, you were p-hacking). It's often best practice to pick an $\alpha$ level *before* seeing your results, to avoid this bias.


In creating `results`, we added an extra column of ones in order to fit our model properly. Let's dig a little deeper with the `const` column.

**Q3.2)** What does a column of ones represent in terms of the data? Are the ones truely in your data? What does it mean for a column of ones to be statistically significant with your prediction? 

Note: This question is suppose to make you think about the items you're testing for significance. Even though `const` has a very low p-value with your prediction, is there anything meaningful between a column of ones and your prediction?

In [None]:
# your answer here

---

That's all for now! The latter half of this homework on model selection criteria will be sent over to you within the next day or so. 