# ER131 Final Project (replace this with your project title)
Fall 2020

In this cell, give an alphabetical (by last name) list of student group members.  Beside each student's name, provide a description of each student's contribution to the project.

**Longmate, Julia**: [contribution] <br>
**Murayama, Hikari**: [contribution] <br>
**Sims, Michelle**: [contribution] <br>
**Worsham, Marshall**: [contribution] <br>

## Basic Project Requirements (delete this markdown cell in your final submission)

**How to use this notebook**:  This notebook is the template for your semester project.  Each markdown cell provides instructions on what to do in order to complete a successful project.  The cell you're reading right now is the only one you can delete from what you eventually hand in.  For the other cells:
1. You may replace the instructions in each cell with your own work but do not edit the cell titles (with the exception of the project title, above).  
2. Follow the instructions in each section carefully.  For some sections you will enter only markdown text in the existing cells. For other sections, you'll accompany the markdown cells with additional code cells, and perhaps more markdown, before moving on to the next section.  

**Grading**.  You'll see point allocations listed in each of the section titles below.  In addition, there are other categories for points: 
1. Visualization (10 points).  Plots should be well organized, legible, labelled, and well-suited for the question they are being used to answer or explore.  
2. Clarity (5 points). Note that clarity also supports points elsewhere, because if we can't understand what you're explaining, we'll assume you didn't understand what you were doing and give points accordingly!  

For each Section or Category, we will give points according to the following percentage scale:
1. More than 90%:  work that is free of anything but superficial mistakes, and demonstrates creativity and / or a very deep understanding of what you are doing.
2. 80-90%: work without fundamental errors and demonstrates a basic understanding of what you're doing.
3. 60-80%: work with fundamental flaws in the analysis and / or conveys that you do not understand the basics of the work you are trying to do.
4. Below 60%: Work that is severely lacking or incomplete.  

Note that we distinguish *mistakes* from *"my idea didn't work"*.  Sometimes you don't know if you can actually do the thing you're trying to do and as you dig in you find that you can't.  That doesn't necessarily mean you made a mistake; it might just mean you needed more information.  We'll still give high marks to ambitious projects that "fail" at their stated objective, as long as that objective was clear and you demonstrate an understanding of what you were doing and why it didn't work.

**Number of prediction questions:**  The number of prediction questions must be greater than or equal to the number of students in the team minus one.  (A 4 person team would need to explore 4-1 = 3 questions.)  Questions should be related, but have distinct work efforts, interpretation and analysis. An example: for land use regression, you could have a core prediction question (what is pollution concentration on a fine spatial scale), a supporting question that explore how the degree of spatial aggregation influences prediction quality, plus a prediction model that explores *temporal* prediction at one point in space.  There is a lot of flexibility here; if you have any doubt about whether your questions are distinct, consult with the instructors.

**Data requirements**:  Projects must use data from a minimum of $1+N_s$ different sources, where $N_s$ is the number of students in the group.  You should merge at least two data sets. </font>

**Advice on Project Topics**:  We want you to do a project that relates to energy and environment topics.  

**Suggested data sets**: If you choose not to work on a client projets, here are some ideas for data starting points. You can definitely bring your own data to the table!
1. [Purple Air](https://www.purpleair.com) Instructions on how to download PurpleAir data are [here](https://docs.google.com/document/d/15ijz94dXJ-YAZLi9iZ_RaBwrZ4KtYeCy08goGBwnbCU/edit).
2. California Enviroscreen database.  Available [here].(https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-30) 
3. Several data sets available from the UC Irvine machine learning library:
    1. [Forest Fires](https://archive.ics.uci.edu/ml/datasets/Forest+Fires)
    4. [Climate](https://archive.ics.uci.edu/ml/datasets/Greenhouse+Gas+Observing+Network)
    5. [Ozone](https://archive.ics.uci.edu/ml/datasets/Ozone+Level+Detection)
4. California Solar Initiative data (installed rooftop solar systems).  Available [here](https://www.californiasolarstatistics.ca.gov/data_downloads/).
5. World Bank Open Data, available [here](https://data.worldbank.org).
6. California ISO monitored emissions data, [here](http://www.caiso.com/TodaysOutlook/Pages/Emissions.aspx).
7. Energy Information Administration Residential Energy Consumption Survey, [here] (https://www.eia.gov/consumption/residential/data/2015/) 

In [None]:
#Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from scipy import stats
import statsmodels.api as sm

## Abstract (5 points)
Although this section comes first, you'll write it last.  It should be a ~250 word summary of your project.  1/3rd of the abstract should provide background, 1/3rd should explain what you did, and 1/3rd should explain what you learned.

## Project Background (5 points)
In this section you will describe relevant background for your project.  It should give enough information that a non-expert can understand in detail the history and / or context of the system or setting you wish to study, the need for quantitative analysis, and, broadly, what impact a quantitative analyses could have on the system.  Shoot for 500 words here.

## Project Objective (5 points)
In this section you will pose the central objective or objectives for your semester project.  Objectives should be extremely clear, well-defined and clearly cast as forecasting problems.  

Some example questions: 
1. *"The purpose of this project is to train and evaluate different models to predict soil heavy metal contamination levels across the state of Louisiana, using a variety of features drawn from EPA, the US Census, and NAICS databases."* or
2. *"The purpose of this project is to train and evaluate different models to predict 1-minute generation from a UCSD solar PV site, up to 2 hours into the future, using historical data as well as basic weather forecast variables.*" or
3. *"The purpose of this project is to forecast daily emergency room visits for cardiac problems in 4 major US cities, using a majority of features including air quality forecasts, weather forecasts and seasonal variables."*

You should reflect here on why it's important to answer these questions.  In most cases this will mean that you'll frame the answers to your questions as informing one or more *resource allocation* problems.  If you have done a good job of providing project background (in the cell above) then this reflection will be short and easy to write.

**Comment on novelty:** You may find it hard to identify a project question that has *never* been answered before.  It's ok if you take inspiration from existing analyses.  However you shouldn't exactly reproduce someone else's analysis.  If you take inspiration from another analyses, you should still use different models, different data, and so on.

## Input Data Description (5 points)
Here you will provide an initial description of your data sets, including:
1. The origins of your data.  Where did you get the data?  How were the data collected from the original sources?
2. The structure, granularity, scope, temporality and faithfulness (SGSTF) of your data.  To discuss these attributes you should load the data into one or more data frames (so you'll start building code cells for the first time).  At a minimum, use some basic methods (`.head`, `.loc`, and so on) to provide support for the descriptions you provide for SGSTF. 

[Chapter 5](https://www.textbook.ds100.org/ch/05/eda_intro.html) of the DS100 textbook might be helpful for you in this section.

TO DO: INPUT DESCRIPTION + (Photos of?) BASIC METHODS `.head(), .loc` (where possible)

### Y variable

#### AVIRIS-NG mean methane concentration
[Marshall]

### Features

#### TROPOMI Satellite Measurements of CH4, O3, and SO2
[Hikari]

#### VISTA-CA Methane Emitters
[Marshall]

#### USGS National Land Cover Data
[Michelle]

#### US Information Administration Natural Gas Pipeline
[Julia]

#### EPA Air Now Measurements of PM2.5, Ozone, and NOx
[Hikari]

#### USDA(?) Cow Density Data
[Julia]

## Data Cleaning (10 points)
In this section you will walk through the data cleaning and merging process.  Explain how you make decisions to clean and merge the data.  Explain how you convince yourself that the data don't contain problems that will limit your ability to produce a meaningful analysis from them.  

[Chapter 4](https://www.textbook.ds100.org/ch/04/cleaning_intro.html) of the DS100 textbook might be helpful to you in this section.  

TO DO: INPUT DESCRIPTION OF DATA CLEANING + ANY CODE(? - maybe just merged dataframe at the end?)

#### AVIRIS-NG mean methane concentration
[Marshall]

#### TROPOMI Satellite Measurements of CH4, O3, and SO2
[Hikari]

#### VISTA-CA Emitter Locations 
[Marshall]

#### VISTA-CA Emitter Type
[Michelle]

#### USGS National Land Cover Data
[Michelle]

#### US Information Administration Natural Gas Pipeline
[Julia]

#### EPA Air Now Measurements of PM2.5, Ozone, and NOx
[Hikari]

#### USDA(?) Cow Density Data
[Julia]

## Data Summary and Exploratory Data Analysis (10 points)

In this section you should provide a tour through some of the basic trends and patterns in your data.  This includes providing initial plots to summarize the data, such as box plots, histograms, trends over time, scatter plots relating one variable or another.  

[Chapter 6](https://www.textbook.ds100.org/ch/06/viz_intro.html) of the DS100 textbook might be helpful for providing ideas for visualizations that describe your data.  

Ideas for visualizations to include:
1. Scatter plot relationship between each feature and our y variable
2. One-two summary visualizations for each feature?
3. Scatter plot relationship between specific variables
4...

## Forecasting and Prediction Modeling (25 points)

This section is where the rubber meets the road.  In it you must:
1. Explore at least 3 prediction modeling approaches for each prediction question, ranging from the simple (e.g. linear regression, KNN) to the complex (e.g. SVM, random forests, Lasso).  
2. Motivate all your modeling decisions.  This includes parameter choices (e.g., how many folds in k-fold cross validation, what time window you use for averaging your data) as well as model form (e.g., If you use regression trees, why?  If you include nonlinear features in a regression model, why?). 
1. Carefully describe your cross validation and model selection process.  You should partition your data into training and testing data sets.  The training data set is what you use for cross-validation (i.e. you sample from within it to create folds, etc.).  The testing data set is held to the very end of your efforts, and used to compare qualitatively different models (e.g. OLS vs random forests).
4. Very carefully document your workflow.  We will be reading a lot of projects, so we need you to explain each basic step in your analysis.  
5. Seek opportunities to write functions allow you to avoid doing things over and over, and that make your code more succinct and readable. 

### Question 1: Predicting air quality measurements near methane emitter sources 
##### **To use as a feature for prediction questions 2 & 3**
We'd like to use air quality as a feature in our model; however, we don't have air quality measurements at the exact location of our emitter points. Therefore, we'll use K Nearest Neighbors to predict air quality at our emitter points, and use these predictions as a feature for our main prediction questions. 

#### Model 1: K Nearest Neighbors
First, we'll try K Nearest Neighbors (KNN) to predict the average air quality measurements for XX timeframe at each of our emitter points. 

#### [Model 2: Land Use Regression?]

#### [Model 3:....]

### Question 2: Predicting ground-level methane emissions throughout California
TO DO: Explanation of approach. Why regression. 

#### Overview of our dataset
TO DO: Provide overview of all our features, explain anything that requires explanation if it hasn't already been justified in prior sections (buffers, averaging time/space for features, timeframe used, etc.)

NOTE: How do we want to handle trying out different versions of our features, such as nonlinear versions of our features, etc... if we want to delve into this at all? Maybe we wait until we do some exploratory analysis on the data to see if there are any clearly nonlinear relationships?

In [None]:
#First, we'll look at the dataframe that contains the data we'll use to create our model.
df.head()

#### Step 1: Train/Test Split
Before delving into our models, we'll create a train/test split of our data. We'll set aside 20% of our observations as test data, and use the remaining 80% as training data. After we fit each of our models, we will use our test data to evaluate which of our models performs the best. 

In [None]:
#We'll need to create a dataframe of our X variables (features), and our y variable
X = ...
Y = ...

In [None]:
#Create train/test split 
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 15)

We'll also want a **standardized** version of our features for the Ridge and Lasso models. Standardizing is important in Ridge and Lasso because the regularization term in these models penalizes large coefficients; therefore, in order to ensure that no feature dominates the solution because of its range of values, we standardize them. We'll standardize our features below, and create another train/test split (with the same `random_state`, to ensure that the train/test split indexes in the same way as our train/test split above).

In [None]:
#Standardize features
scalar = StandardScaler()
scalar.fit(X_raw)
X_std = scalar.transform(X)

In [None]:
#Create train/test split with standardized features
X_std_train, X_std_test, y_std_train_, y_std_test = train_test_split(X_std, Y, test_size = 0.2, random_state = 15)

We will use `X_train` and `y_train` to fit each of our models in the sections below, and `X_test` and `y_test` to compare between models once we have fit the parameters and hyperparameters of each model. 

#### Model 1: Ordinary Least Squares Regression
To begin, we'll use Ordinary Least Squares Linear regression.

We'll start by using `statsmodels` to fit the model on our training data, as this package provides more detailed statistical information about linear regression models. This will allow us to examine statistics such as R-squared and the P-value for each of our coefficients. 

We'll then run the model in `scikit learn` on both our testing and training data. 

In [None]:
#NOTE: MAKE SURE ONE DUMMY VARIABLE FOR EACH SET IS DROPPED, OTHERWISE MODEL RESULTS BETWEEN statsmodels and 
#scikit learn will not be the same. 

#For statsmodels, we need to include a column of one's to X so that it can fit an intercept. We'll add this column to 
# our training data to run the model. 
X_stats = sm.add_constant(X_train)

#Fit OLS regression
sm_model = sm.OLS(y_train, X_stats)
results = sm_model.fit()
results.summary()

Discussion of results.

Now, we'll fit the model in `scikit learn`, and evaluate how the model performs on the testing data.

In [None]:
#Create a function  that fits a linear regression model and returns the training and test MSE. 
def OLS(X_train, y_train, X_test, y_test):
    """ Fits an Ordinary Least Squares Linear regression on the training set of X and y,
    and finds the MSE of the training and test set. 
    Arguments:
        X_train: An ndarray containing the set of features used to train the model. 
        y_train: A list/array containing containing the set of response variable observations used to train the model.
        X_test: An ndarray containing the set of features used to test the model.
        y_test: A list/array containing the set of response variable observations used to test the model. 
    Returns:
        train_mse: the MSE for the training data
        test_mse: The MSE for the test data
        model_coef: The coefficients for the model
        """
    #Fit model
    lm = LinearRegression()
    lm.fit(X_train, y_train)
    
    #Get training MSE
    train_mse = mean_squared_error(y_train, lm.predict(X_train))
    
    #Get test MSE
    test_mse = mean_squared_error(y_test, lm.predict(X_test))
    
    return train_mse, test_mse

Now, we can utilize our function to fit the model using Ordinary Least Squares Linear regression. 

In [None]:
ols_train_mse, ols_test_mse = OLS(X_train, y_train, X_test, y_test)
print("Training MSE: ", ols_train_mse)
print("Test MSE: ", ols_test_mse)

Discussion. 

NOTE: Should we check that coefficients between `statsmodels` and `scikitlearn` are the same? Do we want to look at AIC at all? How to evaluate which features to use, or should we just use them all and then use ridge and lasso to evaluate our features?

#### Model 2: Ridge Regression
Since we have a large number of features in our model, we'll want to try regularization methods; namely Ridge and Lasso Regression. These methods are computationally faster than subset selection, and can help reduce model variance associated with a large number of features.

TODO: Add additional comments that might justify use of Ridge after seeing results from standard OLS. 


First, we'll try Ridge regression. Ridge will shrink our coefficients toward zero, without ever fully eliminating them. 

We'll begin by creating a function that fits the model, including tuning the hyperparameters, using K-fold cross validation. Considering the large size of our dataset, K-fold cross-validation will save a significant amount of computing time, as opposed to Leave One Out Cross Validation. 

In the case of Ridge and Lasso, the only hyperparameter we need to tune is our shrinkage penalty, which we'll denote here as `alpha`. 

In [None]:
#Define model for Ridge and Lasso Cross Validation
def fit_model_cv(Model, X_train, y_train, X_test, y_test, kf, alphas):
    """Fits a Ridge or Lasso model with K-fold cross-validation on the training set of X and y, 
    and finds the MSE of the training and test set. 
    Arguments:
        Model: The type of Model to use, RidgeCV or LassoCV.
        X_train: An ndarray containing the set of features used to train the model.
        y_train: A list/array containing the set of response variable observations used to train the model. 
        X_test: An ndarray containing the set of features used to test the model. 
        y_test: A list/array containing the set of response variable observations used to test the model. 
        kf: a KFold cross-validation selector object.
            [Note: This should have n_splits, shuffle, and random_state specified]. 
        alphas: a list of alpha values to test during the cross-validation process
    Returns:
        train_mse: the MSE for the training data
        test_mse: the MSE for the test data
        opt_alpha: the optimal alpha value"""
    
    #Fit model
    modelcv = Model(cv = kf, alphas = alphas)
    modelcv.fit(X_train, y_train)
    
    #Get optimal alpha value
    opt_alpha = modelcv.alpha_
    
    #Get training MSE
    train_mse = mean_squared_error(y_train, modelcv.predict(X_train))
    
    #Get test MSE
    test_mse = mean_squared_error(y_test, modelcv.predict(X_test))

    return train_mse, test_mse, opt_alpha    

We'll define a KFold cross validation selector object, as well as a list of alphas to test during the cross validation process. 

For our model, we'll use 5/10 folds (EXPLAIN WHY). 

In [None]:
#Define K-Fold cross-validation object
kf = KFold(n_splits = 10, shuffle = True, random_state = 15)

#Define list of alpha values 
ridge_alphas = ...

Now, we can utilize our function to find the optimal model using Ridge Regression. 

In [None]:
#Fit model
r_train_mse, r_test_mse, r_alpha = fit_model_cv(RidgeCV, 
                                                X_std_train, 
                                                y_std_train, 
                                                X_std_test, 
                                                y_std_test, 
                                                kf, 
                                                ridge_alphas)
print("Training MSE: ", r_train_mse)
print("Test MSE: ", r_test_mse)
print("Optimal Alpha Value: ", r_alpha)

Discussion. 

#### Model 3: Lasso Regression
Lasso, similar to Ridge, will shrink our coefficients; however, Lasso has the benefit of also performing subset selection, since some coefficients can be shrunk to zero. We'll try this model next, and see if we can achieve a better performance. 

TODO: Add additional comments that might justify use of Lasso after seeing results from standard OLS and/or Ridge.

For our Lasso model, we'll use the same K-Fold cross validation object. However, we'll define a new set of alpha values, since optimal Lasso alpha values tend to have a smaller range than Ridge alpha values. (IS THIS ALWAYS TRUE?)

In [None]:
#Define lasso alpha values
lasso_alphas = ...

In [None]:
#Fit model
l_train_mse, l_test_mse, l_alpha = fit_model_cv(LassoCV, 
                                                X_std_train, 
                                                y_std_train, 
                                                X_std_test, 
                                                y_std_test, 
                                                kf, 
                                                lasso_alphas)
print("Training MSE: ", l_train_mse)
print("Test MSE: ", l_test_mse)
print("Optimal Alpha Value: ", l_alpha)

Discussion. 

#### Model Comparison: OLS, Ridge, and Lasso

In [None]:
#Print OLS
print("OLS Test MSE: ", ols_test_mse)

#Print Ridge Values
print("Ridge Test MSE: ", r_test_mse)

#Print Lasso Values
print("Lasso Test MSE: ", l_test_mse)

In [None]:
#Add any visuals?

Discussion. 

### Question 3:

## Interpretation and Conclusions (20 points)
In this section you must relate your modeling and forecasting results to your original prediction question.  You must:
1. Address a resource allocation question.  What do the answers mean? What advice would you give a decision maker on the basis of your results?  How might they allocate their resources differently with the results of your model?  Why should the reader care about your results?
2. Discuss caveats and / or reasons your results might be flawed.  No model is perfect, and understanding a model's imperfections is extremely important for the purpose of knowing how to interpret your results.  Often, we know the model output is wrong but we can assign a direction for its bias.  This helps to understand whether or not your answers are conservative.  

Shoot for 500-1000 words for this section.