Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list anyone you collaborated with on this workbook

---

## Lab 6: Multiple Regression and Geopandas (Part 2)
**This lab was distributed the week of October 5th and should be completed by Tuesday, 10/13/2020 at 11:59PM.**

-------------------------------------------

Welcome to your sixth lab of the semester!<br>

This lab continues to build on the spatial analysis and modeling skills we have been developing in previous assignments. Specifically, we will use Geopandas and the `statsmodels` library to try to predict the area burned by large wildfires in the Sierra Nevada region of California. 

Feel free to refer to Lab 3 for the basic Geopandas methods we learned a few weeks ago, and to Lab 5 for linear regression (single variable) basics. 

## Setup & Review

Let's begin by importing the packages we'll need.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd

%matplotlib inline

The first dataset we will examine is fire perimeter data from the [Monitoring Trends in Burn Severity (MTBS)](https://www.mtbs.gov/project-overview) database. The data are stored as shapefiles in the `data/mtbs_ca` folder. To reduce the file size, we pre-processed the original, nationwide data to only include data for the Sierra Nevada region (as defined by [the Sierra Nevada Conservancy boundary](https://gis.data.ca.gov/datasets/f147fdc76a104484b9fa90baacf9462f_0?geometry=-133.799%2C35.544%2C-106.047%2C41.552)). The raw MTBS data includes information about prescribed fires and wildfires; in this lab, we have filtered out all fire types except for wildfires. 

**Question 1:** Import the shapefile as a GeoDataFrame. Print the first few rows. 

In [None]:
# YOUR CODE HERE

**Question 2:** Let's do a very light EDA of `sn_wildfire` dataframe, focusing on granularity and scope. Using the project information linked above and your own exploration of the dataframe, answer the questions below: <br>
a. What is the temporal extent of the data? *(scope)*<br>
b. What sizes of fires are included in MTBS? What land ownership types? *(scope)*<br>
c. What is the temporal resolution of the data? *(granularity)*<br>
d. How many records are there, and what does each record represent? *(granularity)*<br>
e. What is the coordinate reference system (CRS) of the data? *(structure)* <br>
f. What type of geometries are included in the dataframe?

In [None]:
#scratch work here


*YOUR ANSWER HERE*

**Question 3:** Print the geometry for one Polygon and one Multipolygon of your choice in the dataframe. Your answer should be a rendering of each object.

In [None]:
# Print the geometry of a Polygon of your choice. 


In [None]:
# Print the geometry of a MultiPolygon of your choice.


## More handy Geopandas operations

Geopandas provides a veritable treasure trove of [methods and attributes for Geoseries](https://geopandas.org/reference.html) (as a reminder, in a GeoDataFrame, a Geoseries is the column that contains the `geometry` attribute. That column is often, but not always, named "geometry"). Let's try out a few of these methods on our `sn_wildfires` data.

In our `sn_wildfire` dataframe, each geometry represents the perimeter of the area burned by a wildfire incident. We can use Geopandas operations to explore different properties of these geometries. 

For example, we might want to know the **centroid** of each burned area:

In [None]:
sn_wildfires.geometry.centroid.head()

# equivalently, we could have called sn_wildfires['geometry'].centroid

We can also obtain the **area**... 

In [None]:
sn_wildfires.geometry.area.head()

... and the **perimeter** each Polygon:

In [None]:
sn_wildfires.geometry.length.head()

In this case, our dataframe already included an area metric, specifically, the `Acres` column. 

**Question 4:** Your centroid, perimeter, and area calls probably returned the following warning:  
`Geometry is in a geographic CRS. Results from 'length' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.`

Why does Python give you this warning? What are the units of area and length returned by the Geopandas built-in methods? *Hint:* it might be helpful to [look up the CRS](https://epsg.org/home.html) for this dataset. 

*YOUR ANSWER HERE*

In this lab, we will try to predict the area burned by a wildfire (using `Acres` as our response variable), using start day and (relative) distance to the nearest highway as independent variables. For the latter, we will need [data on the locations of primary roads (i.e., interstates and highways) in the U.S](https://catalog.data.gov/dataset/tiger-line-shapefile-2016-nation-u-s-primary-roads-national-shapefile/resource/d7153734-1bce-4cb6-9882-466ecf897b65). 

**Question 5:** Open the shapefile in `data/tl_2016_us_primaryroads/` as a GeoDataFrame named `roads`. Does the dataset need to be transformed to a different CRS? 

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

**Question 6:** How many records are in `sn_roads`, and what do they represent? What type of geometry are the objects in the `sn_roads` dataframe?

In [None]:
# SCRATCH WORK HERE

*YOUR ANSWER HERE*

**Question 7:** Use Geopandas operations to find the length of each road in `sn_roads`.

In [None]:
# YOUR CODE HERE

Geopandas can also calculate the distance between geometries. The code below finds the nearest distance between each road in `sn_roads` and the centroid of the first wildfire listed in `sn_wildfires`. 

In [None]:
dsts = sn_roads.distance(sn_wildfires.centroid.loc[0])
dsts

**Question 8:** Write a function that takes in a single point ("point") and a series of linestrings ("lines") and returns the distance between that point and the nearest line.

In [None]:
def min_distance(point, lines):
    return ... # YOUR CODE HERE

In [None]:
assert min_distance(sn_wildfires.centroid[0], sn_roads) == dsts.min()

**Question 9:** Using your `min_distance` function, add a new column to `sn_wildfires`, each of whose elements represents the distance between the centroid of the burned area and the nearest major road in `sn_roads`. Name this column "dist_to_rd."

In [None]:
# YOUR CODE HERE


In [None]:
sn_wildfires.head()

## Multi-Variable Regression

In addition to distance to the nearest highway, we want to use day of the year that the fire starts as an independent variable. To facilitate this analysis, let's make a new column in our `sn_wildfires` dataframe that combines the information from the `year`, `StartMonth`, and `StartDay` columns into a Pandas datetime object.

In [None]:
sn_wildfires['date'] = pd.to_datetime({'year':sn_wildfires['Year'],
                      'month':sn_wildfires['StartMonth'],
                      'day':sn_wildfires['StartDay']})
sn_wildfires.head()

**Question 10:** Add a column called "day_of_year" to `sn_wildfires`. Each element of this column will be an integer between 1 and 365 representing the day of the year. *Hint:* You've done this before! Refer to Homework 3. 

In [None]:
# YOUR CODE HERE

At this point, we've prepared our data for to fit a regression model. Before we do so, let's visualize the data and qualitatively try to identify any patterns or trends.

**Question 11:** Create a pair of scatter plots showing the relationship between `Acres` burned (the target variable, represented on the y-axis), and each of the independent variables (`day_of_year` and `dist_to_rd`, represented on the x-axes). Do you notice any trends? What happens if you log-transform the y-axis?

In [None]:
#YOUR CODE HERE


*YOUR OBSERVATION HERE*


We are ready at last to create our linear regression model, using two features (start day and distance to nearest highway) to predict acres burned. 

This time, instead of `scikit-learn`, we'll use a library called `statsmodels`. One nice feature of `statsmodels` is its clean, informative summary of regression results and statistics.

In [None]:
# Run this cell to import the statsmodels library
import statsmodels.api as sm

Estimating a model with `statsmodels` uses a similar process to model estimation in `scikit-learn`. We first initialize a model, in this case using the `sm.OLS()` method, which takes **X** and **y** (in dataframe form) as arguments. We then `.fit()` the model and can view information about the coefficients and model performance using `.summary()`. 

**Question 12:** Create a dataframe **X**, which holds our two independent variables, each as a column of observations. In addition, create a dataframe **y** that holds the response variable.

In [None]:
# YOUR CODE HERE

Unlike `scikit-learn`, statsmodels expects a column of 1's in the **X** dataframe in order to fit an intercept. One way to achieve this is to apply `statsmodel`'s built-in `add_constant` function to your dataframe of **X** values.

In [None]:
# run this cell
X_const = sm.add_constant(X)
X_const.head()

Run the cell below to fit a model to **X** and **y** and view the results.

In [None]:
# Run this cell
sn_wf_model = sm.OLS(y,X_const)
sn_wf_results = sn_wf_model.fit()
sn_wf_results.summary()

**Question 13:** What are the values and 95% confidence intervals of the three coefficients? What do the confidence intervals imply about the model we've built?

*YOUR ANSWER HERE* 

## Feature Engineering

Let's try to improve our model by adding more features. Instead of using new sources of data, we will transform the two independent variables we already have and add these transformations as additional features. This process is known as "feature engineering."

**Question 14** To make it easy to test different sets of features, write a function `fit_OLS` that takes in a dataframe containing the independent variables ($X$) and another dataframe containing response variable ($y$). The function should fit a linear regression model and output the `statsmodels` summary for the model. Feel free to use the code in the previous section as a template. Test your model on the $X$ and $y$ dataframes you created in Question 14.

In [None]:
def fit_OLS(X,y):
    # YOUR CODE HERE* 
    ...
    return ...

In [None]:
fit_OLS(X,y)

The first new feature we will add to our model is the natural log of the `dist_to_rd` variable. The code block below provides an approach to expanding our $X$ dataframe to include this new feature.

In [None]:
X['log_dist'] = np.log(sn_wildfires['dist_to_rd'])
X.head()

**Question 15** Add two more features to $X$: a) the `day_of_year` variable squared, and b) $e^f$, where $f$ is `dist_to_rd`/`day_of_year`.   

*Note:* These new features don't necessarily have any intuitive meaning in the real world. We're just experimenting to see if we can come up with some new transformations that improve our model's performance. Since we are focusing on prediction and not inference, we don't have to understand the physical reasons why a particular transformation might work or not.

In [None]:
# YOUR CODE HERE
...
X.head()

**Question 16** Use your `fit_OLS` function to estimate a model for your expanded $X$ feature set. Did the addtion of the transformed features improve the model? *Hint:* compare the AIC value for the model you estimated in the previous section to this one.

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

**Question 17:** Take a stab at engineering at least one new feature of your own, using different transformations and/or combinations of the features in the $X$ dataframe. Fit your model and view the results. Did your new feature(s) improve the model?

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

**Question 18:** Besides feature engineering, what else might you be able to do to build a model that better predicts burned area in the Sierra Nevada?

*YOUR ANSWER HERE*

# Hooray, you're done! 

Please remember to submit your lab work, after clicking Kernel -> Restart & Run All, in .html and .ipynb format on bCourses.