In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list names of anyone you worked with on this homework.

# [ER 131] Homework 5: K-nearest neighbors and single linear regression
<br>

## Table of Contents
1 - [Project](#section1)<br>
2 - [Set-Up](#section2)<br>
3 - [K-Nearest Neighbors](#section3)<br>
4 - [Single Linear Regression with scikit-learn](#section4)<br>

---

In this homework, you will create your first models to predict air quality data using the K-nearest neighbors and regression methods. But first, you'll spend some more time preparing your final project.

### Topics Covered
- Continue getting comfortable working with new data, and continue to practice working with tools that help manage and summarize large data sets.
- Understand how KNN works and make some cool maps in the process.
- Learn how to use the simple single linear regression tool in scikit-learn.
- Analyze spatial distribution of annual changes in pollutant concentration.

---


## Section 1: Project <a id='section1'></a>

Last week, you provided your initial ideas for the project. Refer back to that homework and to the project-related discussions and thinking you have engaged in over the past week.<br>

**Question 1.1** What was challenging about answering the project-related questions in HW 4 (defining a prediction problem and listing relevant datasets)? If you've been able to work through those challenges, how have you done so? <br> 

*Note*: try to be as descriptive as possible here! eg. instead of saying "finding data was hard", you can say "I wanted to find non-US data for drinking water quality and it has been challenging to locate a dataset." You can discuss conceptual challenges (e.g., figuring out if a question is phrased as a prediction problem) or practical challenges (e.g., figuring out a time to meet as a group). This question is mainly here so that we can organize the lab time and the resources we provide in a helpful way.

*YOUR ANSWER HERE*

**Question 1.2** Who will you be working with on the project? Enter first + last names of all your group members below.

*YOUR ANSWER HERE*

**Question 1.3** What prediction questions will your group be exploring? As in HW 4, your answer can be preliminary - it's ok (and expected) if it changes and becomes more refined throughout the next few weeks. Write down at least two.  Make it clear that you're posing prediction problems, not inference problems. 

*YOUR ANSWER HERE*

**Question 1.4** In a few sentences, explain the *motivation* behind your prediction problem. Who would be interested in seeing the results your prediction model? Why is it important to answer this question?

*YOUR ANSWER HERE*

## Section 2: Set-Up <a id='section2'></a>

**Important note**: You'll notice in the dependencies code block that there's a section that we want you to comment out in the final version, and a section that we want you to uncomment. Make sure to uncomment everything in `# uncomment this for final version` and comment everything in `# comment this out for final version` - it ensures that one of the plots you'll be outputting will show up properly in the .html file you submit. We'll remind you at the end of the homework, too!

---


**Dependencies:**

In [None]:
# Run this cell to install these packages
! pip install sklearn
! pip install plotly
! pip install mapbox

In [None]:
# Run this cell to set up your notebook
import requests
from pathlib import Path
import zipfile
import os
import csv
import pandas as pd
import numpy as np
from numpy.linalg import inv

pd.set_option('display.max_columns', None)

import utils
from utils import run_plotly

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

# uncomment this for final version
# import plotly.offline as py
# py.init_notebook_mode(connected=False)
# import plotly.graph_objs as go
# uncomment this for final version

# comment this out for final version
import plotly
import plotly.graph_objs as go
# comment this out for final version

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Section 3: K-Nearest Neighbors  <a id='section3'></a>



K-nearest neighbors is covered in Introduction to Statistical Learning. KNN for classification is described in section 2.2.3 and for regression in Section 3.5. In this homework we're going to use KNN for quantiative spatial forecasting, meaning we'll predict a numeric value for a location in space based on the average of the K-nearest points in space for which we have data.

We'll use the EPA air pollution measurements again and create a map of predicted PM2.5 concentrations in locations throughout California. Let's run a KNN algorithm on the [hourly EPA PM2.5 data](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_hourly_data_files) that we used in Homework 2. This time, we've gone ahead and created a reduced dataset that contains only the hourly data from California on Nov 15, 2018, which you may remember was one of the worst air quality days associated with the Camp Fire. 

There are a lot of steps to calculating the K-nearest neighbours, but you'll get to produce a really cool plot at the end!<br>

In [None]:
# Run the following cell
nov15 = pd.read_csv('data/pm25_nov15.csv', low_memory=False)
nov15.head()

In addition, we've also gathered a dataset containing the latitude and longitude coordinates of every major city and town in the state of California as `ca_cities_towns.csv`. We will use these as our locations on which we will run our algorithm to predict PM2.5 concentrations.

In [None]:
# Run the following cell
ca_locations = pd.read_csv('data/ca_cities_towns.csv', low_memory=False)
ca_locations.head()

For our purposes, nearest neighbor proximity will be based on spatial distance. We will find a given location's *K* geographically nearest neighbors in the EPA dataset, and then we will use average PM2.5 concentration at these neighbor points as the forecast for that location. This simple but effective algorithm should allow us in the end to create a map of California on which we can color locations based on their observed and predicted PM2.5 concentrations.<br>

Before we jump into writing the KNN algorithm, we'll do a quick review of what KNN is (the Module 5 asynchronous content and section 2.2 of ISLR are also helpful resources here). KNN estimates the value at a point by taking an average of the K nearest values to that point (so if K = 2, then the algorithm estimates the value at the point by taking the average value of the 2 nearest points). Mathematically, this looks like:

$$\hat{y}_j=\frac{1}{K}\sum_{i \epsilon N_j}y_i$$<br>

In the formula above, we're trying to predict the value of $y$ at position *j*. $N_j$ is the set of $K$ points closest to *j*. The formula sums the $y$ values at all of the points within the set $N_j$, and then divides by $K$ to get an average.

### Writing the KNN Algorithm

Our KNN algorithm will estimate the PM2.5 concentration at each location by hour. This means that for each call to our algorithm, we will need to go through our EPA dataset and select only the data that correspond to a given hour.

In [None]:
# Run to see the recorded hours
np.unique(nov15['Time Local'])

<b>Question 3.1:</b> Write a `get_hour_data()` function that:
- Takes an hour parameter passed in as a string and returns a dataframe containing only data from `nov15` that was recorded during that hour;
- Adds a new column to the others included in the original dataframe. The name of the new column should be 'Source,' and it should be populated with the string value 'Observation' for every row of the dataframe. 
- Renames the 'Sample Measurement' column 'Value'. 

(The latter two manipulations will be useful later on when we merge observed data from `nov15` with predicted data for `ca_locations`.)

*Hint*: `np.repeat('a', 3)` returns `['a', 'a', 'a']`.<br>

In [None]:
def get_hour_data(hour):
    '''Takes in a string indicating the hour (using a 24-hour clock) and returns the nov15 dataframe filtered
    to include only observations from that hour on November 15th. Adds a "Source" column and assigns "Observation" 
    as its value for all observations. Renames the "Sample Measurement" column "Value."'''
    
    #YOUR CODE HERE
        
    return ...

In [None]:
# run this cell, do not change it
get_hour_data('18:00').head()

In [None]:
# run this cell, do not change it
assert get_hour_data('18:00').shape == (76, 25)

A downside to KNN is that it can be particularly slow. If we are working with a large dataset, we will have to iterate many times over to find the K-nearest neighbors and thus our computational cost will be very high. `ca_locations` contains 1500+ cities and towns, so we will need to decrease its size to 150-300 locales.


**Question 3.2:** Write a `get_sample()` function that when called returns a dataframe with a random sample of $N$ locations from `ca_locations`. Along with $N$, this function should take in a random seed parameter passed in as an integer. The random seed allows us to replicate the random set of locations everytime we run the function. Like `get_hour_data()`, `get_sample()` should return a dataframe with all of the original columns plus a new column, 'Source', that contains a string value 'Prediction' for every row. In addition, rename the "Location" column to "Locale". Again, these last two steps will be useful when we merge dataframes.


*Hint*: There are many ways to select random samples: you can use the numpy method that we used in lab and lecture, or [`pandas.sample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)

In [None]:
def get_sample(N, seed):
    '''Takes in N (an integer representing the number of locations) and seed 
    (any number) and returns the ca_locations dataframe, 
    filtered to include a sample of N locations, randomly selected. Adds a "Source" column and 
    assigns "Prediction" as its value for all observations. Renames the "Location" column "Locale".'''
    
    # YOUR CODE HERE
        
    return ...

In [None]:
# run this cell, do not change it
get_sample(200,1).head()

In [None]:
assert get_sample(200,1).shape == (200,4)

Now that we have functions to get our hour data and our sampled California cities and towns, it's time to set up the KNN algorithm. The first step to running KNN is to find the distance between our locations of interest (i.e., each of the randomly-sampled California locales) and the locations at which PM2.5 has been measured. Both the `nov15` and the `ca_locations` contain latitude and longitude coordinates. Let's take advantage of this fact by defining a function that finds the distance between any two points given each point's latitude and longitude values. We can use the Euclidean distance formula; i.e., if we have one set of points $(a_1,b_1)$, and another set of points $(a_2,b_2)$, the distance between them is:<br>

$distance = \sqrt{(a_1-a_2)^2 + (b_1-b_2)^2}$

*Side note*: calculating distances between latitude-longitude pairs is often more complicated than the formula above makes it appear because the distance between two points of longitude actually varies based on how far away the points are from the equator. Since we're calculating distances over a relatively small area (the state of California), we can use the approximation above. If you wanted to accurately look at distances between latitude-longitude pairs over a larger area of the globe, you would have to use a slightly more involved trigonometry formula.

<b>Question 3.3:</b> Define the function `find_distance()`, which returns the distances between a single point $(a,b)$ and a series of $(\vec{x},\vec{y})$ coordinates using [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). The function should return a one-dimensional array of the same length as $x$ and $y$, containing the distances.

*Hint:* NumPy functions (e.g., [np.subtract](https://numpy.org/doc/stable/reference/generated/numpy.subtract.html), [np.sqrt](https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html)) are useful for conducting element-wise calculations on arrays.

In [None]:
# SOLUTION
def find_distance(x, y, a, b):  
    '''Returns a one-dimensional array containing the distances between the point (a,b) and a series of 
    points whose x-coordinates are represented as the one-dimensional array x, and whose y-coordinates 
    are represented by the one-dimensional array y'''
    # YOUR CODE HERE

In [None]:
# run this cell to check your function; do not change it
print(find_distance([5,3],[0,4],0,0)) # calculate the distance from (5,0) and (3,4) to (0,0)

**Question 3.4**: Using `find_distance()`, we're now going to create a distance matrix. Each row will be one of the $N$ sampled California locations where we're interested in predicting the PM2.5 concentration. Each column will represent one of the locations at which PM2.5 has been measured. The elements of the array will represent the distance between each sampled California locale and each PM2.5 measurement during a given hour. For instance, if Oakland was the first out of the cities to appear in the output of `get_sample()`, then row 0 of the distance matrix would contain distances between Oakland and every PM2.5 measurement location in a given hour.<br>

Define a function, `get_dist_array()`, that creates this distance matrix. As input, it will take in `hour_data` - a dataframe of PM2.5 measurements during a given hour - and `ca_sample` - a dataframe of randomly-sampled locales in California. It should return a numpy array with $N$ rows (for each California locale in the sample) and *M* columns, where *M* is equal to the number of rows in `hour_data` (i.e. the number of observed measurements in that hour).

A skeleton code has been provided, but you are welcome to come up with your own approach.

In [None]:
def get_dist_array(ca_sample, hour_data):
    '''Inputs: 
        hour_data - a dataframe of observations at a given hour 
        ca_sample - a dataframe of sampled cities and towns in California. 
    Returns: a numpy array with N rows (for each California location in the sample) and M columns, 
    where M is equal to the number of rows in `hour_data` (i.e. the number of observed measurements in that hour).'''
    # Fill in the ellipses, or replace the skeleton code with your own approach
    dist_array = np.full((len(ca_sample), len(hour_data)), np.nan) # initialize an array of size NxM, filled with NaN 
    
    for i in range(len(ca_sample)): # loop through CA cities/towns
        # calculate distance between each city/town and each measurement location, 
        # and add to array row
        dist_array[i,:] = ... #YOUR CODE HERE
    
    return dist_array

In [None]:
# run this cell; do not change it
hour_data = get_hour_data('18:00')
ca_sample = get_sample(200,1)
print(get_dist_array(ca_sample, hour_data))
assert get_dist_array(ca_sample, hour_data).shape == (200,76)

**Question 3.5:** Next, write a function that predicts PM2.5 concentrations for each point in your set of randomly-sampled California locales based on the average PM2.5 measurement of its K-nearest measurement locations. This function, `predict_PM25()`, should take in as parameters the `hour_data` dataframe, the `ca_sample` dataframe, and a value for $K$.

Your function should use `get_dist_array()` to find the spatial distance between each locale and each measurement location. Then, it should select the $K$ nearest neighboring measurement sites to the locale. Finally, it should assign the average PM2.5 measurement for those $K$ nearest neighbors to a new "Value" column.

*Hint*: you may want to use the [np.argsort()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html) function here.

In [None]:
def predict_PM25(hour_data, ca_sample, K):
    '''`Takes in as parameters the hour_data dataframe, the ca_sample dataframe, and an integer value for K 
    (number of nearest neighbors). 
    Returns the `ca_sample` dataframe with a new 'Value' column, whose elements are the predicted PM2.5 concentrations for
    each city or town, calculated based on the mean of the K nearest Sample Measurements to that city or town.'''

    # get distances between CA cities/towns and measurement locations
    # for each CA city/town, get the average value of the K nearest measurement locations: this is your predicted Value
    # add predicted values to ca_sample

    return ca_sample

In [None]:
# run this cell; do not change it
hour_data = get_hour_data('18:00')
ca_sample = get_sample(200,1)
test = predict_PM25(hour_data, ca_sample, 2)
test.head()
# your ca_sample dataframe should have 5 columns: 
# Locale, Latitude, Longitude, Source, and Value

In the real world, data that we work with is often messy, incomplete, and/or missing important values. Case in point: the hourly dataset we pulled from the EPA website that we have been working with so far &mdash; although it contains precise latitude and longitude coordinates for each location &mdash; only contains the county name for each location and not the city or town. This is in contrast to `ca_locations` which contains city and town names.

When we plot all of our data, we would like to have the city and town names visible instead of county names for greater accuracy and clarity. We can use `ca_locations` to approximate the locations in the `hour_data` based on their latitude and longitude coordinates.

<br>

<b>Question 3.6:</b> Write `approximate_locale()` which takes as input `hour_data` and `ca_locations`. For every record in the `hour_data`, the function should go through all the records in `ca_locations` and find the nearest city or town to that measurement. The function should return the `hour_data` dataframe with an appended 'Locale' column that contains the name of the closest city or town. Here, you can make use of `get_dist_array()` again - remember that each **column** of the array that `get_dist_array()` returns corresponds to a measurement location, and each element within a given column tells you the distance from that measurement location to a locale in `ca_locations` (note that we're using `ca_locations` and not `ca_sample` in this case, because we want to look at all the possible cities or towns.

In [None]:
ca_locations.columns

In [None]:
def approximate_locale(hour_data, ca_locations):
    '''Takes in as parameters the hour_data dataframe and the ca_sample dataframe. 
    Returns the `hour_date` dataframe with a new 'Locale' column. Each element is the name (as a string) 
    of the nearest city or town to the measurement site.'''
    
    # get distances between CA cities/towns and measurement locations
    # for each measurement location, find the nearest CA city/town: this is the location name corresponding to that measurement location
    # add locales to hour_data
    
    return hour_data

For a quick check of your results, you can choose a couple lat-long coordinates in `hour_data`, input them to Google Maps, and make sure that the location that your code is outputting is at or near those coordinates.

In [None]:
# run this cell; do not change it
hour_data = get_hour_data('18:00')
ca_sample = get_sample(200,1)
approximate_locale(hour_data, ca_locations).head()

To make a meaningful plot, we need to do a little more formatting. Taking a glance at `nov15`, we see that our PM2.5 sample measurements range anywhere from about 0 LC to more than 200 LC, with most data falling far below 200 LC.

In [None]:
# run this cell to see the distribution of measurements
plt.hist(nov15["Sample Measurement"], bins = 20);
plt.title("Distribution of PM2.5 observations on Nov 15 2018")
plt.xlabel("PM2.5 concentration")
plt.ylabel("Count")
plt.show()

To allow our plot of PM2.5 concentrations to have greater color contrast, we will need to take the log of these measurements.

In addition, we would like to add a 'Text' column to our data that will allow us to display information about each point when we plot the data. For each point, we would like to display the city or town (Locale) name, the data Source (predicted or observed), and the PM2.5 concentration Value.

<br>

<b>Question 3.7:</b> Write a `convert_to_log()` function and an `add_text()` function that both take in a dataframe. Assume that the dataframe passed into these functions will be the `hour_data` and `ca_sample` merged into one data frame, with a column "Value" that contains either the observed or predicted PM25 concentration, a column "Locale" that contains the nearest town or city name, and a column "Source" that contains the data source ("Observation" or "Prediction").

`convert_to_log()` should return the dataframe with an appended 'Log Value' column, whose elements are the natural logarithms of the "Value" column in the same dataframe.

`add_text()` should return the dataframe with an appended 'Text' column, in which each entry is a string that contains a record's nearest city or town name, data source, and PM2.5 value. Be sure to round the value to 3 decimals.

In [None]:
# SOLUTION
def convert_to_log(df):
    '''Returns the input dataframe df with a new column "Log Value" whose elements are the natural logarithms of the values 
    in the Value column'''
    
    #YOUR CODE HERE
    

def add_text(df):
    '''Returns the input dataframe df with an appended 'Text' column, in which each element is a string that 
    contains the data point's locale name, data source, and PM2.5 value, rounded the nearest 3 decimal places.'''
    # FILL IN THE ELLIPSESE OR REPLACE WITH YOUR OWN APPORACH
    text = ... # INITALIZE A LIST TO HOLD THE STRINGS
    
    for i in np.arange(...): # LOOP OVER THE ROWS OF THE DATAFRAME
        locale = ... # FIND THE LOCALE FOR THE RECORD
        data_source = ... # FIND THE SOURCE OF THE RECORD (PREDICTED OR OBSERVED)
        value = ... # FIND THE (MEASURED OR PREDICTED) PM2.5 CONCENTRATION OF THE RECORD
        text.append(... + '<br>' + ... + ':' + ... + ' LC')
    
    ... # ADD A NEW COLUMN, "TEXT," TO df
    return df

Now, we are able to create our KNN map. Let's use the functions we've defined above to write our KNN algorithm and graph the data.

<br>

<b>Question 3.8:</b> Write the `knn_algorithm()` function. The input parameters are:
- hour: the hour (as a string) on which to filter the data
- N_locales: an integer number of cities or towns to sample from `ca_locations`
- seed: an integer for the random seed to use to choose the sample of locales
- K: an integer designating the number of nearest neighbors on which to base the algorithm.

Your function should merge the predicted and observed PM2.5 concentrations for the designated hour into one dataframe (*hint:* use an outer merge on more than one column). Format this `total_data` dataframe by taking the log of the PM2.5 values and adding a "Text" column. We've provided for you a `run_plotly()` function that takes in the observed data, predicted data, total data, hour, and K, and plots the map using `plotly` and `mapbox`. The function takes in the observed and predicted data separately, so you will need to separate your total_data after formatting it.

If you are stuck or unsure how to approach this problem, try looking back to see the order of the steps we took to get the data, run the algorithm, and format the data for plotting. If you later encounter any errors, try going back to your previous code to look for any potential mistakes.

In [None]:
# SOLUTION
def knn_algorithm(hour, N_locales, seed, K):
    
    # get data for the specified hour, and using the seed get a random sample of CA cities/towns 
    # predict the PM2.5 concentrations for the sampled CA cities/towns
    # get the approximate locations for the hourly measured data
    # merge (or concatenate) dataframes, convert the measurements to log values, and add a text column
    # subset your dataframe into observed and predicted data
    
    # return a plot of observed and predicted values
    
    return run_plotly(observed_data, predicted_data, total_data, hour, K)

### Analyzing the KNN Algorithm

Try out the KNN algorithm for `hour='12:00'`, `N_locales=250`, `seed=100`, and `k=3`. When the map loads, try hovering over points, zooming in and out, right clicking and dragging, and toggling on/off options in the interactive legend to get a better grasp of what the data looks like in both a local and a regional sense. Once you've done that, try it out for different hours and for different values of K.

The K value should be the main focus of your analysis. Try different values of K to see the changes in predicted measurements. And keep in mind that larger values of K will take longer to load &mdash; most likely anything more than K = 10 might take too long to run.

Also, try out different seeds, but keep in mind that the seed is meant to preserve a randomized set of locations, so when comparing different hours and K values it is best to keep the same seed.

In [None]:
# Run to see the recorded hours for reference
np.unique(nov15['Time Local'])

In [None]:
#YOUR CODE HERE

<b>Question 3.9:</b> Comment on what you think is a "good" value of K, and explain why. Note that there is no single right answer here, but there are undoubtedly better and worse options &mdash; what would be a bad value of K?

*Your answer here*

<b>Question 3.10:</b> What are other factors that might be affecting spatial distributions? Explain why it would be good to create a model that predicts concentrations based on location, nearby measurements *and* the other factors that you've listed.

*Your answer here*

---

## Section 4: Single Linear Regression with `scikit-learn` <a id='section4'></a>

Now that we've learned how to generate maps using KNN, we will learn how to use the simple single linear regression tool in [`scikit-learn`](http://scikit-learn.org/stable/), a popular Python package for machine learning algorithms. Their documentation is quite good, so feel free to browse if you would like to learn the details behind how their functions work.

For this section, we will use `scikit-learn` to build a simple prediction model of ozone concentration over time. 

### Downloading and Filtering the Data

First, let's download the data we will be using for this section. Run the following cell below to download the zip files from the EPA website. Each file contains a dataset of annual air pollutant concentrations by site, or "monitor", and related data.

In [None]:
# Download the zip files from the EPA website
# This cell only needs to be run once
# Once the files are downloaded, they'll stay on datahub.
for year in np.arange(1998, 2020):
    airquality_url = 'https://aqs.epa.gov/aqsweb/airdata/annual_conc_by_monitor_' + str(year) + '.zip'
    airquality_path = Path('annual_conc_by_monitor_' + str(year) +'.zip')
    if not airquality_path.exists():
        print('Downloading ' + str(airquality_path) + ' ...', end=' ')
        airquality_data = requests.get(airquality_url)
        with airquality_path.open('wb') as f:
            f.write(airquality_data.content)
        print('Done!')

Let's try to get a sense of what our data look like. Run the next cell to see the 2019 dataset.

In [None]:
airquality_path = Path('annual_conc_by_monitor_2019.zip')
zf = zipfile.ZipFile(airquality_path, 'r')
f_name = 'annual_conc_by_monitor_2019.csv'

# Unzip the file
with zf.open(f_name) as fh:

    # Create data frame
    annual_2019 = pd.read_csv(fh, low_memory=False)

annual_2019.head()

For this homework we will only be considering annual measures for ozone in the state of California. Our goal right now is to create a single dataframe that compiles data from all of the annual files.

<br><b>Question 4.1:</b> The goal of the following cell is to create one dataframe, `O3_ca`, that contains ozone data ('Parameter Code' = 44201), focusing on the daily maximum of an 8-hour running average and the pollutant standard of 'Ozone 8-hour 2015', from California only.<br> 

To do this, you can look at each csv file within each zip file, read that .csv file into a dataframe, and create a filtered dataframe that contains only the data we care about based on the conditions above. Then, you can concatenate that dataframe to `O3_ca` so that everytime you run through the loop, you've added data from a different year to your dataframe. When you concatenate, make sure the `axis` parameter is correctly specified so that each iteration adds **rows** associated for a given year to the dataframe.<br>

If you're unsure of where to start, look at the code block above to see how we open zipfiles and then access .csv files from within those zip files.

In [None]:
O3_ca = pd.DataFrame() # initialize empty dataframe

for year in np.arange(1998, 2020):
    
    # YOUR CODE HERE
    

In [None]:
# run this cell
O3_ca.head()

Run the cell below to see if the final data frame has the correct dimensions.

In [None]:
assert O3_ca.shape == (4329, 55)

**Question 4.2:** We are interested in predicting ozone concentration as a function of year. The "Year" column in `O3_ca` represents our independent variable ($X$). The "Arithmetic Mean" column will be our response variable ($y$). Describe what you interpret the values in the "Arithmetic Mean" column to represent (i.e., this column is the mean of ...?). 

*YOUR ANSWER HERE*

### Using `scikit-learn`

Now that our data is loaded, we can use `scikit-learn`. Refer to Lab 5 for the introduction and important notes on the `scikit-learn` `linear_regression` tools.

<b>Question 4.3:</b> Using `scikit-learn`, let's fit a linear regression model for the city of Modesto, California. First, create a `O3_modesto` data frame that contains only data for Modesto. Then, generate a linear regression object called `lm_modesto`. Finally, fit that linear model to "Year" values (the independent variable) and the "Arithmetic Mean" values (the target variable). Save fitted model to an object named `fit_modesto`.<br>

In [None]:
# YOUR CODE HERE
O3_modesto = ...
X = ...
y = ...
lm_modesto = ...
fit_modesto = ...

<b>Question 4.4:</b> Now that we've fitted the linear model `fit_modesto`, we can use it to predict the ozone concentrations for each year. Our linear model has a `.predict()` method, which takes in X and returns a list of our estimated coefficients. We can then use `matplotlib` to compare the regression line with the observed data points. Generate `y_prediction`. Then, plot the `O3_modesto` observations as well as the regression line. Again, make sure to give the plot a title, label the axes, and choose a range for the xticks that makes sense.

In [None]:
# YOUR CODE HERE

Let's do further analysis on the outputs. Namely, let's look at two coefficients that our linear regression object stores &mdash; $\hat{\beta}_0$ (the intercept) and $\hat{\beta}_1$ (the slope).

<b>Question 4.5:</b> Browse through the [`scikit-learn`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) documentation to find out how to call the intercept and slope attributes of `fit_modesto`, and print them.

In [None]:
#YOUR CODE HERE

<b>Question 4.6:</b> In the context of the plot we generated, try to make sense of our intercept and slope. What do they mean? Write down an explanation.

*YOUR ANSWER HERE*

### Linear Regression on `O3_ca`

Now that we've gotten practice with using the single linear regression function on a sample dataset, we are now able to observe the spatial distribution of annual changes in pollutant concentration for all locations in the state.

<b>Question 4.7:</b> Use what we've learned in this homework on the `O3_ca` dataset to estimate and print out the slope (for ozone concentration versus time) if we include data for all of California. Create the corresponding scatter plot and regression line. As always, be sure to give the plot proper formatting.

In [None]:
# YOUR CODE HERE

<b>Question 4.8:</b> What trends do you observe? How does the model for California compare to the model for Modesto?

*YOUR ANSWER HERE*

**Question 4.9:** What does our model predict the average ozone concentration in Modesto will be in 2025? How about 2040?

In [None]:
# YOUR CODE HERE

### Extra Credit
Compute the 95% confidence intervals for $\hat{\beta}_0$ and $\hat{\beta}_1$ for both the Modesto and the California linear models you have created. Feel free to draw from the functions in the asynchronous lectures and the lab as a starting point.

In [None]:
# YOUR CODE HERE

----
## Submission

Congrats, you've finished homework 5! **In the dependencies code block, make sure to uncomment everything in `# uncomment this for final version` and comment everything in `# comment this out for final version`.**

Before you submit, click **Kernel** --> **Restart & Clear Output**. Then, click **Cell** --> **Run All**. Then, go to the toolbar and click **File** -> **Download as** -> **.html** and submit the file through bCourses.


---

## Bibliography

- Adi Bronshtein - Referred to KNN concepts. https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7

- Anwar A. Ruff - Used normal equation example as model. https://github.com/aaruff/Course-MachineLearning-AndrewNg/blob/master/NormalEquation.ipynb

- Introduction to Statistical Learning - Referred to KNN concepts. https://www-bcf.usc.edu/~gareth/ISL/

- Manu Jeevan - Adapted scikit-learn techniques. http://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/

- Maps of World - Obtained latitude/longitude of CA cities and towns. https://www.mapsofworld.com/usa/states/california/lat-long.html

- scikit-learn.org - Referred to scikit-learn documentation. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

- Shawon Ashraf - Adapted normal equation implementation techniques. https://www.c-sharpcorner.com/article/normal-equation-implementation-from-scratch-in-python/

---
Notebook developed by: Joshua Asuncion, edited by Jessica Katz

Data Science Modules: http://data.berkeley.edu/education/modules
