# [ERG 190C] Homework 5


---

In this homework students will start working with air quality data, run k-nearest neighbors, and do a simple linear regression.

K-nearest neighbors (and the remainder of the methods I'll cover in the semester) is covered in Introduction to Statistical Learning. KNN for classification is described in section 2.2.3 and for regression in Section 3.5. In this homework we're going to use KNN for quantiative spatial forecasting, meaning we'll predict a numeric value for a location in space based on the average of the K-nearest points in space for which we have data.

We'll use the EPA air pollution measurements again (first used in HW2). For linear regression the objective is to build simple prediction models for PM2.5 concentration versus time. The data can be found [here](https://aqs.epa.gov/aqsweb/airdata/download_files.html).

---

### Topics Covered
- Continue getting comfortable working with new data, and continue to practice working with tools that help manage and summarize large data sets.
- Understand how KNN works and make some cool maps in the process.
- Learn how to implement the normal equations. Estimate regression coefficients using the normal equation.
- Learn how to use the simple single linear regression tool in scikit-learn.
- Analyze spatial distribution of annual changes in pollutant concentration.

### Table of Contents

1 - [K-Nearest Neighbors](#section1)<br>
2 - [Regression and the Normal Equation](#section2)<br>
3 - [Single Linear Regression with scikit-learn](#section3)<br>

**Dependencies:**

In [None]:
# Run this cell to install these packages
! pip install sklearn
! pip install plotly
! pip install mapbox

In [None]:
# Run this cell to set up your notebook
import requests
from pathlib import Path
import zipfile
import os
import csv
import pandas as pd
import numpy as np
from numpy.linalg import inv

import utils
from utils import run_plotly

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

# uncomment this for final version
# import plotly.offline as py
# py.init_notebook_mode(connected=False)
# import plotly.graph_objs as go
# uncomment this for final version

# delete this for final version
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
# delete this for final version

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Section 1: K-Nearest Neighbors  <a id='section1'></a>

Let's run a KNN algorithm on the EPA `hourly_88101_2017.csv` dataset that we previously used in Homework 2. This time, we've reduced the dataset to contain just the hourly data from California. We will use KNN to plot a map of predicted PM2.5 concentrations in locations throughout California, focusing in on October 13, 2017 &mdash; which you may remember from the [October 2017 Northern California wildfires](https://en.wikipedia.org/wiki/October_2017_Northern_California_wildfires) as the day air pollution in some areas reached to the level of hazardous. We've gone ahead and created that dataset for you to use as `pm25_oct13.csv`.

In [None]:
# Run the following cell
oct_13 = pd.read_csv('data/pm25_oct13.csv', low_memory=False)
oct_13.head()

In addition, we've also gathered together a dataset containing the latitude and longitude coordinates of every major city and town in the state of California as `ca_cities_towns.csv`. We will use these as our locations on which we will run our algorithm to predict PM2.5 concentrations.

In [None]:
# Run the following cell
ca_locations = pd.read_csv('data/ca_cities_towns.csv', low_memory=False)
ca_locations.head()

For our purposes, nearest neighbor proximity will be based on spatial distance. For each location, we will find its K-nearest neighbors in the EPA dataset, and then we will use their average PM2.5 concentration as the forecast for that location. This simple but effective algorithm should allow us in the end to create a map of California where we can color locations based on their observed and predicted PM2.5 concentrations.

### Writing the KNN Algorithm

Because we are working with an hourly dataset, we want to plot our points by hour. This means that for each call to our algorithm, we will need to go through our EPA dataset and select only the data that correspond to that hour.

In [None]:
# Run to see the recorded hours
np.unique(oct_13['Time Local'])

A downside to KNN is that it can be particularly slow. If we are working with a large dataset, we will have to iterate many times over to find the K-nearest neighbors and thus our computational cost will be very high. `ca_locations` contains 1500+ cities and towns so we will need to decrease its size.

As we plan to eventually combine these two datasets, it will be useful for us if we first categorize our data into types &mdash; meaning we need to keep track that `oct_13` contains our observed data and `ca_locations` will contain our predicted data.

<br>

<b>Question 1.1:</b> Write a `get_hour_data()` function that takes an hour parameter passed in as a string and returns a data frame containing only data from `oct_13` that was recorded during that hour.

In addition, write a `create_grid()` function that when called returns a data frame of a random sample of 150 locations from `ca_locations`. This function should also take in a seed parameter passed in as an integer that allows us to reuse the same randomized set of locations.

Make sure in both functions to append a 'Type' column to the data frame that contains an array of strings of either 'Observed' (for the `get_hour_data` function) or 'Predicted' (for the `crate_grid` function).  These will be useful later on.

*Hint: `np.repeat('a', 3)` returns `['a', 'a', 'a']`.*

*Hint: Selecting random samples using `pandas` might be helpful.*

In [None]:
# SOLUTION
def get_hour_data(hour):
    hour_data = oct_13[oct_13['Time Local'] == hour]
    hour_data['Type'] = np.repeat('Observed', len(hour_data))
    
    return hour_data

def create_grid(seed):
    grid = ca_locations.sample(150, random_state = seed)
    grid['Type'] = np.repeat('Predicted', len(grid))
    
    return grid

In [None]:
create_grid(1).head()

Now that we are able to get our hour data and our grid for the map, it's time to run the KNN algorithm. Both the hour data and the grid contain latitude and longitude coordinates. Let's take advantage of that by defining a function that finds the distance between any two points given each point's latitude and longitude values, which will help us when comparing nearest distances.

Then, write a function that predicts PM2.5 measurements for each point in the grid by first calculating the spatial distance between that point and every point in the hour data, selecting the K-nearest neighbors, then finding the average of their PM2.5 measurements, with the function returning the grid appended with the predicted measurements.

<b>Question 1.2:</b> Fill out the following cells. `find_distance()` finds the distance from point $(x, y)$ to point $(a, b)$ using the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). `predict_measurements()` takes in as parameters the hour data, the grid, and a value for K. It should return the grid with KNN performed on it, containing the predicted measurements under a 'Sample Measurement' column.

In [None]:
# SOLUTION
def find_distance(x, y, a, b):  
    return np.sqrt((x - a) ** 2 + (y - b) ** 2)

def predict_measurements(hour_data, grid, k):
    predicted_measurements = []
    
    for i in np.arange(0, len(grid)):
        distances = []
        
        for j in np.arange(0, len(hour_data)):
            distance = find_distance(grid.iloc[i]['Latitude'], grid.iloc[i]['Longitude'],
                                     hour_data.iloc[j]['Latitude'], hour_data.iloc[j]['Longitude'])
            distances.append(distance)
            
        hour_data['Distances'] = distances
        nearest_neighbors = hour_data.sort_values(by='Distances').iloc[0:k]
        average_measurement = np.mean(nearest_neighbors['Sample Measurement'])
        predicted_measurements.append(average_measurement)
        
    grid['Sample Measurement'] = predicted_measurements
    return grid

In [None]:
sample_grid = create_grid(1)
sample_grid.iloc[0,2] == sample_grid.iloc[0]['Longitude']

In the real world, data that we work with is often messy, imcomplete, and/or missing important values. Case in point, the hourly dataset we pulled from the EPA website that we have been working with so far &mdash; although it contains precise latitude and longitude coordinates for each location &mdash; only contains the county name for each location and not the city or town. This is in contrast to `ca_locations` which contains city and town names.

For our plot, we would like to have the city and town names visible instead of county names for greater accuracy and clarity. We can use `ca_locations` to approximate the locations in the hour data based on their latitude and longitude coordinates.

<br>

<b>Question 1.3:</b> Write `approximate_locations()` which takes in the hour data and grid. For every point in the hour data, it should go through all the locations in the grid and find the nearest location to that point. The function should return the hour data with an appended 'Location' column that contains the approximated locations.

In [None]:
# SOLUTION
def approximate_locations(hour_data, grid):
    locations = []
    
    for i in np.arange(0, len(hour_data)):
        distances = []
        
        for j in np.arange(0, len(grid)):
            distance = find_distance(hour_data.iloc[i]['Latitude'], hour_data.iloc[i]['Longitude'],
                                     grid.iloc[j]['Latitude'], grid.iloc[j]['Longitude'])
            distances.append(distance)
            
        grid['Distances'] = distances
        nearest_location = grid.sort_values(by='Distances').iloc[0]
        locations.append(nearest_location['Location'])

    hour_data['Location'] = locations
    return hour_data

The last thing we need to do before we can plot our data is more formatting. Take a glance at `oct_13` to see that our PM2.5 sample measurements range anywhere from 0 LC to more than 300 LC, with most data falling far below 300 LC. To allow our locations to have greater color contrast, we will need to take the log of these measurements.

In addition, we would like to add a 'Text' column to our data that will allow us to display information about each point when we plot the data. For each point we would like to display the city or town name, the data type (predicted or observed), and the PM2.5 sample measurement.

<br>

<b>Question 1.4:</b> Write a `convert_to_log()` function and an `add_text()` function that both take in a data frame. Assume that the data frame passed into these functions will be the hour data and grid concatenated into one data frame.

`convert_to_log()` should return the data frame with an appended 'Log Sample Measurement' column.

`add_text()` should return the data frame with an appended 'Text' column where each entry is a string that contains the data point's location name, data type, and measurement. Be sure to round the measurement to 3 decimals.

In [None]:
# SOLUTION
def convert_to_log(data):
    data['Log Sample Measurement'] = np.log(data['Sample Measurement'])
    return data

def add_text(data):
    text = []
    
    for i in np.arange(0, len(data)):
        location = data.iloc[i]['Location']
        data_type = data.iloc[i]['Type']
        measurement = round(data.iloc[i]['Sample Measurement'], 3).astype(str)
        text.append(location + '<br>' + data_type + ' Concentration: ' + measurement + ' LC')
    
    data['Text'] = text
    return data

Now, we are able to create our KNN map. Let's use the functions we've defined above to write our KNN algorithm and graph the data.

<br>

<b>Question 1.5:</b> Write `knn_algorithm()`. For the parameters, it takes in a string for the hour to filter the data, an integer for the seed to choose the set of locations, and an integer for K to run the algorithm with.

Be sure that after you have predicted measurements for the grid and approximated locations for the hour data that you concatenate them into one data frame, and then once you have the total data, format it and plot it. We've provided for you a `run_plotly()` function that takes in the observed data, predicted data, total data, hour, and K, and plots the map using `plotly` and `mapbox`. The function takes in the observed and predicted data separately, so you will need to separate your total data after formatting it.

If you are stuck or unsure how to approach this problem, try looking back to see the order of the steps we took to get the data, run the algorithm, and format the data for plotting. If you later encounter any errors, try going back to your previous code to look for any potential mistakes.

In [None]:
# SOLUTION
def knn_algorithm(hour, seed, k):
    # Our solution took 10 lines
    
    hour_data = get_hour_data(hour)
    grid = create_grid(seed)
    grid = predict_measurements(hour_data, grid, k)
    hour_data = approximate_locations(hour_data, grid)
    total_data = pd.concat([grid, hour_data])
    total_data = convert_to_log(total_data)
    total_data = add_text(total_data)
    observed_data = total_data[total_data['Type'] == 'Observed']
    predicted_data = total_data[total_data['Type'] == 'Predicted']
    
    return run_plotly(observed_data, predicted_data, total_data, hour, k)

### Analyzing the KNN Algorithm

Try out the KNN algorithm for `hour='12:00'`, `seed=100`, and `k=3`. When the map loads, try hovering over points, zooming in and out, right clicking and dragging, and toggling on/off options in the interactive legend to get a better grasp of what the data looks like in both a local and a regional sense. Once you've done that, try it out for different hours and for different values of K.

Try different hours to see how PM2.5 concentrations changed throughout the day. Although, the K value should be the main focus of your analysis.

Try different values of K to see the changes in predicted measurements. And keep in mind that larger values of K will take longer to load &mdash; most likely anything more than K = 10 might take too long to run.

Also, try out different seeds, but keep in mind that the seed is meant to preserve a randomized set of locations, so when comparing different hours and K values it is best to keep the same seed.

In [None]:
# Run to see the recorded hours for reference
np.unique(oct_13['Time Local'])

In [None]:
knn_algorithm(hour='12:00', seed=100, k=5)

<b>Question 1.6:</b> Comment on what you think is a "good" value of K, and explain why. Note that there is no single right answer here, but there are undoubtedly better and worse options &mdash; what would be a bad value of K?

**Important points:**
- The important thing is that K is not too small or too large 
- A low K-value (e.g., 1) would result in high variance
- A high K-value (e.g., 100) would result in high bias

<b>Question 1.7:</b> What are other factors that might be affecting spatial distributions? Explain why it would be good to create a model that predicts concentrations based on location, nearby measurements *and* those other factors.

**Important points:** 
- Listing a handful of potential factors of interest. For example: topography could also affect how air pollution spreads.
- These air quality monitors are not uniformally distributed - why? Possible explanations: population density, economic development, urban vs. rural...

---

## Section 2: Regression and the Normal Equation <a id='section2'></a>

Now that we've learned how to generate maps using the KNN clustering algorithm, we will move on to the topic of linear regression, one of the more essential aspects of data analysis.

In this section, we will learn how to create the regression line for a dataset using linear algebra, and in a later section we will compare our results here with the results from a popular Python package. For this section, in the meantime, we will gain practice with the usage of normal equations.

### Downloading and Filtering the Data

First, let's download the data we will be using for the rest of this homework. Run the following cell below to download the zip files from the EPA website. Each file contains a dataset of annual air pollutant concentrations by site, or "monitor", and related data.

In [None]:
# Download the zip files from the EPA website
# This cell only needs to be run once
# Once the files are downloaded, they'll stay on datahub.
for year in np.arange(1998, 2018):
    airquality_url = 'https://aqs.epa.gov/aqsweb/airdata/annual_conc_by_monitor_' + str(year) + '.zip'
    airquality_path = Path('annual_conc_by_monitor_' + str(year) +'.zip')
    if not airquality_path.exists():
        print('Downloading ' + str(airquality_path) + ' ...', end=' ')
        airquality_data = requests.get(airquality_url)
        with airquality_path.open('wb') as f:
            f.write(airquality_data.content)
        print('Done!')

Let's try to get a sense of what our data looks like. Run the next cell to see the 2017 dataset.

In [None]:
airquality_path = Path('annual_conc_by_monitor_2017.zip')
zf = zipfile.ZipFile(airquality_path, 'r')
f_name = 'annual_conc_by_monitor_2017.csv'

# Unzip the file
with zf.open(f_name) as fh:

    # Create data frame
    annual_2017 = pd.read_csv(fh, low_memory=False)

print(annual_2017.columns)

For this homework we will only be considering annual measures for PM2.5 in the state of California. Our goal right now is to create a single csv file that compiles all of the annual files using these specifications.

<br><b>Question 2.1:</b> Fill out the following cell. For each csv file let's write a filtered file that contains only the data we care about. Create a table with PM2.5 data (parameter code 88101) with sample duration of 24 hours and pollutant standard of 'PM25 Annual 2006'. Be sure to select just the data from California.

In [None]:
# SOLUTION
# Create a new filtered csv file for each annual zip file
for year in np.arange(1998, 2018):
    
    zip_name = 'annual_conc_by_monitor_' + str(year) +'.zip'
    airquality_path = Path(zip_name)
    zf = zipfile.ZipFile(airquality_path, 'r')
    f_name = 'annual_conc_by_monitor_' + str(year) +'.csv'
    
    # Unzip the file
    with zf.open(f_name) as fh:
        print('Writing ' + 'pm25_' + str(year) +'.csv' + ' ...', end=' ')
        
        # Create data frame
        df = pd.read_csv(fh, low_memory=False)

        # Filter data frame according to specifications
        df = df[df['Parameter Code'] == 88101]
        df = df[df['Sample Duration'] == '24 HOUR']
        df = df[df['Pollutant Standard'] == 'PM25 Annual 2006']
        df = df[df['State Name'] == 'California']

        # Write new filtered csv file
        df.to_csv('pm25_' + str(year) +'.csv')
        os.remove(zip_name)
print('Done!')

Now that we've filtered each file, run the following cell to concatenate them into a single `pm25_ca` csv file which we will use for the duration of the homework.

***NOTE:*** When you have completed Question 2.1, only run the following cell **ONCE**. If you run this cell multiple times, the dataset will have extra rows and thus not work for the rest of the homework. Make sure you've filtered the data correctly in the previous cell

In [None]:
# Concatenate the filtered csv files
fout = open('pm25_ca.csv','a')

# First file:
for line in open('pm25_2008.csv'):
    fout.write(line)
    
# Now the rest:
for year in np.arange(2008, 2018):
    pm_file = 'pm25_' + str(year) + '.csv'
    f = open(pm_file)
    
    # Skip the header
    f.__next__()
    for line in f:
         fout.write(line)
    
    os.remove(pm_file)
    
    f.close()
fout.close()

This is what our resulting annual California PM2.5 dataset looks like.

In [None]:
pm25_ca = pd.read_csv('pm25_ca.csv', low_memory=False)
pm25_ca.head()

Run the cell below to see if the final data frame has the correct dimensions. ***DO NOT*** move on if it raises an error.

In [None]:
assert pm25_ca.shape == (2221, 56)

----
### Linear Regression Using the Normal Equation


Now that we have a single PM2.5 dataset, let's regress annual pollutant concentration using the normal equation. Recall that the normal equation is given as: 

<img src="normal_equation.jpg" width=150>

For this section, we will use `pm25_ca`, `pandas` and `numpy` to perform single linear regression on time versus PM2.5 concentration on one location &mdash; the city of [Victorville](https://www.google.com/maps/place/Victorville,+CA/@34.5311766,-118.8229951,7.83z/data=!4m5!3m4!1s0x80c3645a63ddd279:0xd95115925f43476!8m2!3d34.5362184!4d-117.2927641), CA. In our case, $X$ is a matrix of two columns, where the first is our independent variable and the second is an array of ones meant to help us with calculating the intercept in our linear equation. $X^T$ is the transpose matrix of $X$, $X^{-1}$ is the inverse matrix of $X$, and $y$ is an array of our dependent variable.

<b>Question 2.2:</b> Create a data frame of annual PM2.5 concentrations just for Victorville. We want the year to be our independent variable, and we want the average concentration value for the year to be our dependent variable. Only include these in the data frame. Refer to the [doc](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_format_3) to figure out which columns are these. Then, add to it an additional 'Intercept' column that contains an array of ones.

*Hint: `np.ones(n)` creates a length n array of ones.*

In [None]:
# SOLUTION
pm25_victorville = pm25_ca[pm25_ca['City Name'] == 'Victorville']
pm25_victorville = pm25_victorville[['Arithmetic Mean', 'Year']]
pm25_victorville['Intercept'] = np.ones(len(pm25_victorville))
pm25_victorville.head()

Built into `pandas` is the ability to find the transpose of a matrix as well as the ability to find the dot product of matrices. Given data frames `X` and `Y`, we can call `X.T` to get the transpose of `X`, and we can call `X.dot(Y)` to get their dot product.

Built into the `numpy` package is `linalg` which provides useful operations to work with linear equations. One such function is `np.linalg.inv(X)` which finds the inverse of `X`.

<b>Question 2.3:</b> Using these tools, solve for the normal equation for `pm25_victorville`. Use the normal equation from above. What should our $X$ and $y$ be?

In [None]:
# SOLUTION
X = pm25_victorville[['Year', 'Intercept']]
y = pm25_victorville['Arithmetic Mean']

xTx = X.T.dot(X)
xTx_inv = np.linalg.inv(xTx)
theta = xTx_inv.dot(X.T).dot(y)

print(theta)

Recall that $\theta$ returns a vector of two coefficients $a$ and $b$ which are used in calculating the linear regression line that has the form $y = ax + b$, where $x$ is our independent variable and $y$ is our dependent variable.

Now that we have solved for the normal equation and estimated our regression coefficients, we can find the regression line for our Victorville dataset.

<b>Question 2.4:</b> Add a 'Prediction' column to `pm25_victorville` that contains the predicted y values from our regression line. Create a scatter plot of the observed Victorville data, and plot the regression line of the predicted values. Be sure to give the plot a title and label the axes. In addition, make sure to choose a range for the xticks that makes sense.

In [None]:
# SOLUTION
pm25_victorville['Prediction'] = theta[0] * pm25_victorville['Year'] + theta[1]

plt.scatter(pm25_victorville['Year'], pm25_victorville['Arithmetic Mean'], color='black')
plt.plot(pm25_victorville['Year'], pm25_victorville['Prediction'], color='red', linewidth=3)

plt.title('Annual PM2.5 Concentration in Victorville', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('PM2.5 Micrograms/cubic meter (LC)', fontsize=13)
plt.xticks(np.arange(1999, 2017, 2))
plt.show()

<b>Question 2.5:</b> Based on the plot, what predictions can we make about future PM2.5 concentration levels in Victorville?

<b>Solution:</b> PM2.5 levels will fall in future years.

---


## Section 3: Single Linear Regression with `scikit-learn` <a id='section3'></a>

Now that we've learned how to calculate the regression line and regression coefficients using the normal equation, we will learn how to use the simple single linear regression tool in [`scikit-learn`](http://scikit-learn.org/stable/), a popular Python package for machine learning algorithms. Their documentation is quite good, so feel free to browse if you would like to learn the details behind how their functions work.

For this section, we will use `scikit-learn` on the yearly PM2.5 dataset from the previous section to compare with the results we obtained from the use of the normal equations.

### Using `scikit-learn`

<b>Question 3.1:</b> Should the output of the `scikit-learn` linear regression function be the same as the one from the normal equations?

<b>Answer:</b> YOUR ANSWER HERE

<b>Solution:</b> Yes, it should.

The `scikit-learn` package has a `linear_model` object upon which you can call `LinearRegression()` to generate a linear regression object:

`lm = linear_model.LinearRegression()`

`lm` takes in its `.fit()` method arrays X and y, where X is a data frame of independent variables and y is a data frame of the dependent variable, or our "target" data.

<b>Question 3.2:</b> Using `scikit-learn`, let's fit a linear regression model to predict PM2.5 concentrations by year using the `pm25_victorville` data frame, and since we're working only with single linear regression, let X be a data frame of our independent variable and our arbitrary `'Intercept'` column, and let y be our target data. What should we set X and y to be, i.e. what is our independent variable and target variable?

In [None]:
# SOLUTION
X = pm25_victorville[['Year', 'Intercept']]
y = pm25_victorville['Arithmetic Mean']
lm_victorville = linear_model.LinearRegression()
lm_victorville.fit(X, y)

<b>Question 3.3:</b> Now that we've fitted a linear model to `lm_victorville`, we can use it to predict the PM2.5 concentrations for each year. Our linear model has a `.predict()` method, which takes in X and returns a list of our estimated coefficients. We can then plot these points using `matplotlib` and compare the regression line with the observed data points. Generate `y_prediction` and plot the `pm25_victorville` data as well as the regression line. Again, make sure to give the plot a title, label the axes, and choose a range for the xticks that makes sense.

In [None]:
# SOLUTION
y_prediction = lm_victorville.predict(X)

plt.scatter(pm25_victorville['Year'], y, color='black')
plt.plot(pm25_victorville['Year'], y_prediction, color='red', linewidth=3)

plt.title('Annual PM2.5 Concentration in Victorville', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('PM2.5 Micrograms/cubic meter (LC)', fontsize=13)
plt.xticks(np.arange(1999, 2017, 2))
plt.show()

Compare this graph with the one we generated with the normal equations. Are they similar?

Now that we've learned how to use the linear regression tool in `scikit-learn` to generate plots, let's do further analysis on the outputs. Namely, let's look at two coefficients that our linear regression object stores &mdash; the intercept and slope.

<b>Question 3.4:</b> Browse through the [`scikit-learn`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) documentation to find out how to call the intercept and slope attributes of `lm_victorville`, and print them.

*Hint: The slope is given as a coefficient.*

In [None]:
# SOLUTION
intercept = lm_victorville.intercept_
slope = lm_victorville.coef_
print('Intercept:', intercept)
print('Slope:', slope[0])

<b>Question 3.5:</b> In the context of the plot we generated, try to make sense of our intercept and slope. What do they mean? Write down an explanation. Keeping in mind the range of our axes, does our intercept make sense in relation to the data? What can we predict will happen in future years from the slope? Also, write down a possible explanation for causality (Time is not the causal variable &mdash; it is just correlated with other things. What might those be?).

**Solution**:
- Intercept means that at x = 0, or year 0, our predicted concentration level is 871.732. In terms of our data, the intercept is nonsensical. 
- Slope means that for every year, we predict the concentration level falls by 0.429. From the slope, we can predict in future years that the concentration levels will continue to fall. Possible causal variable could be increased effort to reduce pollutant concentration (althought we can't make any claims about causality).

### Linear Regression on `pm25_ca`

Now that we've gotten practice with using the single linear regression function on a sample dataset, we are now able to observe the spatial distribution of annual changes in pollutant concentration for all locations in the state.

<b>Question 3.6:</b> Use what we've learned in this homework on the `pm25_ca` dataset to estimate and print out the coefficient (that is, PM2.5 concentration versus time) for all of California, and create the corresponding scatter plot and regression line. As always, be sure to give the plot proper formatting.

In [None]:
# SOLUTION
# Our solution took 7 lines to make the model and 7 lines for plotting and formatting

lm_ca = linear_model.LinearRegression()
pm25_ca['Intercept'] = np.ones(2221)
X = pm25_ca[['Year', 'Intercept']]
y = pm25_ca['Arithmetic Mean']
lm_ca.fit(X, y)
y_prediction = lm_ca.predict(X)
slope = lm_ca.coef_

plt.scatter(pm25_ca['Year'], y, color='black', marker='.')
plt.plot(pm25_ca['Year'], y_prediction, color='red', linewidth=3)

plt.title('Annual PM2.5 Concentration in California', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('PM2.5 Micrograms/cubic meter (LC)', fontsize=13)
plt.xticks(np.arange(2000, 2020, 5))
plt.show()

print('Coefficient:', slope[0])

<b>Question 3.7:</b> Fill out the markdown cell. What trends do you observe? Discuss whether PM2.5 concentration is increasing or decreasing in most California regions. What can we predict will happen in the future?

Then fill out the code cell. What does our model predict will be the average concentration in 2020? What about in 2030?

<b>Solution:</b> We observe a downward trend in PM2.5 concentration throughout California. Using the regression line we predict PM2.5 levels will further decrease in the future.

In [None]:
# SOLUTION
predicted_2020 = lm_ca.coef_[0] * 2020 + lm_ca.intercept_
predicted_2030 = lm_ca.coef_[0] * 2030 + lm_ca.intercept_
print('Prediction for 2020:', predicted_2020, 'Micrograms/cubic meter (LC)')
print('Prediction for 2030:', predicted_2030, 'Micrograms/cubic meter (LC)')

----
## Submission

Congrats, you've finished homework 5!

Before you submit, click **Kernel** --> **Restart & Clear Output**. Then, click **Cell** --> **Run All**. Then, go to the toolbar and click **File** -> **Download as** -> **.html** and submit the file through bCourses.


---

## Bibliography

- Adi Bronshtein - Referred to KNN concepts. https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7

- Anwar A. Ruff - Used normal equation example as model. https://github.com/aaruff/Course-MachineLearning-AndrewNg/blob/master/NormalEquation.ipynb

- Introduction to Statistical Learning - Referred to KNN concepts. https://www-bcf.usc.edu/~gareth/ISL/

- Manu Jeevan - Adapted scikit-learn techniques. http://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/

- Maps of World - Obtained latitude/longitude of CA cities and towns. https://www.mapsofworld.com/usa/states/california/lat-long.html

- scikit-learn.org - Referred to scikit-learn documentation. http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

- Shawon Ashraf - Adapted normal equation implementation techniques. https://www.c-sharpcorner.com/article/normal-equation-implementation-from-scratch-in-python/

---
Notebook developed by: Joshua Asuncion

Data Science Modules: http://data.berkeley.edu/education/modules
