In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab11.ipynb")

# Lab 11: Least Squares 

**Recommended Readings**: 

* [The Regression Line](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html)
* [Method of Least Squares](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html)
* [Least Squares Regression](https://www.inferentialthinking.com/chapters/15/4/Least_Squares_Regression.html)


Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!


**Note: This homework has hidden tests on it. That means even though the tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**


Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck.


In [None]:
# Run this cell to set up the notebook, but please don't change it.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from datetime import datetime

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Part 1. NBA Spreads 

We will again use data coming for sports betting on NBA games. 

In a basketball game, each team scores some number of points.  Conventionally, the team playing at its own arena is called the "home team", and their opponent is called the "away team".  The winner is the team with more points at the end of the game.

We can summarize what happened in a game by the "**outcome**", defined as the **the away team's score minus the home team's score**:

$$\text{outcome} = \text{points scored by the away team} - \text{points scored by the home team}$$

If this number is positive, the away team won.  If it's negative, the home team won. 

In order to facilitate betting on games, analysts at casinos try to predict the outcome of the game. This prediction of the outcome is called the **spread.**


In [None]:
spreads = pd.read_csv("spreads.csv")
spreads.head(10)

<!-- BEGIN QUESTION -->

<br>

---

### Question 1.1 

Create a scatter plot of the outcomes and spreads, with spreads on the horizontal axis.  

*Note:* Make sure to label your axes. 

In [None]:
# Create scatter plot
    

<!-- END QUESTION -->

<br>

---

### Question 1.2 

You will create functions that can be used for this dataset and other datasets as well in order to convert to `standard_units`, calculate the `correlation`, `slope`, and `intercept`. 

In [None]:
def standard_units(x):
    # "Convert any array of numbers to standard units."
    return ...

def correlation(df, x, y):
    # Computes the correlation between columns x and y of DataFrame df
    x_su = ...
    y_su = ...
    return ...

def slope(df, x, y):
    # Computes the slope of the regression line
    r = ...
    y_sd = ...
    x_sd = ...
    return ...
    
def intercept(df, x, y):
    # Computes the intercept of the regression line
    x_mean = ...
    y_mean = ...
    return ...
    

spreads_r = correlation(spreads, 'Spread', 'Outcome')
spreads_slope = slope(spreads, 'Spread', 'Outcome')
spreads_intercept = intercept(spreads, 'Spread', 'Outcome')

In [None]:
grader.check("q1_2")

<!-- BEGIN QUESTION -->

<br>

---

### Question 1.3 

Suppose that we create another model that simply predicts the average outcome regardless of the value for spread. Does this new model minimize the least squared error? Why or why not?




*Answer Here*

<!-- END QUESTION -->

#### Fitting a Least-Squares Regression Line

Recall that the least-squares regression line is the unique straight line that minimizes root mean squared error (RMSE) among all possible fit lines. Using this property, we can find the equation of the regression line by finding the pair of slope and intercept values that minimize root mean squared error. 

<br>

---

### Question 1.4 

Define a function called `errors`.  It should take three arguments:
1. a DataFrame `df` like `spreads` (with the same column names and meanings, but not necessarily the same data)
2. the `slope` of a line (a number)
3. the `intercept` of a line (a number).

It should **return an array of the errors** made when a line with that slope and intercept is used to predict outcome from spread for each game in the given table.

*Note*: Make sure you are returning an array of the errors, and not the RMSE. 

In [None]:
def errors(df, slope, intercept):
    ...
    

In [None]:
grader.check("q1_4")

<br>

---

### Question 1.5 

Using `errors`, compute the errors for the line with slope `0.5` and intercept `25` on the `spreads` dataset.  Name this array/Series `outcome_errors`.  Then, make a scatter plot of the errors. 

*Hint:* For the scatter plot of errors, plot the error for each outcome in the dataset.  Put the actual spread on the horizontal axis and the outcome error on the vertical axis. 

In [None]:
outcome_errors = ...
...

In [None]:
grader.check("q1_5")

You should find that the errors are almost all negative.  That means our line is not the best fit to our data.  Let's find a better one.

<br>

---

### Question 1.6 

Define a function called `fit_line`.  It should take a DataFrame like `spreads` (with the same column names and meanings) as its argument.  It should return an array containing the slope (as the first element) and intercept (as the second element) of the least-squares regression line predicting outcome from spread for that table. 

*Hint*: Define a function `rmse` within `fit_line` that takes an array as its argument, where the first element of the array is a slope and the second element is an intercept. `rmse` will use the DataFrame passed into `fit_line` to compute predicted outcomes and then return the root mean squared error between the predicted and actual outcomes. Within `fit_line`, you can call `rmse` the way you would any other function.

You will need to use the `minimize` function of [`scipy.optimize` library](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html). 

*Hint*: The output of the minimize can be parsed to find the values that minimize the given function, `rmse`.  This output will be an array where the first element is the slope value, and second element is the intercept (just like the ordering for the `rmse` function. 

*Hint*: The default options can be used for the `minimize` function. 


In [None]:
from scipy.optimize import minimize

def fit_line(df):
    # Your code may need more than 1 line below here.
    def rmse(...):
        return ... 
    return ... 
    
# Here is an example call to your function.  To test your function,
# figure out the right slope and intercept by hand.
example_df = pd.DataFrame({
    "Spread": np.array([0, 1]),
    "Outcome": np.array([1, 3])})
fit_line(example_df)

In [None]:
grader.check("q1_6")

<br>

---

### Question 1.7 

Use `fit_line` to fit a line to `spreads`, and assign the output to `best_line`. Assign the first and second elements in `best_line` to `best_line_slope` and `best_line_intercept`, respectively.

Then, set `new_errors` to the array of errors that we get by calling `errors` with our new line. The provided code will graph the corresponding residual plot with a best fit line. 

*Hint:* Make sure that the residual plot makes sense. What qualities should the best fit line of a residual plot have?

In [None]:
best_line = ...
best_line_slope = ...
best_line_intercept = ...

new_errors = ...

# This code displays the residual plot, given your values for the best_line_slope and best_line_intercept
sns.regplot(pd.DataFrame({"Spread": spreads["Spread"], "Outcome errors":  new_errors}), 
            x='Spread', y='Outcome errors')

# This just prints your slope and intercept
"Slope: {:g} | Intercept: {:g}".format(best_line_slope, best_line_intercept)

In [None]:
grader.check("q1_7")

<br><br>

<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Part 2 - Plovers 

The [Snowy Plover](https://www.audubon.org/field-guide/bird/snowy-plover) is a tiny bird that lives on the coast in parts of California and elsewhere. It is so small that it is vulnerable to many predators, including people and dogs that don't look where they are stepping when they go to the beach. It is considered endangered in many parts of the U.S.

The data are about the eggs and newly-hatched chicks of the Snowy Plover. Here's a picture of [a parent bird incubating its eggs](http://cescos.fau.edu/jay/eps/articles/snowyplover.html).

<img src="plover_and_eggs.jpeg" alt="Plover and Eggs">

The data were collected at the Point Reyes National Seashore by a former student at UC Berkely. The goal was to see how the size of an egg could be used to predict the weight of the resulting chick. The bigger the newly-hatched chick, the more likely it is to survive.

<img src="plover_and_chick.jpeg" alt="Plover and Chick">

Each row of the table below corresponds to one Snowy Plover egg and the resulting chick. Note how tiny the bird is:

- `Egg Length` and `Egg Breadth` (widest diameter) are measured in millimeters
- `Egg Weight` and `Bird Weight` are measured in grams; for comparison, a standard paper clip weighs about one gram

In [None]:
birds = pd.read_csv('snowy_plover.csv')
birds.head(8)

In this investigation, we will be using the egg weight to predict bird weight. Run the cell below to create a scatter plot of the egg weights and bird weights, along with their line of best fit.

In [None]:
sns.scatterplot(birds, x='Egg Weight', y='Bird Weight');

Looking at the scatter plot of our sample, we observe a linear relationship between egg weight and bird weight. 

<br>

---

### Question 2.1 

Using the functions you defined above determine the correlation between `Egg Weight` and `Bird Weight`. 

The functions you create should be generic to work on not only the `spreads` dataset above, but this new `birds` dataset as well.

In [None]:
birds_r = ...

In [None]:
grader.check("q2_1")

<br>

---

### Question 2.2 

Next, you will determine the slope and intercept for the least squares regression line.  If you try using the `fit_line` function implemented in Question 1.6, you will get an error, because it expects a DataFrame like `spreads` in the `errors` function (implemented in Question 1.4). 

Therefore, create a generic `errors_generic` function that takes in a DataFrame (where the first column being the feature  and the second column being what is to be predicted), the slope and the intercept. 

Create a new function `fit_line_generic` that will work for a DataFrame (where the first column being the feature  and the second column being what is to be predicted).

Use the new `errors_generic` function and `fit_line_generic` to predict the slope and intercept for the plover data. 

In [None]:
def errors_generic(df, slope, intercept):
    ...

def fit_line_generic(df):
    def rmse(...):
        return ... 
    return ... 
    

...
[birds_slope, birds_intercept]

In [None]:
grader.check("q2_2")

<br><br>

<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Congratulations! You have finished Lab 11!

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions. 

**It is your responsibility to make sure your work is saved before running the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)