In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab10.ipynb")

# Lab 10: Linear Regression


**Recommended Readings**: 

* [The Regression Line](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html)
* [Method of Least Squares](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html)
* [Least Squares Regression](https://www.inferentialthinking.com/chapters/15/4/Least_Squares_Regression.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!


**Note: This homework has hidden tests on it. That means even though the tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**


Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck.

In [None]:
# Run this cell to set up the notebook, but please don't change it.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from datetime import datetime

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Part 1. Triple Jump Distances vs. Vertical Jump Heights 

Does skill in one sport imply skill in a related sport?  The answer might be different for different activities. Let's find out whether it's true for the [triple jump](https://en.wikipedia.org/wiki/Triple_jump) (a horizontal jump similar to a long jump) and the [vertical jump](https://en.wikipedia.org/wiki/Vertical_jump).  Since we're learning about linear regression, we will look specifically for a *linear* association between skill level in the two sports.

The following data was collected by observing 40 collegiate-level soccer players. Each athlete's distances in both events were measured in centimeters. Run the cell below to load the data.

In [None]:
# Run this cell to load the data
jumps = pd.read_csv('triple_vertical.csv')
jumps.head(10)

<br>

--- 

### Question 1.1 

Create a function `standard_units` that converts the values in the array or DataSeries `data` to standard units. 


In [None]:
def standard_units(data):
    ...

In [None]:
grader.check("q1_1")

<br>

--- 

### Question 1.2

Now, using the `standard_units` function, define the function `correlation` which computes the correlation between `x` and `y` (where `x` and `y` can either be arrays or DataSeries). 


In [None]:
def correlation(x, y):
    ...

In [None]:
grader.check("q1_2")

<!-- BEGIN QUESTION -->

<br>

---

### Question 1.3

Before running a regression, it's important to see what the data looks like, because our eyes are good at picking out unusual patterns in data.  Draw a scatter plot, **that includes the regression line**, with the triple jump distances on the horizontal axis and the vertical jump heights on vertical axis. 

See the documentation on `seaborn's regplot` [here](https://seaborn.pydata.org/generated/seaborn.regplot.html#seaborn.regplot) for instructions on how to have Python draw the regression line automatically.




In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br> 

---

### Question 1.4

Does the correlation coefficient $r$ look closest to 0, .5, or -.5? Use the visualization to explain. 


*Fill in your answer here*

<!-- END QUESTION -->

<br>

---

### Question 1.5

Create a function called `parameter_estimates` that takes in the argument `df`, a two-column DataFrame where the first column is the x-axis and the second column is the y-axis. It should return an array with three elements: the **(1) correlation coefficient** of the two columns and the **(2) slope** and **(3) intercept** of the regression line that predicts the second column from the first, in original units. 

*Hint:* This is a rare occasion where it’s better to implement the function using column indices instead of column names, in order to be able to call this function on any table. 


In [None]:
def parameter_estimates(df):
    ...
    r = ... 
    slope = ...
    intercept = ...
    return np.array([r, slope, intercept])
    

parameters = parameter_estimates(jumps) 
print('r:', parameters.item(0), '; slope:', parameters.item(1), '; intercept:', parameters.item(2))

In [None]:
grader.check("q1_5")

<br>

---

### Question 1.6

Now suppose you want to go the other way and predict a triple jump distance given a vertical jump distance. What would the regression parameters of this linear model be? How do they compare to the regression parameters from the model where you were predicting vertical jump distance given a triple jump distance (in Question 1.5)? 

Set `regression_changes` to an array of 3 elements, with each element corresponding to whether or not the corresponding item returned by `parameter_estimates` changes when switching vertical and triple as $x$ and $y$. For example, if $r$ changes, the slope changes, but the intercept wouldn't change, the `regression_changes` would be assigned to `np.array([True, True, False])`.

*Hint*: Try to answer this question without running any code. 


In [None]:
regression_changes = ...
regression_changes

In [None]:
grader.check("q1_6")

<br>

---

### Question 1.7 

Let's use the `parameters` (from Question 1.5) to create a function `predict_vertical` that will predict what certain athletes' vertical jump heights would be given their triple jump distances. 

*Note:* Make sure your function works for a single triple jump value or a array/Series-like object. 

In [None]:
def predict_vertical(triple, parameters):
    # Predict vertical jump distance using the triple jump value and parameters
    # from Question 1.5
    ...
    

pred_vert = predict_vertical(jumps.iloc[0, 0], parameters)
pred_vert

In [None]:
grader.check("q1_7")

<br>

---

### Question 1.8

Let's use `parameters` (from Question 1.5) to predict what certain athletes' vertical jump heights would be given their triple jump distances. 

The world record for the triple jump distance is 18.29 *meters* by Johnathan Edwards. What is the prediction for Edwards' vertical jump using this line?

*Hint:* Make sure to convert from meters to centimeters!


In [None]:
triple_record_vert_est = ...
print("Predicted vertical jump distance: {:f} centimeters".format(triple_record_vert_est))

In [None]:
grader.check("q1_8")

<!-- BEGIN QUESTION -->

<br>

---

### Question 1.9

Do you think it makes sense to use this line to predict Edwards' vertical jump? 

*Hint:* Compare Edwards' triple jump distance to the triple jump distances in `jumps`. Is it relatively similar to the rest of the data (shown in Question 1.3)? 


*Enter your answer here*

<!-- END QUESTION -->

<br>

---

### Question 1.10 

Create a function `error_vertical` to calculate the error from the true vertical jump to the predicted vertical jump. The function has inputs of triple jump values and parameters from Question 1.5. 

Find the maximum error (absolute) and also identify the data point with the highest error (absolute). 

In [None]:
def error_vertical(triple, parameters):
    # Calculate error of true vertical - predicted vertical 
    ...
    

vert_error = error_vertical(jumps.iloc[:,1], parameters) 
max_vert_error = ...
max_vert_error_data = ...

In [None]:
grader.check("q1_10")

<br><br>

<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Congratulations! You have finished Lab 10!


Congrats! You are finished with this assignment.

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. 

**You are responsible for ensuring your submission follows our requirements. We will not be granting regrade requests nor extensions to submissions that don't follow instructions.** If you encounter any difficulties with submission, please don't hesitate to reach out to staff prior to the deadline. 


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)