In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw10.ipynb")

# Homework 10: Linear Regression


**Recommended Readings**: 

* [The Regression Line](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html)
* [Method of Least Squares](https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares.html)
* [Least Squares Regression](https://www.inferentialthinking.com/chapters/15/4/Least_Squares_Regression.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!


**Note: This homework has hidden tests on it. That means even though the tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**


Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck.

In [None]:
# Run this cell to set up the notebook, but please don't change it.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from datetime import datetime

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Part 1. Cryptocurrencies

Imagine you're an investor in December 2017. Cryptocurrencies, online currencies backed by secure software, are becoming extremely valuable, and you want in on the action!

The two most valuable cryptocurrencies are Bitcoin (BTC) and Ethereum (ETH). Each one has a dollar price attached to it at any given moment in time. For example, on December 1st, 2017, one BTC costs $\$10,859.56$ and one ETH costs $\$424.64.$

For fun, here are the current prices of [Bitcoin](https://www.coinbase.com/price/bitcoin) and [Ethereum](https://www.coinbase.com/price/ethereum)!

**You want to predict the price of ETH at some point in time based on the price of BTC.** Below, we load two [tables](https://www.kaggle.com/jessevent/all-crypto-currencies/data) called `btc` and `eth`. Each has 5 columns:
* `date`, the date
* `open`, the value of the currency at the beginning of the day
* `close`, the value of the currency at the end of the day
* `market`, the market cap or total dollar value invested in the currency
* `day`, the number of days since the start of our data

In [None]:
btc = pd.read_csv('btc.csv')
btc.head(5)

In [None]:
eth = pd.read_csv('eth.csv')
eth.head(5)

<!-- BEGIN QUESTION -->

<br>

---

### Question 1.1

In the cell below, create an overlaid line plot that visualizes the BTC and ETH open prices as a function of the day. Both BTC and ETH open prices should be plotted on the same graph. 

*Hint:* We suggest creating a DataFrame that combines the "date", "eth open" and "btc open" prices.  Then using seaborn's [`lineplot` function](https://seaborn.pydata.org/generated/seaborn.lineplot.html)



In [None]:
# Create a line plot of btc and eth open prices as a function of time
combined = ... 



<!-- END QUESTION -->

<br>

---

### Question 1.2 

Now, calculate the correlation coefficient between the opening prices of BTC and ETH use the `numpy` [`corrcoef` function](https://numpy.org/devdocs/reference/generated/numpy.corrcoef.html). 


In [None]:
r = ...
r

In [None]:
grader.check("q1_2")

<br>

---

### Question 1.3 

Create a function called `parameter_estimates` that takes in the argument `df`, a two-column DataFrame where the first column is the x-axis and the second column is the y-axis. It should return an array with three elements: the **(1) correlation coefficient** of the two columns and the **(2) slope** and **(3) intercept** of the regression line that predicts the second column from the first, in original units. 

*Hint:* This is a rare occasion where it’s better to implement the function using column indices instead of column names, in order to be able to call this function on any table. 


In [None]:
def parameter_estimates(df):
    ...
    r = ... 
    slope = ...
    intercept = ...
    return np.array([r, slope, intercept])
    
btc_eth_df = pd.DataFrame({'btc': btc['open'], 'eth': eth['open']})
parameters = parameter_estimates(btc_eth_df) 
print('r:', parameters.item(0), '; slope:', parameters.item(1), '; intercept:', parameters.item(2))

In [None]:
grader.check("q1_3")

<br>

---

### Question 1.4

Write a function `eth_predictor` which takes an opening BTC price and predicts the opening price of ETH. Again, it will be helpful to use the function `parameter_estimates` that you defined above. 

*Hint* Think about the input to `parameter_estimates`. 


In [None]:
def eth_predictor(btc_price):
    ...
    r = ... 
    slope = ...
    intercept = ...
    return ...
    

In [None]:
grader.check("q1_4")

<!-- BEGIN QUESTION -->

<br>

---

## Question 1.5

Now, using the `eth_predictor` function you just defined, make a scatter plot with BTC prices along the x-axis and both real and predicted ETH prices along the y-axis. The color of the dots for the real ETH prices should be different from the color for the predicted ETH prices. 


*Hint:* Think about creating a DataFrame, `eth_predictions`, that has the btc open price, eth open, and predicted eth open. 


In [None]:
# Draw scatterplot 
    

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br>

---

### Question 1.6

Considering the shape of the scatter plot of the true data, is the model we used reasonable? If so, what features or characteristics make this model reasonable? If not, what features or characteristics make it unreasonable? 


*Enter your answer here*

<!-- END QUESTION -->

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Part 2: Evaluating NBA Game Predictions

#### A Brief Introduction to Sports Betting

In a basketball game, each team scores some number of points.  Conventionally, the team playing at its own arena is called the "home team", and their opponent is called the "away team".  The winner is the team with more points at the end of the game.

We can summarize what happened in a game by the "**outcome**", defined as the **the away team's score minus the home team's score**:

$$\text{outcome} = \text{points scored by the away team} - \text{points scored by the home team}$$

If this number is positive, the away team won.  If it's negative, the home team won. 

In order to facilitate betting on games, analysts at casinos try to predict the outcome of the game. This prediction of the outcome is called the **spread.**


In [None]:
spreads = pd.read_csv("spreads.csv")
spreads

Here's a scatter plot of the outcomes and spreads, with the spreads on the horizontal axis.

In [None]:
spreads.plot.scatter("Spread", "Outcome")

From the scatter plot, you can see that the spread and outcome are almost never 0, aside from one case of the spread being 0. This is because a game of basketball never ends in a tie. One team has to win, so the outcome can never be 0. The spread is almost never 0 because it's chosen to estimate the outcome.

Let's investigate how well the casinos are predicting game outcomes.

One question we can ask is: Is the casino's prediction correct on average? In other words, for every value of the spread, is the average outcome of games assigned that spread equal to the spread? If not, the casino would apparently be making a systematic error in its predictions.

<br>

---

### Question 2.1 

Compute the correlation coefficient between outcomes and spreads. 

*Note:* You can use the `np.corrcoef` function or the `correlation` function from Lab10.


In [None]:
spread_r = ...
spread_r

In [None]:
grader.check("q2_1")

<br>

---

### Question 2.2

Among games with a spread between 3.5 and 6.5 (including both 3.5 and 6.5), what was the average outcome? 


In [None]:
spreads_around_5 = ...
spread_5_outcome_average = ...
print("Average outcome for spreads around 5:", spread_5_outcome_average)

In [None]:
grader.check("q2_2")

<br>

---

### Question 2.3

Use the function `parameter_estimates` that you defined earlier to compute the least-squares linear regression line that predicts outcomes from spreads, in original units. We have provided a two column DataFrame for you in the cell below with the first column representing `Spread` (x) and the second column representing `Outcome` (y), which you should use as an argument to the function. 


In [None]:
compute_df = spreads[['Spread', 'Outcome']]
estimates = ...
spread_slope = ...
spread_intercept = ...
print("Slope:", round(spread_slope, 3))
print("Intercept", round(spread_intercept, 3))

In [None]:
grader.check("q2_3")


<br><br>

<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Congratulations! You have finished HW 10!

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions. 

**It is your responsibility to make sure your work is saved before running the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)