# Homework 4 Lab: Forecasting Points Per Game (PPG) in the NBA Part I

We will be using data scraped from https://www.basketball-reference.com/ to explore different forecasting strategies to predict who will have the highest PPG in the NBA in a particular week. We'll use data from the 2020-21 NBA season between December 2020 and March 2021 to forecast the highest PPG players in April 2021, the final month of the regular season. In this lab, we will focus on the approach of **reference class forecasting**. Part II in a future week will explore another approach. This lab has many open-ended questions, and there is often not one correct answer. We will grade submissions mostly based on your reasoning. 

## Gradescope Submission 
To submit this assignment, rerun the notebook from scratch (by selecting Kernel > Restart & Run all), and then print as a pdf (File > download as > pdf) and submit it to Gradescope. You are also welcome to use R or another language if you are more comfortable with data analysis in that language.


**This assignment should be completed and submitted before Monday, February 14, 2022 at 11:59 PM. PT**

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

### Data
We have collected the raw data you will be using into the file "scraped_games_2020-21.csv", which you can download from the class website [here](). Here we will import it and provide small amounts of starter code with the pandas library in Python, but feel free to use other languages. If you're interested in seeing the code we ran to scrape this data (which could be useful for your final project), we'll post it in a separate file on the course website.

In [3]:
df = pd.read_csv("scraped_games_2020-21.csv", index_col=0, parse_dates=['Date'])
df.head()

Unnamed: 0,Date,Player,Pts
0,2020-12-22,"Wiggins,Andrew",13
1,2020-12-22,"Curry,Stephen",20
2,2020-12-22,"Oubre,Kelly",6
3,2020-12-22,"Wiseman,James",19
4,2020-12-22,"Paschall,Eric",6


### Optional preprocessing steps

In [4]:
# Split into pre-April and April data; here we'ved named them train and test but the variable names don't matter.
df_train = df[df["Date"] < "2021-3-29"]
df_test = df[df["Date"] >= "2021-3-29"]

In [5]:
# We group the games by week and find the total points scored by each player that week
# The flag freq='W-MON' will group data weekly on Mondays as described in https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
df_train_weekly = df_train.groupby([pd.Grouper(key='Date', freq='W-MON'), 'Player'])['Pts'].sum().reset_index().sort_values(['Date', 'Pts'], ascending=[True,False])
df_test_weekly = df_test.groupby([pd.Grouper(key='Date', freq='W-MON'), 'Player'])['Pts'].sum().reset_index().sort_values(['Date', 'Pts'], ascending=[True,False])

# Approach: Reference class forecasting

We can look at how often each player had the highest PPG over a week. If the weeks we are forecasting are similar to weeks in the rest of the season, this is a reasonable reference class to use. 

## Q1. Collect reference class data

a) For each of the 14 weeks between December 2020 and March 2021, which player had the highest PPG?

b) Which players had the highest PPG in more than one week, and how many weeks did they have the highest PPG? 

In [6]:
# space for work

## Q2. Using Q1, forecast PPG in April 2021 with a naive reference class forecast.
Let's assume that the highest PPG player each week is drawn from a multinomial, where each player $i$ has probability $p_i$ of getting the highest PPG. Next assume that that data we collected from Q1 gives us exactly those probabilities (i.e., if a player had the highest PPG in 1 out of 14 weeks, $p_i$ for that player is $1/14 \approx 0.07$). 

Based on these assumptions, what would be our forecast for who will get the highest PPG in a given week in April? (This should assign a forecast $q_i$ to every NBA player, where $\sum q_i = 1$).

In [7]:
# space for work

## Q3. Evaluate the Q2 forecast
We know who was the player with the highest PPG in each of the weeks of April. Imagine we submitted the forecast from Q2 for every week in April. Calculate the average Brier quadratic score of the forecast from Q2 across April's weeks (all 6 weeks that appear in the test dataset), using the resolved outcomes in our dataset.

In [8]:
# space for work

## Q4. Better base rates

We made pretty strong assumptions in Q2, for example that if a player didn’t have the highest PPG in any of the 14 weeks before April, they had 0 probability of getting the highest PPG in future weeks. This seems pretty inaccurate, and in lecture we discussed ways to estimate base rates for events that haven't happened yet.

Modify the forecast from Q2 to give **positive** probability $p_i$ to each player. Justify your forecast. (You can lump together players in groups, so you don't need to enumerate every player individually, as demonstrated in lecture 7. Also, this is meant to be open-ended; there is not one correct answer we are looking for).

In [9]:
# space for work

## Q5. Evaluate the Q4 forecast

Calculate the average Brier quadratic score of the forecast from Q4 across April's weeks (the 6 weeks that appear in the test dataset), using the resolved outcomes in our dataset. How does the score compare to the score for the Q2 forecast? 

In [10]:
# space for work