# Homework 4 Lab: Forecasting Points Per Game (PPG) in the NBA Part I

We will be using data scraped from https://www.basketball-reference.com/ to explore different forecasting strategies to predict who will have the highest PPG in the NBA in a particular week.

## Objective
We'll use data from the 2020-21 NBA season between December 2020 and March 2021 to forecast the highest PPG players in April 2021, the final month of the regular season. In this lab, we will focus on the approach of **reference class forecasting**.

Part II in a future week will explore another approach. This lab has many open-ended questions, and there is often not one correct answer. We will grade submissions mostly based on your reasoning. 

## Gradescope Submission 
To submit this assignment, rerun the notebook from scratch (by selecting Kernel > Restart & Run all), and then print as a pdf (File > Save and export notebook as... > Webpdf) and submit it to Gradescope. After downloading,  **please double-check that the pdf you generated contains all of your work, including any relevant plots.** You are also welcome to use R or another language if you are more comfortable with data analysis in that language.


**This assignment should be completed and submitted before Wednesday, February 19, 2024 at 11:59pm.**

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

### Data
We have collected the raw data you will be using into the file "scraped_games_2020-21.csv", which you can download from the class website [here](http://www.forecastingclass.com/assets/hw4/scraped_games_2020-21.csv). Here we will import it and provide small amounts of starter code with the pandas library in Python, but feel free to use other languages. If you're interested in seeing the code we ran to scrape this data (which could be useful for your final project), we'll post it in a separate file on the course website.

In [3]:
df = pd.read_csv("scraped_games_2020-21.csv", index_col=0, parse_dates=['Date'])
df.head()

Unnamed: 0,Date,Player,Pts
0,2020-12-22,"Wiggins,Andrew",13
1,2020-12-22,"Curry,Stephen",20
2,2020-12-22,"Oubre,Kelly",6
3,2020-12-22,"Wiseman,James",19
4,2020-12-22,"Paschall,Eric",6


### Optional preprocessing steps

In [4]:
# Split into pre-April and April data; here we'ved named them train and test but the variable names don't matter.
df_train = df[df["Date"] < "2021-3-29"]
df_test = df[df["Date"] >= "2021-3-29"]

In [5]:
# We group the games by week and find the total points scored by each player that week
# The flag freq='W-MON' will group data weekly on Mondays as described in https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
df_train_weekly = df_train.groupby([pd.Grouper(key='Date', freq='W-MON'), 'Player'])['Pts'].sum().reset_index().sort_values(['Date', 'Pts'], ascending=[True,False])
df_test_weekly = df_test.groupby([pd.Grouper(key='Date', freq='W-MON'), 'Player'])['Pts'].sum().reset_index().sort_values(['Date', 'Pts'], ascending=[True,False])

# Approach: Reference class forecasting

We can look at how often each player had the highest PPG over a week. If the weeks we are forecasting are similar to weeks in the rest of the season, this is a reasonable reference class to use. 

## Q1. Collect reference class data

a) For each of the 14 weeks between December 2020 and March 2021, which player had the highest PPG?

b) Which players had the highest PPG in more than one week, and how many weeks did they have the highest PPG? 

In [6]:
first_date = (pd.to_datetime(df_train["Date"].iloc[0]) - pd.Timedelta(days=1)).to_datetime64()
unique_dates = np.unique(df_train_weekly["Date"])
unique_dates = np.append(first_date, unique_dates)

df_train_weekly["Week Games Played"] = 0

# Use unique_dates to iterate through each dates corresponding to a week
for i in range(len(unique_dates)):
    start_date = unique_dates[i]
    end_date = unique_dates[i+1] if i+1 < len(unique_dates) else df_train_weekly["Date"].max()

    # Count the number of match per player per week
    weekly_counts = (
        df_train[(df_train["Date"] > start_date) & (df_train["Date"] <= end_date)]
        .groupby("Player")
        .size()
        .reset_index(name="Week Games Played")
    )

    # Append the numbers of games played to the corresponding week
    df_train_weekly.loc[
        (df_train_weekly["Date"] > start_date) & (df_train_weekly["Date"] <= end_date),
        "Week Games Played"
    ] = df_train_weekly["Player"].map(weekly_counts.set_index("Player")["Week Games Played"])

# Fill NaN with 0
df_train_weekly["Week Games Played"] = df_train_weekly["Week Games Played"].fillna(0).astype(int)

In [7]:
df_train_weekly["Week PPG"] = (
    df_train_weekly["Pts"] / df_train_weekly["Week Games Played"].replace(0, np.nan)
).fillna(0)
highest_ppg_per_week = df_train_weekly.loc[
    df_train_weekly
    .groupby("Date")["Week PPG"]
    .idxmax(),
    ["Date", "Player", "Week PPG"]]
highest_ppg_per_week

Unnamed: 0,Date,Player,Week PPG
147,2020-12-28,"Harden,James",39.0
480,2021-01-04,"Curry,Stephen",37.25
844,2021-01-11,"Beal,Bradley",45.0
1408,2021-01-18,"Lillard,Damian",33.25
1943,2021-01-25,"Young,Trae",40.5
1977,2021-02-01,"Beal,Bradley",35.75
2423,2021-02-08,"Curry,Stephen",38.75
2810,2021-02-15,"Dončić,Luka",39.333333
3378,2021-02-22,"Sabonis,Domantas",36.0
3450,2021-03-01,"Antetokounmpo,Giannis",37.0


a) We can see above the highest ppg for each week

In [8]:
highest_ppg_per_week_count = highest_ppg_per_week["Player"].value_counts().reset_index()
highest_ppg_per_week_count = highest_ppg_per_week_count.rename(columns={"count": "Nb Week Highest PPG"})
highest_ppg_per_week_count

Unnamed: 0,Player,Nb Week Highest PPG
0,"Curry,Stephen",2
1,"Beal,Bradley",2
2,"Dončić,Luka",2
3,"Lillard,Damian",2
4,"Harden,James",1
5,"Young,Trae",1
6,"Sabonis,Domantas",1
7,"Antetokounmpo,Giannis",1
8,"Embiid,Joel",1
9,"Fox,De'Aaron",1


b) We have 4 players having the highest PPG for more than 1 week: Bradley Beal, Stephen Curry, Luka Doncic, Damian Lillard

They all had 2 weeks with the highest PPG

## Q2. Using Q1, forecast PPG in April 2021 with a naive reference class forecast.
Let's assume that the highest PPG player each week is drawn from a multinomial, where each player $i$ has probability $p_i$ of getting the highest PPG. Next assume that that data we collected from Q1 gives us exactly those probabilities (i.e., if a player had the highest PPG in 1 out of 14 weeks, $p_i$ for that player is $1/14 \approx 0.07$). 

Based on these assumptions, what would be our forecast for who will get the highest PPG in a given week in April? (This should assign a forecast $q_i$ to every NBA player, where $\sum q_i = 1$).

In [9]:
# space for work
highest_ppg_per_week_count["Forecast qi"] = highest_ppg_per_week_count["Nb Week Highest PPG"] / sum(highest_ppg_per_week_count["Nb Week Highest PPG"])
print("Sum qi =", round(sum(highest_ppg_per_week_count["Forecast qi"])))
highest_ppg_per_week_count

Sum qi = 1


Unnamed: 0,Player,Nb Week Highest PPG,Forecast qi
0,"Curry,Stephen",2,0.142857
1,"Beal,Bradley",2,0.142857
2,"Dončić,Luka",2,0.142857
3,"Lillard,Damian",2,0.142857
4,"Harden,James",1,0.071429
5,"Young,Trae",1,0.071429
6,"Sabonis,Domantas",1,0.071429
7,"Antetokounmpo,Giannis",1,0.071429
8,"Embiid,Joel",1,0.071429
9,"Fox,De'Aaron",1,0.071429


## Q3. Evaluate the Q2 forecast
We know who was the player with the highest PPG in each of the weeks of April. Imagine we submitted the forecast from Q2 for every week in April. Calculate the average Brier quadratic score of the forecast from Q2 across April's weeks (all 6 weeks that appear in the test dataset), using the resolved outcomes in our dataset.

In [10]:
first_date = (pd.to_datetime(df_test["Date"].iloc[0]) - pd.Timedelta(days=1)).to_datetime64()
unique_dates = np.unique(df_test_weekly["Date"])
unique_dates = np.append(first_date, unique_dates)

df_test_weekly["Week Games Played"] = 0

# Use unique_dates to iterate through each dates corresponding to a week
for i in range(len(unique_dates)):
    start_date = unique_dates[i]
    end_date = unique_dates[i+1] if i+1 < len(unique_dates) else df_train_weekly["Date"].max()

    # Count the number of match per player per week
    weekly_counts = (
        df_test[(df_test["Date"] > start_date) & (df_test["Date"] <= end_date)]
        .groupby("Player")
        .size()
        .reset_index(name="Week Games Played")
    )

    # Append the numbers of games played to the corresponding week
    df_test_weekly.loc[
        (df_test_weekly["Date"] > start_date) & (df_test_weekly["Date"] <= end_date),
        "Week Games Played"
    ] = df_test_weekly["Player"].map(weekly_counts.set_index("Player")["Week Games Played"])

# Fill NaN with 0
df_test_weekly["Week Games Played"] = df_test_weekly["Week Games Played"].fillna(0).astype(int)
df_test_weekly["Week PPG"] = (
    df_test_weekly["Pts"] / df_test_weekly["Week Games Played"].replace(0, np.nan)
).fillna(0)
test_highest_ppg_per_week = df_test_weekly.loc[
    df_test_weekly
    .groupby("Date")["Week PPG"]
    .idxmax(),
    ["Date", "Player", "Week PPG"]]
test_highest_ppg_per_week

Unnamed: 0,Date,Player,Week PPG
86,2021-03-29,"Harden,James",38.0
326,2021-04-05,"Curry,Stephen",36.5
734,2021-04-12,"Curry,Stephen",41.0
1143,2021-04-19,"Curry,Stephen",42.75
1493,2021-04-26,"Beal,Bradley",35.0
2197,2021-05-03,"Tatum,Jayson",47.5


In [11]:
# space for work
test_highest_ppg_per_week["Accurate Prediction"] = test_highest_ppg_per_week["Player"].isin(highest_ppg_per_week_count["Player"].iloc[0:3]).astype(int)
test_highest_ppg_per_week

Unnamed: 0,Date,Player,Week PPG,Accurate Prediction
86,2021-03-29,"Harden,James",38.0,0
326,2021-04-05,"Curry,Stephen",36.5,1
734,2021-04-12,"Curry,Stephen",41.0,1
1143,2021-04-19,"Curry,Stephen",42.75,1
1493,2021-04-26,"Beal,Bradley",35.0,1
2197,2021-05-03,"Tatum,Jayson",47.5,0


In [12]:
predictions_vs_test = pd.merge(test_highest_ppg_per_week,highest_ppg_per_week_count, on="Player", how="left").fillna(0)
predictions_vs_test = predictions_vs_test.drop(columns=["Nb Week Highest PPG", "Week PPG"])
predictions_vs_test

Unnamed: 0,Date,Player,Accurate Prediction,Forecast qi
0,2021-03-29,"Harden,James",0,0.071429
1,2021-04-05,"Curry,Stephen",1,0.142857
2,2021-04-12,"Curry,Stephen",1,0.142857
3,2021-04-19,"Curry,Stephen",1,0.142857
4,2021-04-26,"Beal,Bradley",1,0.142857
5,2021-05-03,"Tatum,Jayson",0,0.0


In [13]:
y_pred = predictions_vs_test["Forecast qi"]
y_true = predictions_vs_test["Accurate Prediction"]
n = predictions_vs_test["Player"].shape[0]

In [14]:
brier_score = (1 / n) * np.sum((y_pred - y_true)**2)
print("Q3) Brier Score =" , brier_score)

Q3) Brier Score = 0.4906462585034014


## Q4. Better base rates

We made pretty strong assumptions in Q2, for example that if a player didn’t have the highest PPG in any of the 14 weeks before April, they had 0 probability of getting the highest PPG in future weeks. This seems pretty inaccurate, and in lecture we discussed ways to estimate base rates for events that haven't happened yet.

Modify the forecast from Q2 to give **positive** probability $p_i$ to each player. Justify your forecast. (You can lump together players in groups, so you don't need to enumerate every player individually, as demonstrated in the lecture on "Other Options". Also, this is meant to be open-ended; there is not one correct answer we are looking for).

In [38]:
#We compute the season ppg for all players and rank them accordingly
df_train_ppg = df_train.copy()
df_train_ppg = df_train_ppg.groupby("Player").mean().drop(columns=["Date"]).reset_index().rename(columns={"Pts": "PPG"}).sort_values("PPG", ascending=False)
df_train_ppg.head(10)
#We will split them in 3 groups, the first will be the 5 top scorers during this period, the second the other top 15 and lastly all the players below
top_5_scorers = df_train_ppg.iloc[0:5,:]
other_top_15_scorers = df_train_ppg.iloc[5:20,:]
rest_of_scorers = df_train_ppg.iloc[20:,:]
#From previous season datas, we will assume that there is 80% chances that each week top scorers will be from the top 5, 15% that it will be one of the other top 15 and 5% that it is one below.
#To compute our probabilities for each we will ponderate each chances by their PPG compared to the total PPG of their group 
#However we can't use this for players in the last group: many players have 0 PPG, this might come from injuries and nothing tells us that there is a chance that they come back for the last weeks
#Therefore we need to assign them a probability, we just use the same probability for all
top_5_scorers["Forecast"] = top_5_scorers["PPG"] / sum(top_5_scorers["PPG"]) * 4/5
other_top_15_scorers["Forecast"] = other_top_15_scorers["PPG"] / sum(other_top_15_scorers["PPG"]) * 15/100
rest_of_scorers["Forecast"] = 1 / len(rest_of_scorers["PPG"]) * 5/100
ppg_scorers = pd.concat([top_5_scorers, other_top_15_scorers, rest_of_scorers])
ppg_scorers

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_5_scorers["Forecast"] = top_5_scorers["PPG"] / sum(top_5_scorers["PPG"]) * 4/5
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  other_top_15_scorers["Forecast"] = other_top_15_scorers["PPG"] / sum(other_top_15_scorers["PPG"]) * 15/100
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rest_of_scorers

Unnamed: 0,Player,PPG,Forecast
37,"Beal,Bradley",31.317073,0.168146
134,"Embiid,Joel",29.935484,0.160728
272,"Lillard,Damian",29.772727,0.159854
128,"Durant,Kevin",29.000000,0.155705
108,"Curry,Stephen",28.974359,0.155567
...,...,...,...
281,"Magnay,Will",0.000000,0.000103
364,"Pasečņiks,Anžejs",0.000000,0.000103
482,"Whittington,Greg",0.000000,0.000103
503,"İlyasova,Ersan",0.000000,0.000103


## Q5. Evaluate the Q4 forecast

Calculate the average Brier quadratic score of the forecast from Q4 across April's weeks (the 6 weeks that appear in the test dataset), using the resolved outcomes in our dataset. How does the score compare to the score for the Q2 forecast? 

In [39]:
test_highest_ppg_per_week["Accurate Prediction"] = test_highest_ppg_per_week["Player"].isin(top_5_scorers["Player"]).astype(int)
predictions_vs_test = pd.merge(test_highest_ppg_per_week, ppg_scorers, on="Player", how="left").fillna(0)
predictions_vs_test = predictions_vs_test.drop(columns=["PPG", "Week PPG"])
predictions_vs_test

Unnamed: 0,Date,Player,Accurate Prediction,Forecast
0,2021-03-29,"Harden,James",0,0.009804
1,2021-04-05,"Curry,Stephen",1,0.155567
2,2021-04-12,"Curry,Stephen",1,0.155567
3,2021-04-19,"Curry,Stephen",1,0.155567
4,2021-04-26,"Beal,Bradley",1,0.168146
5,2021-05-03,"Tatum,Jayson",0,0.00953


In [40]:
# space for work
y_pred = predictions_vs_test["Forecast"]
y_true = predictions_vs_test["Accurate Prediction"]
n = predictions_vs_test["Player"].shape[0]

brier_score = (1 / n) * np.sum((y_pred - y_true)**2)
print("Q5) Brier Score =" , brier_score)

Q5) Brier Score = 0.4718946079146149


We still get a poor Brier Score, because we are missing many factors in the dataset: we can't forecast the possible injuries, we don't know the team record so if players should be resting or pushing hard, the last top scorers from previous seasons, assign specific probabilities to players usually high performant in this period of the season.