# NBA Player Points

### Group Members:
Matthew Gillies - 59188508 (Group Leader),
Hans Lam - 80010721
...



### Group Contributions:
Matthew: R code appendix, 
...



## Objective: 
We are looking to analyze NBA points data for players who have both played in the regular season and playoffs in order to determine the average points per game for the regular season and playoffs along with whether the average player performs better in the playoffs. 

## Background: 
In the NBA the regular season and playoffs are very different in terms of player usage and rotations. We often see star players "coast" through the regular season in order to save energy and maintain health for the playoffs. Some players raise their level of play during the playoffs, while others become overwhelmed with pressure and fail to produce to their usual standards. Our goal is to determine whether the average NBA player performs better in the playoffs than the regular season, along with standalone estimates for both the regular season and playoffs. There are many underlying factors to this analysis, including rotations becoming shorter in the playoffs, allowing star players more opportunities to score and average role players less. The playoffs are also an overall higher level of competition, as only teams who were successful enough in the regular season compete. Through this analysis we hope to further understand the relationship and nuances between regular season and playoff scoring. 

## Importance: 
During the last NBA season, the 8-th seeded Miami Heat made the NBA finals. This is only the second time that an 8-th seeded team has made it all the way to the NBA finals, with the first time being all the way back in 1999. Naturally, this created a strong underdog narrative for the Miami Heat during last season's playoffs where many sports commentators were pointing out how much better the Miami Heat players were playing during the playoffs compared to the regular season. In particular, the Miami Heat's star player Jimmy Butler received the moniker of "Playoff Jimmy" due to the perceived difference in his level of play in the playoffs when compared to the regular season. This narrative of the Miami Heat players suddenly becoming better during the playoffs made our group wonder if this increase in level of play between the regular season and the playoffs applied to all NBA players in general, leading to the aforementioned objective of this project.



## Sampling: 
The target population we chose to sample from is the NBA players who played during both the 2022/2023 NBA regular season and the 2022/2023 NBA playoffs. We chose this population due to the following reasons:
1. Due to this being the most recent completed NBA season, the data was easily accessible.
2. During the last NBA season, the 8-th seeded Miami Heat made it to the NBA finals. As stated previously, the narrative presented by sports media was that everyone on that team was playing better in the playoffs. By including those Miami Heat players in our population, we can see if the data supports that narrative, and if that idea can be generalized to all players as a whole.

Our parameters of interest for measuring the difference in performance for NBA players between the regular season and the playoffs are:
1. PPG (points per game) per player during the 2022/2023 NBA regular season
2. PPG per player during the 2022/2023 NBA playoffs
3. Proportion of players who had a higher PPG during the playoffs compared to their PPG during the regular season

The two sampling methods we chose to use are simple random sampling and stratified sampling. 
For our simple random sample, we chose $n = 50$ players at random and recorded their PPG for both the regular season and the playoffs. 

For our stratified sample, since the cost of sampling is 0 for all players, we decided to set an arbitrary sample size of $n = 50$ (matching the sample size for the SRS). We decided to choose our stratas based on player positions, giving us the $h = 5$ stratas of "C" (center), "PF" (power forward), "PG" (point guard), "SF" (small forward), and "SG" (shooting guard). The reasons for choosing player position to define our stratas are: 
1. The between-strata variation for player position should be relatively large. Again harking back to the defined roles for each position, these roles also make the average in points scored between positions fairly different. For example, centers are mainly responsible for rebounds and scoring closer to the basket. Shooting guards on the other hand are mainly responsible for taking many long-range 3-pointer shots. Thus, the average of points scored should be lower for centers when compared to the position of shooting guard, since shooting guards have opportunities to score more when compared to centers. 
2. The within-strata variation for player position should be relatively minimial. Since each position has defined roles for players, the players of each position should be scoring in similar ways with one another by fulfilling those defined roles. For example, though the quality of center players may vary within the NBA, the way that centers score should remain about the same between centers. This in turn should minimize the difference in points scored within positions since the players of each position should score (and fulfill their roles) in about the same way.
To choose the sample sizes for each strata, we took a look at the overall sizes ($N_h$) of each position in our dataset as well as the sample standard deviation for PPG within each position. Using the proportionality of $n_h \propto N_h * s_{h, guess}$, we determined each strata's sample size by divying up the $n = 50$ proportionally based on each strata's $N_h * s_{h, guess}$, resulting in:
1. $n_c = 10$ for the center position "C"
2. $n_pf = 10$ for the power forward position "PF"
3. $n_pg = 9$ for the point guard position "PG"
4. $n_sf = 10$ for the small forward position "SF"
5. $n_sg = 11$ for the shooting guard position "SG" 

## Data Analysis:


## Conclusion: 
...


## Appendix

#### R code:

In [None]:
library(tidyverse)
set.seed(123)
playoff_data <- read.csv("202223nbaplayoffs.csv", header = T)
reg_data <- read.csv("202223regseasonnodupes.csv", header = T)

## SRS sampling
common_sample_indices <- sample(1:nrow(reg_data), 50)

sample_size <- 50
population_size <- 200

# SRS for Regular season
srs_sample_reg <- reg_data[common_sample_indices, ]
srs_est_reg <- mean(srs_sample_reg$PTS)
fpc <- 1 - sample_size / population_size
se_reg <- sqrt(var(srs_sample_reg$PTS)/sample_size * fpc)
quantile <- qnorm(0.95)
CI_reg <- c(srs_est_reg - quantile*se_reg, srs_est_reg + quantile*se_reg)


# SRS for Playoffs
srs_sample_playoffs <- playoff_data[common_sample_indices, ]
srs_est_playoffs <- mean(srs_sample_playoffs$PTS)
se_playoffs <- sqrt(var(srs_sample_playoffs$PTS)/sample_size * fpc)
CI_playoffs <- c(srs_est_playoffs - quantile*se_playoffs, srs_est_playoffs +
                   quantile*se_playoffs)

## Stratified Sampling for regular season:
stratum_sizes <- reg_data %>%
  group_by(Pos) %>%
  summarize(StratumSize = n())

within_strata_vars <- reg_data %>%
  group_by(Pos) %>%
  summarize(SD = sd(PTS))

## We see that Within-Strata Variance are varying so we use optimal allocation
## Here we assume the cost of sampling from each strata is the same

se_strata <- stratum_sizes$StratumSize * within_strata_vars$SD
samp_sizes_strat <- round(se_strata/sum(se_strata) * sample_size)

samp_sizes_strat[3] = 9

## We round the point guard strata size down to ensure the total sample size 
## is 50, although it should technically round up. 


final_samples <- data.frame()  # Initialize an empty dataframe to store the final samples

for (i in 1:5) {
  samp_size <- samp_sizes_strat[i]
  current_stratum <- unique(reg_data$Pos)[i]
  
  stratum_samples <- reg_data %>%
    group_by(Pos) %>%
    filter(Pos == current_stratum) %>%
    sample_n(samp_size)
  
  final_samples <- rbind(final_samples, stratum_samples)
}

## Check to make sure sampling is done correctly: 
check_stratsize <- final_samples %>%
  group_by(Pos) %>%
  summarize(Size = n())


## Estimating Stratified for Regular Season: 
N <- nrow(reg_data)
Nh_data <- reg_data %>% group_by(Pos) %>% summarize(n = n())
Nh <- Nh_data$n
pos_avg <- final_samples %>% group_by(Pos) %>% summarize(mean = mean(PTS))
pos_means <- pos_avg$mean

strat_est_reg <- sum((Nh/N)*pos_means)

pos_vars_dat <- final_samples %>% group_by(Pos) %>% summarize(Var = var(PTS))
pos_vars <- pos_vars_dat$Var

se_strat_reg <- sqrt(sum((Nh/N)^2 * (1-(samp_sizes_strat/Nh)) * 
                           (pos_vars/samp_sizes_strat)))
CI_strat_reg <- c(strat_est_reg - quantile*se_strat_reg, 
                  strat_est_reg + quantile*se_strat_reg)

## Getting stratified sample for Playoffs: 
strat_players <- final_samples$Player
playoffs_strat <- playoff_data %>% filter(Player %in% strat_players)

## Estimating stratified for playoffs: 
N <- nrow(playoff_data)
pos_avg_p <- playoffs_strat %>% group_by(Pos) %>% summarize(mean = mean(PTS))
pos_means_p <- pos_avg_p$mean

strat_est_playoff <- sum((Nh/N)*pos_means_p)

pos_vars_dat_p <- playoffs_strat %>% group_by(Pos) %>% summarize(Var = var(PTS))
pos_vars_p <- pos_vars_dat_p$Var

se_strat_p <- sqrt(sum((Nh/N)^2 * (1-(samp_sizes_strat/Nh)) * 
                           (pos_vars_p/samp_sizes_strat)))
CI_strat_reg <- c(strat_est_playoff - quantile*se_strat_p, 
                  strat_est_playoff + quantile*se_strat_p)

## Proportion Estimate for players who score more in playoffs than regular 
## season:

## SRS
merged_data <- merge(srs_sample_reg, srs_sample_playoffs, by = "Player", 
                     suffixes = c("_reg", "_playoff"))
prop_playoff_srs <- mean(merged_data$PTS_playoff > merged_data$PTS_reg)
se_prop_srs <- sqrt(fpc*((prop_playoff_srs* (1-prop_playoff_srs))/sample_size))
CI_prop_srs <- c(prop_playoff_srs - quantile*se_prop_srs, 
                 prop_playoff_srs + quantile*se_prop_srs)

## Stratified: 
merged_strat <- merge(final_samples, playoffs_strat, by = "Player", 
                      suffixes = c("_reg", "_playoff"))
merged_strat$Playoff_Higher <- ifelse(merged_strat$PTS_playoff > 
                                        merged_strat$PTS_reg, 1, 0)

props <- merged_strat %>% group_by(Pos_reg) %>%
  summarize(Prop = mean(Playoff_Higher))
strat_props <- props$Prop
strat_est_prop <- sum((Nh/N)*strat_props)

vars_prop_dat <- merged_strat %>% group_by(Pos_reg) %>%
  summarize(Var = var(Playoff_Higher))
vars_prop <- vars_prop_dat$Var

se_prop_strat <- sqrt(sum((Nh/N)^2 * (1-(samp_sizes_strat/Nh)) * 
                            (vars_prop/samp_sizes_strat)))
CI_prop_strat <- c(strat_est_prop - quantile*se_prop_strat, 
                   strat_est_prop + quantile*se_prop_strat)

#### Regular Season Data

In [None]:
reg_data <- read.csv("202223regseasonnodupes.csv", header = T)
reg_data

#### Playoff Data

In [None]:
playoff_data <- read.csv("202223nbaplayoffs.csv", header = T)
playoff_data