# COGS 108 - EDA Checkpoint

# Names

- Megan Yee
- Steven Bui
- Hung Ho
- Junmo Robbins


# Research Question

Does playing back-to-back games in the NBA affect a player's individual performance? Specifically, does fatigue influence key performance metrics, such as points per game, shooting percentage, rebounds, and turnovers?

## Background and Prior Work


The NBA, also known as the National Basketball Association, consists of 82 games for each team, and there are 32 teams, in a regular season from October through April. This means that some star players could be playing back-to-back games for the team to have the best chance of winning. This leads us to our research question of does playing back-to-back games affects a player’s performance, in areas such as points per game (PPG), shooting percentage, rebounds, and turnovers. After doing some initial research about our topic, we could not find any player’s performance related to the back-to-back games, however, we could find the team’s performance related to the back-to-back games. Currently, we know that players tend to get worse on the second day of a back-to-back schedule. Thus, comparing the performance of a player when they have a one-day break to when they have a back-to-back game will allow us to get an in-depth statistic to generate a strong conclusion to the question.

Previously, a similar study was conducted to determine the margin of victory by an NBA team on the second night of a back to back game. They found that at home, teams played 1 point worse on the second night, compared to 2.5 points worse if they played away.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) While this information tells us that teams do in fact play worse on the second night of a back to back game, it’s a very small margin and it doesn’t account for this new resting idea in the modern NBA. More often in recent years, NBA teams have started to rest their star players in back to back games, whether this means less minutes or not playing them at all. On average, NBA teams play 16% of their games as a back to back and most commonly, they play 63% of their games with a one day break. As a result, a more accurate way to accommodate for the potential rest of players, the Sparese Impacts NBA Model was designed. <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) This model creates a numerical point value of a player’s worth which can be compared to a team’s performance on the second night of a back to back. While this previous study gives us an insight into how teams usually perform on the second night of a back to back, we specifically want to study individual players’ performances so we can analyze player consistency.

The second study we found relating to the topic examines how back-to-back games impact NBA team performance, specifically when travel is involved. Using data from 2014-2021, the study found that teams generally perform worse on the second game of a back-to-back, with an average decline in margin of victory by 2.5 points. <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Statistical analysis through Welch’s t-test confirmed a significant difference in performance, where “mean margin of victory on a back-to-back with travel” was measured at -2.33 points compared to 0.6 points without travel. This effect is even more noticeable when the teams need to travel, increasing the margin by nearly 3 points. By distinguishing games where teams travel from those where they remain at the same location, the study concludes that travel-induced fatigue may considerably impact player performance on back-to-backs, suggesting that reducing travel during these games could maintain game quality and competitiveness.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) The Stats Behind Back to Back NBA Games *The Data Jocks*. https://thedatajocks.com/the-stats-behind-back-to-back-nba-games/#How_Often_do_Teams_Play_Back_to_Back_Games/ 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Live NBA Model: Sparse Impacts Model *The Data Jocks*. https://thedatajocks.com/sparse-impacts-model-2023/ 
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Analyzing the Impact of Back-to-Backs on NBA Team Performance *Medium*. https://medium.com/@kevhopper92/analyzing-the-impact-of-back-to-backs-on-nba-team-performance-88a4b80a5d99#:~:text=Travel%20during%20a%20Back-to,a%20back-to-back/ 


# Hypothesis



We predict that NBA players will experience a significant decrease in key performance metrics during the second game of back-to-back games compared to their performance in games with at least a day of rest. Not only can fatigue affect a player’s physical abilities, but can also negatively impact their cognitive functions which may lead to an increase in turnovers.


# Data

## Data overview

- Dataset #1
  - Dataset Name: Combined Stats from Unique Players
  - Link to the dataset: https://github.com/COGS108/Group123-FA24/blob/master/Combined_Unique_Stats.csv
  - Number of observations: 92
  - Number of variables: 4
- Dataset #2
  - Dataset Name: Generated Back to Back Games
  - Link to the dataset:https://github.com/COGS108/Group123-FA24/blob/master/Generated_Back_to_Back_Games.csv
  - Number of observations: 424
  - Number of variables: 4

For our first dataset, instead of analyzing every NBA player's statistics, we selected the top 50 players from last season based on their highest averages in points per game, rebounds, and turnovers. We then removed any duplicate players appearing across multiple lists and narrowed the columns to display each player’s team and the three key metrics we plan to analyze.
Our second dataset includes the NBA schedule for all back-to-back games played by each team last season. We first gathered the full schedule and identified back-to-back games by finding those played with exactly one day between them. These dates will allow us to track each of the top 50 players’ individual statistics for these specific games.
Finally, we will combine these datasets to show each player’s stats for every back-to-back game they played. By comparing their averages in the first game of a back-to-back to their averages in the second, we aim to understand how fatigue affects an NBA player’s performance.



## Dataset #1 (Player Performance Metrics)

In [None]:
import pandas as pd

# Load CSV files
points_df = pd.read_csv('Pts_per_Game.csv', index_col=0)
turnover_df = pd.read_csv('Turnover.csv', index_col=0)
rebound_df = pd.read_csv('Rebound.csv', index_col=0)

# Combine dataframes
combined_df = pd.concat([points_df, turnover_df, rebound_df], ignore_index=True)

# Drop duplicate players
unique_combined_df = combined_df.drop_duplicates(subset=['Player', 'TEAM'])

# Keep only the specified columns
unique_combined_df = unique_combined_df[['Player', 'TEAM', 'PTS', 'TOV', 'REB']]

# Save the filtered DataFrame to a new CSV file
unique_combined_df.to_csv('Combined_Unique_Players.csv', index=False)



## Dataset #2 (Back-to-Back Games Schedule Filtering)

In [None]:
import pandas as pd

# Load the NBA schedule CSV file
nba_schedule = pd.read_csv('NBA_Schedule.csv')

# Convert the Date column to datetime format
nba_schedule['Date'] = pd.to_datetime(nba_schedule['Date'], errors='coerce')

# Remove any extra whitespace
nba_schedule['Visitor/Neutral'] = nba_schedule['Visitor/Neutral'].str.strip()
nba_schedule['Home/Neutral'] = nba_schedule['Home/Neutral'].str.strip()

# Create separate DataFrames for visitor and home games and rename
visitor_games = nba_schedule[['Date', 'Visitor/Neutral']].rename(columns={'Visitor/Neutral': 'Team'})
home_games = nba_schedule[['Date', 'Home/Neutral']].rename(columns={'Home/Neutral': 'Team'})

# Combine visitor and home games into a single schedule
team_schedule = pd.concat([visitor_games, home_games], ignore_index=True)
team_schedule = team_schedule.sort_values(by=['Team', 'Date']).reset_index(drop=True)

# Identify back-to-back games
team_schedule['Previous_Game_Date'] = team_schedule.groupby('Team')['Date'].shift(1)
team_schedule['Back_to_Back'] = (team_schedule['Date'] - team_schedule['Previous_Game_Date']).dt.days == 1

# Filter to include only the back-to-back games
back_to_back_games = team_schedule[team_schedule['Back_to_Back']]

# Save the back-to-back games to a new CSV file
back_to_back_games.to_csv('Generated_Back_to_Back_Games.csv', index=False)



# Results

## Exploratory Data Analysis



### Section 1 of EDA - Scatterplots and Boxplots for Stats, First and Second day B2B games

We used scatterplots and boxplots to compare player performance on the first and second days of back-to-back games. The scatterplots show how points, rebounds, and turnovers on Day 1 compare to Day 2, but they’re hard to interpret because so many points overlap. To address this, we added boxplots, which make it easier to spot overall trends. From the boxplots, we see that points scored drop slightly on Day 2, while rebounds and turnovers remain about the same. This suggests fatigue might impact scoring more than other parts of the game, but further analysis is needed to confirm this.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the datasets for the first and second days of back-to-back games
first_day_df = pd.read_csv("Back_to_Back_First_Day.csv")
second_day_df = pd.read_csv("Back_to_Back_Second_Day.csv")

# Merge the datasets on PLAYER NAME
merged_df = pd.merge(first_day_df, second_day_df, on='PLAYER NAME', suffixes=('_Day1', '_Day2'))

# Metrics to visualize
metrics = ['PTS', 'REB', 'TOV']

# Scatter plots for each metric
for metric in metrics:
    plt.figure(figsize=(8, 6))
    plt.scatter(merged_df[f'{metric}_Day1'], merged_df[f'{metric}_Day2'], alpha=0.7)
    plt.title(f'{metric}: First Day vs. Second Day of Back-to-Back Games')
    plt.xlabel(f'{metric} on First Day')
    plt.ylabel(f'{metric} on Second Day')
    plt.axline((0, 0), slope=1, color='red', linestyle='--')
    plt.legend()
    plt.grid(True)
    plt.show()

# Box plots comparing Day 1 and Day 2 for each metric
plt.figure(figsize=(12, 8))
for i, metric in enumerate(metrics, 1):
    plt.subplot(1, 3, i)
    merged_df[[f'{metric}_Day1', f'{metric}_Day2']].boxplot()
    plt.title(f'{metric}: Day 1 vs Day 2')
    plt.ylabel(metric)
    plt.xticks([1, 2], ['Day 1', 'Day 2'])
plt.tight_layout()
plt.show()


### Section 2 of EDA - Means, T-Stat, P-Value

We used the same metrics and dataset from Section 1 to conduct a paired t-test for each metric, comparing Day 1 and Day 2 of the back-to-back games. We also calculated the p-value for each metric to determine the statistical significance. Our results showed that the differences in the mean values for each stat between the two days were minor, showing that the metrics are very similar overall. However, with the p-value at the 0.05 significance level, the p-values showed that there is significant change for PTS and RBS but not for TOV.

In [None]:
import pandas as pd
from scipy.stats import ttest_rel

# Load the datasets for the first and second days of back-to-back games
first_day_df = pd.read_csv("Back_to_Back_First_Day.csv")
second_day_df = pd.read_csv("Back_to_Back_Second_Day.csv")

# Merge the datasets on PLAYER NAME
merged_df = pd.merge(first_day_df, second_day_df, on='PLAYER NAME', suffixes=('_Day1', '_Day2'))

# Metrics to analyze
metrics = ['PTS', 'REB', 'TOV']

# Dictionary to store the metrics summary
metrics_summary = {}
for metric in metrics:
    day1_mean = merged_df[f'{metric}_Day1'].mean()
    day2_mean = merged_df[f'{metric}_Day2'].mean()
    t_stat, p_value = ttest_rel(merged_df[f'{metric}_Day1'], merged_df[f'{metric}_Day2'])
    metrics_summary[metric] = {
        'Day 1 Mean': day1_mean,
        'Day 2 Mean': day2_mean,
        'Difference': day1_mean - day2_mean,
        'T-Statistic': t_stat,
        'P-Value': p_value
    }

# Convert the metrics summary to a DataFrame
metrics_summary_df = pd.DataFrame(metrics_summary).T

# Save the summary DataFrame to a CSV file
updated_file_path = "MetricSummary.csv"
metrics_summary_df.to_csv(updated_file_path, index=False)


# Ethics & Privacy

Since our project uses publicly available NBA player statistics, privacy concerns are minimal. However, we recognize that some biases might affect our results. For example, certain players may get more rest depending on their team’s schedule, which could skew performance data in favor of some teams over others. Additionally, factors like travel, injuries, and coaching strategies can influence performance but aren’t directly reflected in our dataset. We’ll note these limitations and interpret our results with these in mind.

# Team Expectations 

* *Our team will communicate through Discord. A reasonable response time for a message would be within 12 hours. Our team will meet twice a week, in discord, in order to get our work done.*
* *We will actively communicate with each other right when any problem arises so that we can work with it as soon as possible. We will reach out to each other for help if needed.*
* *Since we value each other’s opinion, during a decision making process, our team decision would mainly be based on consensus decision-making after a debate.*
* *We will split the workload among everyone, so that everyone will contribute a bit of everything to the team. We will mostly work together while calling on discord, so assigning work to everyone will happen during the call.*
* *If someone is struggling to deliver something they promised to do, reach out to the team as soon as possible for help. If things cannot be resolved within the team, reach out to the IA or TA for extra help.*
* *Make sure to keep updating the project timeline proposal in order to keep up with deadlines and plan ahead.*


# Project Timeline Proposal

| Meeting Date  | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---------------|--------------|--------------------------|---------------------|
| 10/22         | 8 PM         | Read through past projects and discuss team goals | Complete the project review form together, set team expectations, and start brainstorming research questions |
| 10/29         | 9 PM         | Finalize a research question through Discord and complete background research | Work on project proposal draft together, completing most of it as a team; assign remaining tasks for individual work |
| 10/30         | 8 PM         | Complete assigned sections from previous meeting | Review, finalize, and submit the project proposal |
| 11/5          | 8 PM         | Gather and import data; start initial research focusing on player performance metrics | Review data for completeness and outline the analysis plan based on initial findings |
| 11/12         | 8 PM         | Create CSV files with our datasets and combine them according to our project | Review, finalize, and submit the DataCheckpoint. |
| 11/19         | 8 PM         | Complete detailed research and clean any remaining data | Discuss research results and make adjustments to the analysis plan if needed |
| 11/25         | 8 PM         | Create a google sheet and manually combining data of each player for each game across the chosen season | Continue to manually combining data, discuss how these data help answer the research question and plan the analysis |
| 11/26         | 9 PM         | Begin full analysis based on research insights, focusing on metrics for fatigue impact | Review analysis progress, discuss preliminary findings, and start drafting results and conclusions |
| 12/11         | Before 11:59 PM | Finalize all results and conclusions; record presentation | Final review of project deliverables, and submit all materials, including team surveys |
