#### All text below is a draft for my medium article. All code is up to date

# Analyzing the impact of back-to-backs on NBA team performance between the 2013-2014 and 2021-2022 seasons

One of the more fun parts of the NBA's all-star weekend are the goofy interview questions players answer that tend to go viral. This year in Indianapolis, Devin Booker got a lot of laughs for his response:

https://x.com/CGBBURNER/status/1758937072855835125?s=20

Reporter: “If you could add one rule to the NBA, what would it be?”

Booker: “No back-to-backs.”

Reporter: “How does this affect Lebron’s legacy”

Booker: “That sounds like a bot comment” 

While we most definitely won't look at Lebron's legacy if back-to-backs were banned, we cab look at back-to-backs to see if they are as bad as Booker seems to think they are. Is there an impact of a back-to-back on a team's performance? Are back-to-backs bad for the NBA?


Back-to-backs, pairs of games that happen on sequential days, are said to be grueling ordeals that put teams to the test. Much more tired than normal, players often compete in different cities with no days of rest in between games. It makes logical sense that teams would play worse in the second game of the back-to-back than in other games, but let's see if we can prove that there is a statistically significant difference in a team's performance. Moreover, once we establish if there is a statistically significant correlation between performance and back-to-backs, we will examine back-to-backs with travel to explore whether back-to-backs alone correlate to the decrease in team performance.

## Measuring Teams' Performance

First we'll need to determine how to measure the teams' performance. After considering player stats and advanced metrics, I determined that the best tool for the job may just be the most obvious: how many points the back-to-back team won (or lost) by. This metric - the margin of victory - is simple: subtract the points a team allowed from the points the team scored. If a team wins, their margin of victory will be positive; if they lose, it will be negative. One assumption that accompanies this metric: over the large dataset we will be looking at, margin of victory will approximate a team's performance and game-to-game luck will be mitigated. In other words, margin of victory will be a proxy for how well a team plays. Across the whole dataset, the mean margin of victory is 0, indicating that an average result would have a margin of victory close to 0 (since NBA games cannot end in a tie). In our experiments, we will explore whether the average margins of victory are notably different between back-to-backs compared to games with the normal amount of rest.

## Obtaining the Data

Once we have determined which metric we will use to compare the effect of back-to-backs on team performance, we need to find a way to obtain the data required for the analysis. One can find season-length data on games, dates, teams, and scores on basketball-reference.com, scattered across several pages. Below, I have created a function that uses the Beautiful Soup package in combination with the Pandas package to loop over the necessary pages and download the passed seasons as csvs for offline analysis. One challenge I encountered in this step was the rate limiting practices of basketball-reference.com. I overcame this in the web-scraping script by adding a pause after each pull request. After pulling the data and formatting it into a dataframe, the script will save the pull as a CSV for offline reference.

By the end of this process, we have 11528 games across 8 seasons (2014-2021) of unstructured data ready to be formatted into a usable state.

## Structuring the Data

The initial set of data is loose and unstructured; we will need to clean the dataset down to the columns we need to do analysis. A fundamental structuring of this data is that each data entry represented a game, whereas we will need to have an individual data entry for each teams' performance within a game. To resolve this, we restructure the data to show performance from each teams' point of view; both teams will have a margin of victory for each game (one positive: winner, and one negative: loser). With our data structured on a fundamental level, we will quickly add a calculated field for our metric, margin of victory. As a last step before data exploration, we will add a flag that indicates back-to-backs. After applying this restructuring, we find ourselves with 23056 rows of teams' performances. Let's take a look at some descriptive statistics and histograms to start to see the shape of the data. 


## Examining the Data

With data acquisition and structuring complete, the fun can finally begin. Let's examine our data by looking at descriptive statistics and the distributions (by flag). In addition, we can do some fun preliminary analysis into which teams had the most back-to-backs over the period and which teams performed the worst on back-to-backs compared to their baseline.

In doing the initial exploration, there were a couple of key ideas that stuck out to me:
1) There was a clear difference in mean between back-to-backs and normal games. Statistical testing will be the key to telling if that is a fluke or not, but from first look, our initial gut feeling seems validated.
2) The distribution of the mean margin of victory is bimodal - this makes sense, as games are never ties. Notably, we see that many games are won or lost by 3-7 points.
3) Back-to-backs possibly have a different variance in margin of victory compared to non back-to-back games with a p value of ~0.06. While small, this, combined with the bimodal distribution, will inform our hypothesis test selection of the Welch's t-test

## Select and Conduct the Hypothesis Test

Our gut-feeling that back-to-backs are detrimental to team performance seems promising, but there is still one more level of rigor needed before we can claim significance: we must define and test our idea with a hypothesis test.

#### Null and Alternative Hypotheses:

Null hypothesis: The second game of a back to back (playing a game the day after another game) does not correlate with lower performance (in the form of a lower margin of victory)

Alternative hypothesis: The second game of a back to back (playing a game the day after another game) does correlate with lower performance (in the form of a lower margin of victory)

$$H_0: \text{(Mean margin of victory)}_{\text{back to back}} - \text{(Mean margin of victory)}_{\text{other}}>=0$$


$$H_1: \text{(Mean margin of victory)}_{\text{back to back}} - \text{(Mean margin of victory)}_{\text{other}}<0$$


Again, a negative margin of victory indicates a team lost the game.

#### Confidence Value
We will be testing for 95% confidence, with an alpha of 0.05.

#### Test Selection
The sample variances of back-to-backs games and other games are not equal and the underlying distributions are bimodal, so we will use a Welch's t-test.

#### Test Results

P-value: 2.4541192065875286e-21

Mean margin of victory on a back to back -1.94 points
Mean margin of victory otherwise 0.45 points

#### Interpreting Test Results
The p-value is much smaller than the alpha of 0.05, so we reject the null hypothesis and conclude that at 95% confidence, there is a difference in mean margin of victory with back-to-back games compared to other games.

## Drawing Conclusions

If we were to rule based on this finding alone, we would recommend limiting the number of back-to-back games per season to ensure that games are competitive and that the quality of the entertainment product is as high as possible. In addition, to ensure a fair regular season, we would want to urge the NBA to do their best to ensure that all teams have a similar number of back to backs, considering that back-to-backs have a measurable impact on teams' performance. However, we're not making recommendations based on just this one test; let's finish our analysis below.

## Travel during a Back-to-Back

At first glance, it may seem obvious that back-to-backs are bad and that the NBA should abolish them. At least, that's what the preliminary research indicates. I am not satisfied at calling the research there, let's keep exploring what makes a back-to-back so tough. As someone who travels for work, one thing that stuck out to me is that a travel day is long and tiring (chartered jet or not!); taking an hours-long flight the night after a close game must be exhausting for players - who would be able to perform at 100% the day after travel? With this idea in mind, I started to develop a new idea to test: back-to-backs with travel may more detrimental to team performance than back-to-backs without travel. If we can prove this, we can create an actionable recommendation for the NBA: if home back-to-backs do not affect teams' performance, reduce travel back-to-backs and increase home back-to-backs to compensate.

## Structuring the Data

For this analysis, we first have to disregard data from the season of the COVID bubble; for most of the season all of the teams stayed at the same complex in Orlando and thus, did not travel. We reload the testing data, eschewing the COVID season in our upload step. After we load and structure the data in a way similar to what we did to test back-to-backs vs normal games, we will use Pandas's shift() function to assign a flag to games teams played at a different arena than the prior night. This captures cases where a team travels:
1) Home to away
2) Away to home
3) Away to away (to play at different venues)

## Examining the Data

After the travel flag is asigned, let's examine the data again.


## Hypothesis Testing - Travel vs No Travel 

Already, we see a large discrepancy in mean margin of victory for back to backs with travel vs back to backs without travel. Let's conduct a test to see if that difference is statistically significant with an alpha of 0.05.

$$H_0: \text{(Mean margin of victory)}_{\text{back to back with travel}} - \text{(Mean margin of victory)}_{\text{back to back without travel}}>=0$$


$$H_1: \text{(Mean margin of victory)}_{\text{back to back with travel}} - \text{(Mean margin of victory)}_{\text{back to back without travel}}<0$$

## Interpretation of Results

We reject the null hypothesis because the test p-value is smaller than the alpha of 0.05. Using this, in combination with our knowledge of the rigors of travel, we can determine it likely that a back-to-back with travel causes a team to perform at a worse level than a team with no travel on their back-to-back. More interestingly, a back-to-back without travel's margin of victory has a 95% confidence interval of (-0.7, 1.9), indicating that a pure back-to-back may not be a detriment to performance, assuming there is no travel.


# TODO:
add a conclusion
add a plug that I am interested in data science opportunities -> direct to linkedin
update linkedin picture


further research: using different stats to measure teams' performance: xMargin of victory, player specific performance, etc

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import time
import os
cwd = os.getcwd()
import plotly.express as px
import numpy as np
import scipy.stats as stats

## scrape_season(years) function
This function will scrape basketball reference for the inputted list of years to return a dataframe with all games played over those years

In [2]:
def scrape_seasons(years):
    
    #create a list of months in the season
    months = ['october', 'november', 'december', 'january', 'february', 'march', 'april', 'may', 'june']
    
    #initialize a dataframe to add data onto each season
    all_seasons = pd.DataFrame(columns = ['Date', 'Time', 'Vis', 'VisScore', 'Home', 'Home_Score', 'Box Score', 'OT', 'Attendence', 'Venue', 'Notes'])
    
    #loop through the passed seasons
    for year in years:
        
        #initialize a dataframe to add data onto each month
        year_season = pd.DataFrame(columns = ['Date', 'Time', 'Vis', 'VisScore', 'Home', 'Home_Score', 'Box Score', 'OT', 'Attendence', 'Venue', 'Notes'])
        
        #loop through the months in a season
        for month in months:
            
            #provide a status message and update the url to the appropriate page
            print(f"Pulling the {month} of the {year} season")
            url = f"https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html"
            
            #as there are seasons that are missing certain months (due to strikes, COVID, etc) I handled errors with a try/except
            try:
                #create some nice and beautiful soup from the webpage
                html = urlopen(url)
                soup = BeautifulSoup(html, features="lxml")
                table = soup.table
                table_rows = table.find_all('tr')
                
                #create an empty list for games in the month
                month_games = []
                
                #loop through each game in the month, adding them to the above-created list
                for tr in table_rows:
                    td = tr.find_all('td')
                    row = [tr.find('th').text]+[i.text for i in td]
                    if len([i.text for i in td]) > 1:
                        month_games.append(row)
                
                #format the list into a dataframe, then append to the dataframe that contains all other months in the season using pd.concat
                month_games_df = pd.DataFrame(month_games, columns = ['Date', 'Time', 'Vis', 'VisScore', 'Home', 'Home_Score', 'Box Score', 'OT', 'Attendence', 'Venue', 'Notes'])
                year_season = pd.concat([year_season,month_games_df], ignore_index = True)
            
            #print the exception if the webpage does not exist for the month in that season
            except Exception as error:
                print(f"Error: {error}")
            
            #sleep for 10 seconds to respect rate-limiting (and not get temporarily banned)
            time.sleep(10)
        
        #append the season's dataframe to the dataframe that contains all other seasons using pd.concat
        all_seasons = pd.concat([all_seasons, year_season], ignore_index = False)
    
    #save the resulting dataframe to the computer for offline use
    try:
        all_seasons.to_csv(f"Seasons//{years[0]}-{years[-1]} seasons.csv")
    except:
        all_seasons.to_csv(f"{years[0]}-{years[-1]} seasons.csv")
    
    #return the dataframe (not used in the final version of the code, but does allow for fun impromptu analysis)
    return all_seasons

In [3]:
#Scrape each season from 2014 to 2023
#seasons = scrape_seasons(['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'])

In [17]:
#create a dataframe, df_seasons, from the saved csvs
df_seasons = pd.DataFrame()
for i in [f for f in os.listdir("Seasons") if f.endswith('.csv')]:
    temp_df = pd.read_csv(f"Seasons//{i}")
    df_seasons = pd.concat([df_seasons, temp_df])
    
#sanity check to make sure the data imported successfully (if this line prints 0, a common error in import has been avoided. Woo!)
#print(df_seasons['Home'].isna().sum())

#format the required columns into dateTime, and drop unneeded columns
df_seasons['Date'] = pd.to_datetime(df_seasons["Date"], format = "mixed")
df_seasons['Time'] = pd.to_datetime(df_seasons["Time"], format = "mixed")
df_seasons = df_seasons[['Date', "Vis", 'VisScore', 'Home', 'Home_Score', 'Venue']]

#each game has two teams, but is only represented once in the data
#account for this, I split each game into two records: one from each team's point of view

#first I create the visiting team's game record. Note how we mark that they are NOT the home team with the column 'home_team'
df_vis_games = pd.DataFrame()
df_vis_games[['Date', 'Team', 'Opponent', 'Venue']] = df_seasons[['Date', 'Vis', 'Home', 'Venue']]
df_vis_games['Margin_of_victory'] = df_seasons['VisScore'] - df_seasons['Home_Score']
df_vis_games['home_team'] = 0

#Next I create the home team's game record. Note how we mark that they are are the home team with the column 'home_team'
df_home_games = pd.DataFrame()
df_home_games[['Date', 'Team', 'Opponent', 'Venue']] = df_seasons[['Date', 'Home', 'Vis', 'Venue']]
df_home_games['Margin_of_victory'] = df_seasons['Home_Score'] - df_seasons['VisScore']
df_home_games['home_team'] = 1

#Combine both dataframes and sort by the date of the game
df_all_teams = pd.concat([df_vis_games, df_home_games]).sort_values(by='Date')

#Find the days since each teams' last game using the diff() function
df_all_teams["Days_Since_Game"] = df_all_teams.groupby('Team')['Date'].diff()

#handle N/As by replacing them with 0s - these N/As appeared on the first game for each team
df_all_teams['Days_Since_Game'].fillna(pd.Timedelta(0), inplace=True)

#cast the days since last game number to an Int for later calculations
df_all_teams['Days_Since_Game'] = df_all_teams['Days_Since_Game'].dt.days.astype(int)

#assign a flag to games that are back-to-back
df_all_teams['back_to_back'] = np.where(df_all_teams['Days_Since_Game']==1, True, False)

# Data Exploration

### Which teams had the most back-to-backs from 2014->2022?

In [5]:
df_all_teams[df_all_teams['back_to_back'] == 1].groupby('Team')['back_to_back'].count().sort_values(ascending=False)

Team
Atlanta Hawks             137
Cleveland Cavaliers       136
Milwaukee Bucks           136
Washington Wizards        136
Detroit Pistons           135
Sacramento Kings          135
Chicago Bulls             133
Memphis Grizzlies         132
Utah Jazz                 132
Indiana Pacers            131
Boston Celtics            131
Portland Trail Blazers    131
Toronto Raptors           130
San Antonio Spurs         130
Los Angeles Clippers      130
Philadelphia 76ers        130
New Orleans Pelicans      130
Houston Rockets           129
Orlando Magic             128
Golden State Warriors     128
Brooklyn Nets             126
New York Knicks           126
Minnesota Timberwolves    126
Phoenix Suns              122
Dallas Mavericks          122
Oklahoma City Thunder     121
Denver Nuggets            121
Los Angeles Lakers        120
Miami Heat                118
Charlotte Hornets         114
Charlotte Bobcats          22
Name: back_to_back, dtype: int64

### Which organizations performed the worst on back-to-backs compared to other games from 2014->2022?

Almost all organizations seem to perform worse on a back to back from their baseline.


In [6]:
(df_all_teams[df_all_teams['back_to_back'] == 1].groupby('Team')['Margin_of_victory'].mean() - df_all_teams[df_all_teams['back_to_back'] == 0].groupby('Team')['Margin_of_victory'].mean()).sort_values(ascending=True)

Team
Brooklyn Nets            -6.673913
Philadelphia 76ers       -6.229007
Charlotte Bobcats        -6.194602
Denver Nuggets           -4.263995
Phoenix Suns             -3.759456
Los Angeles Lakers       -3.724183
Milwaukee Bucks          -3.709458
Minnesota Timberwolves   -3.469673
Utah Jazz                -3.444878
Indiana Pacers           -3.295693
Oklahoma City Thunder    -3.133058
Dallas Mavericks         -3.102173
New Orleans Pelicans     -2.823127
Sacramento Kings         -2.583648
Cleveland Cavaliers      -2.538479
Detroit Pistons          -2.472871
Boston Celtics           -2.386446
Charlotte Hornets        -2.011278
Memphis Grizzlies        -1.943134
Chicago Bulls            -1.904120
Golden State Warriors    -1.874571
Atlanta Hawks            -0.975676
New York Knicks          -0.902732
Miami Heat               -0.800306
Los Angeles Clippers     -0.618101
Washington Wizards       -0.310720
Toronto Raptors           0.027711
San Antonio Spurs         0.081464
Orlando Magic  

### Histogram of Margin of victory for back to back (red) and other games (blue)

In [25]:
fig = px.histogram(df_all_teams, x = "Margin_of_victory", color='back_to_back', histnorm = 'percent', barmode = 'overlay', title = 'Histogram of Margin of Victory, Back-to-back vs Other')
fig.update_layout(
    title={
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_layout(font_family = 'Arial', title_font_color = 'black')
fig.show()

### Summary Statistics for back to backs and other games

In [19]:
df_all_teams.groupby('back_to_back')['Margin_of_victory'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
back_to_back,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,19178.0,0.392637,14.460829,-57.0,-9.0,1.0,10.0,73.0
True,3878.0,-1.941723,14.025152,-73.0,-11.0,-3.0,8.0,61.0


In [21]:
df_all_teams[df_all_teams['back_to_back']==True]

Unnamed: 0,Date,Team,Opponent,Venue,Margin_of_victory,home_team,Days_Since_Game,back_to_back
8,2013-10-30,Indiana Pacers,New Orleans Pelicans,Smoothie King Center,5,0,1,True
16,2013-10-30,Los Angeles Lakers,Golden State Warriors,Oracle Arena,-31,0,1,True
9,2013-10-30,Orlando Magic,Minnesota Timberwolves,Target Center,-5,0,1,True
5,2013-10-30,Miami Heat,Philadelphia 76ers,Wells Fargo Center,-4,0,1,True
17,2013-10-31,New York Knicks,Chicago Bulls,United Center,-1,0,1,True
...,...,...,...,...,...,...,...,...
1215,2022-04-10,Indiana Pacers,Brooklyn Nets,Barclays Center,-8,0,1,True
1227,2022-04-10,New Orleans Pelicans,Golden State Warriors,Smoothie King Center,-21,1,1,True
1222,2022-04-10,Philadelphia 76ers,Detroit Pistons,Wells Fargo Center,12,1,1,True
1219,2022-04-10,Memphis Grizzlies,Boston Celtics,FedEx Forum,-29,1,1,True


# Hypothesis testing - Mean margin of victory

### Null Hypothesis
Null hypothesis: The second game of a back to back (playing a game the day after another game) does not correlate with lower performance (in the form of a lower margin of victory)

Alternative hypothesis: The second game of a back to back (playing a game the day after another game) does correlate with lower performance (in the form of a lower margin of victory)

$$H_0: \text{(Mean margin of victory)}_{\text{back to back}} - \text{(Mean margin of victory)}_{\text{other}}>=0$$


$$H_1: \text{(Mean margin of victory)}_{\text{back to back}} - \text{(Mean margin of victory)}_{\text{other}}<0$$


As a note, a negative margin of victory indicates a team lost the game.

### Confidence Value
We will be testing for 95% confidence, with a p-value of 0.05.

### Test selection
The variance is not equal and the underlying distributions are bimodal, so we will use a Welch's t-test.

In [9]:
#Perform the t-test
t_statistic, p_value = stats.ttest_ind(df_all_teams[df_all_teams['back_to_back']==0]['Margin_of_victory'], df_all_teams[df_all_teams['back_to_back']==1]['Margin_of_victory'], equal_var = False)

print(f"Welch's T-statistic: {t_statistic}")
print(f"P-value: {p_value}\n")

print(f"Mean margin of victory on a back to back {df_all_teams[df_all_teams['back_to_back']==1]['Margin_of_victory'].mean().round(2)} points")
print(f"Mean margin of victory otherwise {df_all_teams[df_all_teams['back_to_back']==0]['Margin_of_victory'].mean().round(2)} points")

Welch's T-statistic: 9.4033275285209
P-value: 7.493420944647e-21

Mean margin of victory on a back to back -1.94 points
Mean margin of victory otherwise 0.39 points


## Interpretation of results
Wow, what a tiny p-value. With a p-value that small, we can rule that it is incredibly unlikely that the mean margin of victory in a back to back is equal to or higher than the mean margin of victory otherwise and reject the null hypothesis.

From our test it is evident that there is likely a strong correlation between back to back games and reduced performance (in the form of a lower margin of victory).

Based on this finding, I recommend limiting back to back games to ensure that the quality of the entertainment product is as high as possible. In addition, to ensure a fair regular season, it is important that all teams have a similar number of back to backs, considering they are disadventageous.

# How much worse is a back to back with travel vs a normal back to back?
Above, we have established a back to back leads to, on average, a 2.5 point swing in the performance of a team. How much worse is a back to back with travel?

As a note, for this investigation, we will categorize all games played in the bubble season as no-travel games.

In [10]:
#create a dataframe, df_seasons, from the saved csvs
df_seasons_trvl = pd.DataFrame()
for i in [f for f in os.listdir("trvl_seasons") if f.endswith('.csv')]:
    temp_df = pd.read_csv(f"trvl_seasons//{i}")
    df_seasons_trvl = pd.concat([df_seasons_trvl, temp_df])

df_seasons_trvl = df_seasons_trvl.sort_values(by='Date', ascending = True)
#sanity check to make sure all the data imported successfully (if prints 0, data has been imported successfully)
#print(df_seasons['Home'].isna().sum())

#format the required columns into dateTime, and drop unneeded columns
df_seasons_trvl['Date'] = pd.to_datetime(df_seasons_trvl["Date"], format = "mixed")
df_seasons_trvl['Time'] = pd.to_datetime(df_seasons_trvl["Time"], format = "mixed")
df_seasons_trvl = df_seasons_trvl[['Date', "Vis", 'VisScore', 'Home', 'Home_Score', 'Venue']]

#each game has two teams, but is only represented once in the data
#account for this, I split each game into two records: one from each team's point of view

#first I create the visiting team's game record. Note how we mark that they are NOT the home team with the column 'home_team'
df_vis_games_trvl = pd.DataFrame()
df_vis_games_trvl[['Date', 'Team', 'Opponent', 'Venue']] = df_seasons_trvl[['Date', 'Vis', 'Home', 'Venue']]
df_vis_games_trvl['Margin_of_victory'] = df_seasons_trvl['VisScore'] - df_seasons_trvl['Home_Score']
df_vis_games_trvl['home_team'] = 0

#Next I create the home team's game record. Note how we mark that they are are the home team with the column 'home_team'
df_home_games_trvl = pd.DataFrame()
df_home_games_trvl[['Date', 'Team', 'Opponent', 'Venue']] = df_seasons_trvl[['Date', 'Home', 'Vis', 'Venue']]
df_home_games_trvl['Margin_of_victory'] = df_seasons_trvl['Home_Score'] - df_seasons_trvl['VisScore']
df_home_games_trvl['home_team'] = 1

#Combine both dataframes and sort by the date of the game
df_all_teams_trvl = pd.concat([df_vis_games_trvl, df_home_games_trvl]).sort_values(by='Date')

#Find the days since each teams' last game using the diff() function
df_all_teams_trvl["Days_Since_Game"] = df_all_teams_trvl.groupby('Team')['Date'].diff()

#handle N/As by replacing them with 0s - these N/As appeared on the first game for each team
df_all_teams_trvl['Days_Since_Game'].fillna(pd.Timedelta(0), inplace=True)

#cast the days since last game number to an Int for later calculations
df_all_teams_trvl['Days_Since_Game'] = df_all_teams_trvl['Days_Since_Game'].dt.days.astype(int)

#assign a flag to games that are back-to-back
df_all_teams_trvl['back_to_back'] = np.where(df_all_teams_trvl['Days_Since_Game']==1, True, False)

#Reset the index to allow for the shift function on the next line
df_all_teams_trvl = df_all_teams_trvl.sort_values(by=['Team', 'Date']).reset_index()

#Pull the previous games' venues
df_all_teams_trvl['prev_venue'] = df_all_teams_trvl.groupby('Team')['Venue'].shift().fillna(1)

#True where there was travel between games, false where there was not travel between games
df_all_teams_trvl['Travel'] = np.where(df_all_teams_trvl['Venue']==df_all_teams_trvl['prev_venue'], 'No Travel', 'Travel')

In [11]:
#Categorizing games as back to back with travel and back to backs without travel
df_b2b = df_all_teams_trvl[df_all_teams_trvl['back_to_back'] == 1]

## Histogram comparing back to backs with travel to those without

In [12]:
fig = px.histogram(df_b2b, x = "Margin_of_victory", color='Travel', nbins = df_b2b['Margin_of_victory'].nunique()*2, barmode = 'overlay', title = '<b>Histogram of Margin of Victory: Travel vs No Travel</b>')
fig.update_layout(
    title={
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_layout(font_family = 'Arial', title_font_color = 'black')
fig.show()

In [26]:
fig = px.histogram(df_b2b, x = "Margin_of_victory", color='Travel', histnorm = 'percent', nbins = df_b2b['Margin_of_victory'].nunique()*2, barmode = 'overlay', title = 'Histogram of Margin of Victory: Travel vs No Travel')
fig.update_layout(
    title={
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_layout(font_family = 'Arial', title_font_color = 'black')
fig.show()

### Summary Statistics

In [13]:
df_b2b.groupby('Travel')['Margin_of_victory'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Travel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No Travel,508.0,0.604331,14.962245,-56.0,-9.25,2.0,10.0,50.0
Travel,3370.0,-2.325519,13.84018,-73.0,-11.0,-3.0,7.0,61.0


In [23]:
df_b2b

Unnamed: 0,index,Date,Team,Opponent,Venue,Margin_of_victory,home_team,Days_Since_Game,back_to_back,prev_venue,Travel
9,138,2013-11-16,Atlanta Hawks,New York Knicks,Madison Square Garden (IV),20,0,1,True,Philips Arena,Travel
11,165,2013-11-20,Atlanta Hawks,Detroit Pistons,Philips Arena,8,1,1,True,AmericanAirlines Arena,Travel
13,189,2013-11-23,Atlanta Hawks,Boston Celtics,Philips Arena,-7,1,1,True,The Palace of Auburn Hills,Travel
15,220,2013-11-27,Atlanta Hawks,Houston Rockets,Toyota Center,-29,0,1,True,Philips Arena,Travel
17,240,2013-11-30,Atlanta Hawks,Washington Wizards,Verizon Center,-7,0,1,True,Philips Arena,Travel
...,...,...,...,...,...,...,...,...,...,...,...
20713,1009,2022-03-12,Washington Wizards,Portland Trail Blazers,Moda Center,-9,0,1,True,Crypto.com Arena,Travel
20717,1059,2022-03-19,Washington Wizards,Los Angeles Lakers,Capital One Arena,8,1,1,True,Madison Square Garden (IV),Travel
20720,1099,2022-03-25,Washington Wizards,Detroit Pistons,Little Caesars Arena,3,0,1,True,Fiserv Forum,Travel
20723,1136,2022-03-30,Washington Wizards,Orlando Magic,Capital One Arena,17,1,1,True,Capital One Arena,No Travel


## Initial interpretation of data

Already, we see a large descrepency in mean mergin of victory for back to backs with travel vs back to backs without travel. Let's conduct a test to see if that difference is statistically significant. With a p-value of 0.05.

$$H_0: \text{(Mean margin of victory)}_{\text{back to back with travel}} - \text{(Mean margin of victory)}_{\text{back to back without travel}}>=0$$


$$H_1: \text{(Mean margin of victory)}_{\text{back to back with travel}} - \text{(Mean margin of victory)}_{\text{back to back without travel}}<0$$

In [14]:
#Perform the t-test
t_statistic, p_value = stats.ttest_ind(df_b2b[df_b2b['Travel']=='No Travel']['Margin_of_victory'], df_b2b[df_b2b['Travel']=='Travel']['Margin_of_victory'], equal_var = False)

print(f"Welch's T-statistic: {t_statistic}")
print(f"P-value: {p_value}\n")

print(f"Mean margin of victory on a back to back with travel {df_b2b[df_b2b['Travel']=='Travel']['Margin_of_victory'].mean().round(2)} points")
print(f"Mean margin of victory on a back to back without travel {df_b2b[df_b2b['Travel']=='No Travel']['Margin_of_victory'].mean().round(2)} points")

Welch's T-statistic: 4.153720542537873
P-value: 3.7117713908065724e-05

Mean margin of victory on a back to back with travel -2.33 points
Mean margin of victory on a back to back without travel 0.6 points


## Interpretation of results

We reject the null hypothesis because the test p-value is smaller than the alpha of 0.05. Using this, in combination with our knowledge of the rigors of travel, we can determine it likely that a back-to-back with travel causes a team to perform at a worse level than a team with no travel on their back-to-back. More interestingly, a back-to-back without travel's margin of victory has a 95% confidence interval of (-0.7, 1.9), indicating that a pure back-to-back may not be a detriment to performance, assuming there is no travel.

In [15]:
from scipy.stats import t

df = df_b2b[df_b2b['Travel']=='No Travel']

# Sample statistics
sample_mean = df['Margin_of_victory'].mean()
sample_std = df['Margin_of_victory'].std()
sample_size = len(df)

# Confidence level (e.g., 95%)
confidence_level = 0.95

# Degrees of freedom (for a t-distribution)
degrees_of_freedom = sample_size - 1

# Calculate the critical value from the t-distribution
critical_value = t.ppf((1 + confidence_level) / 2, df=degrees_of_freedom)

# Calculate the margin of error
margin_of_error = critical_value * (sample_std / np.sqrt(sample_size))

# Calculate the confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

# Display the results
print(f"Sample Mean: {sample_mean}")
print(f"Confidence Interval ({int(confidence_level * 100)}%): {confidence_interval}")


Sample Mean: 0.6043307086614174
Confidence Interval (95%): (-0.6998896931496859, 1.9085511104725206)
