# Nate Gentry's MLB Circadian Advantage Solution

First things first I imported the relevant libraries.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from math import sqrt
import requests

Next up I pulled in the CSV for the case from Github and printed a sample to make sure everything looked ok.

In [2]:
url = ('https://raw.githubusercontent.com/johnjfox/Analytic_Enterprise/master/data/circadian/circadian.csv')
df = pd.read_csv(url)
print df.sample(5)

        Game         Team     Date  Year H/A W/L  CT  CTopp
3349    3350      Oakland   8/4/98  1998   h   l   0      0
29759  29760  Los Angeles  4/29/04  2004   h   l   0      0
18606  18607    Cleveland   9/1/01  2001   a   w   0      0
42273  42274      Arizona  8/11/06  2006   h   l   0     -3
39213  39214      Toronto  4/16/06  2006   a   l   0      0


The first thing I wanted to do with the dataset was to add a column for the calculated Circadian Advantage for the subject team of each game. Since the advantage is determined by hours away from zero and not positive or negative I created two new columns that converted the existing 'CT' and 'CTopp' columns into absolute values. Finally I created a column that subtracted the absolute value of the subject's CT hours from that of their opponent to determine the Circadian Advantage of each game.

In [3]:
df['CT_Abs'] = df['CT'].abs()
df['CTopp_Abs'] = df['CTopp'].abs()
df['CTadv'] = df['CTopp_Abs'] - df['CT_Abs']
df.sample(5)

Unnamed: 0,Game,Team,Date,Year,H/A,W/L,CT,CTopp,CT_Abs,CTopp_Abs,CTadv
23751,23752,Tampa Bay,9/10/02,2002,h,l,0,0,0,0,0
19863,19864,Baltimore,4/18/02,2002,a,l,0,0,0,0,0
20427,20428,NY Yankees,5/9/02,2002,a,w,0,0,0,0,0
10894,10895,San Diego,5/18/00,2000,a,w,0,0,0,0,0
20934,20935,Minnesota,5/28/02,2002,a,w,1,0,1,0,-1


Next I converted the 'W/L' column from 'w' and 'l' variables into a 1 for wins and a 0 for losses so I could more easily manipulate the data later on. I named the resulting column 'Won_Game'.

In [4]:
df['W/L_New'] = np.where(df['W/L']=='w',1,0)
df = df.drop({'W/L'}, axis=1)
df = df.rename(columns={'W/L_New': 'Won_Game'})

df.sample(5)

Unnamed: 0,Game,Team,Date,Year,H/A,CT,CTopp,CT_Abs,CTopp_Abs,CTadv,Won_Game
34185,34186,Seattle,4/11/05,2005,a,2,2,2,2,0,1
38820,38821,Anaheim,10/2/05,2005,a,0,0,0,0,0,1
15861,15862,Nationals,5/21/01,2001,h,0,0,0,0,0,0
28222,28223,Houston,8/27/03,2003,h,0,0,0,0,0,1
16797,16798,Philadelphia,6/25/01,2001,h,0,0,0,0,0,0


The next column I added to the dataset was a categorical dataframe that labeled each game as either one in which the subject team had a Circadian Advantage '+', Disadvantage '-', or no difference '0". I named the new column 'CTadvCat'.

In [5]:
df['CTadvCat'] = 0
df.ix[df.CTadv > 0,'CTadvCat'] = '+'
df.ix[df.CTadv < 0,'CTadvCat'] = '-'
df.sample(5)

Unnamed: 0,Game,Team,Date,Year,H/A,CT,CTopp,CT_Abs,CTopp_Abs,CTadv,Won_Game,CTadvCat
2181,2182,Cincinnati,6/21/98,1998,a,0,0,0,0,0,0,0
7197,7198,Minnesota,7/1/99,1999,a,0,0,0,0,0,0,0
6187,6188,Tampa Bay,5/25/99,1999,h,0,0,0,0,0,0,0
24051,24052,NY Mets,9/21/02,2002,h,0,0,0,0,0,1,0
42407,42408,Philadelphia,8/15/06,2006,h,0,0,0,0,0,1,0


Now that I had all of the new columns I needed, I decided to drop the columns I didn't need from the dataframe. This included the game identifiers such as 'Game', 'Team', 'Date', and 'Year'. I also dropped the columns which were used to calculate the new columns I added and were now irrelevent to the analysis.

In [6]:
df = df.drop({'Game','Team','Date','Year','CT','CTopp','CT_Abs','CTopp_Abs'}, axis=1)
df.sample(5)

Unnamed: 0,H/A,CTadv,Won_Game,CTadvCat
21924,a,0,1,0
21792,a,0,1,0
13700,h,1,1,+
5167,h,0,0,0
31171,a,0,0,0


For my next action I decided to filter the dataset down to just Home games played. This was done because as the case mentions, every game is represented in the data with a row for each of the two teams that played in it. This means that for each subject row there is an inverse row for the same game. I felt that this duplication would not give an accurate representation of the data in my analysis and as such removed all of the away games.

In [7]:
df = df[df['H/A']=='h']
df.sample(5)

Unnamed: 0,H/A,CTadv,Won_Game,CTadvCat
40144,h,0,0,0
19470,h,0,0,0
6488,h,0,1,0
30007,h,0,1,0
26879,h,1,1,+


The next step in my analysis was to pivot the dataset in order to determine the winning percentage for games in which the home team had either a positive, negative, or no Circadian Advantage. To do this I made a pivot table that indexed on the 'CTadvCat' column and returned values based on the 'Won_Game' column. Using the aggfunc option in the pivot table code, I was able to have one column that showed the number games won (sum) and number of games played (count) for each category. I then added a column that divided the wins by total games played to determine a winning percentage for each category, this column was titled 'Home_WinPct'.

In [8]:
pivot = pd.pivot_table(df, index=('CTadvCat'), values = 'Won_Game', aggfunc = ('sum', 'count'))
pivot['Home_WinPct'] = pivot['sum'] / pivot['count']
pivot.rename(columns={'sum': 'Home_Wins','count': 'Home_Games'}, inplace=True)
pivot

Unnamed: 0_level_0,Home_Wins,Home_Games,Home_WinPct
CTadvCat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,9224,17268,0.534167
+,1840,3360,0.547619
-,676,1227,0.550937


With wins and games (successes and sample sizes) as well as winning percentages (proportions) for each category, I was ready to run a hypothesis test with the data. I first used the no advantage category as my null hypothesis sample and the positive advantage category as my test hypothesis and plugged the data into the Z-Test equation for proportions. I then converted the resulting Z-Value into a P-Value.

In [9]:
Ratio_Diff_Adv = pivot.iloc[1][2] - pivot.iloc[0][2]
Pop_Proportion_Adv = ((pivot.iloc[0][0] + pivot.iloc[1][0]) / (pivot.iloc[0][1] + pivot.iloc[1][1]))
Std_Error_Adv = np.sqrt((Pop_Proportion_Adv*(1-Pop_Proportion_Adv))*((1/(pivot.iloc[1][1])+(1/(pivot.iloc[0][1])))))
z_test_Adv = Ratio_Diff_Adv / Std_Error_Adv
z_test_Adv

1.4306184249952527

In [10]:
p_value_Adv = stats.norm.sf(abs(z_test_Adv))
p_value_Adv

0.076269801971357132

Next I did the same calculation but with the negative advantage category as the test hypothesis.

In [11]:
Ratio_Diff_NegAdv = pivot.iloc[2][2] - pivot.iloc[0][2]
Pop_Proportion_NegAdv = ((pivot.iloc[0][0] + pivot.iloc[2][0]) / (pivot.iloc[0][1] + pivot.iloc[2][1]))
Std_Error_NegAdv = np.sqrt((Pop_Proportion_NegAdv*(1-Pop_Proportion_NegAdv))*((1/(pivot.iloc[2][1])+(1/(pivot.iloc[0][1])))))
z_test_NegAdv = Ratio_Diff_NegAdv / Std_Error_NegAdv
z_test_NegAdv

1.1380542948766821

In [12]:
p_value_NegAdv = stats.norm.sf(abs(z_test_NegAdv))
p_value_NegAdv

0.12754890567837712

## Summary

My analysis of the data showed that for the games in the dataset, the home team's winning percentage was higher for games in which they had both a Circadian Advantage and Disadvantage than games for which both teams had the same circadian score. However, when the advantage and disadvantage categories were tested against the neutral circadian games I found that the differences in winning percentage for both categories ultimately didn't meet the threshold for statistical significance. For the advantage category, the resulting p-value of .076 means that the resulting winning percentage of the advantage games would be expected to occur accross it's sample size 7.6% of the time under a normal distribution of the neutral circadian winning percentage.  The disadvantage category had an even higher p-value despite a better winning percentage due to smaller sample size. At .127, this p-value suggests that such a winning percentage would be achieved across that sample size 12.7% of the time under a normal distribution of the neutral circadian winning percentage.

Ultimately it appears that the data proves the "Circadian Advantage" to be a myth.