# Inferential Statistics

## Background

With the increasing prevalence of drug-abuse within the sport of horse racing, the Thoroughbred Horseracing Integrity Act of 2015 was introduced to monitor and regulate drug usage and administration to American race horses under a national uniform standard. Effective on January 1, 2017, the legislation authorizes the Thoroughbred Horseracing Anti-Doping Authority (THADA) to develop and administer a national anti-doping program. As an independent organization, _______

In [40]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats 

#import dataset as a dataframe
df = pd.read_csv('Equine_Breakdown_Death_Doping.csv',index_col = 'Unnamed: 0')

# Yearly Death and Breakdown Rate

As we discussed in the exploratory data analysis, the number of horses that have died and broken down per year has decreased over the years. But is this decrease significant? The Thoroughbred Horseracing Integrity Act was enacted on January 1st, 2017, so we will test to see if the 2017 deaths and breakdown rate is significantly different from the mean of the other years. We only have 9 years worth of data total, so we will be performing a t-test.

## Yearly Rates - Total (Doping Trainers and Non-Doping Trainers)

### Hypothesis Test

$H_0$: There is no significant difference in the 2017 death and breakdown rate and the average death and breakdown rate of other years (2009 - 2016)

$H_A$: There is a significant difference in the 2017 death and breakdown rate and the average death and breakdown rate of other years (2009 - 2016)

### Perform t Test

In [41]:
# extract yearly totals
year_total = df.groupby('Year')['Year'].count().reset_index(name='Total')
total_rate_2017 = year_total['Total'].values[8]
total_rate_otheryrs = year_total['Total'][:8]
n1 = len(total_rate_otheryrs)

# print some sample statistics
year_total['Total'].describe()

count      9.000000
mean     360.000000
std       92.126272
min      213.000000
25%      295.000000
50%      373.000000
75%      426.000000
max      483.000000
Name: Total, dtype: float64

In [42]:
# calculate t statistic
t1 = ( np.mean(total_rate_otheryrs) - total_rate_2017 ) / ( np.std(total_rate_otheryrs) / np.sqrt(n1) )
print('t =', round(t1,3))

# perform two sided t test
p_t1 = stats.t.sf(abs(t1), n1-1) * 2
print('p = p( t <= ', -round(t1,3), ') + p( t >= ', abs(round(t1,3)), ') = ', round(p_t1,3))
print('p (unrounded) is ', p_t1)

t = 6.337
p = p( t <=  -6.337 ) + p( t >=  6.337 ) =  0.0
p (unrounded) is  0.000390091360871


We have found a very small p value, revealing that we should reject the null hypothesis. There is indeed a statistically significant difference in the 2017 death and breakdown rate and the average death and breakdown rate of other years (2009 - 2016). This suggests that the sport really has been improving over the past few years, whether or not it's related to anti-doping legislation.

We do have to keep in mind that we are working with a small dataset. We only have 8 years worth of data (2009-2016) that we are using to calculate a pre-2017 mean. The fact that we still get a very small p value though suggests that the difference is significant.

## Yearly Rates - Total (Doping Trainers)

Now that we know that there's been a significant decrease in the number of deaths and breakdowns per year, let's see if we can discover if it's related to the introduction of anti-doping programs. Has there been a decrease in the number of horses dying and breaking down under trainers with a doping history?

### Hypothesis Test

$H_0$: There is no significant difference in the 2017 death and breakdown rate and the average death and breakdown rate of other years (2009 - 2016) for horses trained by trainers with a history of doping

$H_A$: There is a significant difference in the 2017 death and breakdown rate and the average death and breakdown rate of other years (2009 - 2016) for horses trained by trainers with a history of doping

### Perform t Test

In [48]:
# extract yearly totals
doping_trainers = df.loc[pd.notnull(df['Year of Action'])]
doped_year_total = doping_trainers.groupby('Year')['Year'].count().reset_index(name='Total')
doped_total_rate_2017 = doped_year_total['Total'].values[8]
doped_total_rate_otheryrs = doped_year_total['Total'][:8]
n2 = len(doped_total_rate_otheryrs)

# print some sample statistics
year_total['Total'].describe()

0    10
1    29
2    23
3    28
4    31
5    25
6    19
7    15
Name: Total, dtype: int64

In [49]:
# calculate t statistic
t2 = ( np.mean(doped_total_rate_otheryrs) - doped_total_rate_2017 ) / ( np.std(doped_total_rate_otheryrs) / np.sqrt(n2) )
print('t =', round(t2,3))

# perform two sided t test
p_t2 = stats.t.sf(abs(t2), n1-1) * 2
print('p = p( t <= ', -round(t2,3), ') + p( t >= ', abs(round(t2,3)), ') = ', round(p_t2,3))
print('p (unrounded) is ', p_t2)

t = 5.57
p = p( t <=  -5.57 ) + p( t >=  5.57 ) =  0.001
p (unrounded) is  0.000842307171112
