# Data Wrangling

For this project, I'll be analyzing data from two datasets. The first is the 'Equine Death and Breakdown' dataset provided by the New York State Gaming Commission. The second is a dataset I compiled myself through my own research on various trainers whether they have a history of drugging their horses. I created this dataset by researching which trainers have been suspended or fined after one or more of their horses had been drug tested and found with traces of drugs over the legal limit (article sources provided in dataset).

For a complete dataset that I will use for my project, I will merge the two datasets together on the trainer names. I will also expand upon different variables in the 'Weather Conditions' and 'Incident Description' columns in order to further test whether those factors are significant.

In [514]:
# import packages
import pandas as pd
import numpy as np

# Equine Death and Breakdown Dataset

Before merging our two datasets, I will clean the Equine Death and Breakdown dataset and study its contents.

In [515]:
# load the dataset as a dataframe
df1 = pd.read_csv('Equine_Death_and_Breakdown.csv')
df1.head()

Unnamed: 0,Year,Incident Date,Incident Type,Track,Inv Location,Racing Type Description,Division,Weather Conditions,Horse,Trainer,Jockey Driver,Incident Description,Death or Injury
0,2009,03/04/2009,EQUINE DEATH,Aqueduct Racetrack (NYRA),Aqueduct,Racing,Thoroughbred,,Private Details,JOHN P. TERRANOVA II,,Private Details-Tr. John Terranova-fell on tra...,Euthanasia
1,2009,03/04/2009,ON-TRACK ACCIDENT,Aqueduct Racetrack (NYRA),,Racing,Thoroughbred,,Private Details,JOHN P. TERRANOVA II,,Private Details-Tr. John Terranova-fell fx LF...,
2,2009,03/04/2009,ON-TRACK ACCIDENT,Aqueduct Racetrack (NYRA),Aqueduct,Racing,Thoroughbred,,All Bets Off,B E. LEVINE,,All Bets Off-Tr. Bruce Levine-fell over downed...,
3,2009,03/04/2009,ON-TRACK ACCIDENT,Aqueduct Racetrack (NYRA),Aqueduct,Racing,Thoroughbred,,Hot Chile Soup,ENRIQUE ARROYO,,Hot Chile Soup-Tr. Enrique Arroyo-fell over do...,
4,2009,03/04/2009,ON-TRACK ACCIDENT,Aqueduct Racetrack (NYRA),Aqueduct,Racing,Thoroughbred,,One Dream Union,BRUCE R. BROWN,,One Dream Union-Tr. Bruce Brown-fell over down...,


In [516]:
# use the info() method to identify size and datatypes of the dataframe
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3218 entries, 0 to 3217
Data columns (total 13 columns):
Year                       3218 non-null int64
Incident Date              3218 non-null object
Incident Type              3218 non-null object
Track                      3218 non-null object
Inv Location               3216 non-null object
Racing Type Description    3218 non-null object
Division                   3218 non-null object
Weather Conditions         3171 non-null object
Horse                      3218 non-null object
Trainer                    3218 non-null object
Jockey Driver              3218 non-null object
Incident Description       3218 non-null object
Death or Injury            3217 non-null object
dtypes: int64(1), object(12)
memory usage: 326.9+ KB


In [517]:
# check the column names
df1.columns

Index(['Year', 'Incident Date', 'Incident Type', 'Track', 'Inv Location',
       'Racing Type Description', 'Division', 'Weather Conditions', 'Horse',
       'Trainer', 'Jockey Driver', 'Incident Description', 'Death or Injury'],
      dtype='object')

In [518]:
# check the shape of the dataframe
df1.shape

(3218, 13)

Our dataframe has 3,218 rows and 13 columns. Since most of the data is categorical, let's check to see that the categories in various columns aren't mispelled and creating more values than necessary.

In [519]:
# use groupby() to display the different incident types
# use count() to display how many records for each incident type
df1.groupby(['Incident Type'])['Incident Type'].count()

Incident Type
ACCIDENT - DRIVER/JOCKEY               10
ACCIDENT - IN STARTING GATE            22
ACCIDENT - ON TRACK                   107
ACCIDENT - TAGGED SULKY                56
DRIVER/JOCKEY INJURED                   7
EQUINE DEATH                         1219
EQUINE DEATH - INFECTIOUS DISEASE      10
FALL OF HORSE                         113
FALL OF RIDER                         211
ON-TRACK ACCIDENT                       7
RACING INJURY                         302
STEWARDS/VETS LIST                   1154
Name: Incident Type, dtype: int64

There are 12 incident types, each unique except there are two categories related to on track accidents. They are the 'ACCIDENT - ON TRACK' and 'ON-TRACK ACCIDENT' categories. Let's recategorize the rows with the second description to be categorized as the first.

In [520]:
# recategorize the on track accidents so they have the same incident type name
df1['Incident Type'] = df1['Incident Type'].replace('ON-TRACK ACCIDENT','ACCIDENT - ON TRACK')
df1.groupby(['Incident Type'])['Incident Type'].count()

Incident Type
ACCIDENT - DRIVER/JOCKEY               10
ACCIDENT - IN STARTING GATE            22
ACCIDENT - ON TRACK                   114
ACCIDENT - TAGGED SULKY                56
DRIVER/JOCKEY INJURED                   7
EQUINE DEATH                         1219
EQUINE DEATH - INFECTIOUS DISEASE      10
FALL OF HORSE                         113
FALL OF RIDER                         211
RACING INJURY                         302
STEWARDS/VETS LIST                   1154
Name: Incident Type, dtype: int64

We now have 11 incident types and each is different from the other. Now let's change the datatype from an object to categorical.

In [521]:
# update the datatype of 'Incident Type' to be categorical and check the datatype
df1['Incident Type'] = df1['Incident Type'].astype('category')
df1['Incident Type'].dtype

category

Let's move on to check some other columns that are also categorical in nature.

In [522]:
# check the track names
df1.groupby(['Track'])['Track'].count()

Track
Aqueduct Racetrack (NYRA)               619
Batavia Downs                            63
Belmont Park (NYRA)                     719
Buffalo Raceway                         101
Finger Lakes Gaming & Racetrack         508
Monticello Raceway & Mighty M Gaming    203
Saratoga Gaming & Raceway               400
Saratoga Racecourse (NYRA)              326
Tioga Downs                              80
Vernon Downs                             62
Yonkers Raceway                         137
Name: Track, dtype: int64

In [523]:
# check racing type descriptions
df1.groupby(['Racing Type Description'])['Racing Type Description'].count()

Racing Type Description
Non-Racing     313
Racing        2512
Training       373
Unknown         20
Name: Racing Type Description, dtype: int64

In [524]:
# check division
df1.groupby(['Division'])['Division'].count()

Division
Harness         1046
Thoroughbred    2172
Name: Division, dtype: int64

The 'Track', 'Racing Type Description', and 'Division' columns all have good categorical data. Let's convert their data types to cateogorical and then keep checking other columns with categorical data.

In [525]:
# update dataype of 'Track', 'Racing Type Description', and 'Division to category
df1['Track'] = df1['Track'].astype('category')
df1['Racing Type Description'] = df1['Racing Type Description'].astype('category')
df1['Division'] = df1['Division'].astype('category')

# check datatype of these three columns by calling the info() method
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3218 entries, 0 to 3217
Data columns (total 13 columns):
Year                       3218 non-null int64
Incident Date              3218 non-null object
Incident Type              3218 non-null category
Track                      3218 non-null category
Inv Location               3216 non-null object
Racing Type Description    3218 non-null category
Division                   3218 non-null category
Weather Conditions         3171 non-null object
Horse                      3218 non-null object
Trainer                    3218 non-null object
Jockey Driver              3218 non-null object
Incident Description       3218 non-null object
Death or Injury            3217 non-null object
dtypes: category(4), int64(1), object(8)
memory usage: 240.0+ KB


Four of our columns now have data of type category. Let's check the 'Death or Injury' column.

In [526]:
# check death or injury
df1.groupby(['Death or Injury'])['Death or Injury'].count()

Death or Injury
                                478
Accident                        249
Death                             3
Equine Death                    679
Equine Injury                   101
Equine Injury / Equine Death     46
Euthanasia                      460
Injury                          218
Lame no death                     1
Lameness                        110
Steward's List                  871
death                             1
Name: Death or Injury, dtype: int64

There are a few adjustments we can make to the categories in the Death or Injury column. To start, there are three categories for death: 'Death', 'Equine Death', 'death'. The same is true for categories related to lameness ('Lameness' and 'Lame no death') and related to injury ('Equine Injury' and 'Injury'). We can probably assume that horses marked as lame didn't die at the time of the incident and we can probably assume that 'Injury' is in reference to the horse's injury and not the jockey's injury. We can consolidate all of these.

In [527]:
# use the replace() method to conslidate categories that are the same
df1['Death or Injury'] = df1['Death or Injury'].replace(['Death','death'],'Equine Death')
df1['Death or Injury'] = df1['Death or Injury'].replace('Lame no death','Lameness')
df1['Death or Injury'] = df1['Death or Injury'].replace('Injury','Equine Injury')
df1.groupby(['Death or Injury'])['Death or Injury'].count()

Death or Injury
                                478
Accident                        249
Equine Death                    683
Equine Injury                   319
Equine Injury / Equine Death     46
Euthanasia                      460
Lameness                        111
Steward's List                  871
Name: Death or Injury, dtype: int64

We have a lot of rows with no data filled in. We should check to see how these categories match up with the 'Incident Type' column since the descriptions are fairly similar. If there are blank rows in the 'Death or Injury' column that line up with categoreis in the 'Incident Type' column that suggest the horse died, perhaps we can fill in the blank rows with 'Equine Death.

In [528]:
# identify matching pairs of 'Incident Type' and 'Death or Injury' in our dataframe
df1[['Incident Type','Death or Injury']].drop_duplicates().sort_values('Incident Type').reset_index(drop=True)

Unnamed: 0,Incident Type,Death or Injury
0,ACCIDENT - DRIVER/JOCKEY,
1,ACCIDENT - DRIVER/JOCKEY,Accident
2,ACCIDENT - IN STARTING GATE,
3,ACCIDENT - IN STARTING GATE,Steward's List
4,ACCIDENT - IN STARTING GATE,Accident
5,ACCIDENT - IN STARTING GATE,Equine Injury
6,ACCIDENT - ON TRACK,Accident
7,ACCIDENT - ON TRACK,Equine Injury
8,ACCIDENT - ON TRACK,Steward's List
9,ACCIDENT - ON TRACK,


The above table shows up that the 'Incident Type' and 'Death or Injury' categories don't match up very well. There's a lot of overlap between the two columns with a total of 43 pairs. Since the 'Incident Type' column is fully populated, we can use this column for most of our analysis. For now, we will leave the blank rows in 'Death or Injury' as is. We may revisit this later.

Finally, convert this column to datatype category.

In [529]:
# update the datatype of 'Death or Injury' to be categorical and check the datatype
df1['Death or Injury'] = df1['Death or Injury'].astype('category')
df1['Death or Injury'].dtype

category

In [530]:
# check all datatypes using the info() method
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3218 entries, 0 to 3217
Data columns (total 13 columns):
Year                       3218 non-null int64
Incident Date              3218 non-null object
Incident Type              3218 non-null category
Track                      3218 non-null category
Inv Location               3216 non-null object
Racing Type Description    3218 non-null category
Division                   3218 non-null category
Weather Conditions         3171 non-null object
Horse                      3218 non-null object
Trainer                    3218 non-null object
Jockey Driver              3218 non-null object
Incident Description       3218 non-null object
Death or Injury            3217 non-null category
dtypes: category(5), int64(1), object(7)
memory usage: 218.4+ KB


Another variable that I want to test is whether the weather is a significant factor in whether horses get injured or die. We're provided with a 'Weather Conditions' column, but the descriptions aren't all consistent and some descriptions are more elaborate than others. Let's check to see how many unique descriptions there are in the 'Weather Conditions' column.

In [531]:
# check number of unique strings
df1['Weather Conditions'].nunique()

799

799 different descriptions is a lot, but it makes sense that there are so many since many of them list the temperature and the weather condition (sometimes more than 1 weather condition). Let's see what the top weather condition descriptions are to get a better idea of what the data in this column looks like.

In [532]:
# display some of the more common weather condition descriptions
df1['Weather Conditions'].value_counts().head(15)

                               999
Clear                          306
Cloudy                         125
Sunny                           39
Clear 50 to 55 : degrees F      34
Clear 60 to 65 : degrees F      28
clear                           27
Cloudy 50 to 55 : degrees F     24
Overcast                        22
Rain                            22
80* Clear                       20
Clear 30 to 40 : degrees F      18
Clear 45 to 50 : degrees F      18
60* Clear                       17
75* Clear                       16
Name: Weather Conditions, dtype: int64

A good chunk of our records don't have anything recorded for the weather conditions. We can also see that there is at least one inconsistency with capitalization (clear and Clear). Instead of alterting the data within this column, I'll create a few new columns filled with Boolean data. If a weather condition keyword appears in the weather condition description, the result will be 'True', else it will be 'False'. This will be a useful way to display the data since we have overlap between some weather conditions.

I will iterate through all the rows and identify in each of their weather conditions whether it can be categorized in one of my new boolean columns.

In [533]:
# create new boolean columns
df1['Cloudy'] = ''
df1['Sunny'] = ''
df1['Clear'] = ''
df1['Overcast'] = ''
df1['Rain'] = ''
df1['Snow'] = ''
df1['Wind'] = ''
df1['Thunder Storm'] = ''
df1['Hot'] = ''
df1['Humid'] = ''
df1['Warm'] = ''
df1['Weather Conditions'] = df1['Weather Conditions'].astype('str')

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3218 entries, 0 to 3217
Data columns (total 24 columns):
Year                       3218 non-null int64
Incident Date              3218 non-null object
Incident Type              3218 non-null category
Track                      3218 non-null category
Inv Location               3216 non-null object
Racing Type Description    3218 non-null category
Division                   3218 non-null category
Weather Conditions         3218 non-null object
Horse                      3218 non-null object
Trainer                    3218 non-null object
Jockey Driver              3218 non-null object
Incident Description       3218 non-null object
Death or Injury            3217 non-null category
Cloudy                     3218 non-null object
Sunny                      3218 non-null object
Clear                      3218 non-null object
Overcast                   3218 non-null object
Rain                       3218 non-null object
Snow                

In [534]:
# iterate through all the rows of the dataframe with a for loop
for index,row in df1.iterrows():
    condition = df1.loc[index,'Weather Conditions']
    # if the weather condition isn't blank, we will populate the boolean columns
    #if condition != ' ':
        ###print(condition, 1)
        # check if the weather condition is cloudy and assign boolean to 'Cloudy'
        #if 'cloud' or 'Cloud' in condition:
            #row['Cloudy'] = True
        #else:
            #row['Cloudy'] = False


#df1.head(15)

# *** NEED TO COME BACK TO THIS

The Equine Death and Breakdown is now clean enough for us to use. We've consolidated some categories, added columns relating to weather condition, and 