## Summary

The goal for our data analysis was to predict the `winprobability` of each horse in the provided dataset, and to complete this task we determined, through model testing, that we would only use the following variables as predictors:

* `HorseAge` 
* `WeightCarried`
* `HandicapDistance`
* `Gender`
* `FrontShoes`
* `HindShoes`

However, we decided to encode the observations for `HandicapDistance` and added a new variable called `HandicapDistance_encoded`. We also created dummy variables for `Gender`, `FrontShoes` and `HindShoes`, thus saving them in the following new variables: 

* `Gender_M` 
* `Gender_F` 
* `FrontShoes_0` 
* `FrontShoes_1`
* `FrontShoes_2`
* `FrontShoes_3` 
* `HindShoes_0`
* `HindShoes_1` 
* `HindShoes_2`
* `HindShoes_3`

Furthermore, we cleaned the dataset by removing all observations where `FinishPosition` was non-numerical, `BeatenMargin` was 999 and `Disqualified` was `True` so now we only have data for horses that finished the race without disqualifying and received a certain rank in the race.


Then, we created the variables `RelativeWeightCarried` and `RelativeHorseAge`. Within each RaceID, these variables would rank the WeightCarried or HorseAge of the corresponding horse, within a given RaceID. 1 would be allocated to the horse with the lowest weight carried relative to the other horses in the race, or likewise, the horse with the lowest relative age.

We then created the `win` variable defined as the following:

$
\texttt{win} = 
\begin{cases} 
      1 & \text{observations with} \texttt{ min(BeatenMargin) } \text{for each race.} \\
      0 & \text{otherwise.} 
\end{cases}
$

Next, we created the testing data by including only the observations till 31st October, 2021 and testing data by including all observations from 1st November, 2021. Also, our training data for the response only consisted of the `win` variable and we only considered the following variables in our training data for the predictors:

* `HorseAge` 
* `WeightCarried`
* `HandicapDistance`
* `HandicapDistance_encoded`
* `Gender_M` 
* `Gender_F` 
* `FrontShoes_0` 
* `FrontShoes_1`
* `FrontShoes_2`
* `FrontShoes_3` 
* `HindShoes_0`
* `HindShoes_1` 
* `HindShoes_2`
* `HindShoes_3`
* `RelativeWeightCarried`
* `RelativeHorseAge`

Finally, we use a 4-layered neural network as our model with the first 3 layers using activation `relu`, since we only have non-negative values and then the last layer using the activation `sigmoid`, since we want the win probability. The first layer has 15 nodes, second layer has 10 nodes, and the third layer has 5 nodes. Our final layer has 1 node which outputs the winprobability of a single particular horse. We trained with 40 epochs and used the `Adam` optimizer with the `binary_crossentropy` loss function with a standard learning rate of 0.01.

After predicting each horse's individual winprobability, we grouped the data by RaceID, and then normalized the winprobability relative to the other horses' winprobability within a given RaceID. For example, if there were 3 horses and their winprobability was 0.5, 0.6 and 0.9 respectively, then the new winprobability for each horse would be 0.5/2.0, 0.6/2.0, 0/9/2.0 respectively, where 0.5+0.6+0.9=2.0

In [111]:
import pandas as pd
import numpy as np
import sklearn

import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import linear, relu, sigmoid
from sklearn import metrics


In [None]:
df = pd.read_parquet('trots_2013-2022.parquet')

In [88]:
Standard = ['AgeRestriction', 'Barrier', 'ClassRestriction', 'CourseIndicator',
'DamID', 'Distance', 'FoalingCountry', 'FoalingDate',
'FrontShoes', 'Gender', 'GoingAbbrev', 'GoingID',
'HandicapDistance', 'HandicapType', 'HindShoes', 'HorseAge',
'HorseID', 'JockeyID', 'RaceGroup', 'RaceID', 'RacePrizemoney',
'RaceStartTime', 'RacingSubType', 'Saddlecloth',
'SexRestriction', 'SireID', 'StartType', 'StartingLine', 'Surface',
'TrackID', 'TrainerID', 'WeightCarried', 'WetnessScale']

Performance = ['BeatenMargin', 'Disqualified', 'FinishPosition', 'PIRPosition', 'Prizemoney', 
               'RaceOverallTime', 'PriceSP', 'NoFrontCover', 'PositionInRunning', 'WideOffRail']

In [89]:
FinishPositionFilter = ['UR ', 'BS ', 'UN ', 'PU ', 'DQ ', 'FL ', 'NP ', 'UR ', 'WC ']
BeatenMarginFilter = 999.00

for finish in FinishPositionFilter:
    df.loc[df['FinishPosition']==finish, 'Disqualified'] = True
df.loc[df['BeatenMargin']==BeatenMarginFilter, 'Disqualified'] = True
dqraces = df[(df.BeatenMargin == 0)& (df.Disqualified)].RaceID.unique()

dfCleaned = df[~df['Disqualified']]
dfCleaned = dfCleaned[~dfCleaned.RaceID.isin(dqraces)]

dfCleaned


Unnamed: 0,AgeRestriction,Barrier,BeatenMargin,ClassRestriction,CourseIndicator,DamID,Disqualified,Distance,FinishPosition,FoalingCountry,...,StartType,StartingLine,Surface,TrackID,TrainerID,NoFrontCover,PositionInRunning,WideOffRail,WeightCarried,WetnessScale
0,6yo,5,1.55,NW$101 CD,,1491946,False,2150.0,2,FR,...,M,1,S,951,38190,-9,-9,-9,0.0,3
1,6yo,6,3.55,NW$101 CD,,1509392,False,2150.0,4,FR,...,M,1,S,951,38432,-9,-9,-9,0.0,3
2,6yo,7,5.55,NW$101 CD,,1507967,False,2150.0,6,FR,...,M,1,S,951,37826,-9,-9,-9,0.0,3
6,6yo,11,6.35,NW$101 CD,,1495060,False,2150.0,7,FR,...,M,2,S,951,38366,-9,-9,-9,0.0,3
7,6yo,1,2.45,NW$75 CE,,1496640,False,2575.0,2,FR,...,M,1,S,906,38070,-9,-9,-9,0.0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200404,8&9yo,0,11.95,NW$231,G,1522959,False,2850.0,11,FR,...,V,-1,C,1969,38832,0,15,2,0.0,3
1200405,8&9yo,0,14.40,NW$231,G,1511895,False,2850.0,12,FR,...,V,-1,C,1969,37955,0,14,2,0.0,3
1200406,8&9yo,0,14.55,NW$231,G,1476559,False,2850.0,13,FR,...,V,-1,C,1969,38131,0,12,1,0.0,3
1200410,6yo,0,0.00,CA,G,1552103,False,2850.0,1,ITY,...,V,-1,C,1969,40749,1,1,1,0.0,3


In [90]:
dfCleaned['RelativeHorseAge'] = dfCleaned.groupby('RaceID')['HorseAge'].rank(ascending=True)
dfCleaned['RelativeWeightCarried'] = dfCleaned.groupby('RaceID')['WeightCarried'].rank(ascending=True)

dfCleaned

Unnamed: 0,AgeRestriction,Barrier,BeatenMargin,ClassRestriction,CourseIndicator,DamID,Disqualified,Distance,FinishPosition,FoalingCountry,...,Surface,TrackID,TrainerID,NoFrontCover,PositionInRunning,WideOffRail,WeightCarried,WetnessScale,RelativeHorseAge,RelativeWeightCarried
0,6yo,5,1.55,NW$101 CD,,1491946,False,2150.0,2,FR,...,S,951,38190,-9,-9,-9,0.0,3,4.0,4.0
1,6yo,6,3.55,NW$101 CD,,1509392,False,2150.0,4,FR,...,S,951,38432,-9,-9,-9,0.0,3,4.0,4.0
2,6yo,7,5.55,NW$101 CD,,1507967,False,2150.0,6,FR,...,S,951,37826,-9,-9,-9,0.0,3,4.0,4.0
6,6yo,11,6.35,NW$101 CD,,1495060,False,2150.0,7,FR,...,S,951,38366,-9,-9,-9,0.0,3,4.0,4.0
7,6yo,1,2.45,NW$75 CE,,1496640,False,2575.0,2,FR,...,S,906,38070,-9,-9,-9,0.0,3,5.5,5.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200404,8&9yo,0,11.95,NW$231,G,1522959,False,2850.0,11,FR,...,C,1969,38832,0,15,2,0.0,3,10.0,7.0
1200405,8&9yo,0,14.40,NW$231,G,1511895,False,2850.0,12,FR,...,C,1969,37955,0,14,2,0.0,3,10.0,7.0
1200406,8&9yo,0,14.55,NW$231,G,1476559,False,2850.0,13,FR,...,C,1969,38131,0,12,1,0.0,3,10.0,7.0
1200410,6yo,0,0.00,CA,G,1552103,False,2850.0,1,ITY,...,C,1969,40749,1,1,1,0.0,3,5.0,5.0


In [91]:
label_encoder = LabelEncoder()
dfCleaned['HandicapDistance_encoded'] = label_encoder.fit_transform(dfCleaned['HandicapDistance'])

categorical_columns = ['Gender', 'Surface', 'FrontShoes', 'HindShoes']

dfCleaned = pd.get_dummies(dfCleaned, columns=categorical_columns)

dfCleaned

Unnamed: 0,AgeRestriction,Barrier,BeatenMargin,ClassRestriction,CourseIndicator,DamID,Disqualified,Distance,FinishPosition,FoalingCountry,...,Surface_S,Surface_T,FrontShoes_0,FrontShoes_1,FrontShoes_2,FrontShoes_3,HindShoes_0,HindShoes_1,HindShoes_2,HindShoes_3
0,6yo,5,1.55,NW$101 CD,,1491946,False,2150.0,2,FR,...,True,False,True,False,False,False,True,False,False,False
1,6yo,6,3.55,NW$101 CD,,1509392,False,2150.0,4,FR,...,True,False,True,False,False,False,True,False,False,False
2,6yo,7,5.55,NW$101 CD,,1507967,False,2150.0,6,FR,...,True,False,True,False,False,False,True,False,False,False
6,6yo,11,6.35,NW$101 CD,,1495060,False,2150.0,7,FR,...,True,False,True,False,False,False,True,False,False,False
7,6yo,1,2.45,NW$75 CE,,1496640,False,2575.0,2,FR,...,True,False,True,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200404,8&9yo,0,11.95,NW$231,G,1522959,False,2850.0,11,FR,...,False,False,True,False,False,False,True,False,False,False
1200405,8&9yo,0,14.40,NW$231,G,1511895,False,2850.0,12,FR,...,False,False,True,False,False,False,True,False,False,False
1200406,8&9yo,0,14.55,NW$231,G,1476559,False,2850.0,13,FR,...,False,False,False,True,False,False,True,False,False,False
1200410,6yo,0,0.00,CA,G,1552103,False,2850.0,1,ITY,...,False,False,True,False,False,False,True,False,False,False


In [93]:
new_data = dfCleaned[['RaceID', 'HorseID', 'RaceStartTime', 'HorseAge', 'WeightCarried', 'HandicapDistance_encoded', 
                         'Gender_F', 'Gender_M', 'FrontShoes_0', 'FrontShoes_1','FrontShoes_2', 'FrontShoes_3', 
                         'HindShoes_0', 'HindShoes_1', 'HindShoes_2', 'HindShoes_3', 'RelativeWeightCarried', 'RelativeHorseAge',
                         'FinishPosition', 'BeatenMargin']]

dftemp = new_data.copy()

winning_indices = dftemp.groupby('RaceID')['BeatenMargin'].idxmin()
dftemp['win'] = 0
dftemp.loc[winning_indices, 'win'] = 1


In [104]:
dftemp['startTime'] = pd.to_datetime(dftemp['RaceStartTime'])

dftemp.replace({True:1, False:0}, inplace=True)

#Train/test split
dftrain = dftemp[dftemp['startTime']<'2021-11-01']
dftest = dftemp[dftemp['startTime']>='2021-11-01']


In [152]:
Xtrain = dftrain[['HorseAge', 'WeightCarried', 'HandicapDistance_encoded', 
        'Gender_F', 'Gender_M', 'FrontShoes_0', 'FrontShoes_1','FrontShoes_2', 'FrontShoes_3', 
        'HindShoes_0', 'HindShoes_1', 'HindShoes_2', 'HindShoes_3', 'RelativeWeightCarried', 'RelativeHorseAge']]

Xtest = dftest[['HorseAge', 'WeightCarried', 'HandicapDistance_encoded', 
        'Gender_F', 'Gender_M', 'FrontShoes_0', 'FrontShoes_1','FrontShoes_2', 'FrontShoes_3', 
        'HindShoes_0', 'HindShoes_1', 'HindShoes_2', 'HindShoes_3', 'RelativeWeightCarried', 'RelativeHorseAge']]

Ytrain = dftrain[['win']]

Ytest = dftest[['win']]

In [149]:

model = Sequential(
    [               
        tf.keras.Input(shape=(15,)),
        Dense(15, activation='relu', name='L1'),
        Dense(10, activation='relu', name='L2'),
        Dense(5, activation='relu', name='L3'),
        Dense(1, activation='sigmoid', name='L4')
    ],
)
model.compile(loss=tf.keras.losses.binary_crossentropy, 
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01), metrics=['accuracy'])




In [147]:
model.summary()

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 L1 (Dense)                  (None, 15)                240       
                                                                 
 L2 (Dense)                  (None, 10)                160       
                                                                 
 L3 (Dense)                  (None, 5)                 55        
                                                                 
 L4 (Dense)                  (None, 1)                 6         
                                                                 
Total params: 461 (1.80 KB)
Trainable params: 461 (1.80 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [123]:
print(Xtrain.shape)
print(Ytrain.shape)

(835593, 15)
(835593, 1)


In [150]:

history = model.fit(
    Xtrain,Ytrain,
    epochs=40
)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [156]:
Ypred = np.rint(model.predict(Xtest).flatten())
print(metrics.accuracy_score(Ytest, Ypred))

0.9019100032372936


In [165]:
predictions = model.predict(Xtest)



In [215]:
#Merging forecasts with the dataframe
dfforecast = pd.merge(dftemp[dftemp['startTime']>='2021-11-01'].reset_index(), pd.DataFrame(predictions), left_index=True, right_index=True)

In [216]:
dfforecast.drop(columns=['index'], inplace=True)
dfforecast.rename(columns={0:'winprobability'}, inplace=True)

dfforecast['normalized_winprobability'] = dfforecast.groupby('RaceID')['winprobability'].transform(lambda x: x / x.sum())
dfforecast.drop(columns='winprobability', inplace=True)
dfforecast.rename(columns={'normalized_winprobability':'winprobability'}, inplace=True)

In [218]:
dfforecast.to_parquet('forecasts.parquet')

In [220]:
model.save('forecastmodel.keras')