### Definitions 

- "win_fair_price" is the price representing the true probability that the horse wins. (In reality you never know this but assume you do).
- "win_starting_price" is the market's price that the horse wins at the start time of the race. 
- "winner" is a binary value indicating whether the horse won or lost the race. 
- "Early_Market_Price" is the price you are offered to bet by the bookmaker at some time, t, before the race starts. 
- "Early_Model_Price" is your own model's price for the horse to win at the time that the bookmaker offers his prices before the race starts (time t).
- "Starting_Model_Price" is your model's price for the horse to win at the start time of the race.
- "race_number" and "saddle_number" are unique identifiers for the race and the horse, respectively.

In [89]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from collections import defaultdict

In [90]:
#import data from csv file
horses_df = pd.read_csv('horses.csv')

#select the columns showing the availabel information
available_information = horses_df[['Early_Market_Price', 'Early_Model_Price']]

In [91]:
horses_df

Unnamed: 0,race_number,saddle_number,win_fair_price,win_starting_price,winner,Early_Market_Price,Early_Model_Price,Starting_Model_Price
0,1,4,1.7353,1.7098,1,1.55,1.729805,1.729805
1,1,1,6.0313,6.0914,0,4.90,6.526502,6.526502
2,1,5,7.6923,7.5101,0,7.08,8.431470,8.431470
3,1,6,20.3325,20.4978,0,16.10,22.772208,22.772208
4,1,2,23.9991,23.4710,0,19.07,26.987190,26.987190
...,...,...,...,...,...,...,...,...
86643,10000,7,14.4772,13.8269,0,11.18,14.134298,14.134298
86644,10000,5,29.0062,29.8636,0,21.75,27.934114,27.934114
86645,10000,3,50.8005,48.9715,0,35.42,48.614364,48.614364
86646,10000,9,310.1959,305.6448,0,185.15,292.496189,292.496189


## Q1: How many selections are you betting? Why?

The selections should be made on the basis of your models early price $P_m$ for a horse at time $t$ with respect to the bookers price $P_b$ at the same time. If the model places better odds for a horse than the booker (i.e if $P_m < P_b$) then a bet should be placed for that horse, assuming we have full confidence in the model's predictions at time $t$. 

In [97]:
#select bets where the models early price is smaller than the markets early price
p_m_win = 1 / horses_df['Early_Model_Price']
p_b_win = 1 / horses_df['Early_Market_Price']
p_true_win = 1 / horses_df['win_fair_price']
favourable_probs = p_m_win > p_b_win
selected_bets = horses_df.copy().loc[favourable_probs]
num_selected_bets = len(selected_bets)
print(f'Number of bets we should select = {num_selected_bets}')

Number of bets we should select = 2112


## Q2

### a). Calculate the Total Stake and Total PnL RoI, assuming a Unit Stake per selection that you bet.

In [98]:
#calculate PnL and PnL ROI for selected bets
unit_stake_bets = selected_bets.copy()
unit_stake_bets['Stake'] = np.ones(num_selected_bets)
unit_stake_bets['PnL'] = unit_stake_bets['Stake'] * (unit_stake_bets['Early_Market_Price'] * unit_stake_bets['winner'] - 1)
unit_stake_bets['PnL ROI'] = unit_stake_bets['PnL'] / unit_stake_bets['Stake']

total_stake = sum(unit_stake_bets['Stake'])
total_pnl = sum(unit_stake_bets['PnL'])
total_pnl_roi = total_pnl / total_stake

print(f'Total Stake = {total_stake}')
print(f'Total PnL ROI = {total_pnl_roi}')

Total Stake = 2112.0
Total PnL ROI = -0.016216856060606067


### b). Calculate the Total Stake and Total PnL RoI, assuming a Kelly Stake per selection that you bet. Use 100,000 as your Kelly bankroll for this calculation.

Kelly Gambling Formula for the optimal price to bet: 

$$
f^* = \frac{p * odds - 1}{odds -1}
$$

where 

- $f^{*}$ is the fraction of the current bankroll to wager.
- $p$ is the probability of a win (i.e. $\frac{1}{P_m}$)
- odds is the bookmaker price on the bet to win. 

Since it wasn't clear if the bets should be placed sequentially or all at one time, we will assume that they are all placed at one time and that the kelly criterion is calculated for many bets at one time. In this manner, the kelly fractions are calculated for each of the selected bets and finally are divided by the sum of all fractions to retrieve the relative weight of the bankroll that we need to place on each bet with respect to the other bets. 

In [125]:
kelly_stake_bets = selected_bets.copy()

p_model_win = 1 / kelly_stake_bets['Early_Model_Price'] #calculate probability of winning as per models predictions
odds = kelly_stake_bets['Early_Market_Price'] #bookmaker odds

fractions = [] 
bankroll = 100000 #start with initial bankroll
for i in range(len(kelly_stake_bets)):
    fractions.append((p_model_win.iloc[i]*odds.iloc[i] - 1) / (odds.iloc[i] - 1)) #kelly formula for fraction of bankroll to bet

fractions = fractions / sum(fractions)
stakes = bankroll * fractions
kelly_stake_bets['Stake'] = stakes
kelly_stake_bets['PnL'] = kelly_stake_bets['Stake'] * (kelly_stake_bets['Early_Market_Price'] * kelly_stake_bets['winner'] - 1) #calculate PnL

#calculate total PnL and PnL ROI
total_stake = sum(kelly_stake_bets['Stake'])
total_pnl = sum(kelly_stake_bets['PnL'])
total_pnl_roi = total_pnl / total_stake

print(f'Total Stake = {total_stake}')
print(f'Total PnL ROI = {total_pnl_roi}')

Total Stake = 99999.99999999994
Total PnL ROI = 0.03241795273893227


## Q3

$$
EV = p*stake*odds - (1-p)*stake
$$

### Using the win_fair_price calculate the Total EV RoI for: 
- a) the Unit Stake strategy
- b) the Kelly Stake strategy

In [127]:
#calculate true probability of winning
p_true_win = 1 / selected_bets['win_fair_price']

#EV using unit stake strategy
unit_stake_bets['EV'] = p_true_win * unit_stake_bets['Stake'] * unit_stake_bets['Early_Market_Price'] - (1 - p_true_win)*unit_stake_bets['Stake']
unit_total_ev_roi = sum(unit_stake_bets['EV']) / sum(unit_stake_bets['Stake'])

#EV using Kelly stake strategy
kelly_stake_bets['EV'] = p_true_win * kelly_stake_bets['Stake'] * kelly_stake_bets['Early_Market_Price'] - (1 - p_true_win)*kelly_stake_bets['Stake']
kelly_total_ev_roi = sum(kelly_stake_bets['EV']) / sum(kelly_stake_bets['Stake'])

print(f'Total EV ROI for Unit Stake Strategy = {unit_total_ev_roi}')
print(f'Total EV ROI for Kelly Stake Strategy = {kelly_total_ev_roi}')

Total EV ROI for Unit Stake Strategy = 1.037653958682085
Total EV ROI for Kelly Stake Strategy = 0.6303574392383056


## Q4: Is the model profitable in the long term?

- a). Monte-Carlo Simulation

Here we run a simulation for 1,000,000 samples taken using the true probabilities for the outcome of the races and calculate the profits and losses using the unit and kelly startegies, with the same assumptions holding for the latter as in question 2b. 

In [128]:
import random
from tqdm import tqdm

In [133]:
#unit stakes
record = {'unit_PnL': [], 'unit_stakes': [], 
          'kelly_fractions': [], 'odds*winner': []}

#run simulation
num_simulations = 1000000
for _ in tqdm(range(num_simulations)):
    sample = selected_bets.iloc[random.randint(0, len(selected_bets)-1)]
    p_model_win = 1 / sample['Early_Model_Price']
    p_true_win = 1 / sample['win_fair_price'] #true probability of winning
    winner = random.random() < p_true_win #sample outcome
    odds = sample['Early_Market_Price'] #bookmaker odds that we pay

    #calculate unit PnL
    unit_PnL = odds * winner - 1
    record['odds*winner'].append(odds * winner)
    record['kelly_fractions'].append((p_model_win*odds - 1) / (odds - 1))
    record['unit_PnL'].append(unit_PnL)
    record['unit_stakes'].append(1)

100%|██████████| 1000000/1000000 [02:56<00:00, 5661.08it/s]


In [135]:
#kelly statistics
bankroll = num_simulations
record['refactored_fractions'] = record['kelly_fractions'] / sum(record['kelly_fractions'])
record['kelly_stakes'] = bankroll * record['refactored_fractions']
record['kelly_PnL'] = record['kelly_stakes'] * (record['odds*winner'] - np.ones(num_simulations))

In [136]:
#unit strategy
total_unit_stake = sum(record['unit_stakes'])
total_unit_pnl = sum(record['unit_PnL'])
total_unit_pnl_roi = total_unit_pnl / total_unit_stake

print(f'Total Unit Stake = {total_unit_stake}')
print(f'Total Unit PnL ROI = {total_unit_pnl_roi}')

#kelly strategy
total_kelly_stake = sum(record['kelly_stakes'])
total_kelly_pnl = sum(record['kelly_PnL'])
total_kelly_pnl_roi = total_kelly_pnl / total_kelly_stake

print(f'Total Kelly Stake = {total_kelly_stake}')
print(f'Total Kelly PnL ROI = {total_kelly_pnl_roi}')

Total Unit Stake = 1000000
Total Unit PnL ROI = 0.14747568000001127
Total Kelly Stake = 1000000.0000000458
Total Kelly PnL ROI = 0.05533722176597893


After running a simulation for 1,000,000 samples, we observe that the PnL ROI for the unit strategy is 0.1475 and the PnL for the kelly startegy is 0.0553, and therefore the model should be profitable in the long term. The fact that the unit strategy performs better than the kelly strategy indicates that the models predictions for the betting prices are accurate but not precise. In other words, the model predicts correctly when a bet should be made (i.e. when the bookmaker underestimates the probability of a horse winning), but the actual values of the betting prices are not reflective enough of the true probabilities to weigh bets differently on a basis of the models predictions. 

#### b). Which of the four prices (win_starting_price, Early_Market_Price, Early_Model_Price, Starting_Model_Price) is the best predictor of the win_fair_price, based on the data provided? Explain your answer.

In order to determine which of the prices is a better predictor of the true price we can calculate the $r^2$ value which estimates the correlation between two variables. If we do this for all of the above variables, the one closes to +1 is the better predictor. 

In [140]:
from scipy import stats

true_price = horses_df['win_fair_price']
early_market_price = horses_df['Early_Market_Price']
early_model_price = horses_df['Early_Model_Price']
starting_model_price = horses_df['Starting_Model_Price']

def r_squared(x, y):
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    return r_value**2

early_market = r_squared(true_price, early_market_price)
early_model = r_squared(true_price, early_model_price)
starting_model = r_squared(true_price, starting_model_price)

print(r"Early Market r^2 = {:.4f}".format(early_market))
print(r"Early Model r^2 = {:.4f}".format(early_model))
print(r"Starting Model r^2 = {:.4f}".format(starting_model))

Early Market r^2 = 0.3076
Early Model r^2 = 0.0050
Starting Model r^2 = 0.0050


The coefficients of determination between the fair price and the different prices are: 

- Early Market Price = 0.3076
- Early Model Price = 0.0050
- Starting Model Price = 0.0050

Therefore the best predictor of the actual probability that a horse wins is the mearly market price. 