# Online Sports Betting: Beating the Bookie

Lisandro Kaunitz, Shenjun Zhong & Javier Kreiner, the authors of **Beating the bookies with their own numbers - and how the online sports betting market is rigged**, attempt a novel and brilliant approach to sports betting. Rather than compete with the bookmakers predictions, Kunitz et al. attempt to beat the bookmakers by using their predictions against them. In the paper, they demonstrate how to take advantage of mispriced odds using the implicit information in boomakers' aggreagete odds and conclude it is possible. Bookmakers countered the authors success by limiting the size and type of bets they were allowed to place, leading to the second conclusion of the paper: even if a bettor has a consistently profitable strategy, the bookies are under no obligation to continue taking his or her bets. Betting exchanges use discriminatory practices against successful gamblers and online sports betting remains a long-term losing proposition.

Github: https://github.com/Lisandro79/BeatTheBookie/tree/master/src

Paper: https://www.researchgate.net/publication/320296375_Beating_the_bookies_with_their_own_numbers_-_and_how_the_online_sports_betting_market_is_rigged

Blog: https://www.lisandrokaunitz.com/index.php/en/category/beatthebookies-en/

### Summary 

Originally, I intended to create a value betting algorithm for this project but Kaunitz, et al. convinced me it was a bad idea for the simple reason that my model would have to predict the probability of sporting events better than the bookmakers' models - and making a good models is their whole business. Furthermore, I can aggregate there guesses to make mine and bypass the data and computationally-expensive process of analyzing teams and players.

This notebook represents my best attempt to recreate such a model using the same soccer match data. I found...

### Agenda

* Exploring the odds
* Evaluating the odds

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import scipy.stats as scs
import pandas as pd
import gzip
import shutil
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression

plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')

In [2]:
#Unzip data into Kaggle working directory
PATH = "../input/beat-the-bookie-worldwide-football-dataset/closing_odds.csv.gz"
with gzip.open(PATH, 'rb') as f_in:
    with open('./closing_odds.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

#Matches with odds data.
close = pd.read_csv('./closing_odds.csv', index_col=0)

### Exploring the odds

Kaunitz, et al. collected historical closing odds (odds provided at game-time) for 479,440 soccer matches between 2005 and 2015 from 32 online bookmakers. The hard part of scraping the data off bookies' websites and cleaning it has been done for me.

In [3]:
close.info()

In [4]:
close.head(3)

I'll convert odds of a home win, away win, or draw to probability for each of the matches.  The consensus probability is the mean of all odds for each outcome and for each match 

$P = 1 / (avg.odds)$

In [5]:
#prob. rounded to nearest 1/80th
bins = 80
close['avg_prob_home_win'] = np.floor((1/close['avg_odds_home_win'])*bins)/bins
close['avg_prob_away_win'] = np.floor((1/close['avg_odds_away_win'])*bins)/bins
close['avg_prob_draw'] = np.floor((1/close['avg_odds_draw'])*bins)/bins

All odds are right-skewed. Home-wins are more likely than away-wins. Away-wins have more variance. Notice that odds represent what the bettor receives if she wins a 1 unit bet, which means the odds cannot be less than 1. Mean odds restated as ratios would be:
* Home-win: 1.5:1
* Away-win: 2:1
* Draw: 2.7:1

In [6]:
for odds in ['avg_odds_home_win', 'avg_odds_away_win', 'avg_odds_draw']:
    plt.figure(figsize=(9,1))
    sns.histplot(close[odds].clip(0,10)) #there are some really high odds
    plt.ylabel('')
    plt.xlim(1,10)
    plt.show()
    print(f'   mean: {round(close[odds].mean(),2)}')
    print(f'   st. dev.: {round(close[odds].std(),2)}')

### Evaluating the odds

A strategy intended to beat the bookmakers at predicting the game outcome (value betting, the commonest form of algorithmic betting) requires a more accurate model than the ones bookmakers have developed.

The mean accuracy in the prediction of the outcome is the proportion of games ending in home team victory, draw, or away team victory for each probability bin. Kaunitz, et al. figured out that the **consensus probability is a good estimate of the underlying probability of an outcome**. There's not much room for improvement if I want to build a value betting algorithm.

In [27]:
#create dummy variable for home win, away win, or draw
close['home_win'] = close.home_score > close.away_score
close['away_win'] = close.home_score < close.away_score
close['draw'] = close.home_score == close.away_score

#calculate number of observations for each probability
home_obs = close.groupby('avg_prob_home_win').sum()['home_win']
away_obs = close.groupby('avg_prob_away_win').sum()['away_win']
draw_obs = close.groupby('avg_prob_draw').sum()['draw']

#calculate the accuracy for each predicted probability for each outcome
home_win_acc = close.groupby('avg_prob_home_win').mean()['home_win']
away_win_acc = close.groupby('avg_prob_away_win').mean()['away_win']
draw_acc = close.groupby('avg_prob_draw').mean()['draw']

#retain accuracy if there are a min. of observations
min_obs = 100
home_win_acc = home_win_acc[home_obs>min_obs]
away_win_acc = away_win_acc[away_obs>min_obs]
draw_acc = draw_acc[draw_obs>min_obs]

In [28]:
plt.figure(figsize=(6,6))
plt.plot(home_win_acc)
plt.plot(away_win_acc)
plt.title("Consensus Probabilities are Accurate")
plt.xlabel(f"estimated prob. (min. {min_obs} obs.)")
plt.ylabel("% of predictions correct")
plt.plot(draw_acc)
plt.legend(['Home Win','Away Win','Draw'])
plt.show()

### Correlations and Regression

Linear regression shows a strong correlation between the consensus probability and the outcome of the game.

In [29]:
X1, y1 = np.array(home_win_acc.index), home_win_acc.values
X2, y2 = np.array(away_win_acc.index), away_win_acc.values
X3, y3 = np.array(draw_acc.index), draw_acc.values

#usage note: Pearson is most appropriate for measurements taken from an interval scale.
print(f'Home win correlation w/ consensus prob.: {scs.pearsonr(X1, y1)[0]}')
print(f'Away win correlation w/ consensus prob.: {scs.pearsonr(X2, y2)[0]}')
print(f'Draw correlation w/ consensus prob.: {scs.pearsonr(X3, y3)[0]}')

In [30]:
lr_home = LinearRegression().fit(X1.reshape(-1,1), y1)
print(f'Home win R^2: {lr_home.score(X1.reshape(-1,1), y1)}')
print(f'  Slope: {lr_home.coef_[0]}')
alpha_home = -lr_home.intercept_
print(f'  Intercept: {-alpha_home}')

lr_away = LinearRegression().fit(X2.reshape(-1,1), y2)
print(f'\nAway win R^2: {lr_away.score(X2.reshape(-1,1), y2)}')
print(f'  Slope: {lr_away.coef_[0]}')
alpha_away = -lr_away.intercept_
print(f'  Intercept: {-alpha_away}')

lr_draw = LinearRegression().fit(X3.reshape(-1,1), y3)
print(f'\nDraw R^2: {lr_draw.score(X3.reshape(-1,1), y3)}')
print(f'  Slope: {lr_draw.coef_[0]}')
alpha_draw = -lr_draw.intercept_
print(f'  Intercept: {-alpha_draw}')


### Building a Strategy

Now I'll try to implement a strategy like Kaunitz, et al.

The optimal strategy maximizes **expected payoff**, which looks like this.

$E(X) = p*Ω - 1$

where $X$ is a random variable representing the payoff of the bet,

where $Ω$ are the odds paid by the bookmaker

and where $p$ is the underlying probability of the outcome

I should bet when $E(X) > 0$

Kaunitz, et al. calculated an adjustment term $α$ for the consensus probabilities in order to calculate the underlying probability to account for the bookies' commission. In their paper, the regression intercepts are the estimated $α$ while the final value is found through trial-and-error. Why intercepts? Because we would expect the regression line to pass through the origin (0, 0) if there wasn't a commission.

Therefore, rearranging for odds, the betting condition is,

$Ω > 1 / (p - α) $

Notice that increasing $α$ increases the expected value while decreasing the number of available bets (because the margins are higher.) To implement this strategy in real-life, I'd need a dashboard with the game and the bookmaker offering the maximum odds displayed. To keep things simple, bets will all be of the same size (50 dollars) for this simulation.


In [31]:
#original alpha = -(alpha_home + alpha_away + alpha_draw)/3 was less successful
alpha = .05
min_odds = 4
bet_size = 50 #this could be any amount

The max odds represent the best available odds on that bet among all the bookmakers so that'll be used to calculate payoff. Ensure at least five bookmakers have given odds for the bet so the market isn't too thin.

In [44]:
close['implied_odds_home_win'] = 1/(close[close.n_odds_home_win>=min_odds].avg_prob_home_win - alpha)
close['implied_odds_away_win'] = 1/(close[close.n_odds_away_win>=min_odds].avg_prob_away_win - alpha)
close['implied_odds_draw'] = 1/(close[close.n_odds_draw>=min_odds].avg_prob_draw - alpha)

close['bet_home_win'] = close.implied_odds_home_win < close.max_odds_home_win
close['bet_away_win'] = close.implied_odds_away_win < close.max_odds_away_win 
close['bet_draw'] = close.implied_odds_draw < close.max_odds_draw

close.head(3)

I put the bets in their own dataframe and look at payouts. What's going on? I'm losing money hand-over-fist.

In [59]:
bets = close[(close['bet_home_win']+close['bet_away_win']+close['bet_draw'])==1]
bets['bet_won'] = (bets.bet_home_win <= bets.home_win) & (bets.bet_away_win <= bets.away_win) & (bets.bet_draw <= bets.draw) 
bets['payoff'] = -1 + bets.bet_won * (bets.home_win * bets.max_odds_home_win +
                                      bets.away_win * bets.max_odds_away_win +
                                      bets.draw * bets.max_odds_draw)
bets = bets[['match_date', 'home_win', 'away_win', 'payoff']]
bets['gain_loss'] = bets.payoff * bet_size
bets['cumulative'] = bet_size + bets.gain_loss.cumsum()
             
acc = sum(bets.payoff>0)/len(bets)*100
print(f'{len(bets)} bets placed, {round(len(bets)/len(close)*100,1)} percent of all games')
print(f'{round(acc,2)} percent of bets won')

bets.head(3)

In [72]:
plt.plot(bets.cumulative)
plt.ylabel('cumulative gains/losses')
plt.xlabel('match ID (chronological over 10 years)')
plt.title('Losing!')
plt.show()

In [47]:
bets.tail(3)

### Random bet strategy (baseline)

I choose an equivalent random number of games to bet on from the whole dataset and bet home win, away win, or draw based on prior probability of those outcomes.

In [43]:
#priors
prior_home = sum(bets.home_win)/len(bets) #.45
prior_away = sum(bets.away_win)/len(bets) #.31
prior_draw = 1 - prior_home - prior_away #too high?

arr = np.random.rand(len(bets))
arr[arr > (1-prior_draw)] = 2
arr[arr < prior_home] = 0
arr[(0 < arr) & (arr < 1)] = 1

prior_home, prior_away, prior_draw

In [77]:
#bootstrapping
trials = 30
accs = []

for _ in range(trials):
    idx = np.random.choice(len(close), len(bets))
    samp = close.iloc[idx]
    samp['result'] = samp.away_win + 2*samp.draw #making it categorical
    samp['bet_won'] = samp.result == arr
    samp['payoff'] = -1 + samp.bet_won * (samp.home_win * samp.max_odds_home_win +
                                          samp.away_win * samp.max_odds_away_win + samp.draw * samp.max_odds_draw)
    samp = samp[['match_date', 'home_win', 'away_win', 'payoff']]
    samp['gain_loss'] = samp.payoff * bet_size
    samp = samp.sort_values(by='match_id')
    samp['cumulative'] = bet_size + samp.gain_loss.cumsum()
    accs.append(sum(samp.payoff>0)/len(samp)*100)

accs = np.array(accs)
print(f'Mean betting accuracy {round(accs.mean(),3)}% and variance {round(accs.var(),4)}')
samp.head(3)

In [79]:
plt.plot(samp.cumulative)
plt.ylabel('cumulative gains/losses')
plt.xlabel('match ID (chronological over 10 years)')
plt.title('Also losing.')
plt.show()

In [78]:
samp.tail(3)

## Results

While the implied odds strategy beat the random bet strategy w.r.t. accuracy (40.3% to 35.6%), it losing a lot more simulated money. It seems my version is betting draws too often, which may be hurting the accuracy. This is underperforming Kaunitz's model by a lot.

I'll need more time to figure out why my strategy isn't profitable even on training data before I can think about real bets.