# Oddstradamus
### Good odds and where to find them

### Introduction

In the long run, the bookmaker always wins. The aim of this project is to disprove exactly this. We are in the football sports betting market and are trying to develop a strategy that is profitable in the long term and which will make the bookmaker leave the pitch as the loser. There are three aspects to this strategy that need to be optimised. 

These are:

- the selection of suitable football matches
- the prediction of the corresponding outcome
- and the determination of the optimal stake per bet.

In order to achieve this goal, a data set is compiled containing data from almost 60,000 football matches from 22 different leagues. This data set is processed, evaluated and then used to develop the long-term strategy with the help of selected machine learning algorithms. 

The data comes from the following source: [Data source](https://www.football-data.co.uk/downloadm.php)

### Merging the data

The first step is to read the data from 264 .csv files and combine them appropriately. Before the data set is saved, an additional column with information about the season of the match is created to ensure a unique allocation.

In [1]:
# import packages
import glob
import os
import pandas as pd

In [2]:
# loading the individual datasets of the different seasons
file_type = 'csv'
seperator =','
df_20_21 = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob('20:21' + "/*."+file_type)],ignore_index=True)
df_19_20 = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob('19:20' + "/*."+file_type)],ignore_index=True)
df_18_19 = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob('18:19' + "/*."+file_type)],ignore_index=True)
df_17_18 = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob('17:18' + "/*."+file_type)],ignore_index=True)
df_16_17 = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob('16:17' + "/*."+file_type)],ignore_index=True)
df_15_16 = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob('15:16' + "/*."+file_type)],ignore_index=True)
df_14_15 = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob('14:15' + "/*."+file_type)],ignore_index=True)
df_13_14 = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob('13:14' + "/*."+file_type)],ignore_index=True)

In [3]:
# add a column of the season for clear assignment
df_20_21['Season'] = '20/21'
df_19_20['Season'] = '19/20'
df_18_19['Season'] = '18/19'
df_17_18['Season'] = '17/18'
df_16_17['Season'] = '16/17'
df_15_16['Season'] = '15/16'
df_14_15['Season'] = '14/15'
df_13_14['Season'] = '13/14'

In [4]:
# combining the individual datasets into one
dfs = [df_14_15, df_15_16, df_16_17, df_17_18, df_18_19, df_19_20, df_20_21]
results = df_13_14.append(dfs, sort=False)

In [5]:
# saving the merged dataframe for processing
results.to_csv("Data/Results2013_2021.csv")

### Quick Overview

In [6]:
# output of the data shape
results.shape

(59415, 133)

In its initial state, the data set comprises almost 60000 rows and 133 columns. In addition to information on league affiliation, the season of the match and the team constellation, information on the final result is available in the form of the number of goals, shots, shots on target, corners, fouls and yellow and red cards for home and away teams. In addition, the dataset contains information on betting odds from a large number of bookmakers.

As a large proportion of the columns are only sporadically filled, especially with regard to the betting odds, those bookmakers whose odds are available for the 60,000 matches were filtered. This procedure alone reduced the data set from 133 to 31 columns. 

In [7]:
# selecting the necessary columns of the original data set
results = results[['Div', 'Season', 'HomeTeam','AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
                   'AC', 'HY', 'AY', 'HR', 'AR','B365H','B365D','B365A', 'BWH','BWD','BWA', 'IWH', 'IWD', 'IWA', 'WHH', 'WHD', 'WHA']]
results.shape

(59415, 31)

Die verbleibenden Spalten werden in der folgenden Tabelle kurz erläutert:

| Column | Description |
| - | - |
| `Div` | League Division |
| `Season` | Season in which the match took place |
| `HomeTeam` | Home Team |
| `AwayTeam` | Away Team |
| `FTHG` | Full Time Home Team Goals |
| `FTAG`| Full Time Away Team Goals |
| `FTR` | Full Time Result (H=Home Win, D=Draw, A=Away Win) |
| `HS` | Home Team Shots |
| `AS` | Away Team Shots |
| `HST` | Home Team Shots on Target |
| `AST` | Away Team Shots on Target |
| `HF` | Home Team Fouls Committed |
| `AF` | Away Team Fouls Committed |
| `HC` | Home Team Corners |
| `AC` | Away Team Corners |
| `HY` | Home Team Yellow Cards |
| `AY` | Away Team Yellow Cards |
| `HR`| Home Team Red Cards |
| `AR` | Away Team Red Cards |
| `B365H` | Bet365 Home Win Odds |
| `B365D` | Bet365 Draw Odds |
| `B365A` | Bet365 Away Win Odds |
| `BWH` | Bet&Win Home Win Odds |
| `BWD` | Bet&Win Draw Odds |
| `BWA` | Bet&Win Away Win Odds |
| `IWH` | Interwetten Home Win Odds |
| `IWD` | Interwetten Draw Odds |
| `IWA` | Interwetten Away Win Odds |
| `WHH` | William Hill Home Win Odds |
| `WHD` | William Hill Draw Odds |
| `WHA` | William Hill Away Win Odds |

Since one aspect of the objective is to use the data to predict football matches, it must be noted that, with the exception of the league, the season, the team constellation and the betting odds, this is exclusively information that only becomes known after the end of the match. Accordingly, the data in its present form cannot be used without further ado to predict the outcome of the match. In the following notebook, the corresponding data is processed and transformed in such a way that it can contribute to the prediction without hesitation and without data leakage.