### Machine learning for football results

A script to try using Machine Learning to predict football match scores, by using historical data of match outcomes.

I don't expect this to work all too well, given that teams constantly change, but it may work a little bit since teams do have a 'national identity' in their playstyle.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBRegressor

In [3]:
data_path = '../data/football_data/results.csv'
data = pd.read_csv(data_path)
data.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


Get training, validation and target data ready.

To reduce the amount of data, we will only consider the European championship games.

In [4]:
# Select tournaments to consider
tournaments = ['UEFA Nations League', 'UEFA Euro', 'UEFA Euro qualification']
euro_mask = data['tournament'].isin(tournaments)
euro_data = data[euro_mask]
euro_data.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
4713,1958-09-28,Russia,Hungary,3,1,UEFA Euro qualification,Moscow,Soviet Union,False
4714,1958-10-01,France,Greece,7,1,UEFA Euro qualification,Paris,France,False
4741,1958-11-02,Romania,Turkey,3,0,UEFA Euro qualification,Bucharest,Romania,False
4748,1958-12-03,Greece,France,1,1,UEFA Euro qualification,Athens,Greece,False
4795,1959-04-05,Republic of Ireland,Czechoslovakia,2,0,UEFA Euro qualification,Dublin,Republic of Ireland,False


In [5]:
# Some preprocessing: convert the date to just the year; also make neutral from True/False to 1/0
euro_data['date'] = pd.to_datetime(euro_data['date']).dt.year
euro_data.rename(columns={'date': 'year'}, inplace=True)
euro_data['neutral'] = euro_data['neutral'].astype(int)
euro_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  euro_data['date'] = pd.to_datetime(euro_data['date']).dt.year
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  euro_data.rename(columns={'date': 'year'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  euro_data['neutral'] = euro_data['neutral'].astype(int)


Unnamed: 0,year,home_team,away_team,home_score,away_score,tournament,city,country,neutral
4713,1958,Russia,Hungary,3,1,UEFA Euro qualification,Moscow,Soviet Union,0
4714,1958,France,Greece,7,1,UEFA Euro qualification,Paris,France,0
4741,1958,Romania,Turkey,3,0,UEFA Euro qualification,Bucharest,Romania,0
4748,1958,Greece,France,1,1,UEFA Euro qualification,Athens,Greece,0
4795,1959,Republic of Ireland,Czechoslovakia,2,0,UEFA Euro qualification,Dublin,Republic of Ireland,0


In [6]:
# So what we want to predict is the score of both the home and away team
y = euro_data[['home_score', 'away_score']]
# Drop those for X, use the rest
X = euro_data.drop(['home_score', 'away_score'], axis=1)

# Split the data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)

# Select categorical columns with relatively low cardinality (arbitrary)
categorical_cols = [cname for cname in X_train_full.columns 
                    if #X_train_full[cname].nunique() < 60 and 
                       X_train_full[cname].dtype == "object"]
## turned out runtime is still fine even with these columns though

# Select numerical columns (all, except selected categorical and these numerical columns, will be dropped)
numerical_cols = [cname for cname in X_train_full.columns if 
                  X_train_full[cname].dtype in ['int64', 'float64', 'int32']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

print(categorical_cols, numerical_cols)
X_train.head()

['home_team', 'away_team', 'tournament', 'city', 'country'] ['year', 'neutral']


Unnamed: 0,home_team,away_team,tournament,city,country,year,neutral
12029,Belgium,Scotland,UEFA Euro qualification,Brussels,Belgium,1979,0
35068,Georgia,Latvia,UEFA Euro qualification,Tbilisi,Georgia,2011,0
45260,Gibraltar,Bulgaria,UEFA Nations League,Gibraltar,Gibraltar,2022,0
20542,Wales,Moldova,UEFA Euro qualification,Cardiff,Wales,1995,0
8602,Romania,Czechoslovakia,UEFA Euro qualification,Bucharest,Romania,1971,0


Preprocessing: use the one-hot encoder for these strings data

In [7]:
# Check if there are any missing values we have to impute
missing_val_count_by_column = X_train.isnull().sum()
print(missing_val_count_by_column)
# Nope, so just use the one hot encoder

home_team     0
away_team     0
tournament    0
city          0
country       0
year          0
neutral       0
dtype: int64


In [8]:
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numerical_cols),  # otherwise the numerical columns would be dropped
        ('cat', categorical_transformer, categorical_cols)
    ])

# Do the preprocessing
X_train_transformed = preprocessor.fit_transform(X_train)
X_valid_transformed = preprocessor.transform(X_valid)

In [9]:
# Print the values to make sure it all worked
feature_names = numerical_cols + list( preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_cols) )
X_train_transformed_df = pd.DataFrame(X_train_transformed.toarray(), columns=feature_names)
X_valid_transformed_df = pd.DataFrame(X_valid_transformed.toarray(), columns=feature_names)
X_train_transformed_df[["year", "neutral", "home_team_Belgium", "away_team_Scotland", "tournament_UEFA Euro qualification"]].head()

Unnamed: 0,year,neutral,home_team_Belgium,away_team_Scotland,tournament_UEFA Euro qualification
0,1979.0,0.0,1.0,1.0,1.0
1,2011.0,0.0,0.0,0.0,1.0
2,2022.0,0.0,0.0,0.0,0.0
3,1995.0,0.0,0.0,0.0,1.0
4,1971.0,0.0,0.0,0.0,1.0


Define the model and get predictions

In [10]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=1, 
                        early_stopping_rounds=5, random_state=42)
my_model.fit(X_train_transformed_df, y_train,
             eval_set=[(X_valid_transformed_df, y_valid)], 
             verbose=False)

predictions = my_model.predict(X_valid_transformed_df)

mae = mean_absolute_error(predictions, y_valid)
print(f"Mean Absolute Error: {mae:.3f}")
print(predictions[:5], '\n', y_valid.head())

Mean Absolute Error: 0.900
[[1.15387   1.0556861]
 [1.4894325 1.1408753]
 [1.1495245 1.1852598]
 [1.5500809 1.0241152]
 [1.0500188 1.2662162]] 
        home_score  away_score
21090           0           1
13426           0           1
10285           2           1
20565           2           1
28327           2           4


In [11]:
# See if it can get the winner right
correct_list = []
for i in range(len(predictions)):
    correct = False
    if (predictions[i][0] > predictions[i][1]) and (y_valid.iloc[i]['home_score'] > y_valid.iloc[i]['away_score']):
        correct = True
    elif (predictions[i][0] < predictions[i][1]) and (y_valid.iloc[i]['home_score'] < y_valid.iloc[i]['away_score']):
        correct = True
    elif (predictions[i][0] == predictions[i][1]) and (y_valid.iloc[i]['home_score'] == y_valid.iloc[i]['away_score']):
        correct = True

    correct_list.append(correct)

print(f"Correct winner predictions: {sum(correct_list)} out of {len(predictions)}, {sum(correct_list)/len(predictions)*100:.2f}%")

Correct winner predictions: 443 out of 748, 59.22%


In the end we can say that it gets the correct winner more than half of the time correctly, so that's better than a random guess at least.

In [16]:
# Now try to optimise model hyperparameters
from sklearn.model_selection import GridSearchCV
import numpy as np

# Parameter values to try
parameters = {
    "n_estimators": [10, 50, 100, 200, 300],  # 500, 1000, 1200 not better
    "learning_rate": [0.01, 0.05, 0.1, 0.5]
}

# Model with fixed parameters
model = XGBRegressor(n_jobs=-1, random_state=42, early_stopping_rounds=5)

# Grid searcher
model_cv = GridSearchCV(model, parameters, cv=2)
model_cv.fit(X_train_transformed_df, y_train,
             eval_set=[(X_valid_transformed_df, y_valid)], 
             verbose=False)

# Print best parameters
print(model_cv.best_params_, model_cv.best_score_)

{'learning_rate': 0.1, 'n_estimators': 100} 0.19764468779293826


In [17]:
best_model = model_cv.best_estimator_

predictions = best_model.predict(X_valid_transformed_df)

# See if it can get the winner right
correct_list = []
for i in range(len(predictions)):
    correct = False
    if (predictions[i][0] > predictions[i][1]) and (y_valid.iloc[i]['home_score'] > y_valid.iloc[i]['away_score']):
        correct = True
    elif (predictions[i][0] < predictions[i][1]) and (y_valid.iloc[i]['home_score'] < y_valid.iloc[i]['away_score']):
        correct = True
    elif (predictions[i][0] == predictions[i][1]) and (y_valid.iloc[i]['home_score'] == y_valid.iloc[i]['away_score']):
        correct = True

    correct_list.append(correct)

print(f"Correct winner predictions: {sum(correct_list)} out of {len(predictions)}, {sum(correct_list)/len(predictions)*100:.2f}%")

Correct winner predictions: 446 out of 748, 59.63%


This ends up with a slightly better result, so although the individual scores aren't great, the winner prediction is better than random (although still not very good either).

### What if instead of actual scores, we try to win just who wins in the first place?

In [18]:
# Then change the target variable from the individual scores to 'home_win', 
# which is 1 if the home team won, 0 otherwise
y_win = (y["home_score"]>y["away_score"]).astype(int)
y_win.head()

4713    1
4714    1
4741    1
4748    0
4795    1
dtype: int32

In [22]:
# Redo some steps
X_train_full_noUse, X_valid_full_noUse, \
    y_win_train, y_win_valid = train_test_split(X, y_win, train_size=0.8, test_size=0.2, random_state=42)

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=1, 
                        early_stopping_rounds=5, random_state=42)
my_model.fit(X_train_transformed_df, y_win_train,
             eval_set=[(X_valid_transformed_df, y_win_valid)], 
             verbose=False)

predictions = my_model.predict(X_valid_transformed_df)
print(predictions[:5])

threshold = 0.5
binary_pred = (predictions >= threshold).astype(int)
print(binary_pred[:5], '\n', y_win_valid.head())

# If the difference between the predicted value and the real one is 0, the prediction is correct
difference = y_win_valid.values - binary_pred
corrects = difference == 0
correct_frac = sum(corrects)/len(corrects)
print("Fraction of instances where the model was correct:", round(correct_frac, 2))

[0.33431584 0.51182276 0.27979702 0.4551581  0.21224345]
[0 1 0 0 0] 
 21090    0
13426    0
10285    1
20565    1
28327    0
dtype: int32
Fraction of instances where the model was correct: 0.68


We see here the correct winner is chosen 68% of the time, which is pretty decent actually. At least a lot better than predicting actual scores.