# Term Project: Predicting EURO 2024

### Goal: build a predictive model that predicts the outcomes of the EURO 2024 tournament for each game from the group stage down to the final. The ultimate question is who will win the tournament. 

### Main features:  venue, tournament, location, home team, away team, score

#### Datasets: {https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017, https://www.kaggle.com/datasets/cashncarry/fifaworldranking}

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## EDA

In [None]:
results_with_confeds = pd.read_csv("results.csv")
results_with_confeds.replace({"Czech Republic": "Czechia"}, inplace=True)
shootouts = pd.read_csv("shootouts.csv")
rankings = pd.read_csv("rankings.csv")
results_with_confeds['date'] = pd.to_datetime(results_with_confeds['date'])
results_with_confeds.insert(5, "score_diff",results_with_confeds['home_score'] - results_with_confeds['away_score'])
shootouts['date'] = pd.to_datetime(shootouts['date'])
rankings['rank_date'] = pd.to_datetime(rankings['rank_date'])

### Home vs away (venue)

In [None]:
non_neutral = results_with_confeds[results_with_confeds['neutral']==False]

In [None]:
plt.figure(figsize=(7, 4))
sns.boxplot([non_neutral['home_score'], non_neutral['away_score']])
plt.show()

#### We see that historically, on non-neutral venues, the home team scores more goals. This does not neccesraily mean the home team wins. Now, we'll see how many wins the home team gets compared to draws and losses.

In [None]:
def wins_draws_losses(df):
    return pd.Series([len(df[df['score_diff'] > 0]), len(df[df['score_diff'] == 0]), len(df[df['score_diff'] < 0])])


d = wins_draws_losses(non_neutral)

plt.figure(figsize=(7, 4))
fig = d.plot(kind='bar')
fig.set_xticklabels(["Wins", "Draws", "Losses"])
plt.xticks(rotation=0)
plt.title("Home team results in non-neutral venues")
plt.show()

#### We see that the home team in non-neutral venues is more likely to win the match. This makes sense since the home crowd rallies behind their country and gives a massive boost and momentum. 

#### Now, we'll look more specifically into the EUROs. We'll explore how the host nation performs, as although the matches are classed as being in a natural venue, the competition still takes place in their own country.

In [None]:
host_nations = results_with_confeds[(results_with_confeds['home_team'] == results_with_confeds['country']) & (results_with_confeds['tournament'] == "UEFA Euro")]

res = wins_draws_losses(host_nations)
plt.figure(figsize=(7, 4))
fig = res.plot(kind='bar')
fig.set_xticklabels(["Wins", "Draws", "Losses"])
plt.xticks(rotation=0)
plt.title("Host nation performances in EUROs")
plt.show()

#### We see that if we look at single games, host nations seem to be favored in the EUROs. However, this doesn't mean that they will neccesarily win the competition, since a single loss may mean getting knocked out of the competition completely. On the other hand, it is possible to recover from a group stage loss.

### Progression of teams

#### Here, we'll look at how far a given team gets in the previous Euros. We can do that by simple counting. From 2016 onwards, there are 6 groups each consisting of 4 teams, there are $6\cdot \binom{4}{2}=36$ group stage matches. Then, $8$ round of 16 matches, $4$ quarter finals, $2$ semi finals, and $1$ final. For previous years, the format is a little bit different, and we can adjust to this. 

In [None]:
def get_euro(year):
    euro = results_with_confeds[(results_with_confeds['date'].dt.year == year) & (results_with_confeds['tournament']=="UEFA Euro")]
    if year >= 2016:
        stages = 36*["Group Stage"] + 8*["Round of 16"] + 4*["Quarter Finals"] + 2*["Semi Finals"] + 1*["Final"]
    elif year >= 1996:
        stages = 24*["Group Stage"] + 4*["Quarter Finals"] + 2*["Semi Finals"] + 1*["Final"]
    elif year > 1980:
        stages = 12*["Group Stage"] + 2*["Semi Finals"] + 1*["Final"]
    elif year == 1980:
        stages = 12*["Group Stage"] + 1*["Third Place Playoff"] + 1*["Final"]
    elif year == 1968:
        stages = 2*["Semi Finals"] + 1*["Third Place Playoff"] + 2*["Final"]
    else:
        stages = 2*["Semi Finals"] + 1*["Third Place Playoff"] + 1*["Final"]
    euro.insert(7, "stage", stages)
    return euro

def achievement(team, year):
    t = get_euro(year)
    return t.loc[(t['home_team'] == team) | (t['away_team']==team)].tail(1)['stage'].values[0]

def winner(year):
    if year == 1968:
        return 'Italy'
    t = get_euro(year)
    f = t[t['stage']=="Final"]
    match f['score_diff'].values[0]:
        case n if n>0: return f['home_team'].values[0] 
        case n if n<0: return f['away_team'].values[0] 
        case n if n==0: return shootouts[shootouts['date']==f['date'].values[0]]['winner'].values[0]

In [None]:
years = [1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2021]
winners = [winner(year) for year in years]
winners_new = list(set(winners))
counts = [winners.count(year) for year in winners_new]
li = sorted(list(zip(winners_new, counts)), key=lambda x: x[1], reverse=True)

labels = [e[0] for e in li]
values = [e[1] for e in li]

plt.figure(figsize=(8, 4))
plt.xticks(rotation=90)
plt.bar(labels, values)
plt.title("EURO Winners")
plt.show()

### Host Nation Progression

In [None]:
def host_nation(year):
    return results_with_confeds[(results_with_confeds['tournament'] == "UEFA Euro") & (results_with_confeds['date'].dt.year == year)].tail(1)['country'].values[0]

years_except_2021 = [1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016]

progressions = [achievement(host_nation(year), year) for year in years_except_2021]

progressions = list(map(lambda x: x.replace('Third Place Playoff', 'Semi Finals'), progressions))

palette_color = sns.color_palette('bright') 
d = {i:progressions.count(i) for i in set(progressions)}

plt.pie(x=d.values(), labels=d.keys(),colors=palette_color, autopct='%.0f%%') 
plt.title("Host Nation Progression")
plt.show()

#### We see that historically, nations perform well when being backed up by a home crowd. 87% of host nations make it to the semi finals or further, although this is misleading as in the early years, teams progressed to the semi finals directly upon getting past the group stages.

## Building the Model

### Including the team rankings and confederations into the dataset

In [None]:
confeds = rankings.drop_duplicates(subset='country_full')[['country_full', 'confederation']]

home_df = confeds.rename(columns={'country_full': 'home_team', 'confederation': 'home_confederation'})
away_df = confeds.rename(columns={'country_full': 'away_team', 'confederation': 'away_confederation'})

results_with_confeds = results_with_confeds.merge(home_df[['home_team', 'home_confederation']], on='home_team', how='left')
results_with_confeds = results_with_confeds.merge(away_df[['away_team', 'away_confederation']], on='away_team', how='left')

In [None]:
rankings = rankings.sort_values(by=['country_full', 'rank_date'])

rankings_home = rankings.rename(columns={'country_full': 'home_team', 'rank': 'home_ranking', 'rank_date': 'home_ranking_date'})
results_with_confeds = pd.merge_asof(results_with_confeds.sort_values('date'),
                           rankings_home.sort_values('home_ranking_date'),
                           left_on='date',
                           right_on='home_ranking_date',
                           by='home_team',
                           direction='backward')

rankings_away = rankings.rename(columns={'country_full': 'away_team', 'rank': 'away_ranking', 'rank_date': 'away_ranking_date'})
results_with_confeds = pd.merge_asof(results_with_confeds.sort_values('date'),
                           rankings_away.sort_values('away_ranking_date'),
                           left_on='date',
                           right_on='away_ranking_date',
                           by='away_team',
                           direction='backward')

In [None]:
results_with_confeds.columns

results_with_rankings = results_with_confeds.drop(columns=['country_abrv_x', 'total_points_x', 'previous_points_x',
       'rank_change_x', 'confederation_x', 'home_ranking_date', 'country_abrv_y', 'total_points_y', 'previous_points_y',
       'rank_change_y', 'confederation_y', 'away_ranking_date', 'city', 'country'])

results_with_rankings

In [None]:
euro24 = results_with_rankings[(results_with_rankings['tournament']=="UEFA Euro") & (results_with_rankings['date'].dt.year==2024)].drop(
    columns=['home_score', 'away_score', 'date', 'score_diff', 'home_confederation', 'away_confederation', 'tournament'])

euro24.replace({False: 0, True: 1}, inplace=True)

### Training set 1: Predict $\texttt{score\_diff}$ (one feature)

In [None]:
with_gd = results_with_rankings.dropna()
with_gd = with_gd[(with_gd['home_confederation'] == "UEFA") & (with_gd['away_confederation'] == "UEFA")]
with_gd.drop(columns=['home_team', 'away_team', 'date', 'home_confederation', 'away_confederation', 'tournament', 'home_score', 'away_score'], inplace=True)
with_gd.replace({False: 0, True: 1}, inplace=True)

with_gd

In [None]:
def pred_group_stages(model):
    euro24_testing = euro24.drop(columns=["home_team", "away_team"]).dropna()
    pred_score_diffs = model.predict(euro24_testing)
    rounded_preds = np.round(pred_score_diffs).astype(int)
    group_stage = euro24.dropna().reset_index(drop=True)
    groups = ['A', 'A', 'B', 'B', 'D', 'C', 'C', 'E', 'D', 'E', 'F', 'F', 'A', 'A', 'B', 'B', 'C', 'C',
              'D', 'D', 'E', 'E', 'F', 'F', 'A', 'A', 'B', 'B', 'D', 'C', 'D', 'C', 'E', 'E', 'F', 'F']
    group_stage.insert(2, "group", groups)
    group_stage['score_diff'] = rounded_preds.flatten()
    return group_stage

#### Model 1.0 (Baseline): Linear Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score

X = with_gd.drop('score_diff', axis=1)
y = with_gd['score_diff']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
from sklearn.linear_model import LinearRegression
model1_0 = LinearRegression()
model1_0.fit(X_train, y_train)
predictions = model1_0.predict(X_test)
rounded_predictions = np.round(predictions).astype(int)
mae = mean_absolute_error(y_test, rounded_predictions)
mse = mean_squared_error(y_test, rounded_predictions)
accuracy = np.mean(rounded_predictions.flatten() == y_test)

mae, mse, accuracy

In [None]:
# pred_group_stages(model1_0)

#### Model 1.1: Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
model1_1 = RandomForestRegressor()
model1_1.fit(X_train, y_train)
predictions = model1_1.predict(X_test)
rounded_predictions = np.round(predictions).astype(int)
mae = mean_absolute_error(y_test, rounded_predictions)
mse = mean_squared_error(y_test, rounded_predictions)
accuracy = np.mean(rounded_predictions.flatten() == y_test)

mae, mse, accuracy

#### Model 1.2: Neural Networks

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from keras.optimizers import Adam

model1_2 = Sequential()

# First hidden layer with 64 neurons
model1_2.add(Dense(128, input_dim=X_train.shape[1], activation='relu'))
model1_2.add(BatchNormalization())  
model1_2.add(Dropout(0.5))  

# Second hidden layer with 64 neurons
model1_2.add(Dense(64, activation='relu'))
model1_2.add(BatchNormalization())
model1_2.add(Dropout(0.5))

# Third hidden layer with 32 neurons
model1_2.add(Dense(32, activation='relu'))
model1_2.add(BatchNormalization())
model1_2.add(Dropout(0.5))

model1_2.add(Dense(1, activation='linear'))

model1_2.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')

model1_2.fit(X_train, y_train, epochs=100, batch_size=10, verbose=1)

predictions = model1_2.predict(X_test)

rounded_predictions = np.round(predictions).astype(int)

In [None]:
accuracy = np.mean(rounded_predictions.flatten() == y_test)
accuracy

In [None]:
# pred_group_stages(model2)

### Training set 3: Predicting both team's scores.

In [None]:
set_3 = results_with_rankings.drop(columns=["home_team", "away_team","date", "score_diff", "tournament", "home_confederation", "away_confederation"]).dropna()

set_3.replace({False: 0, True: 1}, inplace=True)

set_3

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor


def pred_group_stages_multi(model):
    group_stage = euro24.dropna().drop(columns=["home_team", "away_team"]).reset_index(drop=True)
    preds = np.round(model.predict(group_stage)).astype(int)
    preds_df = pd.DataFrame(preds, columns=["home_score", "away_score"]).reset_index(drop=True)
    group_stage = pd.concat([euro24.dropna().reset_index(drop=True), preds_df], axis=1)
    groups = ['A', 'A', 'B', 'B', 'D', 'C', 'C', 'E', 'D', 'E', 'F', 'F', 'A', 'A', 'B', 'B', 'C', 'C',
              'D', 'D', 'E', 'E', 'F', 'F', 'A', 'A', 'B', 'B', 'D', 'C', 'D', 'C', 'E', 'E', 'F', 'F']
    group_stage.insert(2, "group", groups)
    return group_stage.drop(columns=["neutral", "home_ranking", "away_ranking"])

#### Model 3.1: MultiOutputRegressor with Linear Regression

In [None]:
from sklearn.multioutput import MultiOutputRegressor

np.random.seed(42)

X = set_3.drop(columns=["home_score", "away_score"])
y = set_3[["home_score", "away_score"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

model3_1 = MultiOutputRegressor(LinearRegression())
model3_1.fit(X_train, y_train)
y_pred_lr = model3_1.predict(X_test)
mse = mean_squared_error(y_test, y_pred_lr)

pred_group_stages_multi(model3_1)

#### Model 3.2: MultiOutputRegressor with Random Forest

In [None]:
np.random.seed(42)

model3_2 = MultiOutputRegressor(RandomForestRegressor(n_estimators=200))
model3_2.fit(X_train, y_train)
y_pred = model3_2.predict(X_test)
# mse = mean_squared_error(y_test, y_pred)

pred_group_stages_multi(model3_2)

#### Model 3.3: Neural Networks

In [None]:
np.random.seed(42)

model3_3 = Sequential([
    Dense(64, input_dim=3, activation='relu'), 
    Dropout(0.2),  # Dropout layer for regularization
    Dense(128, activation='relu'),  # First hidden layer
    Dropout(0.2),  # Dropout layer for regularization
    Dense(64, activation='relu'),  # Second hidden layer
    Dense(32, activation='relu'),  # Third hidden layer
    Dense(16, activation='relu'),  # Fourth hidden layer
    Dense(2)  # Output layer with 2 neurons (for 2 continuous outcomes)
])

model3_3.compile(optimizer='adam', loss='mean_squared_error') 
model3_3.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)  # Validation split for monitoring

In [None]:
pred_group_stages_multi(model3_3)

## Group Stages

#### After predicting each match of the group stage, we now calculate the group tables. The top two teams progress to the round of 16, along with the four best third place teams. The group table for each group A to F will be represented as a $\texttt{dict\{\text{Team}: List[\text{pts}, \text{scored}, \text{conceded}, \text{difference}]\}}$.

In [None]:
def get_group_table(group, model):
    predictions = pred_group_stages_multi(model)
    groups = [dict(), dict(), dict(), dict(), dict(), dict()]
    h2h = dict()
    group_letters = ['A', 'B', 'C', 'D', 'E', 'F']
    for index, row in predictions.iterrows():
        ind = group_letters.index(row['group'])
        dic = groups[ind]
        home = row['home_team']
        away = row['away_team']
        home_score = row['home_score']
        away_score = row['away_score']
        if home not in dic.keys():
            dic[home] = [0,0,0,0]
        if away not in dic.keys():
            dic[away] = [0,0,0,0]
        if home not in h2h.keys():
            h2h[home] = dict()
        if away not in h2h.keys():
            h2h[away] = dict()
        if home_score > away_score:
            diff = home_score - away_score
            dic[home][0] += 3
            dic[home][1] += home_score
            dic[home][2] += away_score
            dic[home][3] += diff
            dic[away][1] += away_score
            dic[away][2] += home_score
            dic[away][3] -= diff
            h2h[home][away] = 1
            h2h[away][home] = 0
        elif home_score == away_score:
            dic[home][0] += 1
            dic[home][1] += home_score
            dic[home][2] += away_score
            dic[away][0] += 1
            dic[away][1] += away_score
            dic[away][2] += home_score
            h2h[home][away] = 0
            h2h[away][home] = 0
        else:
            diff = away_score - home_score
            dic[away][0] += 3
            dic[away][1] += away_score
            dic[away][2] += home_score
            dic[away][3] += diff
            dic[home][1] += home_score
            dic[home][2] += away_score
            dic[home][3] -= diff
            h2h[home][away] = 0
            h2h[away][home] = 1
        groups[ind] = dic
    ans = groups[group_letters.index(group)]
    return dict(sorted(ans.items(), key=lambda x: (x[1][0], x[1][3], x[1][2]), reverse=True))

def get_h2h(model):
    predictions = pred_group_stages_multi(model)
    h2h = dict()
    for index, row in predictions.iterrows():
        home = row['home_team']
        away = row['away_team']
        home_score = row['home_score']
        away_score = row['away_score']
        if home not in h2h.keys():
            h2h[home] = dict()
        if away not in h2h.keys():
            h2h[away] = dict()
        if home_score > away_score:
            h2h[home][away] = 1
            h2h[away][home] = 0
        elif home_score == away_score:
            h2h[home][away] = 0
            h2h[away][home] = 0
        else:
            h2h[home][away] = 0
            h2h[away][home] = 1
    return h2h

get_group_table('E', model3_2)

### Determining the bracket

![alt text](euro-2024-bracket.jpeg)

## Round of 16

In [None]:
def get_recent_ranking(country):
    return rankings[rankings['country_full']==country].tail(1)['rank'].values[0]

def group_winner(group, model):
    group_tbl = list(get_group_table(group, model))
    return group_tbl[0]

def runner_up(group, model):
    group_tbl = list(get_group_table(group, model))
    return group_tbl[1]

def third_place_teams(model):
    third_places = dict()
    for group in ["A", "B", "C", "D", "E", "F"]:
        (key, value) = list(get_group_table(group, model).items())[2]
        third_places[key] = (value, group)
    return dict(sorted(third_places.items(), key=lambda x: (x[1][0][0], x[1][0][3], x[1][0][2]), reverse=True))

def third_place_slot(third_place, possible_groups):
    for key in third_place:
        if third_place[key][1] in possible_groups:
            return key, third_place[key][1]
        
def get_neutral(row):
    if row['home_team'] == "Germany":
        return 0
    else:
        return 1

def forced_home_adv(x):
    return (x != "Germany", x)

def home_rankings(row):
    return get_recent_ranking(row['home_team'])

def away_rankings(row): 
    return get_recent_ranking(row['away_team'])

def get_r16(model):
    third_places = third_place_teams(model)
    r16 = pd.DataFrame(columns=["home_team", "away_team"])
    third_ADEF = third_place_slot(third_places, ["A", "D", "E", "F"])[0]
    del third_places[third_ADEF]
    third_ABC = third_place_slot(third_places, ["A", "B", "C"])[0]
    del third_places[third_ABC]
    third_ABCD = third_place_slot(third_places, ["A", "B", "C", "D"])[0]
    del third_places[third_ABCD]
    third_DEF = third_place_slot(third_places, ["D", "E", "F"])[0]
    del third_places[third_DEF]
    r16.loc[0] = sorted([group_winner("B", model), third_ADEF], key=forced_home_adv)
    r16.loc[1] = sorted([group_winner("A", model), runner_up("C", model)], key=forced_home_adv)
    r16.loc[2] = [group_winner("F", model), third_ABC]
    r16.loc[3] = [runner_up("D", model), runner_up("E", model)]
    r16.loc[4] = [group_winner("E", model), third_ABCD]
    r16.loc[5] = [group_winner("D", model), runner_up("F", model)]
    r16.loc[6] = [group_winner("C", model), third_DEF]
    r16.loc[7] = [runner_up("A", model), runner_up("B", model)]
    r16['neutral'] = r16.apply(get_neutral, axis=1)
    r16['home_ranking'] = r16.apply(home_rankings, axis=1)
    r16['away_ranking'] = r16.apply(away_rankings, axis=1)
    return r16

get_r16(model3_2)

### Predicting penalty shootouts as a tiebreaker

In [None]:
rankings = rankings.sort_values(by=['country_full', 'rank_date'])

rankings_home = rankings.rename(columns={'country_full': 'home_team', 'rank': 'home_ranking', 'rank_date': 'home_ranking_date'})

shootouts_with_rankings = pd.merge_asof(shootouts.sort_values('date'),
                           rankings_home.sort_values('home_ranking_date'),
                           left_on='date',
                           right_on='home_ranking_date',
                           by='home_team',
                           direction='backward')

rankings_away = rankings.rename(columns={'country_full': 'away_team', 'rank': 'away_ranking', 'rank_date': 'away_ranking_date'})

shootouts_with_rankings = pd.merge_asof(shootouts_with_rankings.sort_values('date'),
                           rankings_away.sort_values('away_ranking_date'),
                           left_on='date',
                           right_on='away_ranking_date',
                           by='away_team',
                           direction='backward')

In [None]:
shootouts_with_rankings.drop(columns=['country_abrv_x', 'total_points_x', 'previous_points_x',
       'rank_change_x', 'confederation_x', 'home_ranking_date', 'country_abrv_y', 'total_points_y', 'previous_points_y',
       'rank_change_y', 'confederation_y', 'away_ranking_date'], inplace=True)

shootouts_with_rankings = shootouts_with_rankings.dropna()

In [None]:
shootouts_with_rankings.loc[shootouts_with_rankings['winner'] == shootouts_with_rankings['home_team'], 'winner_01'] = 1
shootouts_with_rankings.loc[shootouts_with_rankings['winner'] == shootouts_with_rankings['away_team'], 'winner_01'] = 0

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

shootout_training = shootouts_with_rankings.iloc[:, -3:]
X = shootout_training.drop('winner_01', axis=1)
y = shootout_training['winner_01']

n_trials = 500
best_accuracy = 0
best_so_model = None

for i in range(n_trials):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    if acc > best_accuracy:
        best_accuracy = acc
        best_so_model = model

## Quarters

In [None]:
def get_quarters(model):
    r16 = get_r16(model)
    r16_train = r16.drop(columns=['home_team', 'away_team'])
    
    r16_preds = np.round(model.predict(r16_train)).astype(int)
    r16_preds_df = pd.DataFrame(r16_preds, columns=["home_score", "away_score"]).reset_index(drop=True)
    
    r16_results = pd.concat([r16.reset_index(drop=True), r16_preds_df], axis=1)

    shootout_preds = best_model.predict(r16.drop(columns=['home_team', 'away_team', 'neutral']))

    r16_results['winner'] = shootout_preds
    
    r16_results.loc[r16_results['home_score'] > r16_results['away_score'], 'winner'] = r16_results['home_team']
    r16_results.loc[r16_results['home_score'] < r16_results['away_score'], 'winner'] = r16_results['away_team']
    r16_results.loc[r16_results['winner'] == 1, 'winner'] = r16_results['home_team']
    r16_results.loc[r16_results['winner'] == 0, 'winner'] = r16_results['away_team']

    winners = list(r16_results['winner'])
    quarters = pd.DataFrame(columns=["home_team", "away_team"])
    
    quarters.loc[0] = sorted([winners[0], winners[1]], key=forced_home_adv)
    quarters.loc[1] = sorted([winners[2], winners[3]], key=forced_home_adv)
    quarters.loc[2] = sorted([winners[4], winners[5]], key=forced_home_adv)
    quarters.loc[3] = sorted([winners[6], winners[7]], key=forced_home_adv)
    
    quarters['neutral'] = quarters.apply(get_neutral, axis=1)
    quarters['home_ranking'] = quarters.apply(home_rankings, axis=1)
    quarters['away_ranking'] = quarters.apply(away_rankings, axis=1)
    
    return quarters

get_quarters(model3_2)

## Semis

In [None]:
def get_semis(model):
    quarters = get_quarters(model)
    
    quarters_train = quarters.drop(columns=['home_team', 'away_team'])
    
    preds = np.round(model.predict(quarters_train)).astype(int)
    preds_df = pd.DataFrame(preds, columns=["home_score", "away_score"]).reset_index(drop=True)
    
    quarters_results = pd.concat([quarters.reset_index(drop=True), preds_df], axis=1)

    shootout_preds = best_model.predict(quarters.drop(columns=['home_team', 'away_team', 'neutral']))

    quarters_results['winner'] = shootout_preds
    
    quarters_results.loc[quarters_results['home_score'] > quarters_results['away_score'], 'winner'] = quarters_results['home_team']
    quarters_results.loc[quarters_results['home_score'] < quarters_results['away_score'], 'winner'] = quarters_results['away_team']
    quarters_results.loc[quarters_results['winner'] == 1, 'winner'] = quarters_results['home_team']
    quarters_results.loc[quarters_results['winner'] == 0, 'winner'] = quarters_results['away_team']
    
    winners = list(quarters_results['winner'])
    semis = pd.DataFrame(columns=["home_team", "away_team"])
    semis.loc[0] = sorted([winners[0], winners[1]], key=forced_home_adv)
    semis.loc[1] = sorted([winners[2], winners[3]], key=forced_home_adv)
    
    semis['neutral'] = semis.apply(get_neutral, axis=1)
    semis['home_ranking'] = semis.apply(home_rankings, axis=1)
    semis['away_ranking'] = semis.apply(away_rankings, axis=1)
    
    return semis

get_semis(model3_2)

## Final

In [None]:
def get_final(model):
    semis = get_semis(model)
    
    semis_train = semis.drop(columns=['home_team', 'away_team'])
    
    preds = np.round(model.predict(semis_train)).astype(int)
    preds_df = pd.DataFrame(preds, columns=["home_score", "away_score"]).reset_index(drop=True)
    
    semis_results = pd.concat([semis.reset_index(drop=True), preds_df], axis=1)

    shootout_preds = best_model.predict(semis.drop(columns=['home_team', 'away_team', 'neutral']))

    semis_results['winner'] = shootout_preds
    
    semis_results.loc[semis_results['home_score'] > semis_results['away_score'], 'winner'] = semis_results['home_team']
    semis_results.loc[semis_results['home_score'] < semis_results['away_score'], 'winner'] = semis_results['away_team']
    semis_results.loc[semis_results['winner'] == 1, 'winner'] = semis_results['home_team']
    semis_results.loc[semis_results['winner'] == 0, 'winner'] = semis_results['away_team']
    
    winners = list(semis_results['winner'])
    final = pd.DataFrame(columns=["home_team", "away_team"])
    final.loc[0] = sorted([winners[0], winners[1]], key=forced_home_adv)
    final['neutral'] = final.apply(get_neutral, axis=1)
    final['home_ranking'] = final.apply(home_rankings, axis=1)
    final['away_ranking'] = final.apply(away_rankings, axis=1)
    
    return final

get_final(model3_2)

## Winner

In [None]:
def get_winner(model):
    final = get_final(model)
    
    final_train = final.drop(columns=['home_team', 'away_team'])
    
    preds = np.round(model.predict(final_train)).astype(int)
    preds_df = pd.DataFrame(preds, columns=["home_score", "away_score"]).reset_index(drop=True)
    
    final_result = pd.concat([final.reset_index(drop=True), preds_df], axis=1)

    shootout_preds = best_model.predict(final.drop(columns=['home_team', 'away_team', 'neutral']))

    final_result['winner'] = shootout_preds
    
    final_result.loc[final_result['home_score'] > final_result['away_score'], 'winner'] = final_result['home_team']
    final_result.loc[final_result['home_score'] < final_result['away_score'], 'winner'] = final_result['away_team']
    final_result.loc[final_result['winner'] == 1, 'winner'] = final_result['home_team']
    final_result.loc[final_result['winner'] == 0, 'winner'] = final_result['away_team']
    
    return final_result['winner'][0]

get_winner(model3_2)

# Testing Multiple Models

In [None]:
# X = set_3.drop(columns=["home_score", "away_score"])
# y = set_3[["home_score", "away_score"]]

# winners = []
# n_trials = 100

# for i in range(n_trials):
#     np.random.seed(i+100)
#     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
#     model3_2 = MultiOutputRegressor(RandomForestRegressor(n_estimators=100))
#     model3_2.fit(X_train, y_train)
#     winners.append(get_winner(model3_2))

In [None]:
# x = sorted(list(set(winners)), key = lambda c: winners.count(c), reverse=True)
# y = [winners.count(c) for c in x]


# plt.figure(figsize=(10, 5))
# plt.bar(x, y)
# plt.xlabel('Winner')
# plt.ylabel('Frequency')
# plt.show()

In [None]:
# X = set_3.drop(columns=["home_score", "away_score"])
# y = set_3[["home_score", "away_score"]]

# winners = []
# n_trials = 100

# for i in range(n_trials):
#     np.random.seed(i+100)
#     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
#     model3_1 = MultiOutputRegressor(LinearRegression())
#     model3_1.fit(X_train, y_train)
#     winners.append(get_winner(model3_1))

In [None]:
# x = sorted(list(set(winners)), key = lambda c: winners.count(c), reverse=True)
# y = [winners.count(c) for c in x]

# plt.figure(figsize=(5,5))
# plt.bar(x, y)
# plt.xlabel('Winner')
# plt.ylabel('Frequency')
# plt.show()

# winners.count("Spain")

In [None]:
# X = set_3.drop(columns=["home_score", "away_score"])
# y = set_3[["home_score", "away_score"]]

# winners = []
# n_trials = 100

# for i in range(n_trials):
#     np.random.seed(i+100)
#     model3_3 = Sequential([
#     Dense(64, input_dim=3, activation='relu'), 
#     Dropout(0.2),  # Dropout layer for regularization
#     Dense(128, activation='relu'),  # First hidden layer
#     Dropout(0.2),  # Dropout layer for regularization
#     Dense(64, activation='relu'),  # Second hidden layer
#     Dense(32, activation='relu'),  # Third hidden layer
#     Dense(16, activation='relu'),  # Fourth hidden layer
#     Dense(2)  # Output layer with 2 neurons (for 2 continuous outcomes)
# ])
#     model3_3.compile(optimizer='adam', loss='mean_squared_error') 
#     model3_3.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)  # Validation split for monitoring
#     winners.append(get_winner(model3_3))


In [None]:
# x = sorted(list(set(winners)), key = lambda c: winners.count(c), reverse=True)
# y = [winners.count(c) for c in x]

# plt.figure(figsize=(5,5))
# plt.bar(x, y)
# plt.xlabel('Winner')
# plt.ylabel('Frequency')
# plt.show()

# winners.count("Netherlands")

## Tuning model

In [None]:
from sklearn.model_selection import GridSearchCV

best_model = None
best_mse = 1000000
n_trials = 100

X = set_3.drop(columns=["home_score", "away_score"])
y = set_3[["home_score", "away_score"]]

# for _ in range(n_trials):
#     np.random.seed(i+100)
#     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
#     model3_2 = MultiOutputRegressor(RandomForestRegressor(n_estimators=100))
#     model3_2.fit(X_train, y_train)
#     y_pred = model3_2.predict(X_test)
#     mse = mean_squared_error(y_test, y_pred)
#     if mse < best_mse:
#         best_mse = mse

# best_mse

# Define the base model
base_model = RandomForestRegressor(n_estimators=200)

# Wrap it inside MultiOutputRegressor
model = MultiOutputRegressor(base_model)

# Define the parameter grid
param_grid = {
    'estimator__n_estimators': [100, 200, 300],
    'estimator__max_depth': [None, 10, 20],
    'estimator__min_samples_split': [2, 5, 10],
    'estimator__min_samples_leaf': [1, 2, 4],
    'estimator__max_features': ['auto', 'sqrt', 'log2']
}

# Set up the grid search with cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', verbose=1)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", -grid_search.best_score_)

In [None]:
np.random.seed(42)

base_model = RandomForestRegressor(n_estimators=500)
best_model = MultiOutputRegressor(base_model)


X = set_3.drop(columns=["home_score", "away_score"])
y = set_3[["home_score", "away_score"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

best_model.fit(X_train, y_train)