<h1 align="center">Predicting the UEFA European Football Championship</h1>

<h1 style="font-size:12px" align="center">Tutorial by Aabid Roshan, Sid Joshi, Pranav Sivaraman</h1>

---

<h1 align="center">
    Introduction
</h1>

<h1 style="font-size:18px">
    Tournament info
</h1>

<p style="font-size:16px">
    The UEFA European Football Championship, also known as the Euros, is a tournament held by UEFA every four years which consists of countries in Europe. There are also similar competitions for other continents such as the CONCACAF Gold Cup which is held for North American international teams. Although there are many different continental competitions, the Euros are by far the most popular because the majority of the worlds best players play for European teams. Although this competition is called the 2020 Euros, it is being held in 2021 because of COVID. Currently the qualifiers for the tournment have already been played and the 24 teams that go into the tournament have already been decided. Later on you will see a list of these teams. The current reigning champions are Portugal.
</p>

<p style="font-size:16px">
    The format of the tournament starts off with a group stage where there are 6 groups of 4 teams each. During the group stage each team will play every other team twice, once at home and once away from home. Each team gets 3 points for a win, 1 for a draw, and 0 for a loss. At the end of the group stage, the top 2 teams move on to the next round and play in a standard knockout style tournament.
</p>

<h1 style="font-size:18px">
    Why do we want to predict the winner?
</h1>

<p style="font-size:16px">
    
</p>

<h1 align="center">
    Data Scraping
</h1>
<p style="font-size:16px">
    First off, here are the libraries we used throughout the tutorial.
</p>

In [21]:
import pandas as pd
import matplotlib.pyplot as plt

<h1 style="font-size:20px">
    Loading the Data
</h1>
<p style="font-size:16px">
    Fortunately for us, all international game data was stored in a csv file and can be downloaded from <a href="https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017">this website</a>. The data was created by Mart Jürisoo so extra thanks to him.   
</p>
<p style="font-size:16px">
    To extract the data from the file we use the pandas
        <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">read_csv</a> function and put the data into a dataframe. After extracting the data we want to convert the data column into a datetime object so it can be used to make things more convenient later on. This can be done by using the pandas <a href="https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html">to_datetime</a> function.
</p>

In [22]:
df = pd.read_csv("match_results.csv")
df['date'] = pd.to_datetime(df['date'])
df

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False
...,...,...,...,...,...,...,...,...,...
42079,2021-03-31,Andorra,Hungary,1,4,FIFA World Cup qualification,Andorra la Vella,Andorra,False
42080,2021-03-31,San Marino,Albania,0,2,FIFA World Cup qualification,Serravalle,San Marino,False
42081,2021-03-31,Armenia,Romania,3,2,FIFA World Cup qualification,Yerevan,Armenia,False
42082,2021-03-31,Germany,North Macedonia,1,2,FIFA World Cup qualification,Duisburg,Germany,False


<p style="font-size:16px">
    This is the data we will be using going forward. Below is a list of what each column represents.
    <ol>
        <li><b>date:</b> The date of the match</li>
        <li><b>home_team:</b> The home team</li>
        <li><b>away_team:</b> The away team</li>
        <li><b>home_score:</b> The number of goals scored by the home team</li>
        <li><b>away_score:</b> The number of goals scored by the away team</li>
        <li><b>tournament:</b> This is the name of the competition the match was played in. This could range anywhere from a simple friendly to a Worl Cup Final game. There are many different competitions around the world</li>
        <li><b>city:</b> The city the game was played in</li>
        <li><b>country:</b> The country the game was played in</li>
        <li><b>neutral:</b> This is whether or not the game was played at a neutral stadium. A neutral stadium is when neither team is home nor away. Both teams must travel to a stadium that is not their own</li>
    </ol>
</p>

---

<h1 align="center">
    Data Management/Representation
</h1>

<h1 style="font-size:18px">
    Creating the Ranking System
</h1>

<p style="font-size:16px">
    These are a list of the teams we are going to look at. The current list of teams below was chosen based on the qualifiers for the upcoming 2020 UEFA European Football Championship. These are the teams that qualified for said tournament. The teams can be changed if need be.
</p>

In [23]:
teams = ['Turkey', 'Italy', 'Wales', 'Switzerland',
         'Denmark', 'Finland', 'Belgium', 'Russia',
         'Netherlands', 'Ukraine', 'Austria', 'North Macedonia',
         'England', 'Croatia', 'Scotland', 'Czech Republic',
         'Spain', 'Sweden', 'Poland', 'Slovakia',
         'Hungary', 'Portugal', 'France', 'Germany']

<p style="font-size:16px">
    The methods below are used to calculate the elo of respective teams as used by FIFA. 
For more information on the math behind the calculations look at <a href="https://en.wikipedia.org/wiki/World_Football_Elo_Ratings">this link</a>
</p>

In [32]:
def initialize_elo_system(teams, initial_rating):
    ratings = {}
    for team in teams:
        ratings[team] = [initial_rating]
        
    return ratings

In [33]:
def calculate_probabilities(home_rating, away_rating, scale_factor=400):
    p_home = 1 / (1 + 10**((away_rating - home_rating) / scale_factor))
    p_away = 1 / (1 + 10**((home_rating - away_rating) / scale_factor)) 
    
    return p_home, p_away

In [34]:
def calculate_rankings(matches, ratings, k_factor=22.2):
    num_matches = len(matches)
    home_teams, away_teams = matches['home_team'].values, matches['away_team'].values
    home_scores, away_scores = matches['home_score'].values, matches['away_score'].values
    
    for i in range(num_matches):
        home_team, away_team = home_teams[i], away_teams[i]
        home_score, away_score = home_scores[i], away_scores[i]
        
        p_home, p_away = calculate_probabilities(ratings[home_team][-1], ratings[away_team][-1])
        
        if home_score > away_score:
            match_result_home = 1
            match_result_away = 0
        elif home_score < away_score:
            match_result_home = 0
            match_result_away = 1
        elif home_score == away_score:
            match_result_home = 0.5
            match_result_away = 0.5
            
        new_rating_home = ratings[home_team][-1] + k_factor * (match_result_home - p_home)
        new_rating_away = ratings[away_team][-1] + k_factor * (match_result_away - p_away)
        
        ratings[home_team].append(new_rating_home)
        ratings[away_team].append(new_rating_away)
        
    return ratings

In [35]:
def calculate_elo(teams, year_range, initial_rating=1200):
    ratings = initialize_elo_system(teams, initial_rating)
    matches = df[df['home_team'].isin(teams) & df['away_team'].isin(teams)]
    matches = matches[matches['date'].dt.year.between(year_range[0], year_range[1])]
    new_ratings = calculate_rankings(matches, ratings)
    rankings = {k: v[-1] for k, v in sorted(new_ratings.items(), key=lambda item: item[1][-1], reverse=True)}
    
    return rankings

In [36]:
rankings = calculate_elo(teams, [2006, 2010])
rankings

{'Spain': 1364.5183009647797,
 'Netherlands': 1313.7134131866942,
 'Germany': 1311.2241288528364,
 'Croatia': 1240.5535333845992,
 'Italy': 1239.1587366951428,
 'France': 1232.2957884337645,
 'England': 1226.9820451253613,
 'Portugal': 1226.670041047823,
 'Switzerland': 1223.7896858908427,
 'Russia': 1213.045771738059,
 'Turkey': 1205.615986060427,
 'Czech Republic': 1193.2703435177937,
 'Sweden': 1188.5366080369754,
 'Slovakia': 1187.7533660997435,
 'Ukraine': 1183.0565508409186,
 'Denmark': 1165.6846236970146,
 'Finland': 1165.6292344395317,
 'Poland': 1159.2568534609081,
 'Scotland': 1148.6082421430551,
 'North Macedonia': 1138.9454850945222,
 'Hungary': 1138.0076886006218,
 'Belgium': 1118.97073932807,
 'Wales': 1109.6238695003447,
 'Austria': 1105.088963860171}

---
# Seeing if being at home gives teams an advantage

In [28]:
def calculate_home_percentages(year_range=None):
    
    # This method calculates winning/non-losing percentages for home teams
    #
    #
    # Inputs:
    #     year_range: An array containing two elements [start, end]
    #                 No input will calculate from all years
    #
    # Outputs: 
    #     W: win rate of the home team
    #     U: non-losing percentages of the home team (Undefeated percentages)
    
    home_win = 0
    draw = 0
    total = 0
    
    nonneutral = df[df['neutral'] == False]
    if year_range != None:
        nonneutral = nonneutral[nonneutral['date'].dt.year.between(year_range[0], year_range[1])]

    for index, row in nonneutral.iterrows():
        if row['home_score'] > row['away_score']:
            home_win += 1
        elif row['home_score'] == row['away_score']:
            draw += 1
        total += 1

    return (home_win/total*100, (home_win+draw)/total*100)

In [29]:
calculate_home_percentages()

(50.42236651326988, 73.55481308705794)