# Exploratory Data Analysis

I have obtained code from three different kaggles.

- One Dataset contains matches that occurred in a team's home league. This dataset contains lots of statistics from each match and will be the one of the most useful datasets for this project. However, the league strength will not be used for anything in this as both teams that face eachother will be in the same league.

- The second data set is of the UEFA Champions League. The only useful data this set provides for this application is: both teams that played eachother, the date, and the final score. It will only contain matches that occured between teams with leagues based in Europe.

- The third data set is of the CONMEBOL Libertadores. This dataset also only contains columns for the date, teams, and score. This will only contain data for teams that play eachother in South America.

Finding data for the North American, Asian, and African Continental compitetions proved difficult. Without the luxury of time for manually scraping the data, It makes more sense to train the model on the data that is readily available, and hope that the model will be able to precisely predict the outcomes. 

There is also a slight issue with the home leagues dataset, as some of the leagues represented at the Club World Cup are not present in the dataset. For this case, I am hoping that metrics involving league and team strength will be able to achieve good predictions. 

## Clean and Combine the Data

The data needs to be cleaned since all of the tables contain different information. 

I am planning only on using data from the past three seasons to train the model. This is because I only have access to the current team and league strengths. In football, four years is still a very long time, but it is not expected for team's performance to change extremely drastically. So I believe that using current rankings can help to assess the difference between teams in the past four seasons. Since leagues across the globe start at different times, I will use Jan 1st, 2021 as the date to cutoff the data

As for the leagues table specifically, I will only need data for the leagues that have representatives at the 2025 FIFA Club World Cup.

In [4]:
import pandas as pd

# ids of only the leagues we want to use
league_id= ["E0","I1","D1","SP1","F1","BRA","P1","EGY","USA","AUS","ARG","TUN","MEX","JAP","SA","KOR","UAE","MOR","KSA","AUT"]

# Load league matches data
league_matches = pd.read_csv("../data/league_training_data_raw.csv",low_memory=False)

# Change the date to a datetime object
league_matches["MatchDate"] = pd.to_datetime(
    league_matches["MatchDate"],
    format="%Y-%m-%d")

# filter dataset by date and league id
league_matches = league_matches[
    (league_matches["MatchDate"] >= "2021-01-01") & 
    (league_matches["Division"].isin(league_id))
]

# Drop any rows that have important missing values
required_cols = ["MatchDate", "HomeTeam", "AwayTeam","Division", "FTHome", "FTAway", "FTResult"]

league_matches["HomeLeague"] = league_matches["Division"]
league_matches["AwayLeague"] = league_matches["Division"]

league_matches = league_matches.dropna(subset=required_cols)

# Remove any teams where they play less than 4 matches
# Stack HomeTeam and AwayTeam into one Series
team_appearances = pd.concat([league_matches["HomeTeam"], league_matches["AwayTeam"]])

# Count how many matches each team has played
team_match_counts = team_appearances.value_counts()

# Keep only teams with 4+ appearances
eligible_teams = team_match_counts[team_match_counts >= 4].index

league_matches = league_matches[
    (league_matches["HomeTeam"].isin(eligible_teams)) &
    (league_matches["AwayTeam"].isin(eligible_teams))
]

# select only columns that could be useful for training
league_matches = league_matches[
    [
        "MatchDate", "HomeTeam","HomeLeague", "AwayTeam", "AwayLeague",
        "FTHome", "FTAway", "FTResult", "HomeTarget", "AwayTarget",
        "HomeFouls","AwayFouls", "HomeCorners", "AwayCorners",
        "HomeYellow", "AwayYellow", "HomeRed", "AwayRed"
    ]
]

#Fix team names that are not consistent
team_name_map = {
    "MGladbach": "M'Gladbach",
    "M'gladbach": "M'Gladbach",
    "Nottm Forest": "Nott'm Forest",
    "Tigre": "Tigres",
    "U.A.N.L.- Tigres": "Tigres"
}

league_matches["HomeTeam"] = league_matches["HomeTeam"].replace(team_name_map)
league_matches["AwayTeam"] = league_matches["AwayTeam"].replace(team_name_map)

print(league_matches.head())
print(league_matches.shape)

home_teams = set(league_matches["HomeTeam"].unique())
away_teams = set(league_matches["AwayTeam"].unique())
all_teams = home_teams.union(away_teams)

print(f"Total unique teams: {len(all_teams)}")
print(sorted(all_teams))

        MatchDate       HomeTeam HomeLeague      AwayTeam AwayLeague  FTHome  \
177975 2021-01-01        Everton         E0      West Ham         E0     0.0   
177977 2021-01-01     Man United         E0   Aston Villa         E0     2.0   
177978 2021-01-02  Santos Laguna        MEX  Club America        MEX     1.0   
177984 2021-01-02      Tottenham         E0         Leeds         E0     3.0   
177988 2021-01-02     Villarreal        SP1       Levante        SP1     2.0   

        FTAway FTResult  HomeTarget  AwayTarget  HomeFouls  AwayFouls  \
177975     1.0        A         2.0         5.0        9.0        7.0   
177977     1.0        H         9.0         5.0       22.0       10.0   
177978     1.0        D         NaN         NaN        NaN        NaN   
177984     0.0        H         7.0         5.0       15.0       13.0   
177988     1.0        H         7.0         2.0       12.0       11.0   

        HomeCorners  AwayCorners  HomeYellow  AwayYellow  HomeRed  AwayRed  
177

### Cleaning Other Datasets to Combine

In [5]:
uefa_ucl_matches = pd.read_csv("../data/uefa_champions_league_training_data_raw.csv")

# Change the date to a datetime object
uefa_ucl_matches["DATE_TIME"] = pd.to_datetime(
    uefa_ucl_matches["DATE_TIME"],
    format="%d-%b-%y %I.%M.%S.%f %p"
)

uefa_ucl_matches["DATE"]= uefa_ucl_matches["DATE_TIME"].dt.date

# filter dataset by date
uefa_ucl_matches = uefa_ucl_matches[
    (uefa_ucl_matches["DATE"] >= pd.to_datetime("2021-01-01").date())
]

team_appearances = pd.concat([uefa_ucl_matches["HOME_TEAM"], uefa_ucl_matches["AWAY_TEAM"]])

# Count how many matches each team has played
team_match_counts = team_appearances.value_counts()

# Keep only teams with 4+ appearances
eligible_teams = team_match_counts[team_match_counts >= 4].index

uefa_ucl_matches = uefa_ucl_matches[
    (uefa_ucl_matches["HOME_TEAM"].isin(eligible_teams)) &
    (uefa_ucl_matches["AWAY_TEAM"].isin(eligible_teams))
]

#Fix team names that are not consistent and have special characters
team_name_map = {
    "AC Milan": "Milan",
    "Atlético Madrid":"Ath Madrid",
    "Bayern München":"Bayern Munich",
    "Beşiktaş":"Besiktas",
    "Borussia Dortmund":"Dortmund",
    "Chelsea FC":"Chelsea",
    "Dinamo Kiev":"Dynamo Kyiv",
    "FC Barcelona":"Barcelona",
    "FC Porto":"Porto",
    "Lille OSC":"Lille",
    "Liverpool FC":"Liverpool",
    "Malmö FF":"Malmo FF",
    "Manchester City":"Man City",
    "Manchester United":"Man United",
    "Paris Saint-Germain":"Paris SG",
    "RB Salzburg":"Salzburg",
    "SL Benfica":"Benfica",
    "Sevilla FC":"Sevilla",
    "Sporting CP":"Sp Lisbon",
    "VfL Wolfsburg":"Wolfsburg",
    "Villarreal CF":"Villarreal"
}

uefa_ucl_matches["HOME_TEAM"] = uefa_ucl_matches["HOME_TEAM"].replace(team_name_map)
uefa_ucl_matches["AWAY_TEAM"] = uefa_ucl_matches["AWAY_TEAM"].replace(team_name_map)

# Drop any rows that have important missing values
required_cols = ["DATE", "HOME_TEAM", "AWAY_TEAM", "HOME_TEAM_SCORE", "AWAY_TEAM_SCORE"]

uefa_ucl_matches = uefa_ucl_matches.dropna(subset=required_cols)

# select only columns that could be useful for training
uefa_ucl_matches = uefa_ucl_matches[required_cols]

print(uefa_ucl_matches.head())
print(uefa_ucl_matches.shape)

home_teams = set(uefa_ucl_matches["HOME_TEAM"].unique())
away_teams = set(uefa_ucl_matches["AWAY_TEAM"].unique())
all_teams = home_teams.union(away_teams)

print(f"Total unique teams: {len(all_teams)}")
print(sorted(all_teams))

         DATE       HOME_TEAM       AWAY_TEAM  HOME_TEAM_SCORE  \
0  2021-09-15        Man City      RB Leipzig                6   
1  2021-09-15  Club Brugge KV        Paris SG                1   
2  2021-09-28        Paris SG        Man City                2   
3  2021-09-28      RB Leipzig  Club Brugge KV                1   
4  2021-10-19  Club Brugge KV        Man City                1   

   AWAY_TEAM_SCORE  
0                3  
1                1  
2                0  
3                2  
4                5  
(150, 5)
Total unique teams: 32
['AFC Ajax', 'Atalanta', 'Ath Madrid', 'BSC Young Boys', 'Barcelona', 'Bayern Munich', 'Benfica', 'Besiktas', 'Chelsea', 'Club Brugge KV', 'Dortmund', 'Dynamo Kyiv', 'FC Sheriff', 'Inter', 'Juventus', 'Lille', 'Liverpool', 'Malmo FF', 'Man City', 'Man United', 'Milan', 'Paris SG', 'Porto', 'RB Leipzig', 'Real Madrid', 'Salzburg', 'Sevilla', 'Shakhtar Donetsk', 'Sp Lisbon', 'Villarreal', 'Wolfsburg', 'Zenit St. Petersburg']


In [9]:
comnebol_matches = pd.read_csv("../data/conmebol_libertadores_training_raw.csv")

# Change the date to a datetime object
comnebol_matches["Date"] = pd.to_datetime(
    comnebol_matches["Date"],
    format="%d/%m/%Y"
)

# filter dataset by date
comnebol_matches = comnebol_matches[
    (comnebol_matches["Date"] >= pd.to_datetime("2021-01-01"))
]

team_appearances = pd.concat([comnebol_matches["Home Club"], comnebol_matches["Away Club"]])

# Count how many matches each team has played
team_match_counts = team_appearances.value_counts()

# Keep only teams with 4+ appearances
eligible_teams = team_match_counts[team_match_counts >= 4].index

comnebol_matches = comnebol_matches[
    (comnebol_matches["Home Club"].isin(eligible_teams)) &
    (comnebol_matches["Away Club"].isin(eligible_teams))
]

# Drop any rows that have important missing values
required_cols = ["Date", "Home Club", "Away Club", "Home Score", "AwayScore"]

comnebol_matches = comnebol_matches.dropna(subset=required_cols)

# select only columns that could be useful for training
comnebol_matches = comnebol_matches[required_cols]

#Fix team names that are not consistent and have special characters
team_name_map = {
    "América Mineiro":"America MG",
    "América de Cali":"America de Cali",
    "Argentinos Juniors":"Argentinos Jrs",
    "Atlético Mineiro":"Atletico-MG",
    "Atlético Nacional": "Atletico Nacional",
    "Atlético PR":"Athletico-PR",
    "Bolívar":"Bolivar",
    "Botafogo FR":"Botafogo RJ",
    "Cerro Porteño":"Cerro Porteno",
    "Deportivo Táchira":"Deportivo Tachira",
    "Estudiantes de la Plata":"Estudiantes L.P.",
    "Flamengo":"Flamengo RJ",
    "Fluminense FC":"Fluminense",
    "Fortaleza Esporte Clube":"Fortaleza",
    "Grêmio":"Gremio",
    "Huracán":"Huracan",
    "Independiente Medellín":"Independiente Medellin",
    "Liverpool":"Liverpool Montevideo",
    "Nacional":"Nacional Uruguay",
    "Peñarol":"Penarol",
    "Red Bull Bragantino SP":"Bragantino",
    "San Lorenzo de Almagro":"San Lorenzo",
    "Santos FC":"Santos",
    "São Paulo":"Sao Paulo",
    "Talleres de Cordoba":"Talleres Cordoba",
    "Universidad Católica":"Universidad Catolica",
    "Unión La Calera":"Union La Calera",
    "Vélez Sarsfield":"Velez Sarsfield",
    "Ñublense":"Nublense"
}

comnebol_matches["Home Club"] = comnebol_matches["Home Club"].replace(team_name_map)
comnebol_matches["Away Club"] = comnebol_matches["Away Club"].replace(team_name_map)

print(comnebol_matches.head())
print(comnebol_matches.shape)

home_teams = set(comnebol_matches["Home Club"].unique())
away_teams = set(comnebol_matches["Away Club"].unique())
all_teams = home_teams.union(away_teams)

print(f"Total unique teams: {len(all_teams)}")
print(sorted(all_teams))


        Date      Home Club      Away Club  Home Score  AwayScore
0 2023-11-04     Fluminense   Boca Juniors           2          1
1 2023-10-06      Palmeiras   Boca Juniors           1          1
2 2023-10-05  Internacional     Fluminense           1          2
3 2023-09-29   Boca Juniors      Palmeiras           0          0
4 2023-09-28     Fluminense  Internacional           2          2
(609, 5)
Total unique teams: 69
['Alianza Lima', 'Always Ready', 'America MG', 'America de Cali', 'Argentinos Jrs', 'Athletico-PR', 'Atletico Nacional', 'Atletico-MG', 'Aucas', 'Barcelona SC', 'Boca Juniors', 'Bolivar', 'Botafogo RJ', 'Bragantino', 'CD Magallanes', 'CS Deportivo Pereira', 'Caracas', 'Cerro Porteno', 'Cobresal', 'Colo Colo', 'Colón Santa Fe', 'Corinthians', 'Defensa y Justicia', 'Deportivo Cali', 'Deportivo La Guaira FC', 'Deportivo Tachira', 'Emelec', 'Estudiantes L.P.', 'FBC Melgar', 'Flamengo RJ', 'Fluminense', 'Fortaleza', 'Gremio', 'Huachipato', 'Huracan', 'Independiente', 'In