# Downloading Data

This notebook is designed to gather and prepare the base dataset required to create a comprehensive historical database of football matches from the top four European leagues (Premier League, La Liga, Serie A, Bundesliga) and their second divisions. The ultimate objective is to leverage this data to predict match outcomes, with a specific focus on goal scoring. The data collection process is critical, as it lays the groundwork for feature engineering and model development, where we will combine season-by-season data into a single, cohesive dataset that includes essential variables such as match dates, teams, results, betting odds, and a wide array of match statistics. This notebook will also include the creation of new features to improve predictive performance by identifying patterns that correlate with match outcomes.

The project’s initial task involved locating reliable and rich sources of historical football data. I identified several trusted websites that provide the necessary information:
- General match data and statistics: https://www.football-data.co.uk/ – This site contains a wealth of data on historical matches, including detailed statistics like goals, shots, fouls, and cards, as well as betting odds from different sources.
- Expected goals (xG): https://fbref.com – A more advanced dataset, which includes xG, a metric that quantifies the quality of scoring chances.
- Current betting odds: https://www.iforbet.pl – This source provides up-to-date betting odds, an important variable for incorporating market-based expectations into the model
- Team elo rating: http://clubelo.com/ – Elo ratings provide a continuous measure of team strength, accounting for historical performance and recent form.

After gathering the data from these sources, I focused on the top seven to eight most recent seasons for eight leagues. These datasets were downloaded and systematically cleaned to ensure consistency across the various formats and sources. The goal was to standardize key variables like match dates, team names, and results, while integrating additional variables such as betting odds, team ratings, and performance statistics.

#### Data Cleaning and Integration

Data cleaning was a crucial step, as it involved handling missing values, resolving discrepancies in team names, and ensuring that all variables were uniformly formatted across seasons and leagues. Each dataset had slightly different structures and terminologies, so careful attention was given to aligning them to a unified format. I merged the data season-by-season and across the different leagues, combining them into one comprehensive table. This master table now serves as the primary dataset for further analysis and model building.

Additionally, I referred to the documentation provided on: https://www.football-data.co.uk/notes.txt, which details the structure and meaning of the various columns and statistics available. This helped me ensure that all primary features were correctly interpreted and formatted.

#### Feature Engineering

In the next steps of this notebook, I will focus on feature engineering, where I will extract and create new variables from the existing data. These new features are intended to provide deeper insights and improve the predictive power of the model. For instance, I plan to calculate rolling averages of team performance metrics (like shots on target, goals scored, and conceded) over the past five to ten matches. Other derived variables could include form-based indicators, home and away team statistics, head-to-head performance, and more nuanced team ratings.

In total, this process will generate over a hundred meaningful variables, which will then be used for training machine learning models aimed at predicting the outcome of matches in terms of goals scored. These features will capture both team-specific attributes (e.g., current form, defensive strength, offensive capabilities) and match-level characteristics (e.g., betting odds, Elo ratings, and home advantage). The aim is to provide the model with as much relevant contextual information as possible to improve the accuracy of the predictions.

By creating a well-organized and comprehensive dataset, we ensure a strong foundation for subsequent predictive modeling tasks, and by thoroughly cleaning and integrating the data, we minimize potential biases and errors that could affect the model’s performance. Each step in this notebook has been meticulously designed to lead to a reliable, accurate outcome prediction model.

In [43]:
import pandas as pd
import numpy as np
import warnings
from Dataset_functions import *
# Suppress all warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

### Data Preprocessing and Feature Engineering

After downloading the CSV files for each league, I renamed them systematically to ensure that if any errors occurred during data preparation, I could easily trace the source of the issue. To handle the specific needs of each dataset, I created tailored functions for each league. These functions, housed in the `Dataset_functions` notebook, address several key preprocessing tasks:

- **Scraping Expected Goals Data**: For each league, I scraped websites such as FBref to gather advanced metrics like Expected Goals (xG), which provides deeper insight into the quality of scoring opportunities in each match. This data plays a crucial role in enhancing the model's understanding of team performance.
  
- **Data Cleaning and Standardization**: Cleaning the data involved multiple steps, including removing unnecessary columns and variables that weren’t directly relevant to predicting match outcomes. I ensured that the formatting of match dates was consistent across all leagues, which is essential when merging datasets or conducting time-series analysis. Dummy variables were created for the teams in each match to represent categorical data numerically, making it easier for machine learning models to interpret team-specific interactions.
  
- **Odds Conversion**: Betting odds, initially in decimal format, were converted into probabilities, taking into account the bookmaker’s margin. This transformation provides a more accurate representation of the implied likelihood of match outcomes (win, lose, or draw), which can serve as an additional predictor in the model.
  
- **Feature Creation from Existing Statistics**: I generated new variables from the raw data. For instance, statistics like shots, goals, and corners were used to compute more insightful metrics, such as averages over specific periods or performance comparisons between home and away games. These features help capture trends that might influence match o#utcomes.

## Feature Engineering Overview

In the next step, I focused extensively on feature engineering, where I transformed existing data into new, more informative features to enhance the predictive capability of the model. The primary variables I concentrated on include **points**, **goals**, **shots**, **corners**, **form**, **home/away status**, and **expected goals (xG)**. By breaking down each of these variables, I was able to create numerous features that capture a team’s performance trends over time.

For example, in the case of **goals**, I calculated various metrics:

- The total number of goals scored by a team in previous matches
- The number of goals conceded in recent games
- The average number of goals scored and conceded per game
- The total and average number of goals in the last *n* games, such as the last 5 or 10 matches
- Separate statistics for home and away games, recognizing the significant impact that venue has on team performance

This process alone generated **36 different features** related to goals, such as "home goals scored in the last 5 games" or "average away goals conceded in the last 10 games." By repeating this method for other primary variables like shots, points, and corners, I was able to build a robust dataset with enough depth to feed a machine learning model with valuable, non-trivial inputs.

These features provide the model with rich contextual information, helping it understand patterns of team performance, form, and tendencies based on home/away status. This careful feature engineering is designed to improve the accuracy of the model’s predictions by supplying it with a wide variety of meaningful variables, rather than relying on raw data alone.


## Premier League data

In [None]:
premier_league_season17_19 = prepare_dataframe('17_18_eng.csv', '2017-2018')
premier_league_season18_19 = prepare_dataframe('18_19_eng.csv', '2018-2019')
premier_league_season19_20 = prepare_dataframe('19_20_eng.csv', '2019-2020')
premier_league_season20_21 = prepare_dataframe('20_21_eng.csv', '2020-2021')
premier_league_season21_22 = prepare_dataframe('21_22_eng.csv', '2021-2022')
premier_league_season22_23 = prepare_dataframe('22_23_eng.csv', '2022-2023')
premier_league_season23_24 = prepare_dataframe('23_24_eng.csv', '2023-2024')
premier_league_season24_25 = prepare_dataframe('24_25_eng.csv', '2024-2025')

premier_league_data = pd.concat([premier_league_season17_19, premier_league_season18_19, premier_league_season19_20, premier_league_season20_21, premier_league_season21_22, premier_league_season22_23, premier_league_season23_24,premier_league_season24_25])
premier_league_data.fillna(0, inplace = True)
premier_league_data.to_csv('premier_league_data.csv', index = False)

## Championship data

In [49]:
championship_season18_19 = prepare_dataframe_championship('championship_18_19.csv', '2018-2019')
championship_season19_20 = prepare_dataframe_championship('championship_19_20.csv', '2019-2020')
championship_season20_21 = prepare_dataframe_championship('championship_20_21.csv', '2020-2021')
championship_season21_22 = prepare_dataframe_championship('championship_21_22.csv', '2021-2022')
championship_season22_23 = prepare_dataframe_championship('championship_22_23.csv', '2022-2023')
championship_season23_24 = prepare_dataframe_championship('championship_23_24.csv', '2023-2024')
championship_season24_25 = prepare_dataframe_championship('championship_24_25.csv', '2024-2025')

In [None]:
championship_data = pd.concat([championship_season18_19, championship_season19_20, championship_season20_21, championship_season21_22, championship_season22_23, championship_season23_24,championship_season24_25])
championship_data.fillna(0, inplace = True)
championship_data.to_csv('championship_data.csv', index = False)

## La Liga data

In [30]:
# # spanish1_season17_18 = prepare_dataframe_spanish1('SP1_17_18.csv', '2017-2018')
spanish1_season18_19 = prepare_dataframe_spanish1('SP1_18_19.csv', '2018-2019')
spanish1_season19_20 = prepare_dataframe_spanish1('SP1_19_20.csv', '2019-2020')
spanish1_season20_21 = prepare_dataframe_spanish1('SP1_20_21.csv', '2020-2021')
spanish1_season21_22 = prepare_dataframe_spanish1('SP1_21_22.csv', '2021-2022')
spanish1_season22_23 = prepare_dataframe_spanish1('SP1_22_23.csv', '2022-2023')
spanish1_season23_24 = prepare_dataframe_spanish1('SP1_23_24.csv', '2023-2024')
spanish1_season24_25 = prepare_dataframe_spanish1('SP_24_25.csv', '2024-2025')

In [31]:
spanish1_data = pd.concat([spanish1_season18_19, spanish1_season19_20, spanish1_season20_21, spanish1_season21_22, spanish1_season22_23, spanish1_season23_24,spanish1_season24_25])
spanish1_data.fillna(0, inplace = True)
spanish1_data.to_csv('spanish1_data.csv', index = False)

## La Liga 2 data

In [None]:
spanish2_season18_19 = prepare_dataframe_spanish2('SP2_18_19.csv', '2018-2019')
spanish2_season19_20 = prepare_dataframe_spanish2('SP2_19_20.csv', '2019-2020')
spanish2_season20_21 = prepare_dataframe_spanish2('SP2_20_21.csv', '2020-2021')
spanish2_season21_22 = prepare_dataframe_spanish2('SP2_21_22.csv', '2021-2022')
spanish2_season22_23 = prepare_dataframe_spanish2('SP2_22_23.csv', '2022-2023')
spanish2_season23_24 = prepare_dataframe_spanish2('SP2_23_24.csv', '2023-2024')
# spanish2_season24_25 = prepare_dataframe_spanish2('SP2_24_25.csv', '2024-2025')

In [None]:
spanish2_data = pd.concat([spanish2_season18_19, spanish2_season19_20, spanish2_season20_21, spanish2_season21_22, spanish2_season22_23, spanish2_season23_24])
spanish2_data.fillna(0, inplace = True)
spanish2_data.to_csv('spanish2_data.csv', index = False)

## Serie A data

In [None]:
# spanish1_season17_18 = prepare_dataframe_spanish1('SP1_17_18.csv', '2017-2018')
italian1_season18_19 = prepare_dataframe_italian1('I1_18_19.csv', '2018-2019')
italian1_season19_20 = prepare_dataframe_italian1('I1_19_20.csv', '2019-2020')
italian1_season20_21 = prepare_dataframe_italian1('I1_20_21.csv', '2020-2021')
italian1_season21_22 = prepare_dataframe_italian1('I1_21_22.csv', '2021-2022')
italian1_season22_23 = prepare_dataframe_italian1('I1_22_23.csv', '2022-2023')
italian1_season23_24 = prepare_dataframe_italian1('I1_23_24.csv', '2023-2024')
# italian1_season24_25 = prepare_dataframe_italian1('I1_24_25.csv', '2024-2025')

In [None]:
italian1_data = pd.concat([italian1_season18_19, italian1_season19_20, italian1_season20_21, italian1_season21_22, italian1_season22_23, italian1_season23_24])
italian1_data.fillna(0, inplace = True)
italian1_data.to_csv('italian1_data.csv', index = False)

## Serie B data

In [None]:
# spanish1_season17_18 = prepare_dataframe_spanish1('SP1_17_18.csv', '2017-2018')
# italian2_season18_19 = prepare_dataframe_italian2('I2_18_19.csv', '2018-2019')
italian2_season19_20 = prepare_dataframe_italian2('I2_19_20.csv', '2019-2020')
italian2_season20_21 = prepare_dataframe_italian2('I2_20_21.csv', '2020-2021')
italian2_season21_22 = prepare_dataframe_italian2('I2_21_22.csv', '2021-2022')
italian2_season22_23 = prepare_dataframe_italian2('I2_22_23.csv', '2022-2023')
italian2_season23_24 = prepare_dataframe_italian2('I2_23_24.csv', '2023-2024')
# italian1_season24_25 = prepare_dataframe_italian1('I1_24_25.csv', '2024-2025')

In [None]:
italian2_data = pd.concat([italian1_season19_20, italian1_season20_21, italian1_season21_22, italian1_season22_23, italian1_season23_24])
italian2_data.fillna(0, inplace = True)
italian2_data.to_csv('italian2_data.csv', index = False)

## 1. Bundesliga data

In [None]:
german1_season18_19 = prepare_dataframe_german1('D1_18_19.csv', '2018-2019')
german1_season19_20 = prepare_dataframe_german1('D1_19_20.csv', '2019-2020')
german1_season20_21 = prepare_dataframe_german1('D1_20_21.csv', '2020-2021')
german1_season21_22 = prepare_dataframe_german1('D1_21_22.csv', '2021-2022')
german1_season22_23 = prepare_dataframe_german1('D1_22_23.csv', '2022-2023')
german1_season23_24 = prepare_dataframe_german1('D1_23_24.csv', '2023-2024')
german1_season24_25 = prepare_dataframe_german1('D1_24_25.csv', '2024-2025')

In [None]:
german1_data = pd.concat([german1_season18_19, german1_season19_20, german1_season20_21, german1_season21_22, german1_season22_23, german1_season23_24,german1_season24_25])
german1_data.fillna(0, inplace = True)
german1_data.to_csv('german1_data.csv', index = False)

## 2. Bundesliga data

In [None]:
german2_season18_19 = prepare_dataframe_german2('D2_18_19.csv', '2018-2019')
german2_season19_20 = prepare_dataframe_german2('D2_19_20.csv', '2019-2020')
german2_season20_21 = prepare_dataframe_german2('D2_20_21.csv', '2020-2021')
german2_season21_22 = prepare_dataframe_german2('D2_21_22.csv', '2021-2022')
german2_season22_23 = prepare_dataframe_german2('D2_22_23.csv', '2022-2023')
german2_season23_24 = prepare_dataframe_german2('D2_23_24.csv', '2023-2024')
german2_season24_25 = prepare_dataframe_german2('D2_24_25.csv', '2024-2025')

In [None]:
german2_data = pd.concat([german2_season18_19, german2_season19_20, german2_season20_21, german2_season21_22, german2_season22_23, german2_season23_24,german2_season24_25])
german2_data.fillna(0, inplace = True)
german2_data.to_csv('german2_data.csv', index = False)

## H2H Data (Head-to-Head)

In addition to the previously described variables, we can extend the feature set by incorporating head-to-head (H2H) matchups. In league systems, two teams typically face each other twice per season, and historical data shows that certain teams consistently outperform their rivals, even when other factors suggest otherwise. This makes H2H data an important consideration when building predictive models, as the outcomes of these specific matchups can be influenced by team-specific dynamics.

The `get_team_h2h` function captures this dynamic by calculating H2H statistics such as goals and points scored in previous encounters between the two teams. This function aggregates historical H2H data, providing insights into long-standing rivalries and how they have evolved over time. These features are derived from the available history of matches and offer additional predictive power by acknowledging the unique relationships between certain teams.

By leveraging H2H data, we enhance the model's ability to predict outcomes in situations where specific rivalries play a key role, independent of other team performance metrics.


In [34]:
season_data0 = pd.read_csv('24_25_eng.csv')
season_data00 = pd.read_csv('23_24_eng.csv')
season_data2 = pd.read_csv('22_23_eng.csv')
season_data3 = pd.read_csv('21_22_eng.csv')
season_data4 = pd.read_csv('20_21_eng.csv')
season_data5 = pd.read_csv('19_20_eng.csv')
season_data6 = pd.read_csv('18_19_eng.csv')
season_data7 = pd.read_csv('17_18_eng.csv')

championship_data0 = pd.read_csv('championship_24_25.csv')
championship_data1 = pd.read_csv('championship_23_24.csv')
championship_data2 = pd.read_csv('championship_22_23.csv')
championship_data3 = pd.read_csv('championship_21_22.csv')
championship_data4 = pd.read_csv('championship_20_21.csv')
championship_data5 = pd.read_csv('championship_19_20.csv')
championship_data6 = pd.read_csv('championship_18_19.csv')

spanish1_data0 = pd.read_csv('SP_24_25.csv')
spanish1_data1 = pd.read_csv('SP1_23_24.csv')
spanish1_data2 = pd.read_csv('SP1_22_23.csv')
spanish1_data3 = pd.read_csv('SP1_21_22.csv')
spanish1_data4 = pd.read_csv('SP1_20_21.csv')
spanish1_data5 = pd.read_csv('SP1_19_20.csv')
spanish1_data6 = pd.read_csv('SP1_18_19.csv')

# spanish2_data0 = pd.read_csv('SP2_24_25.csv')
spanish2_data1 = pd.read_csv('SP2_23_24.csv')
spanish2_data2 = pd.read_csv('SP2_22_23.csv')
spanish2_data3 = pd.read_csv('SP2_21_22.csv')
spanish2_data4 = pd.read_csv('SP2_20_21.csv')
spanish2_data5 = pd.read_csv('SP2_19_20.csv')
spanish2_data6 = pd.read_csv('SP2_18_19.csv')

italian1_data1 = pd.read_csv('I1_23_24.csv')
italian1_data2 = pd.read_csv('I1_22_23.csv')
italian1_data3 = pd.read_csv('I1_21_22.csv')
italian1_data4 = pd.read_csv('I1_20_21.csv')
italian1_data5 = pd.read_csv('I1_19_20.csv')
italian1_data6 = pd.read_csv('I1_18_19.csv')

italian2_data1 = pd.read_csv('I2_23_24.csv')
italian2_data2 = pd.read_csv('I2_22_23.csv')
italian2_data3 = pd.read_csv('I2_21_22.csv')
italian2_data4 = pd.read_csv('I2_20_21.csv')
italian2_data5 = pd.read_csv('I2_19_20.csv')
# italian2_data6 = pd.read_csv('I2_18_19.csv')

german1_data0 = pd.read_csv('D1_24_25.csv')
german1_data1 = pd.read_csv('D1_23_24.csv')
german1_data2 = pd.read_csv('D1_22_23.csv')
german1_data3 = pd.read_csv('D1_21_22.csv')
german1_data4 = pd.read_csv('D1_20_21.csv')
german1_data5 = pd.read_csv('D1_19_20.csv')
german1_data6 = pd.read_csv('D1_18_19.csv')

german2_data0 = pd.read_csv('D2_24_25.csv')
german2_data1 = pd.read_csv('D2_23_24.csv')
german2_data2 = pd.read_csv('D2_22_23.csv')
german2_data3 = pd.read_csv('D2_21_22.csv')
german2_data4 = pd.read_csv('D2_20_21.csv')
german2_data5 = pd.read_csv('D2_19_20.csv')
german2_data6 = pd.read_csv('D2_18_19.csv')

season_data = pd.concat([season_data7, season_data6,season_data5,season_data4,season_data3,season_data2,season_data00,season_data0,
                        championship_data0, championship_data1, championship_data2, championship_data3, championship_data4, championship_data5, championship_data6,
                        spanish1_data0, spanish1_data1,spanish1_data2,spanish1_data3,spanish1_data4,spanish1_data5,spanish1_data6,
                          spanish2_data1,spanish2_data2,spanish2_data3,spanish2_data4,spanish2_data5,spanish2_data6,
                        italian1_data1,italian1_data2,italian1_data3,italian1_data4,italian1_data5,italian1_data6,
                         italian2_data2,italian2_data3,italian2_data4,italian2_data5,italian2_data1,
                         german1_data0, german1_data1,german1_data2,german1_data3,german1_data4,german1_data5,german1_data6,
                          german2_data0, german2_data1,german2_data2,german2_data3,german2_data4,german2_data5,german2_data6,
                        ])
season_data['Date'] = season_data['Date'].str[:10]
season_data['Date'] = pd.to_datetime(season_data['Date'],format = '%d/%m/%Y') 
season_data.sort_values(['Date'],ascending = True, inplace = True)
season_data.reset_index(drop = True, inplace = True)
season_data =  get_team_h2h(season_data)

In [35]:
season_data.to_csv('h2h_data.csv', index = False)