# 01 Data Exploration

This notebook performs a structured first-pass inspection of FBref player stats data (2022–2025) to validate consistency and prepare for downstream merging. It covers:

* **Data loading & cleaning:** Loads CSVs with misaligned headers; standardizes column names and strips placeholder rows (e.g. `"Matches"` artifact columns)
* **Schema validation:** Confirms key columns like `Player`, `Age`, and `Pos` are present across seasons for linking
* **Data sanity checks:** Flags unexpected values in noisy columns (`Matches`), suggesting cleanup before analysis
* **Player continuity mapping:** Identifies players who appear in multiple seasons via clean joins on player names

> Output of this step is a cleaned and season-labeled base of player identities for temporal analysis. This establishes a reliable backbone for joining performance metrics over time.

In [20]:
import pandas as pd
from pathlib import Path

from typing import List, Tuple, Dict

In [21]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [22]:
# Base path for all CSVs
base_path = '../data/raw/'

In [23]:
def load_clean_fbref_csv(path: str) -> pd.DataFrame:
    """
    Loads a FBref CSV file where the first row is leftover html junk and the second row is the real header.
    """
    df = pd.read_csv(path, header=1)
    df.columns = [col.strip().replace(' ', '_') for col in df.columns]
    return df

In [24]:
# Season 2024–2025
df_player_stats_2425         = load_clean_fbref_csv(base_path + 'df_player_stats_2425.csv')
df_player_shooting_2425      = load_clean_fbref_csv(base_path + 'df_player_shooting_2425.csv')
df_player_passing_2425       = load_clean_fbref_csv(base_path + 'df_player_passing_2425.csv')
df_player_passing_types_2425 = load_clean_fbref_csv(base_path + 'df_player_passing_types_2425.csv')
df_player_gca_2425           = load_clean_fbref_csv(base_path + 'df_player_gca_2425.csv')
df_player_defense_2425       = load_clean_fbref_csv(base_path + 'df_player_defense_2425.csv')
df_player_possession_2425    = load_clean_fbref_csv(base_path + 'df_player_possession_2425.csv')

# Season 2023–2024
df_player_stats_2324         = load_clean_fbref_csv(base_path + 'df_player_stats_2324.csv')
df_player_shooting_2324      = load_clean_fbref_csv(base_path + 'df_player_shooting_2324.csv')
df_player_passing_2324       = load_clean_fbref_csv(base_path + 'df_player_passing_2324.csv')
df_player_passing_types_2324 = load_clean_fbref_csv(base_path + 'df_player_passing_types_2324.csv')
df_player_gca_2324           = load_clean_fbref_csv(base_path + 'df_player_gca_2324.csv')
df_player_defense_2324       = load_clean_fbref_csv(base_path + 'df_player_defense_2324.csv')
df_player_possession_2324    = load_clean_fbref_csv(base_path + 'df_player_possession_2324.csv')

# Season 2022–2023
df_player_stats_2223         = load_clean_fbref_csv(base_path + 'df_player_stats_2223.csv')
df_player_shooting_2223      = load_clean_fbref_csv(base_path + 'df_player_shooting_2223.csv')
df_player_passing_2223       = load_clean_fbref_csv(base_path + 'df_player_passing_2223.csv')
df_player_passing_types_2223 = load_clean_fbref_csv(base_path + 'df_player_passing_types_2223.csv')
df_player_gca_2223           = load_clean_fbref_csv(base_path + 'df_player_gca_2223.csv')
df_player_defense_2223       = load_clean_fbref_csv(base_path + 'df_player_defense_2223.csv')
df_player_possession_2223    = load_clean_fbref_csv(base_path + 'df_player_possession_2223.csv')

## Data Description

In [25]:
df_player_stats_2425.head()

Unnamed: 0,Player,Nation,Pos,Age,MP,Starts,Min,90s,Gls,Ast,G+A,G-PK,PK,PKatt,CrdY,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,PrgR,Gls.1,Ast.1,G+A.1,G-PK.1,G+A-PK,xG.1,xAG.1,xG+xAG,npxG.1,npxG+xAG.1,Matches
0,Cristhian Mosquera,es ESP,DF,20.0,37,37,3319.0,36.9,1.0,0.0,1.0,1.0,0.0,0.0,6.0,0.0,0.4,0.4,0.0,0.4,35.0,123.0,6.0,0.03,0.0,0.03,0.03,0.03,0.01,0.0,0.01,0.01,0.01,Matches
1,Diego López,es ESP,"MF,FW",22.0,38,35,2737.0,30.4,8.0,5.0,13.0,8.0,0.0,0.0,3.0,0.0,6.6,6.6,5.0,11.6,76.0,85.0,219.0,0.26,0.16,0.43,0.26,0.43,0.22,0.17,0.38,0.22,0.38,Matches
2,Giorgi Mamardashvili,ge GEO,GK,23.0,34,34,3060.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.1,0.1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Matches
3,César Tárrega,es ESP,DF,22.0,34,34,3026.0,33.6,2.0,0.0,2.0,2.0,0.0,0.0,7.0,0.0,1.9,1.9,0.1,2.1,20.0,79.0,1.0,0.06,0.0,0.06,0.06,0.06,0.06,0.0,0.06,0.06,0.06,Matches
4,Luis Rioja,es ESP,"MF,FW",30.0,36,33,2831.0,31.5,5.0,3.0,8.0,3.0,2.0,2.0,6.0,0.0,3.3,1.7,4.5,6.2,93.0,103.0,228.0,0.16,0.1,0.25,0.1,0.19,0.1,0.14,0.25,0.05,0.2,Matches


| Feature    | Datatype | Description                                                              |
| ---------- | -------- | ------------------------------------------------------------------------ |
| Player     | string   | Player name                                                              |
| Nation     | string   | Player's nationality (short code and ISO)                                |
| Pos        | string   | Primary position(s) played                                               |
| Age        | float    | Age in years                                                             |
| MP         | int      | Matches played                                                           |
| Starts     | int      | Matches started                                                          |
| Min        | float    | Total minutes played                                                     |
| 90s        | float    | Minutes played divided by 90                                             |
| Gls        | float    | Goals scored                                                             |
| Ast        | float    | Assists                                                                  |
| G+A        | float    | Goals + Assists                                                          |
| G-PK       | float    | Non-penalty goals                                                        |
| PK         | float    | Penalty goals                                                            |
| PKatt      | float    | Penalty attempts                                                         |
| CrdY       | float    | Yellow cards                                                             |
| CrdR       | float    | Red cards                                                                |
| xG         | float    | Expected goals                                                           |
| npxG       | float    | Non-penalty expected goals                                               |
| xAG        | float    | Expected assisted goals                                                  |
| npxG+xAG   | float    | Total attacking contribution (non-pen xG + xAG)                          |
| PrgC       | float    | Progressive carries                                                      |
| PrgP       | float    | Progressive passes                                                       |
| PrgR       | float    | Progressive passes received                                              |
| Gls.1      | float    | Goals per 90                                                             |
| Ast.1      | float    | Assists per 90                                                           |
| G+A.1      | float    | Goals + assists per 90                                                   |
| G-PK.1     | float    | Non-penalty goals per 90                                                 |
| G+A-PK     | float    | G+A minus penalties per 90                                               |
| xG.1       | float    | Expected goals per 90                                                    |
| xAG.1      | float    | Expected assists per 90                                                  |
| xG+xAG     | float    | Combined xG and xAG per 90                                               |
| npxG.1     | float    | Non-penalty xG per 90                                                    |
| npxG+xAG.1 | float    | Non-penalty xG plus xAG per 90                                           |
| Matches    | string   | Placeholder field (repeats “Matches” — likely noise from original table) |

In [26]:
df_player_shooting_2425.head()

Unnamed: 0,Player,Nation,Pos,Age,90s,Gls,Sh,SoT,SoT%,Sh/90,SoT/90,G/Sh,G/SoT,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Matches
0,Cristhian Mosquera,es ESP,DF,20.0,36.9,1,5,2,40.0,0.14,0.05,0.2,0.5,10.5,0,0,0,0.4,0.4,0.08,0.6,0.6,Matches
1,Diego López,es ESP,"MF,FW",22.0,30.4,8,44,14,31.8,1.45,0.46,0.18,0.57,15.1,0,0,0,6.6,6.6,0.15,1.4,1.4,Matches
2,Giorgi Mamardashvili,ge GEO,GK,23.0,34.0,0,0,0,,0.0,0.0,,,,0,0,0,0.0,0.0,,0.0,0.0,Matches
3,César Tárrega,es ESP,DF,22.0,33.6,2,20,6,30.0,0.59,0.18,0.1,0.33,10.3,0,0,0,1.9,1.9,0.1,0.1,0.1,Matches
4,Luis Rioja,es ESP,"MF,FW",30.0,31.5,5,28,12,42.9,0.89,0.38,0.11,0.25,21.2,1,2,2,3.3,1.7,0.06,1.7,1.3,Matches


| Feature  | Datatype | Description                                                |
| -------- | -------- | ---------------------------------------------------------- |
| Player   | string   | Player name                                                |
| Nation   | string   | Player's nationality                                       |
| Pos      | string   | Position(s) played                                         |
| Age      | float    | Player age                                                 |
| 90s      | float    | Minutes played divided by 90                               |
| Gls      | int      | Total goals scored                                         |
| Sh       | int      | Total shots taken                                          |
| SoT      | int      | Shots on target                                            |
| SoT%     | float    | Percentage of shots on target                              |
| Sh/90    | float    | Shots per 90 minutes                                       |
| SoT/90   | float    | Shots on target per 90 minutes                             |
| G/Sh     | float    | Goals per shot                                             |
| G/SoT    | float    | Goals per shot on target                                   |
| Dist     | float    | Average shot distance                                      |
| FK       | int      | Shots taken from free kicks                                |
| PK       | int      | Penalty goals                                              |
| PKatt    | int      | Penalty attempts                                           |
| xG       | float    | Expected goals                                             |
| npxG     | float    | Non-penalty expected goals                                 |
| npxG/Sh  | float    | Non-penalty xG per shot                                    |
| G-xG     | float    | Goals minus expected goals (finishing efficiency)          |
| np\:G-xG | float    | Non-penalty goals minus non-penalty xG                     |
| Matches  | string   | Placeholder (likely constant or repeated "Matches" string) |

In [27]:
df_player_passing_2425.head()

Unnamed: 0,Player,Nation,Pos,Age,90s,Cmp,Att,Cmp%,TotDist,PrgDist,Cmp.1,Att.1,Cmp%.1,Cmp.2,Att.2,Cmp%.2,Cmp.3,Att.3,Cmp%.3,Ast,xAG,xA,A-xAG,KP,1/3,PPA,CrsPA,PrgP,Matches
0,Cristhian Mosquera,es ESP,DF,20.0,36.9,1858,2053,90.5,32827,11556,766,812,94.3,947,1009,93.9,126,185,68.1,0,0.0,0.3,0.0,0,118,1,0,123,Matches
1,Diego López,es ESP,"MF,FW",22.0,30.4,680,943,72.1,10541,2502,373,457,81.6,251,320,78.4,37,61,60.7,5,5.0,4.7,0.0,38,43,34,9,85,Matches
2,Giorgi Mamardashvili,ge GEO,GK,23.0,34.0,732,1080,67.8,20581,15111,131,131,100.0,393,398,98.7,208,548,38.0,0,0.1,0.1,-0.1,2,11,1,0,0,Matches
3,César Tárrega,es ESP,DF,22.0,33.6,1428,1692,84.4,27168,9416,477,512,93.2,817,892,91.6,127,247,51.4,0,0.1,0.2,-0.1,3,79,1,0,79,Matches
4,Luis Rioja,es ESP,"MF,FW",30.0,31.5,821,1212,67.7,13662,3934,427,506,84.4,307,419,73.3,68,181,37.6,3,4.5,5.5,-1.5,35,42,60,25,103,Matches


| Feature | Datatype | Description                                                      |
| ------- | -------- | ---------------------------------------------------------------- |
| Player  | string   | Player name                                                      |
| Nation  | string   | Player's nationality                                             |
| Pos     | string   | Position(s) played                                               |
| Age     | float    | Player age                                                       |
| 90s     | float    | Minutes played divided by 90                                     |
| Cmp     | int      | Total completed passes                                           |
| Att     | int      | Total pass attempts                                              |
| Cmp%    | float    | Overall pass completion percentage                               |
| TotDist | int      | Total passing distance (in yards)                                |
| PrgDist | int      | Progressive passing distance                                     |
| Cmp.1   | int      | Completed short passes                                           |
| Att.1   | int      | Short pass attempts                                              |
| Cmp%.1  | float    | Short pass completion percentage                                 |
| Cmp.2   | int      | Completed medium passes                                          |
| Att.2   | int      | Medium pass attempts                                             |
| Cmp%.2  | float    | Medium pass completion percentage                                |
| Cmp.3   | int      | Completed long passes                                            |
| Att.3   | int      | Long pass attempts                                               |
| Cmp%.3  | float    | Long pass completion percentage                                  |
| Ast     | int      | Assists                                                          |
| xAG     | float    | Expected assisted goals                                          |
| xA      | float    | Actual assisted goals                                            |
| A-xAG   | float    | Difference between assists and xAG (over/underperformance)       |
| KP      | int      | Key passes (passes leading directly to a shot)                   |
| 1/3     | int      | Passes into final third                                          |
| PPA     | int      | Passes into penalty area                                         |
| CrsPA   | int      | Crosses into penalty area                                        |
| PrgP    | int      | Progressive passes                                               |
| Matches | string   | Placeholder field (likely constant or repeated “Matches” string) |

In [28]:
df_player_passing_types_2425.head()

Unnamed: 0,Player,Nation,Pos,Age,90s,Att,Live,Dead,FK,TB,Sw,Crs,TI,CK,In,Out,Str,Cmp,Off,Blocks,Matches
0,Cristhian Mosquera,es ESP,DF,20.0,36.9,2053,1931,117,57,0,6,2,49,0,0,0,0,1858,5,9,Matches
1,Diego López,es ESP,"MF,FW",22.0,30.4,943,904,36,0,5,3,67,33,0,0,0,0,680,3,47,Matches
2,Giorgi Mamardashvili,ge GEO,GK,23.0,34.0,1080,742,336,117,0,3,0,0,0,0,0,0,732,2,1,Matches
3,César Tárrega,es ESP,DF,22.0,33.6,1692,1608,79,47,0,10,1,1,0,0,0,0,1428,5,14,Matches
4,Luis Rioja,es ESP,"MF,FW",30.0,31.5,1212,1077,132,28,6,14,203,48,56,31,14,0,821,3,61,Matches


| Feature | Datatype | Description                                                      |
| ------- | -------- | ---------------------------------------------------------------- |
| Player  | string   | Player name                                                      |
| Nation  | string   | Player's nationality                                             |
| Pos     | string   | Position(s) played                                               |
| Age     | float    | Player age                                                       |
| 90s     | float    | Minutes played divided by 90                                     |
| Att     | int      | Total passes attempted                                           |
| Live    | int      | Live-ball passes                                                 |
| Dead    | int      | Dead-ball passes                                                 |
| FK      | int      | Passes attempted from free kicks                                 |
| TB      | int      | Through balls                                                    |
| Sw      | int      | Switches (long diagonal passes)                                  |
| Crs     | int      | Crosses                                                          |
| TI      | int      | Throw-ins                                                        |
| CK      | int      | Corner kicks                                                     |
| In      | int      | Corners played short                                             |
| Out     | int      | Corners played long (to the edge of the box or beyond)           |
| Str     | int      | Straight corner kicks                                            |
| Cmp     | int      | Total completed passes (likely overall)                          |
| Off     | int      | Passes that led to an offside                                    |
| Blocks  | int      | Passes blocked by opponent                                       |
| Matches | string   | Placeholder field (likely constant or repeated “Matches” string) |

In [29]:
df_player_gca_2425.head()

Unnamed: 0,Player,Nation,Pos,Age,90s,SCA,SCA90,PassLive,PassDead,TO,Sh,Fld,Def,GCA,GCA90,PassLive.1,PassDead.1,TO.1,Sh.1,Fld.1,Def.1,Matches
0,Cristhian Mosquera,es ESP,DF,20.0,36.9,14,0.38,12,0,0,0,0,2,1,0.03,1,0,0,0,0,0,Matches
1,Diego López,es ESP,"MF,FW",22.0,30.4,69,2.27,60,1,3,2,3,0,9,0.3,7,0,1,0,1,0,Matches
2,Giorgi Mamardashvili,ge GEO,GK,23.0,34.0,9,0.26,6,3,0,0,0,0,2,0.06,1,1,0,0,0,0,Matches
3,César Tárrega,es ESP,DF,22.0,33.6,10,0.3,9,0,0,1,0,0,0,0.0,0,0,0,0,0,0,Matches
4,Luis Rioja,es ESP,"MF,FW",30.0,31.5,82,2.61,65,8,5,1,2,1,10,0.32,8,1,0,1,0,0,Matches


| Feature    | Datatype | Description                                            |
| ---------- | -------- | ------------------------------------------------------ |
| Player     | string   | Player name                                            |
| Nation     | string   | Player's nationality                                   |
| Pos        | string   | Position(s) played                                     |
| Age        | float    | Player age                                             |
| 90s        | float    | Minutes played divided by 90                           |
| SCA        | int      | Shot-Creating Actions (total)                          |
| SCA90      | float    | Shot-Creating Actions per 90 minutes                   |
| PassLive   | int      | SCA via live-ball passes                               |
| PassDead   | int      | SCA via dead-ball passes                               |
| TO         | int      | SCA via take-ons (successful dribbles leading to shot) |
| Sh         | int      | SCA via shots (i.e., rebound shots, etc.)              |
| Fld        | int      | SCA via fouls drawn                                    |
| Def        | int      | SCA via defensive actions                              |
| GCA        | int      | Goal-Creating Actions (total)                          |
| GCA90      | float    | Goal-Creating Actions per 90 minutes                   |
| PassLive.1 | int      | GCA via live-ball passes                               |
| PassDead.1 | int      | GCA via dead-ball passes                               |
| TO.1       | int      | GCA via take-ons                                       |
| Sh.1       | int      | GCA via shots                                          |
| Fld.1      | int      | GCA via fouls drawn                                    |
| Def.1      | int      | GCA via defensive actions                              |
| Matches    | string   | Placeholder field (likely constant “Matches”)          |

In [30]:
df_player_defense_2425.head()

Unnamed: 0,Player,Nation,Pos,Age,90s,Tkl,TklW,Def_3rd,Mid_3rd,Att_3rd,Tkl.1,Att,Tkl%,Lost,Blocks,Sh,Pass,Int,Tkl+Int,Clr,Err,Matches
0,Cristhian Mosquera,es ESP,DF,20.0,36.9,56,36,41,13,2,31,40,77.5,9,37,24,13,35,91,128,4,Matches
1,Diego López,es ESP,"MF,FW",22.0,30.4,31,18,11,16,4,16,47,34.0,31,30,0,30,16,47,15,0,Matches
2,Giorgi Mamardashvili,ge GEO,GK,23.0,34.0,2,2,2,0,0,2,3,66.7,1,0,0,0,1,3,16,3,Matches
3,César Tárrega,es ESP,DF,22.0,33.6,49,37,33,13,3,19,31,61.3,12,41,20,21,22,71,222,2,Matches
4,Luis Rioja,es ESP,"MF,FW",30.0,31.5,33,21,14,12,7,14,42,33.3,28,30,1,29,22,55,44,1,Matches


| Feature  | Datatype | Description                                   |
| -------- | -------- | --------------------------------------------- |
| Player   | string   | Player name                                   |
| Nation   | string   | Player's nationality                          |
| Pos      | string   | Position(s) played                            |
| Age      | float    | Player age                                    |
| 90s      | float    | Minutes played divided by 90                  |
| Tkl      | int      | Total tackles                                 |
| TklW     | int      | Tackles won                                   |
| Def\_3rd | int      | Tackles in the defensive third                |
| Mid\_3rd | int      | Tackles in the middle third                   |
| Att\_3rd | int      | Tackles in the attacking third                |
| Tkl.1    | int      | Tackles vs dribbles                           |
| Att      | int      | Times dribbled at                             |
| Tkl%     | float    | Tackle success rate vs dribbles               |
| Lost     | int      | Times beaten by dribbler                      |
| Blocks   | int      | Total blocks (shots or passes)                |
| Sh       | int      | Blocked shots                                 |
| Pass     | int      | Blocked passes                                |
| Int      | int      | Interceptions                                 |
| Tkl+Int  | int      | Tackles + Interceptions (total)               |
| Clr      | int      | Clearances                                    |
| Err      | int      | Defensive errors leading to opponent chance   |
| Matches  | string   | Placeholder field (likely constant “Matches”) |

In [31]:
df_player_possession_2425.head()

Unnamed: 0,Player,Nation,Pos,Age,90s,Touches,Def_Pen,Def_3rd,Mid_3rd,Att_3rd,Att_Pen,Live,Att,Succ,Succ%,Tkld,Tkld%,Carries,TotDist,PrgDist,PrgC,1/3,CPA,Mis,Dis,Rec,PrgR,Matches
0,Cristhian Mosquera,es ESP,DF,20.0,36.9,2367,233,1159,1144,74,7,2367,9,5,55.6,4,44.4,1401,7430,4283,35,11,0,18,10,1583,6,Matches
1,Diego López,es ESP,"MF,FW",22.0,30.4,1257,11,132,510,634,100,1257,77,23,29.9,51,66.2,715,4212,2026,76,51,24,71,27,881,219,Matches
2,Giorgi Mamardashvili,ge GEO,GK,23.0,34.0,1163,978,1155,8,0,0,1163,0,0,,0,,574,2610,1452,1,0,0,0,0,490,0,Matches
3,César Tárrega,es ESP,DF,22.0,33.6,2096,330,1141,914,58,37,2096,15,11,73.3,2,13.3,1134,6087,3641,20,10,0,16,4,1318,1,Matches
4,Luis Rioja,es ESP,"MF,FW",30.0,31.5,1505,39,252,533,741,59,1503,91,43,47.3,37,40.7,816,5793,2812,93,57,28,42,40,1031,228,Matches


| Feature  | Datatype | Description                                                          |
| -------- | -------- | -------------------------------------------------------------------- |
| Player   | string   | Player name                                                          |
| Nation   | string   | Player's nationality                                                 |
| Pos      | string   | Position(s) played                                                   |
| Age      | float    | Player age                                                           |
| 90s      | float    | Minutes played divided by 90                                         |
| Touches  | int      | Total touches                                                        |
| Def\_Pen | int      | Touches in defensive penalty area                                    |
| Def\_3rd | int      | Touches in defensive third                                           |
| Mid\_3rd | int      | Touches in middle third                                              |
| Att\_3rd | int      | Touches in attacking third                                           |
| Att\_Pen | int      | Touches in attacking penalty area                                    |
| Live     | int      | Live-ball touches                                                    |
| Att      | int      | Take-on attempts                                                     |
| Succ     | int      | Successful take-ons                                                  |
| Succ%    | float    | Take-on success rate                                                 |
| Tkld     | int      | Times tackled while attempting take-on                               |
| Tkld%    | float    | Percentage of take-ons resulting in a tackle                         |
| Carries  | int      | Total carries (moving ball ≥ 5 ft while controlling it)              |
| TotDist  | int      | Total distance carried (in yards)                                    |
| PrgDist  | int      | Progressive carry distance                                           |
| PrgC     | int      | Progressive carries (≥5 yards towards goal and not in defensive 40%) |
| 1/3      | int      | Carries into final third                                             |
| CPA      | int      | Carries into penalty area                                            |
| Mis      | int      | Miscontrols                                                          |
| Dis      | int      | Dispossessions                                                       |
| Rec      | int      | Passes received                                                      |
| PrgR     | int      | Progressive passes received                                          |
| Matches  | string   | Placeholder field (likely constant “Matches”)                        |

In [32]:
def print_fbref_datasets_with_unexpected_matches_column_values(
    fbref_datasets_by_name: Dict[str, pd.DataFrame]
) -> None:
    for dataset_name in fbref_datasets_by_name:
        fbref_dataframe: pd.DataFrame = fbref_datasets_by_name[dataset_name]

        matches_column_value_counts: pd.Series = fbref_dataframe['Matches'].value_counts(dropna=False)

        matches_column_has_unexpected_values: bool = (
            len(matches_column_value_counts) > 1 or matches_column_value_counts.index[0] != 'Matches'
        )

        if matches_column_has_unexpected_values:
            print(f"[!] Unexpected values found in '.Matches' column of {dataset_name}:")
            print(matches_column_value_counts, end="\n\n")

In [33]:
datasets = {
    'df_player_stats_2425': df_player_stats_2425,
    'df_player_shooting_2425': df_player_shooting_2425,
    'df_player_passing_2425': df_player_passing_2425,
    'df_player_passing_types_2425': df_player_passing_types_2425,
    'df_player_gca_2425': df_player_gca_2425,
    'df_player_defense_2425': df_player_defense_2425,
    'df_player_possession_2425': df_player_possession_2425,
}

print_fbref_datasets_with_unexpected_matches_column_values(datasets)

[!] Unexpected values found in '.Matches' column of df_player_stats_2425:
Matches
Matches    41
NaN         2
Name: count, dtype: int64

[!] Unexpected values found in '.Matches' column of df_player_shooting_2425:
Matches
Matches    32
NaN         2
Name: count, dtype: int64

[!] Unexpected values found in '.Matches' column of df_player_passing_2425:
Matches
Matches    32
NaN         2
Name: count, dtype: int64

[!] Unexpected values found in '.Matches' column of df_player_passing_types_2425:
Matches
Matches    32
NaN         2
Name: count, dtype: int64

[!] Unexpected values found in '.Matches' column of df_player_gca_2425:
Matches
Matches    32
NaN         2
Name: count, dtype: int64

[!] Unexpected values found in '.Matches' column of df_player_defense_2425:
Matches
Matches    32
NaN         2
Name: count, dtype: int64

[!] Unexpected values found in '.Matches' column of df_player_possession_2425:
Matches
Matches    32
NaN         2
Name: count, dtype: int64



In [34]:
# NOTE: drop matches

---

In [35]:
def find_players_appearing_in_multiple_seasons(dataframes_with_season_labels: List[Tuple[pd.DataFrame, str]]) -> pd.DataFrame:
    list_of_player_season_dataframes: List[pd.DataFrame] = []

    for single_season_dataframe, season_label in dataframes_with_season_labels:
        if 'Player' not in single_season_dataframe.columns:
            raise ValueError(f"The required column 'Player' is missing in the dataset for season {season_label}")

        player_and_season_dataframe: pd.DataFrame = single_season_dataframe[['Player']].copy()
        player_and_season_dataframe['Season'] = season_label
        list_of_player_season_dataframes.append(player_and_season_dataframe)

    combined_player_season_dataframe: pd.DataFrame = pd.concat(list_of_player_season_dataframes)

    grouped_players_with_all_seasons: pd.DataFrame = (
        combined_player_season_dataframe
        .groupby('Player')['Season']
        .agg(lambda seasons: sorted(seasons.unique()))
        .reset_index()
    )

    players_with_multiple_seasons: pd.DataFrame = grouped_players_with_all_seasons[
        grouped_players_with_all_seasons['Season'].apply(len) > 1
    ].rename(columns={'Season': 'Seasons'})

    return players_with_multiple_seasons

In [36]:
dataframes_with_season_labels: List[Tuple[pd.DataFrame, str]] = [
    (df_player_stats_2223, '2223'),
    (df_player_stats_2324, '2324'),
    (df_player_stats_2425, '2425'),
]

players_with_multiple_seasons: pd.DataFrame = find_players_appearing_in_multiple_seasons(dataframes_with_season_labels)

In [37]:
players_with_multiple_seasons

Unnamed: 0,Player,Seasons
0,Alberto Marí,"[2223, 2324, 2425]"
2,Cenk Özkacar,"[2223, 2324, 2425]"
4,Cristhian Mosquera,"[2223, 2324, 2425]"
5,Cristian,"[2223, 2324]"
6,César Tárrega,"[2324, 2425]"
8,David Otorbi,"[2324, 2425]"
9,Diego López,"[2223, 2324, 2425]"
10,Dimitri Foulquier,"[2223, 2324, 2425]"
11,Domingos André Ribeiro Almeida,"[2223, 2324, 2425]"
16,Francisco Perez,"[2223, 2324, 2425]"


In [44]:
def save_all_loaded_dataframes_to_interim_folder() -> None:
    interim_data_directory: Path = Path("..", "data", "interim")
    interim_data_directory.mkdir(parents=True, exist_ok=True)

    loaded_dataframes: dict[str, pd.DataFrame] = {
        variable_name: variable_object
        for variable_name, variable_object in globals().items()
        if variable_name.startswith("df_") and isinstance(variable_object, pd.DataFrame)
    }

    for dataframe_name, dataframe_object in loaded_dataframes.items():
        output_file_path: Path = interim_data_directory / f"{dataframe_name}.csv"
        dataframe_object.to_csv(output_file_path, index=False)
        print(f"Saved: {output_file_path}")

In [45]:
save_all_loaded_dataframes_to_interim_folder()

Saved: ../data/interim/df_player_stats_2425.csv
Saved: ../data/interim/df_player_shooting_2425.csv
Saved: ../data/interim/df_player_passing_2425.csv
Saved: ../data/interim/df_player_passing_types_2425.csv
Saved: ../data/interim/df_player_gca_2425.csv
Saved: ../data/interim/df_player_defense_2425.csv
Saved: ../data/interim/df_player_possession_2425.csv
Saved: ../data/interim/df_player_stats_2324.csv
Saved: ../data/interim/df_player_shooting_2324.csv
Saved: ../data/interim/df_player_passing_2324.csv
Saved: ../data/interim/df_player_passing_types_2324.csv
Saved: ../data/interim/df_player_gca_2324.csv
Saved: ../data/interim/df_player_defense_2324.csv
Saved: ../data/interim/df_player_possession_2324.csv
Saved: ../data/interim/df_player_stats_2223.csv
Saved: ../data/interim/df_player_shooting_2223.csv
Saved: ../data/interim/df_player_passing_2223.csv
Saved: ../data/interim/df_player_passing_types_2223.csv
Saved: ../data/interim/df_player_gca_2223.csv
Saved: ../data/interim/df_player_defense_