# 02 Data Exploration

This notebook explores and summarizes the combined FBref player datasets (2022–2025) to assess data quality and inform feature engineering for player value modeling. It covers:

* **Data loading:** Loads all interim-season CSVs and assigns them to global DataFrames
* **Multi-season merging:** Concatenates datasets by type (e.g. passing, shooting) across seasons with a `Season` column for time-aware analysis
* **Statistical summaries:** Prints key statistics, missing values, and outlier counts per dataset, focusing only on numeric columns with modeling relevance
* **Domain-specific notes:** Adds practical cleaning and transformation insights for each dataset, from minutes normalization to log-scaling and role-awareness
* **Feature relevance review:** Notes pitfalls like inflated rate stats, context leakage, sparse roles, and non-linear player valuation signals

> Output of this step is a set of combined, season-tagged datasets ready for cleaning and feature extraction. The summary insights guide what to transform, exclude, or enrich before modeling.

In [9]:
from pathlib import Path
import pandas as pd

In [10]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [11]:
interim_data_directory = Path("..", "data", "interim")
csv_files = interim_data_directory.glob("df_*.csv")

In [12]:
for csv_file in csv_files:
    df_name = csv_file.stem  # Remove ".csv"
    globals()[df_name] = pd.read_csv(csv_file)
    print(f"Loaded: {df_name}")

Loaded: df_player_defense_2324
Loaded: df_player_shooting_2324
Loaded: df_player_possession_2324
Loaded: df_player_passing_types_2223
Loaded: df_player_passing_types_2425
Loaded: df_player_passing_types_2324
Loaded: df_player_possession_2223
Loaded: df_player_possession_2425
Loaded: df_player_shooting_2425
Loaded: df_player_shooting_2223
Loaded: df_player_defense_2223
Loaded: df_player_defense_2425
Loaded: df_player_gca_2324
Loaded: df_player_stats_2223
Loaded: df_player_stats_2425
Loaded: df_player_passing_2324
Loaded: df_player_passing_2425
Loaded: df_player_passing_2223
Loaded: df_player_stats_2324
Loaded: df_player_gca_2425
Loaded: df_player_gca_2223


---

We combine season-specific datasets using `concat` to enable cross-season analysis, player tracking, and time-aware feature engineering in a unified structure.

In [13]:
def concat_player_dataframes_across_seasons(dataframe_prefix: str) -> pd.DataFrame:
    """Concatenates all player DataFrames with the given prefix across seasons and adds a Season column."""
    combined_dataframe: pd.DataFrame = pd.DataFrame()
    seasons: list[str] = ["2223", "2324", "2425"]

    for season_label in seasons:
        dataframe_variable_name: str = f"df_{dataframe_prefix}_{season_label}"
        if dataframe_variable_name not in globals():
            raise ValueError(f"DataFrame {dataframe_variable_name} not found in current environment.")

        single_season_dataframe: pd.DataFrame = globals()[dataframe_variable_name].copy()
        single_season_dataframe["Season"] = season_label
        combined_dataframe = pd.concat([combined_dataframe, single_season_dataframe], ignore_index=True)

    return combined_dataframe

In [14]:
dataframe_prefixes: list[str] = [
    "player_stats",
    "player_shooting",
    "player_passing",
    "player_passing_types",
    "player_gca",
    "player_defense",
    "player_possession"
]

combined_dataframes: dict[str, pd.DataFrame] = {}

for prefix in dataframe_prefixes:
    combined_name: str = f"df_combined_{prefix}"
    combined_dataframes[combined_name] = concat_player_dataframes_across_seasons(prefix)
    print(f"Combined: {combined_name}")

Combined: df_combined_player_stats
Combined: df_combined_player_shooting
Combined: df_combined_player_passing
Combined: df_combined_player_passing_types
Combined: df_combined_player_gca
Combined: df_combined_player_defense
Combined: df_combined_player_possession


---

In [15]:
def summarize_dataframe_statistics(dataframe: pd.DataFrame, dataframe_name: str) -> None:
    print(f"\n--- Summary: {dataframe_name} ---")

    # Columns to exclude from summary
    excluded_columns: set[str] = {"Player", "Nation", "Pos", "Matches", "Season"}

    # Filtered DataFrame
    columns_to_analyze: list[str] = [
        column_name for column_name in dataframe.columns
        if column_name not in excluded_columns and pd.api.types.is_numeric_dtype(dataframe[column_name])
    ]
    filtered_dataframe: pd.DataFrame = dataframe[columns_to_analyze]

    # Missing values
    missing_total: pd.Series = filtered_dataframe.isna().sum()
    missing_percent: pd.Series = (missing_total / len(filtered_dataframe)) * 100
    missing_summary: pd.DataFrame = pd.concat([missing_total, missing_percent], axis=1)
    missing_summary.columns = ['Missing Values', '% Missing']
    missing_summary = missing_summary[missing_summary['Missing Values'] > 0]

    if not missing_summary.empty:
        print("\nMissing Values (Total and %):")
        display(missing_summary.sort_values('% Missing', ascending=False).T)

    # Descriptive statistics
    print("\nDescriptive Statistics:")
    display(filtered_dataframe.describe().transpose())

    # Outlier detection
    for column_name in filtered_dataframe.columns:
        series = filtered_dataframe[column_name]
        if series.std() == 0 or series.isna().all():
            continue
        z_scores = (series - series.mean()) / series.std()
        outlier_count = (z_scores.abs() > 3).sum()
        if outlier_count > 0:
            print(f"Outliers in {column_name}: {outlier_count} rows > 3 std deviations")

In [16]:
# Run summary for all combined DataFrames
for dataframe_name, dataframe_object in combined_dataframes.items():
    summarize_dataframe_statistics(dataframe_object, dataframe_name)


--- Summary: df_combined_player_stats ---

Missing Values (Total and %):


Unnamed: 0,Min,PrgC,npxG.1,xG+xAG,xAG.1,xG.1,G+A-PK,G-PK.1,G+A.1,Ast.1,Gls.1,PrgR,PrgP,npxG+xAG,90s,xAG,npxG,xG,CrdR,CrdY,PKatt,PK,G-PK,G+A,Ast,Gls,npxG+xAG.1
Missing Values,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0
% Missing,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333,20.833333



Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,120.0,23.184167,3.987796,15.0,20.0,22.5,25.0,35.0
MP,120.0,16.8,14.116919,0.0,1.0,16.5,30.25,38.0
Starts,120.0,31.35,89.873174,0.0,0.0,7.0,24.0,418.0
Min,95.0,1401.431579,1132.042596,7.0,362.0,1315.0,2245.0,3420.0
90s,95.0,15.573684,12.580246,0.1,4.0,14.6,24.9,38.0
Gls,95.0,4.010526,10.525041,0.0,0.0,1.0,2.0,51.0
Ast,95.0,2.621053,7.221686,0.0,0.0,0.0,2.0,38.0
G+A,95.0,6.631579,17.602014,0.0,0.0,1.0,4.0,89.0
G-PK,95.0,3.515789,9.404462,0.0,0.0,0.0,2.0,45.0
PK,95.0,0.494737,1.47946,0.0,0.0,0.0,0.0,8.0


Outliers in Starts: 6 rows > 3 std deviations
Outliers in Gls: 6 rows > 3 std deviations
Outliers in Ast: 4 rows > 3 std deviations
Outliers in G+A: 6 rows > 3 std deviations
Outliers in G-PK: 5 rows > 3 std deviations
Outliers in PK: 5 rows > 3 std deviations
Outliers in PKatt: 4 rows > 3 std deviations
Outliers in CrdY: 5 rows > 3 std deviations
Outliers in CrdR: 2 rows > 3 std deviations
Outliers in xG: 6 rows > 3 std deviations
Outliers in npxG: 6 rows > 3 std deviations
Outliers in xAG: 6 rows > 3 std deviations
Outliers in npxG+xAG: 6 rows > 3 std deviations
Outliers in PrgC: 5 rows > 3 std deviations
Outliers in PrgP: 6 rows > 3 std deviations
Outliers in PrgR: 6 rows > 3 std deviations
Outliers in Gls.1: 4 rows > 3 std deviations
Outliers in Ast.1: 4 rows > 3 std deviations
Outliers in G+A.1: 4 rows > 3 std deviations
Outliers in G-PK.1: 5 rows > 3 std deviations
Outliers in G+A-PK: 4 rows > 3 std deviations
Outliers in xG.1: 6 rows > 3 std deviations
Outliers in xAG.1: 5 rows 

Unnamed: 0,G/SoT,SoT%,G/Sh,Dist,npxG/Sh
Missing Values,25.0,13.0,13.0,13.0,13.0
% Missing,26.315789,13.684211,13.684211,13.684211,13.684211



Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,95.0,23.569474,3.701152,16.0,21.0,23.0,26.0,35.0
90s,95.0,15.573684,12.580246,0.1,4.0,14.6,24.9,38.0
Gls,95.0,4.010526,10.525041,0.0,0.0,1.0,2.0,51.0
Sh,95.0,38.726316,98.157433,0.0,3.0,10.0,21.5,480.0
SoT,95.0,12.610526,33.003963,0.0,0.0,2.0,7.0,154.0
SoT%,82.0,27.684146,16.644173,0.0,16.7,30.2,36.9,66.7
Sh/90,95.0,1.802526,2.696021,0.0,0.395,0.89,1.995,12.63
SoT/90,95.0,0.559579,0.932148,0.0,0.0,0.23,0.655,4.05
G/Sh,82.0,0.067561,0.089575,0.0,0.0,0.045,0.1,0.5
G/SoT,70.0,0.245714,0.257128,0.0,0.0,0.25,0.38,1.0


Outliers in Age: 1 rows > 3 std deviations
Outliers in Gls: 6 rows > 3 std deviations
Outliers in Sh: 6 rows > 3 std deviations
Outliers in SoT: 6 rows > 3 std deviations
Outliers in Sh/90: 6 rows > 3 std deviations
Outliers in SoT/90: 5 rows > 3 std deviations
Outliers in G/Sh: 1 rows > 3 std deviations
Outliers in FK: 4 rows > 3 std deviations
Outliers in PK: 5 rows > 3 std deviations
Outliers in PKatt: 4 rows > 3 std deviations
Outliers in xG: 6 rows > 3 std deviations
Outliers in npxG: 6 rows > 3 std deviations
Outliers in npxG/Sh: 2 rows > 3 std deviations
Outliers in G-xG: 2 rows > 3 std deviations
Outliers in np:G-xG: 2 rows > 3 std deviations

--- Summary: df_combined_player_passing ---

Missing Values (Total and %):


Unnamed: 0,Cmp%.3,Cmp%.2
Missing Values,4.0,1.0
% Missing,4.210526,1.052632



Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,95.0,23.569474,3.701152,16.0,21.0,23.0,26.0,35.0
90s,95.0,15.573684,12.580246,0.1,4.0,14.6,24.9,38.0
Cmp,95.0,1306.157895,3376.123461,2.0,86.0,387.0,799.5,16672.0
Att,95.0,1673.568421,4291.092768,4.0,128.0,482.0,1071.0,20730.0
Cmp%,95.0,74.198947,10.627889,40.0,70.15,75.5,80.4,100.0
TotDist,95.0,23320.842105,60549.109485,27.0,1445.0,5750.0,15090.0,302306.0
PrgDist,95.0,8427.715789,22007.847171,3.0,352.5,1857.0,5211.0,108756.0
Cmp.1,95.0,588.936842,1510.871355,1.0,49.0,161.0,370.0,7313.0
Att.1,95.0,670.147368,1714.350655,1.0,56.0,202.0,420.0,8278.0
Cmp%.1,95.0,86.330526,9.051273,53.3,82.35,87.9,90.95,100.0


Outliers in Age: 1 rows > 3 std deviations
Outliers in Cmp: 6 rows > 3 std deviations
Outliers in Att: 6 rows > 3 std deviations
Outliers in Cmp%: 2 rows > 3 std deviations
Outliers in TotDist: 6 rows > 3 std deviations
Outliers in PrgDist: 6 rows > 3 std deviations
Outliers in Cmp.1: 6 rows > 3 std deviations
Outliers in Att.1: 6 rows > 3 std deviations
Outliers in Cmp%.1: 2 rows > 3 std deviations
Outliers in Cmp.2: 6 rows > 3 std deviations
Outliers in Att.2: 6 rows > 3 std deviations
Outliers in Cmp%.2: 2 rows > 3 std deviations
Outliers in Cmp.3: 6 rows > 3 std deviations
Outliers in Att.3: 6 rows > 3 std deviations
Outliers in Ast: 4 rows > 3 std deviations
Outliers in xAG: 6 rows > 3 std deviations
Outliers in xA: 5 rows > 3 std deviations
Outliers in A-xAG: 1 rows > 3 std deviations
Outliers in KP: 6 rows > 3 std deviations
Outliers in 1/3: 5 rows > 3 std deviations
Outliers in PPA: 5 rows > 3 std deviations
Outliers in CrsPA: 4 rows > 3 std deviations
Outliers in PrgP: 6 rows 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,95.0,23.569474,3.701152,16.0,21.0,23.0,26.0,35.0
90s,95.0,15.573684,12.580246,0.1,4.0,14.6,24.9,38.0
Att,95.0,1673.568421,4291.092768,4.0,128.0,482.0,1071.0,20730.0
Live,95.0,1483.947368,3818.221744,4.0,100.0,417.0,897.0,18759.0
Dead,95.0,182.157895,461.955932,0.0,8.0,33.0,89.5,2007.0
FK,95.0,51.915789,130.901917,0.0,0.0,8.0,33.5,570.0
TB,95.0,4.284211,12.379344,0.0,0.0,0.0,2.0,78.0
Sw,95.0,13.336842,34.998362,0.0,0.0,3.0,9.5,167.0
Crs,95.0,69.252632,176.217646,0.0,1.0,6.0,46.5,924.0
TI,95.0,78.610526,205.603848,0.0,0.0,2.0,21.5,868.0


Outliers in Age: 1 rows > 3 std deviations
Outliers in Att: 6 rows > 3 std deviations
Outliers in Live: 6 rows > 3 std deviations
Outliers in Dead: 6 rows > 3 std deviations
Outliers in FK: 6 rows > 3 std deviations
Outliers in TB: 3 rows > 3 std deviations
Outliers in Sw: 5 rows > 3 std deviations
Outliers in Crs: 4 rows > 3 std deviations
Outliers in TI: 6 rows > 3 std deviations
Outliers in CK: 5 rows > 3 std deviations
Outliers in In: 5 rows > 3 std deviations
Outliers in Out: 4 rows > 3 std deviations
Outliers in Str: 3 rows > 3 std deviations
Outliers in Cmp: 6 rows > 3 std deviations
Outliers in Off: 6 rows > 3 std deviations
Outliers in Blocks: 6 rows > 3 std deviations

--- Summary: df_combined_player_gca ---

Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,95.0,23.569474,3.701152,16.0,21.0,23.0,26.0,35.0
90s,95.0,15.573684,12.580246,0.1,4.0,14.6,24.9,38.0
SCA,95.0,69.094737,174.424163,0.0,6.0,15.0,44.5,859.0
SCA90,95.0,2.958105,4.575692,0.0,0.645,1.81,2.75,22.61
PassLive,95.0,49.347368,124.035322,0.0,4.0,12.0,31.5,607.0
PassDead,95.0,6.852632,18.233106,0.0,0.0,0.0,2.5,88.0
TO,95.0,3.873684,10.184722,0.0,0.0,0.0,3.0,59.0
Sh,95.0,3.789474,10.250993,0.0,0.0,1.0,2.5,53.0
Fld,95.0,3.863158,10.010219,0.0,0.0,0.0,2.0,51.0
Def,95.0,1.368421,3.386621,0.0,0.0,0.0,1.0,17.0


Outliers in Age: 1 rows > 3 std deviations
Outliers in SCA: 6 rows > 3 std deviations
Outliers in SCA90: 6 rows > 3 std deviations
Outliers in PassLive: 6 rows > 3 std deviations
Outliers in PassDead: 6 rows > 3 std deviations
Outliers in TO: 5 rows > 3 std deviations
Outliers in Sh: 4 rows > 3 std deviations
Outliers in Fld: 5 rows > 3 std deviations
Outliers in Def: 4 rows > 3 std deviations
Outliers in GCA: 6 rows > 3 std deviations
Outliers in GCA90: 5 rows > 3 std deviations
Outliers in PassLive.1: 5 rows > 3 std deviations
Outliers in PassDead.1: 3 rows > 3 std deviations
Outliers in TO.1: 3 rows > 3 std deviations
Outliers in Sh.1: 4 rows > 3 std deviations
Outliers in Fld.1: 5 rows > 3 std deviations
Outliers in Def.1: 3 rows > 3 std deviations

--- Summary: df_combined_player_defense ---

Missing Values (Total and %):


Unnamed: 0,Tkl%
Missing Values,12.0
% Missing,12.631579



Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,95.0,23.569474,3.701152,16.0,21.0,23.0,26.0,35.0
90s,95.0,15.573684,12.580246,0.1,4.0,14.6,24.9,38.0
Tkl,95.0,61.915789,155.03978,0.0,4.0,18.0,41.0,678.0
TklW,95.0,37.821053,94.343305,0.0,2.0,12.0,24.5,428.0
Def_3rd,95.0,31.252632,78.097997,0.0,1.0,8.0,19.0,354.0
Mid_3rd,95.0,23.210526,58.940629,0.0,1.0,7.0,14.0,291.0
Att_3rd,95.0,7.452632,18.894876,0.0,0.0,3.0,4.5,88.0
Tkl.1,95.0,29.515789,74.124803,0.0,2.0,9.0,18.5,321.0
Att,95.0,57.452632,144.230516,0.0,4.5,17.0,39.0,659.0
Tkl%,83.0,53.626506,22.582845,0.0,39.25,54.5,66.7,100.0


Outliers in Age: 1 rows > 3 std deviations
Outliers in Tkl: 6 rows > 3 std deviations
Outliers in TklW: 6 rows > 3 std deviations
Outliers in Def_3rd: 6 rows > 3 std deviations
Outliers in Mid_3rd: 6 rows > 3 std deviations
Outliers in Att_3rd: 6 rows > 3 std deviations
Outliers in Tkl.1: 6 rows > 3 std deviations
Outliers in Att: 6 rows > 3 std deviations
Outliers in Lost: 5 rows > 3 std deviations
Outliers in Blocks: 6 rows > 3 std deviations
Outliers in Sh: 5 rows > 3 std deviations
Outliers in Pass: 6 rows > 3 std deviations
Outliers in Int: 6 rows > 3 std deviations
Outliers in Tkl+Int: 6 rows > 3 std deviations
Outliers in Clr: 4 rows > 3 std deviations
Outliers in Err: 4 rows > 3 std deviations

--- Summary: df_combined_player_possession ---

Missing Values (Total and %):


Unnamed: 0,Succ%,Tkld%
Missing Values,15.0,15.0
% Missing,15.789474,15.789474



Descriptive Statistics:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,95.0,23.569474,3.701152,16.0,21.0,23.0,26.0,35.0
90s,95.0,15.573684,12.580246,0.1,4.0,14.6,24.9,38.0
Touches,95.0,2068.421053,5276.516462,4.0,146.0,620.0,1251.5,24786.0
Def_Pen,95.0,211.442105,558.045272,0.0,6.0,24.0,75.0,2406.0
Def_3rd,95.0,672.547368,1724.223586,0.0,25.5,144.0,395.0,7654.0
Mid_3rd,95.0,917.6,2366.407806,0.0,42.5,237.0,561.0,11522.0
Att_3rd,95.0,498.347368,1269.133469,0.0,31.5,117.0,317.5,6255.0
Att_Pen,95.0,66.947368,172.746793,0.0,5.5,16.0,39.0,822.0
Live,95.0,2067.747368,5274.849912,4.0,146.0,620.0,1251.5,24780.0
Att,95.0,63.947368,162.426636,0.0,2.5,14.0,38.5,747.0


Outliers in Age: 1 rows > 3 std deviations
Outliers in Touches: 6 rows > 3 std deviations
Outliers in Def_Pen: 6 rows > 3 std deviations
Outliers in Def_3rd: 6 rows > 3 std deviations
Outliers in Mid_3rd: 6 rows > 3 std deviations
Outliers in Att_3rd: 6 rows > 3 std deviations
Outliers in Att_Pen: 6 rows > 3 std deviations
Outliers in Live: 6 rows > 3 std deviations
Outliers in Att: 6 rows > 3 std deviations
Outliers in Succ: 5 rows > 3 std deviations
Outliers in Tkld: 6 rows > 3 std deviations
Outliers in Carries: 6 rows > 3 std deviations
Outliers in TotDist: 6 rows > 3 std deviations
Outliers in PrgDist: 6 rows > 3 std deviations
Outliers in PrgC: 5 rows > 3 std deviations
Outliers in 1/3: 6 rows > 3 std deviations
Outliers in CPA: 6 rows > 3 std deviations
Outliers in Mis: 6 rows > 3 std deviations
Outliers in Dis: 6 rows > 3 std deviations
Outliers in Rec: 6 rows > 3 std deviations
Outliers in PrgR: 6 rows > 3 std deviations


NOTE

##### `df_combined_player_stats`

* The `90s` column is really important. If it's low, most stats won't mean much. Probably best to filter out players with too little play time.
* Age needs standardizing. Might be useful to transform it since player value doesn't increase linearly with age. Peak is usually mid-20s.
* Position isn't in this dataset but should definitely be added again later, maybe using one-hot or some position grouping.
* Stats like goals, assists or penalties have high variance and a lot of zeros. Could make sense to log-transform or put into bins.
* There's no info about team strength or player role, which could mess with value predictions if we don’t handle it carefully.

##### `df_combined_player_shooting`

* Most features here are per 90 minutes. Only makes sense to use them if the player actually played enough.
* Penalty stats are biased towards certain roles. Need to keep that in mind.
* Some advanced stats like `npxG` might have missing values. Should either fill them or flag them.
* Watch out for overperformance in ratios like `G/Sh` or `G/SoT`. Could just be randomness.
* Might be useful to calculate how much a player is over- or underperforming their xG.

##### `df_combined_player_passing`

* Pass completion percentage doesn't say much if a player barely passed the ball. Better to ignore low-attempt cases.
* Would be helpful to break down passing types (short, medium, long) to get a clearer picture.
* We can use progressive passes and final third passes to estimate passing risk or intent.
* Raw pass counts should probably be scaled by minutes.

##### `df_combined_player_passing_types`

* Some of these stats are very sparse and depend a lot on position. Need to be careful here or just fill with zeros.
* Could help identify roles, like fullbacks who cross a lot or midfielders who switch play.
* Might need to log or bin a few columns because of skewed distributions.

##### `df_combined_player_gca`

* GCA and SCA totals are just raw counts, so they really depend on minutes. Need to normalize.
* There’s a chance to use some columns to group players by creative style (e.g., dribbling vs. passing).
* These features could be strong signals for value, but probably a bit noisy too.

##### `df_combined_player_defense`

* Tackle and block counts can say more about team tactics than about player quality. So could be misleading.
* Use tackle win rate or efficiency stats instead of just totals.
* Errors or possessions lost can actually be higher for important players. Not always a bad sign.
* Could be worth clustering defensive styles based on these columns.

##### `df_combined_player_possession`

* Carries and touches give you a sense of how involved a player is with the ball.
* Miscontrols or times dispossessed often go up with involvement. Should divide these by touches or carries.
* Useful to combine with other datasets to understand if a player is just busy or actually efficient.

#### General Notes

* Normalize all count stats by minutes played, otherwise comparisons are off.
* Handle missing values with care. Many of them are probably role-related, not random.
* Make sure to avoid using future performance when predicting current or future value. Watch out for leakage.
* Use the `Season` column when doing train-test splits, especially if you're doing time-based modeling.


### Additional thoughts and considerations

* Add team strength data like league position, team Elo or average xG. Stats can look inflated or deflated depending on the club
* Consider player role and system. A player in a high pressing team or possession-heavy system will show different patterns
* Contract info would be useful. Years left, salary, or buyout clauses can influence market value
* Transfer history and rumors can help. Past fees and interest from other clubs can shift value quickly
* National team appearances are a good signal. Being called up regularly usually reflects higher reputation
* Split minutes by competition. Performing in Champions League is not the same as domestic league games
* Add injury history or availability data. Players missing many games are harder to value based on stats alone
* Recent form compared to season average can show if a player is trending up or down
* External ratings like Transfermarkt or maybe SofaScore can give a proxy for public or expert perception
* Social media or media coverage might help as an additional signal. More exposure can increase market attention
* Creating player similarity groups might help compare profiles and estimate expected value range