# Complete Data Audit - Thesis Datasets

Independent analysis of all parquet files in thesis_data.

**Files analyzed:**
1. `Twelve/male_transfer_model.parquet` (115 MB) - Main transfer performance data
2. `tm_data/transfer_history_all.parquet` (7.7 MB) - Transfermarkt transfer history
3. `Wyscout/players_wyscout.parquet` (2.6 MB) - Player metadata
4. `Transfermarkt/wy_tm_players_mapping.parquet` (2.0 MB) - Player ID mapping
5. `Transfermarkt/tm_teams.parquet` (131 KB) - Team metadata
6. `Wyscout/competitions_wyscout.parquet` (38 KB) - Competition metadata
7. `Transfermarkt/tm_league_links.parquet` (16 KB) - League links

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 200)

BASE_PATH = "../../thesis_data/"

# Load all datasets
df_transfers = pd.read_parquet(f"{BASE_PATH}raw_data_twelve/Twelve/male_transfer_model.parquet")
df_tm_history = pd.read_parquet(f"{BASE_PATH}tm_data/transfer_history_all.parquet")
df_players = pd.read_parquet(f"{BASE_PATH}raw_data_twelve/Wyscout/players_wyscout.parquet")
df_mapping = pd.read_parquet(f"{BASE_PATH}raw_data_twelve/Transfermarkt/wy_tm_players_mapping.parquet")
df_teams = pd.read_parquet(f"{BASE_PATH}raw_data_twelve/Transfermarkt/tm_teams.parquet")
df_competitions = pd.read_parquet(f"{BASE_PATH}raw_data_twelve/Wyscout/competitions_wyscout.parquet")
df_leagues = pd.read_parquet(f"{BASE_PATH}raw_data_twelve/Transfermarkt/tm_league_links.parquet")

print("All datasets loaded successfully!")

All datasets loaded successfully!


---
## 1. Dataset Overview

In [2]:
datasets = {
    'male_transfer_model': df_transfers,
    'transfer_history_all': df_tm_history,
    'players_wyscout': df_players,
    'wy_tm_players_mapping': df_mapping,
    'tm_teams': df_teams,
    'competitions_wyscout': df_competitions,
    'tm_league_links': df_leagues
}

overview = pd.DataFrame([
    {
        'dataset': name,
        'rows': df.shape[0],
        'columns': df.shape[1],
        'memory_mb': df.memory_usage(deep=True).sum() / 1e6,
        'duplicates': df.duplicated().sum(),
        'missing_pct': df.isna().mean().mean() * 100
    }
    for name, df in datasets.items()
])

overview.round(2)

Unnamed: 0,dataset,rows,columns,memory_mb,duplicates,missing_pct
0,male_transfer_model,78191,362,150.21,0,8.11
1,transfer_history_all,363943,22,304.54,0,29.83
2,players_wyscout,50000,14,40.73,0,1.8
3,wy_tm_players_mapping,171892,2,1.38,0,0.0
4,tm_teams,5890,3,0.87,0,0.0
5,competitions_wyscout,1793,18,1.02,0,0.0
6,tm_league_links,399,3,0.1,0,0.0


---
## 2. male_transfer_model.parquet (Main Dataset)

This is the core dataset with player performance before and after transfers.

In [3]:
print(f"Shape: {df_transfers.shape}")
print(f"\nColumn types:")
print(df_transfers.dtypes.value_counts())
print(f"\nColumns:")
print(df_transfers.columns.tolist()[:30], "...")

Shape: (78191, 362)

Column types:
float32    342
int32       10
object       7
int16        2
float64      1
Name: count, dtype: int64

Columns:
['player_id', 'from_competition', 'to_competition', 'from_team_id', 'to_team_id', 'from_season', 'to_season', 'last_played_date', 'first_played_date', 'from_position', 'from_Minutes', 'from_Hold-up play', 'from_Involvement', 'from_Providing teammates', 'from_Aerial threat', 'from_Poaching', 'from_Run quality', 'from_Pressing', 'from_Finishing', 'from_Box threat', 'from_Dribbling', 'from_Active defence', 'from_Progression', 'from_Intelligent defence', 'from_Defensive heading', 'from_Passing quality', 'from_Winning duels', 'from_Composure', 'from_Effectiveness', 'from_Territorial dominance'] ...


In [4]:
# Identify column prefixes to understand structure
cols = df_transfers.columns.tolist()
from_cols = [c for c in cols if c.startswith('from_')]
to_cols = [c for c in cols if c.startswith('to_')]
other_cols = [c for c in cols if not c.startswith('from_') and not c.startswith('to_')]

print(f"from_* columns: {len(from_cols)}")
print(f"to_* columns: {len(to_cols)}")
print(f"Other columns: {len(other_cols)}")
print(f"\nOther columns: {other_cols}")

from_* columns: 176
to_* columns: 176
Other columns: 10

Other columns: ['player_id', 'last_played_date', 'first_played_date', 'competition_start_date', 'birth_date', 'short_name', 'player_season_age', 'team_id_to', 'competition_to', 'season_to']


In [5]:
# Top 20 columns with most missing values
missing = df_transfers.isna().mean().sort_values(ascending=False)
print("Top 20 columns with most missing values:")
missing.head(20)

Top 20 columns with most missing values:


from_Territorial dominance                                          0.609213
from_z_score_Defensive line height (m)                              0.609213
from_Defensive area (m^2)                                           0.609213
from_z_score_Opposition xT into defensive area                      0.609213
from_z_score_Opposition xT from defensive area                      0.609213
from_Opposition pass success % into defensive area                  0.609213
from_Opposition progressive passes from defensive area %            0.609213
from_Opposition xG after defensive action                           0.609213
from_Opposition xG from defensive area                              0.609213
from_Opposition xT from defensive area                              0.609213
from_Chance prevention                                              0.609213
from_Opposition xT into defensive area                              0.609213
from_z_score_Opposition xG from defensive area                      0.609213

In [6]:
# Key identifier columns
key_cols = ['player_id', 'from_competition', 'to_competition', 
            'from_team_id', 'to_team_id', 'from_season', 'to_season',
            'from_position', 'to_position']

print("Key columns missing values:")
df_transfers[key_cols].isna().mean()

Key columns missing values:


player_id           0.0
from_competition    0.0
to_competition      0.0
from_team_id        0.0
to_team_id          0.0
from_season         0.0
to_season           0.0
from_position       0.0
to_position         0.0
dtype: float64

In [7]:
# Season distribution
print("From Season distribution:")
print(df_transfers['from_season'].value_counts().sort_index())
print(f"\nSeason range: {df_transfers['from_season'].min()} - {df_transfers['from_season'].max()}")

From Season distribution:
from_season
2018     7579
2019     7285
2020     9836
2021    11421
2022    12918
2023    16114
2024    12176
2025      862
Name: count, dtype: int64

Season range: 2018 - 2025


In [8]:
# Positions
print("From Position distribution:")
print(df_transfers['from_position'].value_counts())
print("\nTo Position distribution:")
print(df_transfers['to_position'].value_counts())

From Position distribution:
from_position
Midfielder          19853
Central Defender    16903
Full Back           13653
Winger              11508
Striker             10761
Goalkeeper           5513
Name: count, dtype: int64

To Position distribution:
to_position
Midfielder          19858
Central Defender    17130
Full Back           13708
Winger              11207
Striker             10776
Goalkeeper           5512
Name: count, dtype: int64


In [9]:
# Check uniqueness at different granularities
granularities = [
    ['player_id'],
    ['player_id', 'from_season', 'to_season'],
    ['player_id', 'from_competition', 'to_competition', 'from_season', 'to_season'],
    ['player_id', 'from_competition', 'to_competition', 'from_season', 'to_season', 'from_position', 'to_position']
]

print("Uniqueness analysis:")
for g in granularities:
    n_unique = df_transfers[g].drop_duplicates().shape[0]
    print(f"  {' × '.join(g)}: {n_unique:,} unique ({n_unique/len(df_transfers)*100:.1f}%)")

Uniqueness analysis:
  player_id: 41,468 unique (53.0%)
  player_id × from_season × to_season: 66,779 unique (85.4%)
  player_id × from_competition × to_competition × from_season × to_season: 66,976 unique (85.7%)
  player_id × from_competition × to_competition × from_season × to_season × from_position × to_position: 78,191 unique (100.0%)


In [10]:
# CRITICAL: How many are actual transfers vs promotions/relegations?
same_team = df_transfers['from_team_id'] == df_transfers['to_team_id']
same_comp = df_transfers['from_competition'] == df_transfers['to_competition']
same_season = df_transfers['from_season'] == df_transfers['to_season']

print("Transfer type breakdown:")
print(f"  Different team (real transfers): {(~same_team).sum():,} ({(~same_team).mean()*100:.1f}%)")
print(f"  Same team, different competition: {(same_team & ~same_comp).sum():,} ({(same_team & ~same_comp).mean()*100:.1f}%)")
print(f"  Same team, same competition: {(same_team & same_comp).sum():,} ({(same_team & same_comp).mean()*100:.1f}%)")
print(f"\nOf same team, different competition:")
print(f"  Different season (promotion/relegation): {(same_team & ~same_comp & ~same_season).sum():,}")
print(f"  Same season (multi-league): {(same_team & ~same_comp & same_season).sum():,}")

Transfer type breakdown:
  Different team (real transfers): 56,697 (72.5%)
  Same team, different competition: 21,494 (27.5%)
  Same team, same competition: 0 (0.0%)

Of same team, different competition:
  Different season (promotion/relegation): 19,142
  Same season (multi-league): 2,352


In [11]:
# Player statistics
n_players = df_transfers['player_id'].nunique()
transfers_per_player = df_transfers.groupby('player_id').size()

print(f"Unique players: {n_players:,}")
print(f"\nRows per player distribution:")
print(transfers_per_player.describe())

Unique players: 41,468

Rows per player distribution:
count    41468.000000
mean         1.885574
std          1.370815
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max         19.000000
dtype: float64


In [12]:
# Competition coverage
from_comps = df_transfers['from_competition'].nunique()
to_comps = df_transfers['to_competition'].nunique()
all_comps = pd.concat([df_transfers['from_competition'], df_transfers['to_competition']]).nunique()

print(f"Competitions in from_competition: {from_comps}")
print(f"Competitions in to_competition: {to_comps}")
print(f"Total unique competitions: {all_comps}")

Competitions in from_competition: 320
Competitions in to_competition: 313
Total unique competitions: 327


In [13]:
# Sample of performance metrics (z-scores)
z_cols = [c for c in df_transfers.columns if 'z_score' in c.lower()]
print(f"Number of z-score columns: {len(z_cols)}")
print(f"\nSample z-score columns:")
print(z_cols[:10])

# Check z-score distributions
if z_cols:
    sample_z = z_cols[0]
    print(f"\nDistribution of {sample_z}:")
    print(df_transfers[sample_z].describe())

Number of z-score columns: 150

Sample z-score columns:
['from_z_score_Aerials per 90', 'from_z_score_Aerials won %', 'from_z_score_Aerials won per 90', 'from_z_score_Assists per 90', 'from_z_score_Attacking aerials won %', 'from_z_score_Attacking aerials won per 90', 'from_z_score_Ball progression (xT) per 90', 'from_z_score_Ball recoveries per 90', 'from_z_score_Ball runs (xT) per 90', 'from_z_score_Box entries per 90']

Distribution of from_z_score_Aerials per 90:
count    78186.000000
mean         0.023071
std          1.005709
min         -3.059211
25%         -0.686467
50%         -0.129804
75%          0.581117
max          7.364971
Name: from_z_score_Aerials per 90, dtype: float64


---
## 3. transfer_history_all.parquet (Transfermarkt History)

In [14]:
print(f"Shape: {df_tm_history.shape}")
print(f"\nColumns:")
print(df_tm_history.columns.tolist())
print(f"\nColumn types:")
print(df_tm_history.dtypes)

Shape: (363943, 22)

Columns:
['wy_player_id', 'player_name', 'player_first_name', 'player_last_name', 'player_short_name', 'player_id', 'team_id_from', 'team_name_from', 'team_id_to', 'team_name_to', 'competition_id_from', 'competition_name_from', 'competition_country_from', 'competition_id_to', 'competition_name_to', 'competition_country_to', 'age_at_transfer', 'transfer_fee', 'transfer_value', 'date', 'remaining_contract_period', 'contract_until_date']

Column types:
wy_player_id                          int64
player_name                          object
player_first_name                    object
player_last_name                     object
player_short_name                    object
player_id                             int64
team_id_from                          int64
team_name_from                       object
team_id_to                            int64
team_name_to                         object
competition_id_from                  object
competition_name_from                obje

In [15]:
df_tm_history.head(10)

Unnamed: 0,wy_player_id,player_name,player_first_name,player_last_name,player_short_name,player_id,team_id_from,team_name_from,team_id_to,team_name_to,competition_id_from,competition_name_from,competition_country_from,competition_id_to,competition_name_to,competition_country_to,age_at_transfer,transfer_fee,transfer_value,date,remaining_contract_period,contract_until_date
0,274255,,,,,281017,3804,Kelty Hearts FC,3024,East Fife FC,SC3,Scottish League One,Scotland,SC4,Scottish League Two,Scotland,29.0,0.0,0,2023-06-30,,2027-05-30
1,274255,,,,,281017,3024,East Fife FC,3804,Kelty Hearts FC,SC3,Scottish League One,Scotland,SC5L,Scottish Lowland Football League,Scotland,27.0,,0,2021-03-30,0.0,2023-05-30
2,274255,,,,,281017,3804,Kelty Hearts FC,3024,East Fife FC,SC5L,Scottish Lowland Football League,Scotland,SC3,Scottish League One,Scotland,27.0,,0,2021-03-15,1172.0,2021-03-30
3,274255,,,,,281017,2451,Inverness Caledonian Thistle FC,3804,Kelty Hearts FC,SC2,Scottish Championship,Scotland,,,,25.0,0.0,0,2019-06-30,,2024-05-30
4,274255,,,,,281017,1191,Falkirk FC,2451,Inverness Caledonian Thistle FC,SC2,Scottish Championship,Scotland,SC2,Scottish Championship,Scotland,23.0,0.0,100000,2017-12-31,,2019-05-30
5,274255,,,,,281017,3024,East Fife FC,1191,Falkirk FC,SC4,Scottish League Two,Scotland,SC2,Scottish Championship,Scotland,22.0,,0,2016-04-30,0.0,
6,274255,,,,,281017,1191,Falkirk FC,3024,East Fife FC,SC2,Scottish Championship,Scotland,SC4,Scottish League Two,Scotland,21.0,,0,2016-01-15,,2016-04-30
7,274255,,,,,281017,3024,East Fife FC,1191,Falkirk FC,SC4,Scottish League Two,Scotland,SC2,Scottish Championship,Scotland,21.0,,0,2016-01-14,502.0,
8,274255,,,,,281017,29832,Leven United AFC,3024,East Fife FC,,,,SC3,Scottish League One,Scotland,19.0,,0,2013-07-25,,2017-05-30
9,274295,,,,,60639,3030,Montrose FC,3016,Forfar Athletic FC,SC3,Scottish League One,Scotland,SC4,Scottish League Two,Scotland,32.0,0.0,0,2023-06-30,,2026-05-30


In [16]:
print("Missing values:")
print(df_tm_history.isna().mean().sort_values(ascending=False))

Missing values:
player_name                  0.699667
player_first_name            0.699667
player_last_name             0.699667
player_short_name            0.699667
remaining_contract_period    0.634775
transfer_fee                 0.609573
competition_name_from        0.416639
competition_country_from     0.416639
contract_until_date          0.414109
competition_name_to          0.355333
competition_country_to       0.355333
competition_id_from          0.305029
competition_id_to            0.242153
team_name_from               0.008042
team_name_to                 0.005853
age_at_transfer              0.000179
date                         0.000179
team_id_to                   0.000000
team_id_from                 0.000000
transfer_value               0.000000
player_id                    0.000000
wy_player_id                 0.000000
dtype: float64


In [17]:
# Check transfer_fee and transfer_value
if 'transfer_fee' in df_tm_history.columns:
    print("Transfer Fee:")
    print(df_tm_history['transfer_fee'].describe())
    print(f"\nZero values: {(df_tm_history['transfer_fee'] == 0).sum()}")
    print(f"Null values: {df_tm_history['transfer_fee'].isna().sum()}")

if 'transfer_value' in df_tm_history.columns:
    print("\nTransfer Value (Market Value):")
    print(df_tm_history['transfer_value'].describe())
    print(f"\nZero values: {(df_tm_history['transfer_value'] == 0).sum()}")

Transfer Fee:
count    1.420930e+05
mean     5.174800e+05
std      3.343274e+06
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.800000e+08
Name: transfer_fee, dtype: float64

Zero values: 119547
Null values: 221850

Transfer Value (Market Value):
count    3.639430e+05
mean     5.505689e+05
std      2.404151e+06
min      0.000000e+00
25%      0.000000e+00
50%      1.000000e+05
75%      3.500000e+05
max      1.800000e+08
Name: transfer_value, dtype: float64

Zero values: 134708


In [18]:
# Date range
if 'date_of_transfer' in df_tm_history.columns:
    df_tm_history['date_of_transfer'] = pd.to_datetime(df_tm_history['date_of_transfer'], errors='coerce')
    print(f"Date range: {df_tm_history['date_of_transfer'].min()} to {df_tm_history['date_of_transfer'].max()}")

if 'season' in df_tm_history.columns:
    print(f"\nSeason distribution:")
    print(df_tm_history['season'].value_counts().sort_index())

In [19]:
# Player coverage
if 'player_id' in df_tm_history.columns:
    n_players_tm = df_tm_history['player_id'].nunique()
    print(f"Unique players in TM history: {n_players_tm:,}")
    print(f"Transfers per player: {len(df_tm_history) / n_players_tm:.1f}")

Unique players in TM history: 34,533
Transfers per player: 10.5


---
## 4. players_wyscout.parquet

In [20]:
print(f"Shape: {df_players.shape}")
print(f"\nColumns:")
print(df_players.columns.tolist())
df_players.head()

Shape: (50000, 14)

Columns:
['player_id', 'short_name', 'first_name', 'last_name', 'name', 'birth_date', 'height', 'weight', 'passport', 'birth_country', 'image_url', 'gender', 'foot', 'role']


Unnamed: 0,player_id,short_name,first_name,last_name,name,birth_date,height,weight,passport,birth_country,image_url,gender,foot,role
0,2,G. Coutinho,Gino,Coutinho,Gino Coutinho,1982-08-05,180,78,Suriname,Netherlands,https://cdn5.wyscout.com/photos/players/public...,male,right,Goalkeeper
1,3,M. de Zwart,Martijn,de Zwart,Martijn de Zwart,1990-11-08,181,0,Netherlands,Netherlands,https://cdn5.wyscout.com/photos/players/public...,male,right,Goalkeeper
2,4,R. Zwinkels,Robert,Zwinkels,Robert Zwinkels,1983-05-04,186,82,Netherlands,Netherlands,https://cdn5.wyscout.com/photos/players/public...,male,right,Goalkeeper
3,5,A. Ammi,Ahmed,Ammi,Ahmed Ammi,1981-01-19,179,72,Netherlands,Morocco,https://cdn5.wyscout.com/photos/players/public...,male,right,Defender
4,6,T. de Rijk,Tim,de Rijk,Tim de Rijk,1992-03-31,0,0,Netherlands,Netherlands,https://cdn5.wyscout.com/photos/players/public...,male,left,Defender


In [21]:
print("Missing values:")
print(df_players.isna().mean().sort_values(ascending=False))

Missing values:
foot             0.25156
passport         0.00002
player_id        0.00000
short_name       0.00000
first_name       0.00000
last_name        0.00000
name             0.00000
birth_date       0.00000
height           0.00000
weight           0.00000
birth_country    0.00000
image_url        0.00000
gender           0.00000
role             0.00000
dtype: float64


In [22]:
# Gender distribution (important! - main dataset is male only)
print("Gender distribution:")
print(df_players['gender'].value_counts())

Gender distribution:
gender
male      47146
female     2854
Name: count, dtype: int64


In [23]:
# Height/Weight issues (0 = missing)
print(f"Height = 0: {(df_players['height'] == 0).sum()} ({(df_players['height'] == 0).mean()*100:.1f}%)")
print(f"Weight = 0: {(df_players['weight'] == 0).sum()} ({(df_players['weight'] == 0).mean()*100:.1f}%)")

# Valid ranges
valid_height = df_players.loc[df_players['height'] > 0, 'height']
valid_weight = df_players.loc[df_players['weight'] > 0, 'weight']
print(f"\nValid height range: {valid_height.min()} - {valid_height.max()} cm")
print(f"Valid weight range: {valid_weight.min()} - {valid_weight.max()} kg")

Height = 0: 9073 (18.1%)
Weight = 0: 12019 (24.0%)

Valid height range: 1 - 255 cm
Valid weight range: 45 - 110 kg


In [24]:
# Birth date / Age
df_players['birth_date'] = pd.to_datetime(df_players['birth_date'], errors='coerce')
df_players['age'] = (pd.Timestamp.now() - df_players['birth_date']).dt.days / 365.25

print("Age distribution (current):")
print(df_players['age'].describe())

Age distribution (current):
count    50000.000000
mean        38.489520
std          4.719886
min         16.659822
25%         34.970568
50%         37.744011
75%         41.223819
max         69.711157
Name: age, dtype: float64


---
## 5. wy_tm_players_mapping.parquet

In [25]:
print(f"Shape: {df_mapping.shape}")
print(f"\nColumns: {df_mapping.columns.tolist()}")
print(f"\nTypes:\n{df_mapping.dtypes}")
df_mapping.head()

Shape: (171892, 2)

Columns: ['wy_id', 'tm_id']

Types:
wy_id    int32
tm_id    int32
dtype: object


Unnamed: 0,wy_id,tm_id
0,896402,939615
1,886722,963878
2,818118,992686
3,809491,983989
4,808262,668547


In [26]:
# Check 1:1 mapping
wy_per_tm = df_mapping.groupby('tm_id')['wy_id'].nunique()
tm_per_wy = df_mapping.groupby('wy_id')['tm_id'].nunique()

print(f"WY IDs per TM ID - max: {wy_per_tm.max()}, mean: {wy_per_tm.mean():.2f}")
print(f"TM IDs per WY ID - max: {tm_per_wy.max()}, mean: {tm_per_wy.mean():.2f}")

if wy_per_tm.max() > 1 or tm_per_wy.max() > 1:
    print("\n⚠️  WARNING: Not a 1:1 mapping!")

WY IDs per TM ID - max: 3, mean: 1.00
TM IDs per WY ID - max: 1, mean: 1.00



---
## 6. Dataset Cross-Coverage Analysis

In [27]:
# Player ID coverage across datasets
transfers_players = set(df_transfers['player_id'].unique())
wyscout_players = set(df_players['player_id'].unique())
mapping_wy = set(df_mapping['wy_id'].unique())
mapping_tm = set(df_mapping['tm_id'].unique())

# Check if tm_history has player_id
if 'player_id' in df_tm_history.columns:
    tm_history_players = set(df_tm_history['player_id'].dropna().unique())
else:
    tm_history_players = set()

print("=== Player ID Coverage ===")
print(f"Transfers (Twelve): {len(transfers_players):,}")
print(f"Wyscout players: {len(wyscout_players):,}")
print(f"Mapping (WY side): {len(mapping_wy):,}")
print(f"Mapping (TM side): {len(mapping_tm):,}")
print(f"TM History: {len(tm_history_players):,}")

print("\n=== Overlaps ===")
print(f"Transfers ∩ Wyscout: {len(transfers_players & wyscout_players):,} ({len(transfers_players & wyscout_players)/len(transfers_players)*100:.1f}% of transfers)")
print(f"Transfers ∩ Mapping(WY): {len(transfers_players & mapping_wy):,} ({len(transfers_players & mapping_wy)/len(transfers_players)*100:.1f}% of transfers)")

=== Player ID Coverage ===
Transfers (Twelve): 41,468
Wyscout players: 50,000
Mapping (WY side): 171,892
Mapping (TM side): 171,198
TM History: 34,533

=== Overlaps ===
Transfers ∩ Wyscout: 8,163 (19.7% of transfers)
Transfers ∩ Mapping(WY): 34,911 (84.2% of transfers)


In [28]:
# Team ID coverage
transfer_teams = set(df_transfers['from_team_id'].unique()) | set(df_transfers['to_team_id'].unique())
tm_teams = set(df_teams['tm_id'].dropna().unique())

print("=== Team ID Coverage ===")
print(f"Teams in transfers: {len(transfer_teams):,}")
print(f"Teams in TM teams: {len(tm_teams):,}")
print(f"Overlap: {len(transfer_teams & tm_teams):,} ({len(transfer_teams & tm_teams)/len(transfer_teams)*100:.1f}% of transfer teams)")

=== Team ID Coverage ===
Teams in transfers: 5,746
Teams in TM teams: 5,890
Overlap: 647 (11.3% of transfer teams)


In [29]:
# Competition coverage
transfer_comps = set(df_transfers['from_competition'].unique()) | set(df_transfers['to_competition'].unique())
wyscout_comps = set(df_competitions['competition_id'].unique())

print("=== Competition Coverage ===")
print(f"Competitions in transfers: {len(transfer_comps)}")
print(f"Competitions in Wyscout metadata: {len(wyscout_comps)}")
print(f"Overlap: {len(transfer_comps & wyscout_comps)} ({len(transfer_comps & wyscout_comps)/len(transfer_comps)*100:.1f}%)")

=== Competition Coverage ===
Competitions in transfers: 327
Competitions in Wyscout metadata: 269
Overlap: 194 (59.3%)


---
## 7. Data Quality Issues Summary

In [30]:
issues = []

# Issue 1: Promotion/relegation in transfers
same_team_pct = (df_transfers['from_team_id'] == df_transfers['to_team_id']).mean() * 100
if same_team_pct > 5:
    issues.append(f"CRITICAL: {same_team_pct:.1f}% of transfer rows are same team (not real transfers)")

# Issue 2: Low player coverage
player_coverage = len(transfers_players & wyscout_players) / len(transfers_players) * 100
if player_coverage < 50:
    issues.append(f"WARNING: Only {player_coverage:.1f}% of transfer players have Wyscout metadata")

# Issue 3: Low team coverage
team_coverage = len(transfer_teams & tm_teams) / len(transfer_teams) * 100
if team_coverage < 50:
    issues.append(f"WARNING: Only {team_coverage:.1f}% of teams match Transfermarkt IDs")

# Issue 4: Missing values in key columns
for col in ['transfer_fee', 'transfer_value']:
    if col in df_tm_history.columns:
        missing_pct = df_tm_history[col].isna().mean() * 100
        if missing_pct > 50:
            issues.append(f"WARNING: {col} is {missing_pct:.1f}% missing in TM history")

# Issue 5: Gender mismatch
if 'gender' in df_players.columns:
    female_pct = (df_players['gender'] == 'female').mean() * 100
    if female_pct > 1:
        issues.append(f"NOTE: Wyscout players includes {female_pct:.1f}% female players (main dataset is male only)")

print("=== DATA QUALITY ISSUES ===")
for i, issue in enumerate(issues, 1):
    print(f"{i}. {issue}")

if not issues:
    print("No critical issues detected.")

=== DATA QUALITY ISSUES ===
1. CRITICAL: 27.5% of transfer rows are same team (not real transfers)
5. NOTE: Wyscout players includes 5.7% female players (main dataset is male only)


---
## 8. Can We Build a Complete Analysis Dataset?

Let's see what happens when we try to join everything together.

In [31]:
# Start with transfers, try to add player metadata
df_analysis = df_transfers.copy()
initial_rows = len(df_analysis)

# Check what columns already exist to avoid conflicts
existing_cols = set(df_analysis.columns)
print("Columns that already exist in transfers:", [c for c in ['birth_date', 'height', 'weight', 'foot', 'passport'] if c in existing_cols])

# Add Wyscout player info (with suffix for conflicts)
df_analysis = df_analysis.merge(
    df_players[['player_id', 'birth_date', 'height', 'weight', 'foot', 'passport']],
    on='player_id',
    how='left',
    suffixes=('', '_wy')  # Keep original, add _wy to new ones
)

# Add TM player mapping
df_analysis = df_analysis.merge(
    df_mapping.rename(columns={'wy_id': 'player_id'}),
    on='player_id',
    how='left'
)

print(f"\nStarted with: {initial_rows:,} rows")
print(f"After joins: {len(df_analysis):,} rows")

# Use the Wyscout columns (with _wy suffix if there was a conflict)
birth_col = 'birth_date_wy' if 'birth_date_wy' in df_analysis.columns else 'birth_date'
height_col = 'height_wy' if 'height_wy' in df_analysis.columns else 'height'

print(f"\nPlayer metadata coverage (from Wyscout):")
print(f"  {birth_col}: {df_analysis[birth_col].notna().mean()*100:.1f}%")
print(f"  {height_col} (non-zero): {((df_analysis[height_col].notna()) & (df_analysis[height_col] > 0)).mean()*100:.1f}%")
print(f"  tm_id (for market value lookup): {df_analysis['tm_id'].notna().mean()*100:.1f}%")

Columns that already exist in transfers: ['birth_date']

Started with: 78,191 rows
After joins: 78,191 rows

Player metadata coverage (from Wyscout):
  birth_date_wy: 21.9%
  height (non-zero): 21.8%
  tm_id (for market value lookup): 88.3%


In [32]:
# Filter to REAL transfers only (different teams)
df_real_transfers = df_analysis[
    df_analysis['from_team_id'] != df_analysis['to_team_id']
].copy()

print(f"Real transfers (different teams): {len(df_real_transfers):,}")
print(f"Unique players in real transfers: {df_real_transfers['player_id'].nunique():,}")
print(f"\nWith TM ID for market value: {df_real_transfers['tm_id'].notna().sum():,} ({df_real_transfers['tm_id'].notna().mean()*100:.1f}%)")

Real transfers (different teams): 56,697
Unique players in real transfers: 34,643

With TM ID for market value: 49,044 (86.5%)


---
## 9. Key Findings & Recommendations

1. MAIN DATASET (male_transfer_model.parquet)
   - Primary dataset for analysis
   - Contains pre/post transfer performance metrics
   - Granularity: player × competition × season × position
   
2. CRITICAL ISSUE: ~27% of rows are NOT real transfers
   - Same team, different competition = promotion/relegation
   - Must filter these out for transfer success analysis
   
3. ID NAMESPACE MISMATCH
   - Twelve/Wyscout use one set of IDs
   - Transfermarkt uses different IDs
   - Mapping exists but coverage varies
   
4. COVERAGE GAPS
   - Player metadata: ~19% coverage
   - Team IDs: ~11% match Transfermarkt
   - TM market values: ~84% have mapping
   
5. TRANSFER HISTORY (transfer_history_all.parquet)
   - Separate TM dataset with transfer fees/values
   - ~74% missing transfer_fee (expected for free transfers)
   - Contains contract information

=== RECOMMENDATIONS ===

1. Always filter out same-team "transfers" for analysis
2. Use wy_tm_players_mapping to link to market values
3. Be aware of position multi-occurrence (same player, multiple positions)
4. The datasets are from different sources - expect ID mismatches