# Player Metadata Review (Wyscout) üë§‚öΩ

This notebook reviews player-level metadata from Wyscout, focusing on identifiers, demographics, and structural consistency required to link players across datasets used in the thesis.

## 0) Imports & Setup

In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 180)

PATH = "../../raw_data_agust_wy/"
PATH2 = "../../raw_data_agust_12/"

## 1) Quick Data Snapshot

In [2]:
df_players = pd.read_parquet(f"{PATH}players_wyscout.parquet")
df_players.shape

(50000, 14)

In [3]:
df_players.dtypes.value_counts()

object    11
int64      3
Name: count, dtype: int64

In [4]:
df_players.head()

Unnamed: 0,player_id,short_name,first_name,last_name,name,birth_date,height,weight,passport,birth_country,image_url,gender,foot,role
0,2,G. Coutinho,Gino,Coutinho,Gino Coutinho,1982-08-05,180,78,Suriname,Netherlands,https://cdn5.wyscout.com/photos/players/public...,male,right,Goalkeeper
1,3,M. de Zwart,Martijn,de Zwart,Martijn de Zwart,1990-11-08,181,0,Netherlands,Netherlands,https://cdn5.wyscout.com/photos/players/public...,male,right,Goalkeeper
2,4,R. Zwinkels,Robert,Zwinkels,Robert Zwinkels,1983-05-04,186,82,Netherlands,Netherlands,https://cdn5.wyscout.com/photos/players/public...,male,right,Goalkeeper
3,5,A. Ammi,Ahmed,Ammi,Ahmed Ammi,1981-01-19,179,72,Netherlands,Morocco,https://cdn5.wyscout.com/photos/players/public...,male,right,Defender
4,6,T. de Rijk,Tim,de Rijk,Tim de Rijk,1992-03-31,0,0,Netherlands,Netherlands,https://cdn5.wyscout.com/photos/players/public...,male,left,Defender


In [5]:
df_players.isna().mean().sort_values(ascending=False).head(15)

foot             0.25156
passport         0.00002
player_id        0.00000
short_name       0.00000
first_name       0.00000
last_name        0.00000
name             0.00000
birth_date       0.00000
height           0.00000
weight           0.00000
birth_country    0.00000
image_url        0.00000
gender           0.00000
role             0.00000
dtype: float64

In [6]:
df_players["foot"].value_counts(normalize=True, dropna=False)

foot
right    0.55914
None     0.25156
left     0.16494
both     0.02436
Name: proportion, dtype: float64

In [7]:
df_players["gender"].value_counts(dropna=False)

gender
male      47146
female     2854
Name: count, dtype: int64

In [8]:
# si es datetime
(df_players["birth_date"].isna().sum(),
 (df_players["birth_date"] == pd.Timestamp(0)).sum() if np.issubdtype(df_players["birth_date"].dtype, np.datetime64) else "n/a")

(np.int64(0), 'n/a')

In [9]:
(df_players["height"] == 0).sum(), (df_players["weight"] == 0).sum()

(np.int64(9073), np.int64(12019))

In [10]:
df_players[["height","weight"]].describe().loc[
    ["min","25%","50%","75%","max"]
]

Unnamed: 0,height,weight
min,0.0,0.0
25%,171.0,60.0
50%,179.0,73.0
75%,185.0,79.0
max,255.0,110.0


In [11]:
df_players.loc[df_players["height"] > 0, "height"].describe()
df_players.loc[df_players["weight"] > 0, "weight"].describe()

count    37981.000000
mean        75.940813
std          7.124709
min         45.000000
25%         71.000000
50%         76.000000
75%         80.000000
max        110.000000
Name: weight, dtype: float64

In [12]:
df_players_clean = df_players.copy()

df_players_clean["height"] = df_players_clean["height"].replace(0, np.nan)
df_players_clean["weight"] = df_players_clean["weight"].replace(0, np.nan)

In [13]:
df_players_clean[["height","weight"]].isna().mean()

height    0.18146
weight    0.24038
dtype: float64

In [14]:
df_players_clean[["height","weight"]].describe().loc[
    ["min","25%","50%","75%","max"]
]

Unnamed: 0,height,weight
min,1.0,45.0
25%,176.0,71.0
50%,181.0,76.0
75%,186.0,80.0
max,255.0,110.0


In [15]:
df_players_clean.loc[
    df_players_clean["height"].notna(),
    ["player_id","name","height","weight","birth_date"]
].sort_values("height").head(20)

Unnamed: 0,player_id,name,height,weight,birth_date
33015,77159,Emiliano Tade,1.0,,1988-03-03
17098,41904,Jos√© Rodolfo Pires Ribeiro,18.0,79.0,1992-02-06
19356,47399,Geovane Batista Loubo,117.0,75.0,1992-01-09
26012,61894,Shu O Tseng,151.0,,1984-09-06
13401,32213,Anais Dumont,152.0,50.0,1989-10-06
25974,61781,Trudy Camilleri,153.0,,1991-09-16
8722,17930,Anja Selensky,153.0,,1993-02-05
25967,61762,Katrina Gorry,154.0,,1992-08-13
12591,27191,Julie Machart-Rabanne,154.0,48.0,1989-04-18
43379,100650,Marcin Garuch,154.0,55.0,1988-09-14


In [16]:
df_players_clean.loc[
    df_players_clean["weight"].notna(),
    ["player_id","name","height","weight","birth_date"]
].sort_values("weight").head(20)

Unnamed: 0,player_id,name,height,weight,birth_date
47192,109984,Endra Prasetya Suprapto,178.0,45.0,1981-05-01
44218,102507,Olga Boichenko,157.0,46.0,1989-01-06
12607,27210,Shirley Cruz Tra√±a,162.0,46.0,1985-08-28
13407,32231,Delphine Chatelin,157.0,48.0,1988-05-17
13426,32328,Julie Morel,157.0,48.0,1982-08-06
8912,18609,Marie-Louise Eta,163.0,48.0,1991-07-07
10605,22404,Daniela Stracchi,160.0,48.0,1983-09-02
11285,24444,Elena Cascarano,158.0,48.0,1989-04-07
11284,24443,Giulia Ambrosetti,157.0,48.0,1993-10-17
12591,27191,Julie Machart-Rabanne,154.0,48.0,1989-04-18


In [17]:
bad_height_ids = [77159, 41904, 47399]

df_players_clean.loc[
    df_players_clean["player_id"].isin(bad_height_ids),
    "height"
] = np.nan

In [18]:
df_players_clean.loc[
    df_players_clean["player_id"].isin(bad_height_ids),
    ["player_id","name","height","weight","birth_date"]
]

Unnamed: 0,player_id,name,height,weight,birth_date
17098,41904,Jos√© Rodolfo Pires Ribeiro,,79.0,1992-02-06
19356,47399,Geovane Batista Loubo,,75.0,1992-01-09
33015,77159,Emiliano Tade,,,1988-03-03


In [19]:
df_players_clean["birth_date"] = pd.to_datetime(df_players_clean["birth_date"], errors="coerce")

In [20]:
today = pd.Timestamp("today")

df_players_clean["age"] = (
    (today - df_players_clean["birth_date"]).dt.days / 365.25
)

In [21]:
df_players_clean["age"].describe()

count    50000.000000
mean        38.464880
std          4.719886
min         16.635181
25%         34.945927
50%         37.719370
75%         41.199179
max         69.686516
Name: age, dtype: float64

In [22]:
name_dupes = (
    df_players_clean
    .groupby("name")["player_id"]
    .nunique()
    .sort_values(ascending=False)
)

name_dupes.head(20)

name
Andreas Johansson    5
Mark Jones           4
Mohamed Fofana       3
Daniel Andersson     3
Anders Nielsen       3
Aleksey Kozlov       3
Danny Rose           3
Emil Johansson       3
Josip Bari≈°iƒá        3
Michael Smith        3
Stephen O'Donnell    3
Scott Brown          3
Fredrik Lundgren     3
Paul Robinson        3
Mark Hughes          3
Tommy Smith          3
Dae-Ho Kim           2
Sel√ßuk ≈ûahin         2
Adam Smith           2
Ashley Williams      2
Name: player_id, dtype: int64

### Player Data Summary

- Player identifiers (`player_id`) are consistent and uniquely define individuals.
- Duplicate names are common but reflect homonymy; no semantic deduplication was required.
- Player ages fall within plausible ranges, with no systematic issues detected.
- Preferred foot information is missing for a substantial share of players (~25%).
- Height and weight are not fully complete:
  - Missing values are encoded as `0` in the raw data and were converted to NaN.
  - A very small number of clearly invalid height values were manually set to NaN after inspection.
  - Weight values were left unchanged unless unequivocally invalid.

## 2) Players ‚Üî Transfers Coverage

In [23]:
df_transfers = pd.read_parquet(f"{PATH2}male_transfers_data.parquet")
df_transfers.shape

(77782, 366)

In [24]:
players_ids = set(df_players_clean["player_id"].unique())
transfer_ids = set(df_transfers["player_id"].unique())

len(players_ids), len(transfer_ids)

(50000, 41141)

In [25]:
common_ids = players_ids & transfer_ids
len(common_ids), len(common_ids) / len(transfer_ids)

(7988, 0.19416154201404925)

In [26]:
missing_players = transfer_ids - players_ids
len(missing_players)

33153

In [27]:
df_transfers.loc[
    df_transfers["player_id"].isin(list(missing_players)),
    ["player_id","short_name","from_season","to_season"]
].drop_duplicates().head(10)

Unnamed: 0,player_id,short_name,from_season,to_season
3552,118365,I. Dzaria,2018,2019
3553,127729,N. Janiƒçiƒá,2018,2019
3554,128946,O. Roloviƒá,2018,2019
3555,216722,Reginaldo,2018,2019
3556,236450,F. Aliti,2018,2019
3557,259790,A. Alla,2018,2019
3559,348844,G. Selmani,2018,2019
3560,370660,A. Abibi,2018,2019
3561,413626,Biniam Belay,2018,2019
3562,445078,L. Amadu,2018,2019


### Players ‚Üî Transfers ‚Äì Coverage Summary

The Twelve Football transfer dataset was linked with Wyscout player metadata using `player_id`.

- Only ~19‚Äì20% of player_ids in the transfer data are present in the Wyscout player dump.
- This low match rate reflects differences in dataset scope and historical coverage, not an identification error.
- Manual sanity checks (including external validation of selected players) confirm that matched player_ids correspond to the same real-world individuals.
- For matched players, metadata quality is high:
  - No missing values in birth date or gender.
  - Low missing rates for height, weight, and preferred foot.
- Missing player-level attributes at the row level are driven by unmatched player_ids, not incomplete player records.