# Wyscout â†” Transfermarkt Player Mapping ðŸ‘¤ðŸ”—

Main purpose: enable joins between Wyscout-based datasets and Transfermarkt information (e.g., market value, nationality, teams).

We will verify:
- uniqueness of IDs (one-to-one vs one-to-many),
- missingness,
- potential duplicate mappings.

## 0) Imports & Setup

In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 180)

TM_PATH = "../../raw_data_agust_tm/"
TWELVE_PATH = "../../raw_data_agust_12/"
WY_PATH = "../../raw_data_agust_wy/"

df_map = pd.read_parquet(f"{TM_PATH}wy_tm_players_mapping.parquet")

## 1) Quick Data Snapshot

In [2]:
df_map.shape

(171892, 2)

In [3]:
df_map.dtypes.value_counts()

int32    2
Name: count, dtype: int64

In [4]:
df_map.head()

Unnamed: 0,wy_id,tm_id
0,896402,939615
1,886722,963878
2,818118,992686
3,809491,983989
4,808262,668547


In [5]:
df_map.isna().mean().sort_values(ascending=False).head(15)

wy_id    0.0
tm_id    0.0
dtype: float64

In [6]:
df_map.duplicated().sum()

np.int64(0)

In [7]:
df_map.columns.tolist()

['wy_id', 'tm_id']

No issues at all.

## 2) Players ID â†” 12 Transfers Coverage

In [8]:
df_transfers = pd.read_parquet(f"{TWELVE_PATH}male_transfers_data.parquet")

In [9]:
twelve_ids = set(df_transfers["player_id"].unique())
len(twelve_ids)

41141

In [10]:
wy_ids_map = set(df_map["wy_id"].unique())
tm_ids_map = set(df_map["tm_id"].unique())

n_wy_match = len(twelve_ids & wy_ids_map)
n_tm_match = df_map.loc[
    df_map["wy_id"].isin(twelve_ids), "tm_id"
].nunique()

n_wy_match, n_tm_match

(34612, 34595)

In [11]:
wy_coverage = n_wy_match / len(twelve_ids)
tm_coverage = n_tm_match / len(twelve_ids)

wy_coverage, tm_coverage

(0.8413018643202644, 0.8408886512238399)

### Twelve Football â†’ Transfermarkt Coverage

Key findings:
- ~84% of unique Twelve Football player_ids can be matched to a Wyscout player ID.
- ~84% can also be linked to a Transfermarkt player ID.
- The near-identical coverage rates indicate a consistent one-to-one mapping between Wyscout and Transfermarkt identifiers for the covered players.

## 3) TM Players ID â†” WY Players  Coverage

In [12]:
df_players_wy = pd.read_parquet(f"{WY_PATH}players_wyscout.parquet")

In [13]:
wy_ids_players = set(df_players_wy["player_id"].unique())
wy_ids_map = set(df_map["wy_id"].unique())
tm_ids_map = set(df_map["tm_id"].unique())

len(wy_ids_players), len(wy_ids_map), len(tm_ids_map)

(50000, 171892, 171198)

In [14]:
n_wy_match = len(wy_ids_players & wy_ids_map)
coverage_wy = n_wy_match / len(wy_ids_players)

n_wy_match, coverage_wy

(29962, 0.59924)

## Transfermarkt â†” Wyscout Players Coverage

- ~60% of players in `players_wyscout.parquet` have a corresponding Transfermarkt player ID via the mapping table.
- This lower coverage (relative to Twelve Football â†’ Transfermarkt coverage) reflects differences in dataset scope rather than identification errors.
- We'd need to refine the scope on all datasets in order to have the data we actually require.