# Exploración y unión de datasets de Transfers

Este notebook verifica que los dos parquets en `raw_data/Transfers/` son complementarios:
- `male_transfer_model.parquet` → transfers entre competiciones **distintas**
- `transfers_same_competitions_data.parquet` → transfers dentro de la **misma** competición

Y los une en un solo archivo: `male_transfers_model_2018_2025.parquet`

In [1]:
# -- Paths (resolve Unicode dir name dynamically) --
from pathlib import Path
docs = Path("/Users/jorgepadilla/Documents")
for _d in docs.iterdir():
    if "Jorge" in _d.name and "MacBook" in _d.name and _d.is_dir():
        RAW = _d / "thesis_data" / "raw_data"
        PROCESSED = _d / "thesis_data" / "processed_data"
        break

import pandas as pd
from pathlib import Path

base = RAW / "Transfers"

df_diff = pd.read_parquet(base / "male_transfer_model.parquet")
df_same = pd.read_parquet(base / "transfers_same_competitions_data.parquet")

print(f"male_transfer_model (diff comps):         {df_diff.shape}")
print(f"transfers_same_competitions (same comp):  {df_same.shape}")

male_transfer_model (diff comps):         (78191, 362)
transfers_same_competitions (same comp):  (184149, 362)


## 1. Comparación de columnas

In [2]:
cols_diff = set(df_diff.columns)
cols_same = set(df_same.columns)

print("Columnas idénticas:", cols_diff == cols_same)
print(f"\nSolo en male_transfer_model:             {sorted(cols_diff - cols_same)}")
print(f"Solo en transfers_same_competitions:       {sorted(cols_same - cols_diff)}")
print(f"\nColumnas en común: {len(cols_diff & cols_same)} de {len(cols_diff)} / {len(cols_same)}")

Columnas idénticas: False

Solo en male_transfer_model:             ['competition_to', 'season_to', 'team_id_to']
Solo en transfers_same_competitions:       ['competition', 'season', 'team_id']

Columnas en común: 359 de 362 / 362


Las 3 columnas que difieren son equivalentes semánticas:
- `competition_to` (diff) ↔ `competition` (same) — ambas = `to_competition`
- `season_to` (diff) ↔ `season` (same) — ambas = `to_season`
- `team_id_to` (diff) ↔ `team_id` (same) — ambas = `to_team_id`

En `same`, como from == to en competición, solo guardan una columna genérica.

## 2. Verificación de hipótesis

In [3]:
# Hipótesis: male_transfer_model = competiciones DISTINTAS
same_in_diff = (df_diff["from_competition"] == df_diff["to_competition"]).sum()
diff_in_diff = (df_diff["from_competition"] != df_diff["to_competition"]).sum()
print("=== male_transfer_model ===")
print(f"  from_comp == to_comp: {same_in_diff:,}")
print(f"  from_comp != to_comp: {diff_in_diff:,}")
print(f"  → 100% competiciones distintas: {same_in_diff == 0}")

# Hipótesis: transfers_same_competitions = MISMA competición
same_in_same = (df_same["from_competition"] == df_same["to_competition"]).sum()
diff_in_same = (df_same["from_competition"] != df_same["to_competition"]).sum()
print("\n=== transfers_same_competitions ===")
print(f"  from_comp == to_comp: {same_in_same:,}")
print(f"  from_comp != to_comp: {diff_in_same:,}")
print(f"  → 100% misma competición: {diff_in_same == 0}")

=== male_transfer_model ===
  from_comp == to_comp: 0
  from_comp != to_comp: 78,191
  → 100% competiciones distintas: True

=== transfers_same_competitions ===
  from_comp == to_comp: 184,149
  from_comp != to_comp: 0
  → 100% misma competición: True


In [4]:
# Verificar que las columnas extra son redundantes
print("SAME: competition == from_competition:", (df_same["competition"] == df_same["from_competition"]).all())
print("SAME: competition == to_competition: ", (df_same["competition"] == df_same["to_competition"]).all())
print("DIFF: competition_to == to_competition:", (df_diff["competition_to"] == df_diff["to_competition"]).all())
print("DIFF: season_to == to_season:          ", (df_diff["season_to"] == df_diff["to_season"]).all())

SAME: competition == from_competition: True
SAME: competition == to_competition:  True
DIFF: competition_to == to_competition: True
DIFF: season_to == to_season:           True


## 3. Rango temporal

In [5]:
print("=== male_transfer_model (diff) ===")
print(f"  from_season: {df_diff['from_season'].min()} - {df_diff['from_season'].max()}")
print(f"  to_season:   {df_diff['to_season'].min()} - {df_diff['to_season'].max()}")
print(f"  season_to:   {df_diff['season_to'].min()} - {df_diff['season_to'].max()}")

print("\n=== transfers_same_competitions (same) ===")
print(f"  from_season: {df_same['from_season'].min()} - {df_same['from_season'].max()}")
print(f"  to_season:   {df_same['to_season'].min()} - {df_same['to_season'].max()}")
print(f"  season:      {df_same['season'].min()} - {df_same['season'].max()}")

=== male_transfer_model (diff) ===
  from_season: 2018 - 2025
  to_season:   2018 - 2025
  season_to:   2018 - 2025

=== transfers_same_competitions (same) ===
  from_season: 2018 - 2024
  to_season:   2019 - 2025
  season:      2018 - 2024


## 4. Unir en un solo parquet

In [6]:
# Renombrar columnas de same para que coincidan con diff
df_same_r = df_same.rename(columns={
    "competition": "competition_to",
    "season": "season_to",
    "team_id": "team_id_to"
})

assert set(df_diff.columns) == set(df_same_r.columns), "Columns don't match!"
print("Columnas alineadas correctamente.")

Columnas alineadas correctamente.


In [7]:
# Agregar columna de tipo de transfer para trazabilidad
df_diff["transfer_type"] = "different_competition"
df_same_r["transfer_type"] = "same_competition"

# Arreglar dtypes mixtos en columnas de fecha antes del concat
for col in ["last_played_date", "first_played_date"]:
    df_diff[col] = pd.to_datetime(df_diff[col], errors="coerce")
    df_same_r[col] = pd.to_datetime(df_same_r[col], errors="coerce")

# Concatenar
df_all = pd.concat([df_diff, df_same_r], ignore_index=True)

print(f"Diff:     {len(df_diff):>10,} rows")
print(f"Same:     {len(df_same_r):>10,} rows")
print(f"Combined: {len(df_all):>10,} rows x {len(df_all.columns)} cols")

Diff:         78,191 rows
Same:        184,149 rows
Combined:    262,340 rows x 363 cols


In [8]:
# Guardar
out_path = base / "male_transfers_model_2018_2025.parquet"
df_all.to_parquet(out_path, index=False)

print(f"Guardado en: {out_path}")
print(f"Tamaño: {out_path.stat().st_size / 1024 / 1024:.1f} MB")

# Verificar
df_check = pd.read_parquet(out_path)
print(f"\nVerificación: {df_check.shape}")
print(f"\n{df_check['transfer_type'].value_counts().to_string()}")

Guardado en: raw_data/Transfers/male_transfers_model_2018_2025.parquet
Tamaño: 377.7 MB

Verificación: (262340, 363)

transfer_type
same_competition         184149
different_competition     78191


In [9]:
df_check.head()

Unnamed: 0,player_id,from_competition,to_competition,from_team_id,to_team_id,from_season,to_season,last_played_date,first_played_date,from_position,...,to_z_score_xG + xA per 100 touches,to_z_score_xG per 90,to_z_score_xG per box touch,to_z_score_xG per shot,to_z_score_xGBuildup per 90,to_z_score_xGChain per possession,to_z_score_xGCreated per 90,to_z_score_xGDribble per 90,to_z_score_xGOT per 90,transfer_type
0,39558,127,546,8714,10985,2018,2019,2018-12-23 12:30:00,2019-04-07 12:00:00,Midfielder,...,-0.668823,-0.442331,,-0.764173,1.161678,0.126879,-0.465592,-0.17891,0.081343,different_competition
1,40619,127,546,8682,10989,2018,2019,2018-12-21 15:30:00,2019-03-09 09:00:00,Winger,...,-0.794302,-0.253436,0.689239,-0.400522,-0.573538,-0.866235,-0.276468,-0.689415,0.295396,different_competition
2,40619,127,546,8682,10989,2018,2019,2018-12-21 15:30:00,2019-03-09 09:00:00,Winger,...,-0.854912,-0.446597,0.141068,0.127532,-0.689799,-1.503197,-0.899362,-0.918006,0.053274,different_competition
3,64543,127,43114,8714,63764,2018,2019,2018-11-12 13:00:00,2019-03-30 23:00:00,Full Back,...,0.968465,-1.040329,,-1.09372,-1.102712,-0.830265,-0.606746,-0.529227,-0.797386,different_competition
4,64543,127,43114,8714,63764,2018,2019,2018-11-12 13:00:00,2019-03-30 23:00:00,Full Back,...,1.350675,-1.363574,-0.517759,-1.490842,0.158656,-0.191947,0.990421,-0.713805,-1.607982,different_competition
