# Investigation: Liga MX "Internal" Transfers üá≤üáΩ‚öΩ

## Objective
Investigate whether the 12 records labeled as "internal Liga MX transfers" are valid data.

## üö® KEY FINDINGS (SPOILER)

### Finding #1: THE DATASET HAS ZERO INTERNAL TRANSFERS FOR ANY LEAGUE
- **78,191 total records**
- **0 records** where `from_competition == to_competition`
- **ALL records** are inter-league transfers (from one league to a different league)

### Finding #2: The "12 Liga MX internal transfers" are actually:
- **Promotions:** Liga de Expansi√≥n MX (615) ‚Üí Liga MX (617)
- **Relegations:** Liga MX (617) ‚Üí Liga de Expansi√≥n MX (615)

### Finding #3: The competition_id is STABLE across seasons
- Liga MX = 617 in 2018, 2019, 2020, ..., 2025
- The ID does NOT change per season or tournament

### Conclusion
The data provided **DOES NOT WORK** for internal Liga MX transfer analysis because:
1. The Twelve dataset **by design** only contains inter-league transfers
2. The records given are promotion/relegation between divisions, NOT internal transfers

In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 200)

BASE_PATH = "../../thesis_data/"
PATH_COMP = f"{BASE_PATH}raw_data_twelve/Wyscout/competitions_wyscout.parquet"
PATH_TRANSFERS = f"{BASE_PATH}raw_data_twelve/Twelve/male_transfer_model.parquet"

In [2]:
df_comp = pd.read_parquet(PATH_COMP)
df = pd.read_parquet(PATH_TRANSFERS)

print(f"Total transfer records: {len(df):,}")
print(f"Total competitions in metadata: {df_comp['competition_id'].nunique()}")

Total transfer records: 78,191
Total competitions in metadata: 269


## 1) üö® CRITICAL CHECK: Does the dataset have ANY internal transfers?

In [3]:
same_comp = df[df['from_competition'] == df['to_competition']]
diff_comp = df[df['from_competition'] != df['to_competition']]

print("="*80)
print("üö® CRITICAL FINDING: INTERNAL vs INTER-LEAGUE TRANSFERS")
print("="*80)
print(f"\nRecords where from_competition == to_competition: {len(same_comp):,}")
print(f"Records where from_competition != to_competition: {len(diff_comp):,}")
print(f"Total records: {len(df):,}")

if len(same_comp) == 0:
    print("\n" + "‚ùå"*40)
    print("THE DATASET HAS ZERO INTERNAL TRANSFERS FOR ANY LEAGUE")
    print("ALL 78,191 RECORDS ARE INTER-LEAGUE TRANSFERS")
    print("‚ùå"*40)

üö® CRITICAL FINDING: INTERNAL vs INTER-LEAGUE TRANSFERS

Records where from_competition == to_competition: 0
Records where from_competition != to_competition: 78,191
Total records: 78,191

‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå
THE DATASET HAS ZERO INTERNAL TRANSFERS FOR ANY LEAGUE
ALL 78,191 RECORDS ARE INTER-LEAGUE TRANSFERS
‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå‚ùå


### üìç Key Insight

The Twelve transfer dataset is **designed to only track inter-league transfers**.

It does NOT contain:
- Players moving from Club Am√©rica to Chivas (both Liga MX)
- Players moving from Barcelona to Real Madrid (both La Liga)
- Any player movement within the same league

It ONLY contains:
- Players moving from one league to a DIFFERENT league

## 2) Does `competition_id` change per season?

In [4]:
# In Wyscout metadata: each competition_id spans multiple seasons
comp_seasons = df_comp.groupby(['competition_id', 'name']).agg(
    n_seasons=('season', 'nunique'),
    seasons=('season', lambda x: sorted(x.unique()))
).reset_index()

print("="*80)
print("WYSCOUT: competition_id is a LEAGUE identifier, NOT season-specific")
print("="*80)
print(f"\nAvg seasons per competition_id: {comp_seasons['n_seasons'].mean():.1f}")
print(f"Max seasons per competition_id: {comp_seasons['n_seasons'].max()}")
print(f"\nExamples:")
print(comp_seasons.sort_values('n_seasons', ascending=False).head(10).to_string())

WYSCOUT: competition_id is a LEAGUE identifier, NOT season-specific

Avg seasons per competition_id: 6.5
Max seasons per competition_id: 9

Examples:
     competition_id                  name  n_seasons                                                 seasons
47              255               Serie A          9  [2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026]
44              248        Pernambucano 1          9  [2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026]
32              219              Baiano 1          9  [2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026]
34              225         Catarinense 1          9  [2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026]
35              227            Cearense 1          9  [2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026]
36              230              Ga√∫cho 1          9  [2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026]
37              232              Goiano 1          9  [2018, 2019, 2020, 2021, 2022, 2

In [5]:
# In Transfers: same competition_id appears across multiple seasons
from_comp_seasons = df.groupby('from_competition').agg(
    n_seasons=('from_season', 'nunique'),
    seasons=('from_season', lambda x: sorted(x.unique()))
).reset_index()

print("\n" + "="*80)
print("TRANSFERS: Same pattern - competition_id spans multiple seasons")
print("="*80)
print(f"\nAvg seasons per from_competition: {from_comp_seasons['n_seasons'].mean():.1f}")
print(f"\n\u2705 competition_id does NOT change per season.")


TRANSFERS: Same pattern - competition_id spans multiple seasons

Avg seasons per from_competition: 4.6

‚úÖ competition_id does NOT change per season.


In [6]:
LIGA_MX = 617
LIGA_EXPANSION = 615

# Liga MX in Wyscout
liga_mx_wyscout = df_comp[df_comp['competition_id'] == LIGA_MX][['competition_id', 'name', 'season', 'season_name']]

print("="*80)
print("LIGA MX (617) IN WYSCOUT METADATA")
print("="*80)
print(liga_mx_wyscout.to_string())

# Liga MX in Transfers
from_617 = df[df['from_competition'] == LIGA_MX]
to_617 = df[df['to_competition'] == LIGA_MX]

print(f"\nLIGA MX (617) IN TRANSFERS:")
print(f"  FROM Liga MX: seasons {sorted(from_617['from_season'].unique())}")
print(f"  TO Liga MX:   seasons {sorted(to_617['to_season'].unique())}")
print(f"\n\u2705 Same ID (617) used across ALL seasons. It is a league ID, not a season ID.")

LIGA MX (617) IN WYSCOUT METADATA
     competition_id     name  season season_name
933             617  Liga MX    2018   2018/2019
934             617  Liga MX    2019   2019/2020
935             617  Liga MX    2020   2020/2021
936             617  Liga MX    2021   2021/2022
937             617  Liga MX    2022   2022/2023
938             617  Liga MX    2023   2023/2024
939             617  Liga MX    2024   2024/2025
940             617  Liga MX    2025   2025/2026

LIGA MX (617) IN TRANSFERS:
  FROM Liga MX: seasons [np.int16(2018), np.int16(2019), np.int16(2020), np.int16(2021), np.int16(2022), np.int16(2023), np.int16(2024)]
  TO Liga MX:   seasons [np.int16(2018), np.int16(2019), np.int16(2020), np.int16(2021), np.int16(2022), np.int16(2023), np.int16(2024), np.int16(2025)]

‚úÖ Same ID (617) used across ALL seasons. It is a league ID, not a season ID.


## 3) What ARE the "Liga MX internal transfers"?

In [7]:
# Mexican competitions in the dataset
mexico_comps = df_comp[df_comp['country'] == 'Mexico']

print("="*80)
print("üá≤üáΩ MEXICAN COMPETITIONS IN DATASET")
print("="*80)

mexico_summary = (
    mexico_comps
    .groupby(['competition_id', 'name'])
    .agg(
        n_seasons=('season', 'nunique'),
        seasons=('season', lambda x: sorted(x.unique()))
    )
    .reset_index()
)

mexico_summary

üá≤üáΩ MEXICAN COMPETITIONS IN DATASET


Unnamed: 0,competition_id,name,n_seasons,seasons
0,615,Liga de Expansi√≥n MX,8,"[2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025]"
1,617,Liga MX,8,"[2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025]"


In [8]:
# Get competition name mapping
comp_id_to_name = df_comp.drop_duplicates('competition_id').set_index('competition_id')['name'].to_dict()

# Find all transfers between Mexican leagues
mexico_mexico = df[
    (df['from_competition'].isin([LIGA_MX, LIGA_EXPANSION])) &
    (df['to_competition'].isin([LIGA_MX, LIGA_EXPANSION]))
].copy()

mexico_mexico['from_comp_name'] = mexico_mexico['from_competition'].map(comp_id_to_name)
mexico_mexico['to_comp_name'] = mexico_mexico['to_competition'].map(comp_id_to_name)

print("="*80)
print("TRANSFERS BETWEEN MEXICAN LEAGUES")
print("="*80)
print(f"\nTotal records: {len(mexico_mexico)}")

TRANSFERS BETWEEN MEXICAN LEAGUES

Total records: 117


In [9]:
# Breakdown by competition pair
pairs = (
    mexico_mexico
    .groupby(['from_competition', 'to_competition', 'from_comp_name', 'to_comp_name'])
    .agg(
        n_records=('player_id', 'count'),
        n_players=('player_id', 'nunique')
    )
    .reset_index()
)

print("\nüìä BREAKDOWN BY COMPETITION PAIR:")
pairs


üìä BREAKDOWN BY COMPETITION PAIR:


Unnamed: 0,from_competition,to_competition,from_comp_name,to_comp_name,n_records,n_players
0,615,617,Liga de Expansi√≥n MX,Liga MX,84,65
1,617,615,Liga MX,Liga de Expansi√≥n MX,33,29


In [10]:
# Check if these are same team (promotion/relegation) or different team
mexico_mexico['same_team'] = mexico_mexico['from_team_id'] == mexico_mexico['to_team_id']

print("="*80)
print("SAME TEAM vs DIFFERENT TEAM")
print("="*80)
print(f"\nSame team (promotion/relegation): {mexico_mexico['same_team'].sum()}")
print(f"Different team (player moved clubs): {(~mexico_mexico['same_team']).sum()}")

SAME TEAM vs DIFFERENT TEAM

Same team (promotion/relegation): 14
Different team (player moved clubs): 103


In [11]:
# Show some example records
cols = ['player_id', 'from_team_id', 'to_team_id', 'from_competition', 'to_competition',
        'from_comp_name', 'to_comp_name', 'from_season', 'to_season', 'same_team']

print("\n\ud83d\udd0d SAMPLE RECORDS:")
mexico_mexico[cols].head(20)

ERROR:tornado.general:Uncaught exception in ZMQStream callback
Traceback (most recent call last):
  File "/opt/anaconda3/envs/soccermatics/lib/python3.10/site-packages/jupyter_client/session.py", line 95, in json_packer
    return json.dumps(
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30-31: surrogates not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/soccermatics/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 550, in _run_callback
    f = callback(*args, **kwargs)
  File "/opt/anaconda3/envs/soccermatics/lib/python3.10/site-packages/ipykernel/iostream.py", line 171, in _handle_event
    event_f()
  File "/opt/anaconda3/envs/soccermatics/lib/python3.10/site-packages/ipykernel/iostream.py", line 644, in _flush
    self.session.send(
  File "/opt/anaconda3/envs/soccermatics/lib/python3.10/site-packages/jupyter_client/session.py", line 852, in send
    

Unnamed: 0,player_id,from_team_id,to_team_id,from_competition,to_competition,from_comp_name,to_comp_name,from_season,to_season,same_team
44120,3584,25866,25866,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,True
44121,6376,25866,25866,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,True
44122,86240,34402,34402,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,True
44130,112446,26045,15416,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,False
44132,112639,34402,34402,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,True
44133,112705,15465,15421,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,False
44134,112778,34402,34402,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,True
44135,113068,25866,25866,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,True
44136,113092,25866,25866,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,True
44137,113102,25865,15412,615,617,Liga de Expansi√≥n MX,Liga MX,2018,2019,False


ssion.py", line 103, in json_packer
    packed = json.dumps(
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30-31: surrogates not allowed


## 4) Complete Picture: All Liga MX Transfer Types

In [12]:
# All transfers involving Liga MX
all_liga_mx = df[
    (df['from_competition'] == LIGA_MX) | 
    (df['to_competition'] == LIGA_MX)
].copy()

# Categorize
internal = all_liga_mx[(all_liga_mx['from_competition'] == LIGA_MX) & (all_liga_mx['to_competition'] == LIGA_MX)]
inbound = all_liga_mx[(all_liga_mx['to_competition'] == LIGA_MX) & (all_liga_mx['from_competition'] != LIGA_MX)]
outbound = all_liga_mx[(all_liga_mx['from_competition'] == LIGA_MX) & (all_liga_mx['to_competition'] != LIGA_MX)]

print("="*80)
print("üìä COMPLETE BREAKDOWN OF LIGA MX TRANSFERS")
print("="*80)
print(f"\nTotal records involving Liga MX: {len(all_liga_mx)}")
print(f"\n  ‚ùå Internal (617 ‚Üí 617): {len(internal)}")
print(f"  ‚úÖ Inbound (other ‚Üí 617): {len(inbound)}")
print(f"  ‚úÖ Outbound (617 ‚Üí other): {len(outbound)}")

üìä COMPLETE BREAKDOWN OF LIGA MX TRANSFERS

Total records involving Liga MX: 694

  ‚ùå Internal (617 ‚Üí 617): 0
  ‚úÖ Inbound (other ‚Üí 617): 448
  ‚úÖ Outbound (617 ‚Üí other): 246


In [13]:
# Where do inbound transfers come from?
inbound['from_comp_name'] = inbound['from_competition'].map(comp_id_to_name)

inbound_sources = (
    inbound
    .groupby(['from_competition', 'from_comp_name'])
    .size()
    .reset_index(name='count')
    .sort_values('count', ascending=False)
    .head(15)
)

print("\n\ud83d\udce5 TOP 15 SOURCE LEAGUES FOR LIGA MX INBOUND TRANSFERS:")
inbound_sources

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  inbound['from_comp_name'] = inbound['from_competition'].map(comp_id_to_name)
ERROR:tornado.general:Uncaught exception in ZMQStream callback
Traceback (most recent call last):
  File "/opt/anaconda3/envs/soccermatics/lib/python3.10/site-packages/jupyter_client/session.py", line 95, in json_packer
    return json.dumps(
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30-31: surrogates not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/soccermatics/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 550, in _run_callback
    f = callback(*args, **kwargs)
  File "/opt/anaconda3/envs/soccermatics/lib/pytho

Unnamed: 0,from_competition,from_comp_name,count
23,615,Liga de Expansi√≥n MX,84
0,146,Liga Profesional de F√∫tbol,62
9,295,Liga BetPlay,27
37,869,MLS,25
11,339,Liga Pro,21
8,284,Primera Divisi√≥n,20
31,795,La Liga,20
38,879,Primera Divisi√≥n,18
32,797,Segunda Divisi√≥n,13
34,852,S√ºper Lig,11


In [14]:
# Where do outbound transfers go?
outbound['to_comp_name'] = outbound['to_competition'].map(comp_id_to_name)

outbound_destinations = (
    outbound
    .groupby(['to_competition', 'to_comp_name'])
    .size()
    .reset_index(name='count')
    .sort_values('count', ascending=False)
    .head(15)
)

print("\n\ud83d\udce4 TOP 15 DESTINATION LEAGUES FOR LIGA MX OUTBOUND TRANSFERS:")
outbound_destinations

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  outbound['to_comp_name'] = outbound['to_competition'].map(comp_id_to_name)
ERROR:tornado.general:Uncaught exception in ZMQStream callback
Traceback (most recent call last):
  File "/opt/anaconda3/envs/soccermatics/lib/python3.10/site-packages/jupyter_client/session.py", line 95, in json_packer
    return json.dumps(
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30-31: surrogates not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/soccermatics/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 550, in _run_callback
    f = callback(*args, **kwargs)
  File "/opt/anaconda3/envs/soccermatics/lib/python3

Unnamed: 0,to_competition,to_comp_name,count
0,146,Liga Profesional de F√∫tbol,38
18,615,Liga de Expansi√≥n MX,33
27,869,MLS,25
9,284,Primera Divisi√≥n,14
10,295,Liga BetPlay,14
28,879,Primera Divisi√≥n,13
7,255,Serie A,12
25,797,Segunda Divisi√≥n,9
12,339,Liga Pro,8
20,688,Primera Divisi√≥n,6


---

## üéØ FINAL CONCLUSION

### Q: Does the `competition_id` change per season?
**NO.** Liga MX is always 617, across all seasons (2018‚Äì2025). The ID represents the **league**, not the season.

### The Problem with the Data

1. **The Twelve dataset ONLY contains inter-league transfers**
   - 78,191 total records
   - 0 records where `from_competition == to_competition`
   - This is BY DESIGN, not a bug

2. **The "12 internal Liga MX transfers" are NOT internal transfers**
   - They are promotion/relegation records between:
     - Liga de Expansi√≥n MX (615) - Second Division
     - Liga MX (617) - First Division

3. **The competition_id is stable across seasons**
   - Liga MX is ALWAYS 617 (2018, 2019, 2020, ..., 2025)
   - The ID does NOT change per season or tournament

### Why This Matters

If you need to analyze **internal Liga MX transfers** (e.g., player from Am√©rica to Chivas), **this dataset cannot help you** because:
- Such transfers don't exist in the data
- The data only tracks movements BETWEEN different leagues

### Recommendation

Reject the 12 records as invalid for internal transfer analysis. They represent:
- **Inter-league** movements (Liga Expansi√≥n ‚Üî Liga MX)
- **NOT intra-league** movements (Liga MX ‚Üí Liga MX)