# Understanding the Transfer Dataset Structure

This notebook explains the key aspects of the dataset:
1. **"Not Real Transfers"** - Why 27% of rows are not actual market transfers
2. **Positional Granularity** - Why some transfers have multiple rows
3. **Position Changes** - What happens when a player changes role
4. **Recommended Filters** - How to create clean analysis datasets

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 200)

BASE_PATH = "../../thesis_data/"

df = pd.read_parquet(f"{BASE_PATH}raw_data_twelve/Twelve/male_transfer_model.parquet")
comps = pd.read_parquet(f"{BASE_PATH}raw_data_twelve/Wyscout/competitions_wyscout.parquet")

print(f"Dataset loaded: {len(df):,} rows")

Dataset loaded: 78,191 rows


---
## PART 1: What are "Not Real Transfers"?

~27% of rows represent players who **stayed at the same club** but the club changed leagues (promotion/relegation).

### Simple Explanation
Imagine a player at Club A:
- **Season 2020**: Club A plays in Serie B (2nd division)
- **Season 2021**: Club A gets promoted to Serie A (1st division)

The player never changed employers - he's still at Club A. But the dataset records this as a "transfer" because the **competitive context changed**.

This is NOT a market transfer - no money changed hands, no contract negotiation happened.

In [2]:
same_team = df['from_team_id'] == df['to_team_id']
diff_comp = df['from_competition'] != df['to_competition']
diff_season = df['from_season'] != df['to_season']

print("=== ROW TYPE BREAKDOWN ===")
print(f"\nTotal rows: {len(df):,}")
print(f"\n1. REAL TRANSFERS (different team): {(~same_team).sum():,} ({(~same_team).mean()*100:.1f}%)")
print(f"\n2. SAME TEAM, DIFFERENT LEAGUE: {(same_team).sum():,} ({same_team.mean()*100:.1f}%)")
print(f"   - Promotion/Relegation (different season): {(same_team & diff_season).sum():,}")
print(f"   - Multi-league same season: {(same_team & ~diff_season).sum():,}")

=== ROW TYPE BREAKDOWN ===

Total rows: 78,191

1. REAL TRANSFERS (different team): 56,697 (72.5%)

2. SAME TEAM, DIFFERENT LEAGUE: 21,494 (27.5%)
   - Promotion/Relegation (different season): 19,142
   - Multi-league same season: 2,352


In [3]:
# Concrete examples
print("=== CONCRETE EXAMPLES ===")

examples = df[same_team].groupby('player_id').first().reset_index().head(5)

for _, row in examples.iterrows():
    print(f"\n--- {row['short_name']} ---")
    print(f"Team ID: {row['from_team_id']} (same before and after)")
    print(f"Season: {row['from_season']} → {row['to_season']}")
    
    from_c = comps[comps['competition_id'] == row['from_competition']][['name', 'country', 'division']].drop_duplicates()
    to_c = comps[comps['competition_id'] == row['to_competition']][['name', 'country', 'division']].drop_duplicates()
    
    if len(from_c) > 0:
        print(f"League BEFORE: {from_c.iloc[0]['name']} ({from_c.iloc[0]['country']}, Division {from_c.iloc[0]['division']})")
    else:
        print(f"League BEFORE: ID {row['from_competition']} (not in metadata)")
    
    if len(to_c) > 0:
        print(f"League AFTER: {to_c.iloc[0]['name']} ({to_c.iloc[0]['country']}, Division {to_c.iloc[0]['division']})")
    else:
        print(f"League AFTER: ID {row['to_competition']} (not in metadata)")

=== CONCRETE EXAMPLES ===

--- K. Omeruo ---
Team ID: 712 (same before and after)
Season: 2019 → 2020
League BEFORE: La Liga (Spain, Division 1)
League AFTER: Segunda División (Spain, Division 2)

--- J. Verhoek ---
Team ID: 2452 (same before and after)
Season: 2020 → 2021
League BEFORE: ID 425 (not in metadata)
League AFTER: 2. Bundesliga (Germany, Division 2)

--- N. Boilesen ---
Team ID: 7452 (same before and after)
Season: 2020 → 2021
League BEFORE: Superliga (Denmark, Division 1)
League AFTER: Eliteserien (Norway, Division 1)

--- I. Aissati ---
Team ID: 4490 (same before and after)
Season: 2018 → 2019
League BEFORE: 1. Lig (Turkey, Division 2)
League AFTER: Süper Lig (Turkey, Division 1)

--- N. Mäenpää ---
Team ID: 3191 (same before and after)
Season: 2020 → 2021
League BEFORE: Serie B (Italy, Division 2)
League AFTER: Serie A (Italy, Division 1)


### Interpretation

These cases are:
- **Promotions**: Team moved up a division (e.g., Serie B → Serie A)
- **Relegations**: Team moved down a division (e.g., La Liga → Segunda División)
- **League restructuring**: Competition ID changed but it's the same context

**These are NOT market transfers** - the player never changed employer.

### Data Quality Note
Some cases look strange (e.g., same team_id appearing in Denmark and Norway). This could be:
- Data errors
- Team ID namespace collisions across different data sources

Either way, filtering by `from_team_id != to_team_id` removes these problematic cases.

In [4]:
# Filtered dataset with real transfers only
df_real = df[df['from_team_id'] != df['to_team_id']].copy()
print(f"Original dataset: {len(df):,} rows")
print(f"Real transfers only: {len(df_real):,} rows")
print(f"Unique players in real transfers: {df_real['player_id'].nunique():,}")

Original dataset: 78,191 rows
Real transfers only: 56,697 rows
Unique players in real transfers: 34,643


---
## PART 2: Why Are There Multiple Rows Per Transfer?

Some transfers have 2, 4, 6, or even 9 rows. This is because the granularity includes **position**.

### Simple Explanation
Think of it this way: the dataset treats "Juan as Striker" and "Juan as Midfielder" as **different entities**.

If Juan played as both Striker and Midfielder before the transfer, and as both Winger and Midfielder after, you get:
- (Striker, Winger)
- (Striker, Midfielder)
- (Midfielder, Winger)
- (Midfielder, Midfielder)

That's 2 × 2 = 4 rows for a single transfer.

In [5]:
# Columns that define a unique transfer
keys = ['player_id', 'from_team_id', 'to_team_id', 'from_competition', 'to_competition', 'from_season', 'to_season']

rows_per_transfer = df.groupby(keys).size().reset_index(name='n_rows')

print("=== ROWS PER TRANSFER ===")
print(rows_per_transfer['n_rows'].value_counts().sort_index())
print(f"\nTotal unique transfers: {len(rows_per_transfer):,}")
print(f"\nLogic: n_rows = positions_before × positions_after")

=== ROWS PER TRANSFER ===
n_rows
1    58313
2     7370
3      255
4      944
6       83
9       11
Name: count, dtype: int64

Total unique transfers: 66,976

Logic: n_rows = positions_before × positions_after


In [6]:
# Example with 9 rows
print("=== EXAMPLE: TRANSFER WITH 9 ROWS ===")

example_9 = rows_per_transfer[rows_per_transfer['n_rows'] == 9].iloc[0]
mask = pd.Series(True, index=df.index)
for k in keys:
    mask &= (df[k] == example_9[k])

example_df = df[mask][['short_name', 'from_position', 'to_position', 'from_Minutes', 'to_Minutes']].copy()

print(f"\nPlayer: {example_df['short_name'].iloc[0]}")
print(f"\nEach row is a combination of position_before × position_after:")
print(example_df[['from_position', 'to_position', 'from_Minutes', 'to_Minutes']])

print(f"\n3 positions before × 3 positions after = 9 rows")

=== EXAMPLE: TRANSFER WITH 9 ROWS ===

Player: Dani Rodríguez

Each row is a combination of position_before × position_after:
      from_position to_position  from_Minutes  to_Minutes
58886        Winger  Midfielder           627         705
58887        Winger      Winger           627        1610
58888        Winger     Striker           627         540
58889       Striker  Midfielder          1647         705
58890       Striker      Winger          1647        1610
58891       Striker     Striker          1647         540
58892    Midfielder  Midfielder           811         705
58893    Midfielder      Winger           811        1610
58894    Midfielder     Striker           811         540

3 positions before × 3 positions after = 9 rows


### How Metrics Work

Each row compares:
- `from_*` metrics: Performance when playing as `from_position` **before** the transfer
- `to_*` metrics: Performance when playing as `to_position` **after** the transfer

### Does This Comparison Make Sense?

| from_position | to_position | Comparison Makes Sense? |
|---------------|-------------|-------------------------|
| Striker | Striker | ✅ Yes - Did they improve as a striker? |
| Striker | Winger | ⚠️ Questionable - Different roles |
| Goalkeeper | Striker | ❌ No - Completely different metrics |

---
## PART 3: Position Changes in Detail

Even in 1-to-1 transfers (single position before, single position after), there can be position changes.

In [7]:
same_team = df['from_team_id'] == df['to_team_id']
same_pos = df['from_position'] == df['to_position']

print("=== POSITION CHANGE BREAKDOWN ===")

print(f"\n--- REAL TRANSFERS (different team) ---")
real = df[~same_team]
print(f"Total rows: {len(real):,}")
print(f"  Same position: {(real['from_position'] == real['to_position']).sum():,} ({(real['from_position'] == real['to_position']).mean()*100:.1f}%)")
print(f"  Different position: {(real['from_position'] != real['to_position']).sum():,} ({(real['from_position'] != real['to_position']).mean()*100:.1f}%)")

print(f"\n--- SAME TEAM (promotion/relegation) ---")
not_real = df[same_team]
print(f"Total rows: {len(not_real):,}")
print(f"  Same position: {(not_real['from_position'] == not_real['to_position']).sum():,}")
print(f"  Different position: {(not_real['from_position'] != not_real['to_position']).sum():,}")

=== POSITION CHANGE BREAKDOWN ===

--- REAL TRANSFERS (different team) ---
Total rows: 56,697
  Same position: 45,415 (80.1%)
  Different position: 11,282 (19.9%)

--- SAME TEAM (promotion/relegation) ---
Total rows: 21,494
  Same position: 17,163
  Different position: 4,331


In [8]:
# Most common position changes in real transfers
print("=== MOST COMMON POSITION CHANGES (Real Transfers) ===")

real_diff_pos = df[(~same_team) & (~same_pos)]
print(f"\nTotal rows with position change: {len(real_diff_pos):,}")
print(f"\nMost common combinations:")
print(real_diff_pos.groupby(['from_position', 'to_position']).size().sort_values(ascending=False).head(10))

=== MOST COMMON POSITION CHANGES (Real Transfers) ===

Total rows with position change: 11,282

Most common combinations:
from_position     to_position     
Winger            Midfielder          1238
Striker           Winger              1223
Winger            Striker             1209
Midfielder        Winger              1158
Full Back         Central Defender    1113
Central Defender  Full Back           1033
Winger            Full Back            669
Midfielder        Central Defender     618
Central Defender  Midfielder           588
Full Back         Winger               540
dtype: int64


In [9]:
# Even 1-to-1 transfers can have position changes
print("=== 1-TO-1 TRANSFERS WITH POSITION CHANGES ===")

transfers_1to1 = rows_per_transfer[rows_per_transfer['n_rows'] == 1]
df_1to1 = df.merge(transfers_1to1[keys], on=keys)
real_1to1 = df_1to1[df_1to1['from_team_id'] != df_1to1['to_team_id']]

print(f"\nTotal 1-to-1 real transfers: {len(real_1to1):,}")
same_pos_1to1 = (real_1to1['from_position'] == real_1to1['to_position']).sum()
diff_pos_1to1 = (real_1to1['from_position'] != real_1to1['to_position']).sum()

print(f"  Same position: {same_pos_1to1:,} ({same_pos_1to1/len(real_1to1)*100:.1f}%)")
print(f"  Different position: {diff_pos_1to1:,} ({diff_pos_1to1/len(real_1to1)*100:.1f}%)")

print("\n** This means: even when a player had only 1 position before and 1 after,")
print("   they could have changed roles (e.g., Winger before → Striker after)")

=== 1-TO-1 TRANSFERS WITH POSITION CHANGES ===

Total 1-to-1 real transfers: 43,405
  Same position: 39,380 (90.7%)
  Different position: 4,025 (9.3%)

** This means: even when a player had only 1 position before and 1 after,
   they could have changed roles (e.g., Winger before → Striker after)


---
## PART 4: Recommended Dataset Splits

Based on the analysis, here are the recommended ways to filter the data:

In [10]:
same_team = df['from_team_id'] == df['to_team_id']
same_pos = df['from_position'] == df['to_position']

print("=== RECOMMENDED DATASET SPLITS ===")
print(f"\nOriginal dataset: {len(df):,} rows\n")

# Split 1: Cleanest - Real transfers, same position
df_clean = df[(~same_team) & (same_pos)]
print(f"1. REAL TRANSFERS + SAME POSITION (cleanest for comparison):")
print(f"   Rows: {len(df_clean):,} ({len(df_clean)/len(df)*100:.1f}%)")
print(f"   Unique players: {df_clean['player_id'].nunique():,}")
print(f"   Use case: 'Did the player improve in the same role after transfer?'")

# Split 2: Real transfers, all positions
df_real_all = df[~same_team]
print(f"\n2. REAL TRANSFERS + ALL POSITIONS:")
print(f"   Rows: {len(df_real_all):,} ({len(df_real_all)/len(df)*100:.1f}%)")
print(f"   Unique players: {df_real_all['player_id'].nunique():,}")
print(f"   Use case: 'All market transfers, need to handle position logic'")

# Split 3: Position changers only
df_pos_change = df[(~same_team) & (~same_pos)]
print(f"\n3. REAL TRANSFERS + POSITION CHANGE:")
print(f"   Rows: {len(df_pos_change):,} ({len(df_pos_change)/len(df)*100:.1f}%)")
print(f"   Unique players: {df_pos_change['player_id'].nunique():,}")
print(f"   Use case: 'Study role transitions after transfers'")

# Split 4: Promotion/Relegation
df_promo_releg = df[same_team]
print(f"\n4. SAME TEAM (promotion/relegation):")
print(f"   Rows: {len(df_promo_releg):,} ({len(df_promo_releg)/len(df)*100:.1f}%)")
print(f"   Unique players: {df_promo_releg['player_id'].nunique():,}")
print(f"   Use case: 'Study impact of playing at different competitive levels'")

=== RECOMMENDED DATASET SPLITS ===

Original dataset: 78,191 rows

1. REAL TRANSFERS + SAME POSITION (cleanest for comparison):
   Rows: 45,415 (58.1%)
   Unique players: 31,983
   Use case: 'Did the player improve in the same role after transfer?'

2. REAL TRANSFERS + ALL POSITIONS:
   Rows: 56,697 (72.5%)
   Unique players: 34,643
   Use case: 'All market transfers, need to handle position logic'

3. REAL TRANSFERS + POSITION CHANGE:
   Rows: 11,282 (14.4%)
   Unique players: 8,521
   Use case: 'Study role transitions after transfers'

4. SAME TEAM (promotion/relegation):
   Rows: 21,494 (27.5%)
   Unique players: 13,017
   Use case: 'Study impact of playing at different competitive levels'


## SUMMARY

 1. "NOT REAL TRANSFERS" (27%)                                           
     → Same team, different league (promotion/relegation)                 
     → Player never changed employer                                      
     → FILTER: df['from_team_id'] != df['to_team_id']                     
                                                                          
  2. MULTIPLE ROWS PER TRANSFER                                           
     → Cartesian product: positions_before × positions_after              
     → Each row compares performance in specific position combos          
     → 87% of transfers are simple (1 position → 1 position)              
                                                                          
  3. POSITION CHANGES                                                     
     → Even 1-to-1 transfers can have position changes                    
     → ~9% of 1-to-1 real transfers involve a role change                 
     → Consider filtering: df['from_position'] == df['to_position']       
                                                                          
  4. RECOMMENDED APPROACH                                                 
     → For cleanest analysis: real transfers + same position (58% of data)
     → Keep position changers separate for specialized analysis           
     → Promotion/relegation is NOT a market transfer                     