# Competition Metadata Review (Wyscout) üèüÔ∏èüåç

This notebook reviews competition-level metadata from Wyscout, with the goal of understanding competition structure, scope, and identifiers needed to contextualize transfers and competitive level changes.

## 0) Imports & Setup

In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 180)

PATH = "../../raw_data_agust_wy//"

## 1) Quick Data Snapshot

In [2]:
df_comp = pd.read_parquet(f"{PATH}competitions_wyscout.parquet")

In [3]:
df_comp.shape

(1793, 18)

In [4]:
df_comp.dtypes.value_counts()

object    8
int64     5
bool      5
Name: count, dtype: int64

In [5]:
df_comp.head()

Unnamed: 0,competition_id,competition,name,country,division,gender,type,category,season_id,start_date,end_date,completed,season,youth,domestic_cup,domestic_league,international_cup,season_name
0,127,127,Abissnet Superiore,Albania,1,male,club,default,185884,2018-08-19,2019-05-26,True,2018,False,False,True,False,2018/2019
1,127,127,Abissnet Superiore,Albania,1,male,club,default,185885,2019-08-23,2020-07-29,True,2019,False,False,True,False,2019/2020
2,127,127,Abissnet Superiore,Albania,1,male,club,default,186452,2020-09-12,2021-05-26,True,2020,False,False,True,False,2020/2021
3,127,127,Abissnet Superiore,Albania,1,male,club,default,187536,2021-09-10,2022-05-26,True,2021,False,False,True,False,2021/2022
4,127,127,Abissnet Superiore,Albania,1,male,club,default,188329,2022-08-19,2023-05-29,True,2022,False,False,True,False,2022/2023


In [6]:
df_comp.isna().mean().sort_values(ascending=False)

competition_id       0.0
competition          0.0
international_cup    0.0
domestic_league      0.0
domestic_cup         0.0
youth                0.0
season               0.0
completed            0.0
end_date             0.0
start_date           0.0
season_id            0.0
category             0.0
type                 0.0
gender               0.0
division             0.0
country              0.0
name                 0.0
season_name          0.0
dtype: float64

In [7]:
df_comp.duplicated().sum()

np.int64(0)

## 2) Clean Dataframe

In [8]:
# ¬øcada competition_id mapea a un solo nombre?
check_id_to_name = (
    df_comp.groupby("competition_id")["competition"]
    .nunique()
    .sort_values(ascending=False)
)

check_id_to_name.head()

competition_id
127    1
883    1
836    1
837    1
841    1
Name: competition, dtype: int64

In [9]:
# ¬øcada nombre mapea a un solo competition_id?
check_name_to_id = (
    df_comp.groupby("competition")["competition_id"]
    .nunique()
    .sort_values(ascending=False)
)

check_name_to_id.head()

competition
127    1
883    1
836    1
837    1
841    1
Name: competition_id, dtype: int64

In [10]:
df_comp_clean = (
    df_comp
    .drop(columns=["competition"])
    .copy()
)

In [11]:
df_comp[["gender","type","category"]].nunique()

gender      1
type        1
category    1
dtype: int64

In [12]:
df_comp_clean = (
    df_comp
    .drop(columns=["gender","type","category"])
    .copy()
)

In [13]:
df_comp[["youth","domestic_cup","domestic_league","international_cup"]].nunique()

youth                1
domestic_cup         1
domestic_league      1
international_cup    1
dtype: int64

In [14]:
df_comp_clean = df_comp.drop(
    columns=["competition","gender","type","category","youth","domestic_cup","domestic_league","international_cup"]
).copy()

In [15]:
df_comp_clean.head()

Unnamed: 0,competition_id,name,country,division,season_id,start_date,end_date,completed,season,season_name
0,127,Abissnet Superiore,Albania,1,185884,2018-08-19,2019-05-26,True,2018,2018/2019
1,127,Abissnet Superiore,Albania,1,185885,2019-08-23,2020-07-29,True,2019,2019/2020
2,127,Abissnet Superiore,Albania,1,186452,2020-09-12,2021-05-26,True,2020,2020/2021
3,127,Abissnet Superiore,Albania,1,187536,2021-09-10,2022-05-26,True,2021,2021/2022
4,127,Abissnet Superiore,Albania,1,188329,2022-08-19,2023-05-29,True,2022,2022/2023


In [16]:
df_comp_clean.shape

(1793, 10)

### Dropped Columns and Rationale

The following columns were removed from the competition metadata as they were constant across all observations and did not add analytical value:

- `competition`: same value as `competition_id`
- `gender`: always `male`
- `type`: always `club`
- `category`: always `default`
- `youth`: always `False`
- `domestic_cup`: always `False`
- `domestic_league`: always `True`
- `international_cup`: always `False`

Since these fields do not vary, they provide no additional information for joins or analysis. Removing them reduces noise and keeps the competition table focused on identifiers and season-specific context.

## 3) Data Summary

In [17]:
n_countries = df_comp_clean["country"].nunique()
n_competitions = df_comp_clean["competition_id"].nunique()

n_countries, n_competitions

(119, 269)

In [18]:
comp_by_country = (
    df_comp_clean.groupby("country")["competition_id"]
    .nunique()
    .sort_values(ascending=False)
)

comp_by_country.head(20)

country
United States       48
Brazil              20
Australia           12
Ghana                3
Azerbaijan           3
Denmark              3
Slovakia             3
Canada               3
New Zealand          3
Argentina            3
Peru                 3
Serbia               2
Sweden               2
Spain                2
Hungary              2
Iceland              2
India                2
Slovenia             2
Ireland Republic     2
Israel               2
Name: competition_id, dtype: int64

In [19]:
country_summary = (
    df_comp_clean.groupby("country")
    .agg(
        competitions=("competition_id","nunique"),
        seasons=("season_id","nunique"),
    )
    .sort_values("seasons", ascending=False)
)

country_summary.head(20)

Unnamed: 0_level_0,competitions,seasons
country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,48,257
Brazil,20,166
Australia,12,68
Slovakia,3,23
Denmark,3,21
Argentina,3,19
China PR,2,18
Azerbaijan,3,18
Paraguay,2,17
Colombia,2,17


In [20]:
comp_time = (
    df_comp_clean.groupby("competition_id")
    .agg(
        country=("country","first"),
        name=("name","first"),
        n_seasons=("season_id","nunique"),
        first_season=("season","min"),
        last_season=("season","max"),
    )
    .sort_values("n_seasons", ascending=False)
)

comp_time.head(20)

Unnamed: 0_level_0,country,name,n_seasons,first_season,last_season
competition_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
43226,United States,NCAA D1 Non-conference matches,12,2019,2025
886,Venezuela,Primera Divisi√≥n,10,2018,2026
255,Brazil,Serie A,9,2018,2026
244,Brazil,Paulista A1,9,2018,2026
1414,United States,USL Championship,9,2018,2026
248,Brazil,Pernambucano 1,9,2018,2026
682,Panama,LPF,9,2018,2026
245,Brazil,Paulista A2,9,2018,2026
890,Vietnam,V.League 1,9,2018,2025
685,Paraguay,Division Profesional,9,2018,2026


In [21]:
comp_time["n_seasons"].describe()

count    269.000000
mean       6.665428
std        2.332356
min        1.000000
25%        6.000000
50%        8.000000
75%        8.000000
max       12.000000
Name: n_seasons, dtype: float64

## Competition Data Summary

- Final dataset contains **269 unique competitions** across multiple countries and seasons.
- Competition identifiers are **season-specific** (competition‚Äìseason level, not abstract leagues).
- Most competitions show **strong temporal coverage**:
  - Median: **8 seasons**
  - IQR: **6‚Äì8 seasons**
  - Very few single-season competitions.
- Countries such as **Brazil, USA, and Australia** have many competitions, reflecting domestic league structure rather than data issues.
- Overall, the competition metadata is **clean, stable, and suitable** for contextualizing transfers and competitive environments.