In [2]:
import os
import pandas as pd

raw_path = os.path.join("..", "raw") 

df_game = pd.read_csv(os.path.join(raw_path, "games.csv"))
df_player = pd.read_csv(os.path.join(raw_path, "players.csv"))
df_stats = pd.read_csv(os.path.join(raw_path, "stats.csv"))

## Data pre-processing on df_player

In [3]:
df_player.head()

Unnamed: 0,PlayerId,PlayerName,Height,Weight,Dob,Position
0,2020654979,Jake Aarts,177,75,1994-12-08,Forward
1,2018655703,Ryan Abbott,200,100,1991-06-25,Ruck
2,2002652211,Gary Ablett,182,87,1984-05-14,Forward
3,2014651814,Blake Acres,191,90,1995-10-07,Midfield
4,2025654137,Jed Adams,196,91,2004-05-14,Defender


In [4]:
# Check for Duplicate Rows
duplicate_total = df_player.duplicated().sum()
print(f"Total Duplicate Rows: {duplicate_total}")

# Check for Null (Missing) Values
null_only = df_player.isnull().sum()
print("--- Columns with Missing Values ---")
print(null_only[null_only > 0])

Total Duplicate Rows: 0
--- Columns with Missing Values ---
Series([], dtype: int64)


In [5]:
# Examine "Position" column
pos_counts = df_player["Position"].value_counts()
print(pos_counts)

Position
Defender              527
Forward               503
Midfield              372
Midfield, Forward     170
Ruck                   91
Defender, Midfield     66
Defender, Forward      47
Forward, Ruck          44
no position             3
Midfield, Ruck          1
Defender, Ruck          1
Name: count, dtype: int64


In this phase, we transform raw player metadata into analytical features suitable for causal inference. The processing pipeline follows three core principles:

- Age: We extract the Birth Year from the Dob column to later calculate age which is a critical factor that represents professional experience and physical maturity.

- Positional Integrity (Ruck-First Rule): To ensure independent outcome models, we resolve hybrid roles (e.g., "Forward, Ruck"). We apply a Priority-Based Mapping strategy as follows:
    - Ruck Priority: Any player whose position includes "Ruck" (e.g., Forward, Ruck) is classified strictly as a Ruck. This maximizes the sample size for our height-based causal model, where physical traits are most dominant.
    - Primary Role Selection: For other hybrids (e.g., Defender, Midfield), we assign the first listed position as the Primary Position.
    - Data Pruning: Rows labeled as no position are removed as they lack the critical categorical information required for our analysis.

- Physiological Derivation: We derive BMI to capture the relationship between body mass and physical performance metrics like Tackles or Contested Marks.


In [6]:
def extract_birth_year(df):
    df = df.copy()
    if "Dob" in df.columns:
        df["Dob"] = pd.to_datetime(df["Dob"], errors="coerce")
        df["BirthYear"] = df["Dob"].dt.year.astype("Int64")
    return df

def finalize_player_positions(df):
    """
    Standardizes player positions using the 'Position' column.
    Removes 'no position' entries and applies the Ruck-Priority rule.
    """
    # 1. Filter out 'no position'
    df_clean = df[df['Position'] != 'no position'].copy()
    
    def map_to_single(pos):
        if pd.isna(pos): 
            return 'Unknown'
        
        # 2. Ruck Priority: Any hybrid with 'Ruck' becomes a 'Ruck'
        if 'Ruck' in pos:
            return 'Ruck'
        
        # 3. First-Mention: Take the first role for other hybrids
        return pos.split(',')[0].strip()

    df_clean['PrimaryPosition'] = df_clean['Position'].apply(map_to_single)
    return df_clean

def add_players_derived_features(df):
    df = df.copy()
    if "Height" in df.columns and "Weight" in df.columns:
        valid_mask = (df["Height"] > 0) & (df["Height"].notna())
        h_m = df.loc[valid_mask, "Height"] / 100.0
        df.loc[valid_mask, "BMI"] = df.loc[valid_mask, "Weight"] / (h_m ** 2)
    return df

# Execution
df_player_processed = (df_player
                       .pipe(extract_birth_year)
                       .pipe(finalize_player_positions)
                       .pipe(add_players_derived_features))

# Drop raw 'Dob' and 'Position' columns as they are no longer needed
cols_to_drop = [c for c in ["Dob", "Position"] if c in df_player_processed.columns]
df_player_final = df_player_processed.drop(columns=cols_to_drop)

df_player_final.head()

Unnamed: 0,PlayerId,PlayerName,Height,Weight,BirthYear,PrimaryPosition,BMI
0,2020654979,Jake Aarts,177,75,1994,Forward,23.939481
1,2018655703,Ryan Abbott,200,100,1991,Ruck,25.0
2,2002652211,Gary Ablett,182,87,1984,Forward,26.264944
3,2014651814,Blake Acres,191,90,1995,Midfield,24.670376
4,2025654137,Jed Adams,196,91,2004,Defender,23.688047


## Data pre-processing on df_game

In [7]:
df_game.columns

Index(['GameId', 'Year', 'Round', 'Date', 'MaxTemp', 'MinTemp', 'Rainfall',
       'Venue', 'StartTime', 'Attendance', 'HomeTeam', 'HomeTeamScoreQT',
       'HomeTeamScoreHT', 'HomeTeamScore3QT', 'HomeTeamScoreFT',
       'HomeTeamScore', 'AwayTeam', 'AwayTeamScoreQT', 'AwayTeamScoreHT',
       'AwayTeamScore3QT', 'AwayTeamScoreFT', 'AwayTeamScore'],
      dtype='object')

In [8]:
# Check for Duplicate Rows
duplicate_total = df_game.duplicated().sum()
print(f"Total Duplicate Rows: {duplicate_total}")

# Check for Null (Missing) Values
null_only = df_game.isnull().sum()
print("--- Columns with Missing Values ---")
print(null_only[null_only > 0])

Total Duplicate Rows: 0
--- Columns with Missing Values ---
MaxTemp     11
MinTemp     11
Rainfall    23
dtype: int64


Our data audit revealed a negligible amount of missing values in the environmental columns: MaxTemp (11), MinTemp (11), and Rainfall (23). Given that these represent a tiny fraction of the total match records, we have opted to drop the rows.

In [9]:
# Create a copy to avoid SettingWithCopyWarning
df_game_cleaned = df_game.dropna(subset=['MaxTemp', 'MinTemp', 'Rainfall']).copy()

# Verification
print(f"Original Row Count: {len(df_game)}")
print(f"Cleaned Row Count: {len(df_game_cleaned)}")
print(f"Rows Removed: {len(df_game) - len(df_game_cleaned)}")

# Final check to confirm 0 nulls
print("\n--- Remaining Null Values ---")
print(df_game_cleaned[['MaxTemp', 'MinTemp', 'Rainfall']].isnull().sum())

Original Row Count: 2879
Cleaned Row Count: 2855
Rows Removed: 24

--- Remaining Null Values ---
MaxTemp     0
MinTemp     0
Rainfall    0
dtype: int64


In our framework, we hypothesize that environmental conditions (specifically Maximum Temperature, Minimum Temperature, and Rainfall) do not just correlate with performance but act as Effect Modifiers.

In [10]:
selected_features = df_game[["MaxTemp", "MinTemp", "Rainfall"]].value_counts()
print(selected_features.head(10)) # Show the top 10 most common weather patterns
print()
print(df_game[["MaxTemp", "MinTemp", "Rainfall"]].describe())

MaxTemp  MinTemp  Rainfall
19.3     12.7     0.0         4
13.5     9.8      0.0         4
14.1     6.8      0.0         3
22.1     13.1     0.0         3
13.4     6.5      0.0         3
         9.8      0.0         3
17.3     5.1      0.0         3
14.6     9.3      0.2         3
17.0     8.9      0.0         3
18.4     7.6      0.0         3
Name: count, dtype: int64

           MaxTemp      MinTemp     Rainfall
count  2868.000000  2868.000000  2856.000000
mean     19.234066     9.818096     1.947549
std       5.107526     4.580458     6.199726
min       8.700000    -5.400000     0.000000
25%      15.100000     6.800000     0.000000
50%      18.500000     9.400000     0.000000
75%      22.700000    12.700000     1.000000
max      38.600000    25.000000   114.400000


To transform raw meteorological data into actionable causal modifiers, we derived the following features:
- Average Temperature (AvgTemp): Calculated as the arithmetic mean of MaxTemp and MinTemp to represent the overall thermal environment of the match.

- Thermal Volatility (TempRange): Captures the fluctuation in temperature, which may serve as a stressor for player endurance.

- Rainfall Categorization (IsRainy): While raw rainfall volume is recorded, the psychological and tactical shift in AFL primarily depends on whether the surface is wet. We created a binary indicator (IsRainy) to distinguish between "Dry Ball" and "Wet Ball" conditions.

In [11]:
def refine_weather_features(df):
    """
    Refines weather data by calculating mean temperatures and 
    creating binary indicators for rainfall.
    """
    df = df.copy()
    
    # 1. Calculate Average Temperature and Temperature Range: This provides a more stable metric for thermal stress than just Max/Min
    df['AvgTemp'] = (df['MaxTemp'] + df['MinTemp']) / 2
    df['TempRange'] = df['MaxTemp'] - df['MinTemp']
    
    # 2. Create a Binary Rainfall Indicator (Modifier): This captures the threshold effect of 'slippery' conditions
    df['IsRainy'] = (df['Rainfall'] > 0).astype(int)
    
    return df

# Execution 
df_game_refined = refine_weather_features(df_game_cleaned)

df_game_refined[['AvgTemp', 'TempRange', 'IsRainy']].head()

Unnamed: 0,AvgTemp,TempRange,IsRainy
0,18.1,11.8,0
1,17.7,16.0,0
2,18.55,17.7,1
3,22.1,14.0,0
4,23.95,8.5,0


## Data pre-processing on df_stats

In [12]:
df_stats.columns

Index(['GameId', 'Year', 'Round', 'Team', 'PlayerId', 'PlayerName',
       'GameNumber', 'Disposals', 'Kicks', 'Marks', 'Handballs', 'Goals',
       'Behinds', 'HitOuts', 'Tackles', 'Rebounds', 'Inside50s', 'Clearances',
       'Clangers', 'Frees', 'FreesAgainst', 'BrownlowVotes',
       'ContestedPossessions', 'UncontestedPossessions', 'ContestedMarks',
       'MarksInside50', 'OnePercenters', 'Bounces', 'GoalAssists', '%Played',
       'Subs'],
      dtype='object')

In [13]:
# Check for Duplicate Rows
duplicate_total = df_stats.duplicated().sum()
print(f"Total Duplicate Rows: {duplicate_total}")

# Remove Duplicates
df_stats = df_stats.drop_duplicates()
print(f"Duplicates removed. New row count: {len(df_stats)}")

# Check for Null (Missing) Values
null_only = df_stats.isnull().sum()
print("--- Columns with Missing Values ---")
print(null_only[null_only > 0])

Total Duplicate Rows: 138
Duplicates removed. New row count: 128800
--- Columns with Missing Values ---
Subs    43560
dtype: int64


In [14]:
# Examine "Subs" column
sub = df_stats["Subs"].value_counts()
print(sub)

Subs
-      42182
0      36291
Off     3386
On      3381
Name: count, dtype: int64


An investigation of the Subs column reveals that it is a categorical variable tracking AFL substitution events. Over 90% of the observations are marked as - (no sub event), while a significant portion (43,560 rows) is missing entirely, representing eras where the sub-rule was not active.

We have decided to drop this column. Because "Off" and "On" events represent a tiny fraction of the total dataset, including this variable would introduce unnecessary complexity without providing significant explanatory power for our primary study.

In [15]:
df_stats = df_stats.drop(columns=["Subs"])
df_stats.columns

Index(['GameId', 'Year', 'Round', 'Team', 'PlayerId', 'PlayerName',
       'GameNumber', 'Disposals', 'Kicks', 'Marks', 'Handballs', 'Goals',
       'Behinds', 'HitOuts', 'Tackles', 'Rebounds', 'Inside50s', 'Clearances',
       'Clangers', 'Frees', 'FreesAgainst', 'BrownlowVotes',
       'ContestedPossessions', 'UncontestedPossessions', 'ContestedMarks',
       'MarksInside50', 'OnePercenters', 'Bounces', 'GoalAssists', '%Played'],
      dtype='object')

In [16]:
# Examine "Round" column
df_stats["Round"].value_counts()

Round
Round 1              5634
Round 7              5634
Round 18             5634
Round 17             5634
Round 5              5588
Round 6              5588
Round 4              5588
Round 2              5542
Round 8              5502
Round 16             5498
Round 9              5458
Round 3              5452
Round 10             5370
Round 23             5238
Round 22             5238
Round 21             5238
Round 20             5238
Round 19             5238
Round 11             5062
Round 15             5002
Round 14             4460
Round 12             4378
Round 13             4342
Qualifying Final     1252
Semi Final           1252
Preliminary Final    1252
Elimination Final    1252
Round 24             1242
Grand Final           626
Opening Round         368
Name: count, dtype: int64

In [17]:
# Remove Round column to simplify the model's categorical complexity
df_stats = df_stats.drop(columns=["Round"])
df_stats.columns

Index(['GameId', 'Year', 'Team', 'PlayerId', 'PlayerName', 'GameNumber',
       'Disposals', 'Kicks', 'Marks', 'Handballs', 'Goals', 'Behinds',
       'HitOuts', 'Tackles', 'Rebounds', 'Inside50s', 'Clearances', 'Clangers',
       'Frees', 'FreesAgainst', 'BrownlowVotes', 'ContestedPossessions',
       'UncontestedPossessions', 'ContestedMarks', 'MarksInside50',
       'OnePercenters', 'Bounces', 'GoalAssists', '%Played'],
      dtype='object')

In [18]:
# Examine "GameNumber"" column
df_stats["GameNumber"].value_counts()

GameNumber
1      1252
2      1209
3      1181
4      1152
5      1132
       ... 
428       1
429       1
430       1
431       1
432       1
Name: count, Length: 432, dtype: int64

In [19]:
# Remove GameNumber column to simplify the model's categorical complexity
df_stats = df_stats.drop(columns=["GameNumber"])
df_stats.columns

Index(['GameId', 'Year', 'Team', 'PlayerId', 'PlayerName', 'Disposals',
       'Kicks', 'Marks', 'Handballs', 'Goals', 'Behinds', 'HitOuts', 'Tackles',
       'Rebounds', 'Inside50s', 'Clearances', 'Clangers', 'Frees',
       'FreesAgainst', 'BrownlowVotes', 'ContestedPossessions',
       'UncontestedPossessions', 'ContestedMarks', 'MarksInside50',
       'OnePercenters', 'Bounces', 'GoalAssists', '%Played'],
      dtype='object')

## Feature Selection Rationale

**1. Removal of Redundant Aggregates**
* **Disposals**: Since $Disposals = Kicks + Handballs$, we have opted to retain **Disposals** as a high-level proxy for total ball handling and removed the individual components (**Kicks** and **Handballs**) to simplify the feature space.
* **Contested & Uncontested Possessions**: These metrics essentially partition total possession into "quality" buckets. As they correlate almost perfectly with total ball-handling metrics and our study focuses on overall performance volume rather than tactical "quality" differences, these were excluded to focus the model on high-signal variables.

**2. Removal of Non-Predictive Identifiers**
* **PlayerName**: This is a high-cardinality string variable that provides no numerical value to a regression or machine learning model. Since **PlayerId** serves as a unique numerical key for each individual, the name column was dropped to reduce memory overhead and ensure the model only processes relevant features.

**3. Eliminating Data Leakage**
* **Brownlow Votes**: These votes are awarded by umpires *post-match* as a subjective evaluation of the "Best on Ground." Since our objective is to predict performance outcomes based on pre-existing physical traits, including an award that is determined after the performance has occurred would introduce data leakage. Removing this ensures our model remains strictly predictive and unbiased.

**4. Pruning Low-Variance and Sparse Features**
* **Bounces**: In AFL, bounces are highly specific to certain roles and the majority of the dataset, this value is near zero. Hence, including a feature with such high sparsity adds noise and can lead to overfitting without providing a meaningful causal link to physical traits like Height or BMI across the entire league.

In [20]:
# 1. Define the columns to be removed based on the four analytical principles
columns_to_remove = [
    # Principle 1: Keep 'Disposals' as the aggregate proxy
    'Kicks', 'Handballs', 
    'ContestedPossessions', 'UncontestedPossessions',
    
    # Principle 2: Remove Non-Predictive Metadata
    'PlayerName', 
    
    # Principle 3: Eliminate Data Leakage
    'BrownlowVotes', 
    
    # Principle 4: Remove Sparse/Low-Variance Noise
    'Bounces'
]

# 2. Drop columns only if they exist in the dataframe to avoid errors
existing_cols = [col for col in columns_to_remove if col in df_stats.columns]
df_stats_final = df_stats.drop(columns=existing_cols)

# 3. Verification
df_stats_final.columns

Index(['GameId', 'Year', 'Team', 'PlayerId', 'Disposals', 'Marks', 'Goals',
       'Behinds', 'HitOuts', 'Tackles', 'Rebounds', 'Inside50s', 'Clearances',
       'Clangers', 'Frees', 'FreesAgainst', 'ContestedMarks', 'MarksInside50',
       'OnePercenters', 'GoalAssists', '%Played'],
      dtype='object')

## Data merge

In [21]:
df_final = df_stats_final.merge(df_player_final, on='PlayerId', how='left')
df_final = df_final.merge(df_game_refined[['GameId', 'AwayTeam', 'AvgTemp', 'TempRange', 'IsRainy']], on='GameId', how='left')

df_final.columns

Index(['GameId', 'Year', 'Team', 'PlayerId', 'Disposals', 'Marks', 'Goals',
       'Behinds', 'HitOuts', 'Tackles', 'Rebounds', 'Inside50s', 'Clearances',
       'Clangers', 'Frees', 'FreesAgainst', 'ContestedMarks', 'MarksInside50',
       'OnePercenters', 'GoalAssists', '%Played', 'PlayerName', 'Height',
       'Weight', 'BirthYear', 'PrimaryPosition', 'BMI', 'AwayTeam', 'AvgTemp',
       'TempRange', 'IsRainy'],
      dtype='object')

In [22]:
## Calculate "Age" at the time of game

df_final['Age'] = df_final['Year'] - df_final['BirthYear']
df_final = df_final.drop(columns=["BirthYear"])
df_final.head()

Unnamed: 0,GameId,Year,Team,PlayerId,Disposals,Marks,Goals,Behinds,HitOuts,Tackles,...,PlayerName,Height,Weight,PrimaryPosition,BMI,AwayTeam,AvgTemp,TempRange,IsRainy,Age
0,2012R0105,2012,Adelaide,2011675768,18,5,2,5,0,5,...,Ian Callinan,171.0,70.0,Forward,23.93899,Adelaide,23.95,8.5,0.0,30
1,2012R0105,2012,Adelaide,2008681760,25,3,2,0,0,2,...,Patrick Dangerfield,189.0,92.0,Midfield,25.755158,Adelaide,23.95,8.5,0.0,22
2,2012R0105,2012,Adelaide,2000686938,17,4,0,0,0,3,...,Michael Doughty,177.0,81.0,Defender,25.854639,Adelaide,23.95,8.5,0.0,33
3,2012R0105,2012,Adelaide,2006687579,19,6,1,3,0,7,...,Richard Douglas,181.0,79.0,Midfield,24.114038,Adelaide,23.95,8.5,0.0,25
4,2012R0105,2012,Adelaide,2010728130,8,1,0,0,0,1,...,Ricky Henderson,188.0,89.0,Midfield,25.181077,Adelaide,23.95,8.5,0.0,24


## Performance Metrics Dictionary

To accurately validate our hypothesis, we categorize performance metrics based on positional responsibilities. This ensures that our models evaluate players against the specific objectives of their roles (Defender, Midfield, Forward, and Ruck).

##### **1. Universal Metrics (Important for All Positions)**
These features represent general ball involvement, discipline, and efficiency across the field.
* **Disposals**: The primary measure of ball handling; represents the total volume of effective player-ball interactions.
* **Marks**: Catching a kicked ball cleanly. This serves as a general indicator of aerial ability and ball retention.
* **Clangers**: Critical unforced errors or poor disposals. Used to measure efficiency and the "cost" of player involvement.
* **Frees / Frees Against**: Measures of discipline and the ability to win favorable umpire decisions.
* **% Played**: Time on Ground (ToG). This is used as an **exposure variable** to normalize statistics (e.g., stats per 100% game time).

##### **2. Ruck: Physical Dominance & Stoppages**
For the Ruck position, the focus is on aerial dominance at stoppages to initiate team possession.
* **Hit-outs**: **Primary Outcome ($Y$)**. The act of tapping the ball out of a ruck contest (similar to a basketball tip-off) toward a teammate's advantage.
* **Clearances**: Successfully extracting the ball from a stoppage (ball-up or throw-in) to initiate an offensive chain.
* **Contested Marks**: Catching the ball while under direct physical pressure from an opponent, a key indicator of raw aerial strength.

##### **3. Midfield: Progression & Inside-Game**
Midfielders are responsible for driving the ball from the defensive zone into the attacking 50.
* **Clearances**: A high-impact metric for midfielders indicating the ability to win the ball in congested areas.
* **Inside 50s**: The total number of times a player successfully delivers the ball into the attacking 50-meter arc.
* **Goal Assists**: The final action (pass or tap) that directly results in a teammate scoring a goal.

##### **4. Forward: Scoring & Offensive Pressure**
The primary goal for Forwards is maximizing scoreboard impact and applying "Front-Half" pressure.
* **Goals / Behinds**: **Primary Outcomes ($Y$)**. The 6-point and 1-point scoring units. Total Score is calculated as $(6 \times Goals) + (1 \times Behinds)$.
* **Marks Inside 50**: A strong indicator of a forward’s ability to find space or win physical contests within scoring range.
* **Tackles**: For forwards, this often represents "forward-pressure" intended to trap the ball in the attacking zone.

##### **5. Defender: Prevention & Rebounding**
Defenders focus on intercepting opposition attacks and launching counter-offensives.
* **Rebounds**: Clearing the ball out of the defensive 50-meter arc after an opposition attack.
* **One Percenters**: Unselfish defensive acts such as spoils (punching the ball away), smothers, and shepherds.
* **Tackles**: Physical obstruction used to stop an opponent and prevent goal-scoring opportunities.

## Split - Outlier & Skewness Handling (per position) - Scaling Pipeline
Since each player position (Ruck, Midfield, Forward, Defender) has distinct roles and performance objectives, we implemented a Position-Stratified Outlier Handling approach.

In [23]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler

def get_preprocessed_split_data(df, target_col, position_filter, test_size=0.2, random_state=42):
    """
    MODULAR PREPROCESSING FACTORY:
    1. Filters data by position.
    2. Splits into Train/Test partitions to prevent Data Leakage.
    3. Handles outliers (Winsorization) using Training statistics only.
    4. Handles skewness (Log Transform) based on Training distribution.
    5. Scales features using RobustScaler fitted on Training data.
    """
    
    # --- STEP 1: Positional Filtering ---
    df_pos = df[df['PrimaryPosition'] == position_filter].copy()
    if df_pos.empty:
        raise ValueError(f"No data found for position: {position_filter}")

    # --- STEP 2: Feature & Target Separation ---
    numeric_features = [
        'Height', 'Weight', 'BMI', 'Age', 'AvgTemp', 'TempRange',
        'Disposals', 'Marks', 'Goals', 'Behinds', 'HitOuts', 'Tackles', 
        'Rebounds', 'Inside50s', 'Clearances', 'Clangers', 'Frees', 
        'FreesAgainst', 'ContestedMarks', 'MarksInside50', 'OnePercenters', 
        'GoalAssists', '%Played'
    ]
    passthrough_features = ['IsRainy']
    all_cols = numeric_features + passthrough_features
    
    X = df_pos[all_cols]
    y = df_pos[target_col]

    # --- STEP 3: Train/Test Split (The Baseline for Leakage Prevention) ---
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )

    # --- STEP 4: Positional Outlier & Skewness Handling (Post-Split) ---
    universal = ['Disposals', 'Marks', 'Clangers', 'Frees', 'FreesAgainst', '%Played']
    mapping = {
        'Ruck': universal + ['HitOuts', 'Clearances', 'ContestedMarks'],
        'Midfield': universal + ['Clearances', 'Inside50s', 'GoalAssists'],
        'Forward': universal + ['Goals', 'Behinds', 'MarksInside50', 'Tackles'],
        'Defender': universal + ['Rebounds', 'OnePercenters', 'Tackles']
    }
    target_metrics = mapping.get(position_filter, universal)

    for metric in target_metrics:
        if metric in X_train.columns:
            # A. Winsorization: Define thresholds based ONLY on X_train
            lower_limit = X_train[metric].quantile(0.01)
            upper_limit = X_train[metric].quantile(0.99)
            
            # Apply thresholds to both sets (Clipping X_test based on X_train logic)
            X_train[metric] = X_train[metric].clip(lower_limit, upper_limit)
            X_test[metric] = X_test[metric].clip(lower_limit, upper_limit)
            
            # B. Log Transformation: Check skewness of X_train ONLY
            if abs(X_train[metric].skew()) > 2.0:
                X_train[metric] = np.log1p(X_train[metric])
                X_test[metric] = np.log1p(X_test[metric])

    # --- STEP 5: Robust Scaling Pipeline ---
    # RobustScaler is chosen to mitigate the influence of superstar performers
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', RobustScaler(), numeric_features),
            ('pass', 'passthrough', passthrough_features)
        ]
    )
    
    scaling_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

    # Fit ONLY on the processed training data
    X_train_scaled = scaling_pipeline.fit_transform(X_train)
    X_test_scaled = scaling_pipeline.transform(X_test)

    # Convert back to DataFrame for better interpretability and model tracking
    X_train_final = pd.DataFrame(X_train_scaled, columns=all_cols)
    X_test_final = pd.DataFrame(X_test_scaled, columns=all_cols)

    print(f"--- Pipeline Ready: Data processed for {position_filter} ---")
    return X_train_final, X_test_final, y_train, y_test, scaling_pipeline

## FYI - AFL Rule Evolutions (2012–2025) : We should consider this

##### **1. The 6-6-6 Starting Position Rule (Introduced in 2019)**
* **The Change:** At every center bounce, teams must now adhere to a strict 6-6-6 formation (6 Defenders, 6 Midfielders, and 6 Forwards).
* **Causal Impact on Height:** Prior to 2019, teams could "flood" the midfield with extra players to neutralize a dominant Ruckman. The introduction of the 6-6-6 rule created more space, significantly increasing **Hit-out Efficiency**.
* **Analytical Objective:** We hypothesize that the causal influence of **Height** on **Hit-outs** and **Clearances** will be statistically stronger (higher regression coefficients) in the post-2019 era due to reduced midfield congestion.

##### **2. The Interchange Rotation Cap: The "Weight" Dilemma**
* **The Change:** Between 2012 and 2025, the AFL strictly reduced the allowed player rotations per match (from 120 to 90, and eventually to 75).
* **Causal Impact on Weight/BMI:** 
    - **Classic Era (~2012):** Higher rotation caps allowed heavier, "tanker-style" players to maintain high intensity in short bursts, making high **Weight** an advantage for physical contests.
    - **Modern Era (~2025):** With limited rest, endurance is paramount. Excessive body mass may now act as a **Negative Effect**, as heavier players struggle with the increased aerobic demands of longer on-field stints.
* **Analytical Objective:** We expect the correlation between **Weight** and **Disposals** to shift from positive (+) in the early 2010s to neutral or negative (-) in the 2020s.

##### **3. The "Stand" Rule (Introduced in 2021)**
* **The Change:** Once a mark is taken, the defending player "on the mark" is prohibited from moving laterally and must remain stationary.
* **Causal Impact on BMI and Agility:** This rule significantly increased the speed of ball movement and improved passing lanes for the attacking team.
* **Analytical Objective:** This environmental shift favors players with lower **BMI** and higher **Agility**, as the game transition speed has accelerated, rewarding mobility over raw stationary strength.