# 03 - Feature Engineering

This notebook computes rolling statistics, form metrics, and head-to-head features for backtesting.

## Key Enhancements
- Optimized rolling calculations using vectorized pandas operations
- Efficient groupby with transform for parallel computation
- Pre-computed team indices for O(1) lookup
- Schema alignment with `ingest_all_data.py` features_metadata structure
- Clear separation of PRE-MATCH vs POST-MATCH features

## Features to Compute (All PRE-MATCH)
1. Team form (last 5/10 games): W-D-L, points, goals, clean sheets
2. Rolling stats: possession, shots, corners, fouls averages
3. Home/away specific metrics
4. Head-to-head history
5. Match outcome labels (ground truth for backtesting)


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime, timedelta
from typing import Optional, Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# Optimize pandas for large datasets
pd.set_option('display.max_columns', 50)
pd.set_option('compute.use_bottleneck', True)
pd.set_option('compute.use_numexpr', True)

DATA_DIR = Path('../data')
PROCESSED_DIR = DATA_DIR / 'processed'

print("ðŸ“Š Feature Engineering Pipeline")
print("=" * 50)

ðŸ“Š Feature Engineering Pipeline


In [2]:
# Load cleaned matches
matches = pd.read_parquet(PROCESSED_DIR / 'matches_base.parquet')
matches['date'] = pd.to_datetime(matches['date'])
matches = matches.sort_values('date').reset_index(drop=True)

print(f"âœ… Loaded {len(matches):,} matches")
print(f"   Date range: {matches['date'].min()} to {matches['date'].max()}")
print(f"   Memory usage: {matches.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

âœ… Loaded 57,870 matches
   Date range: 2024-01-01 05:00:00+00:00 to 2025-12-15 20:45:00+00:00
   Memory usage: 39.58 MB


## 1. Build Team Match History

Create a unified view of all matches from each team's perspective for rolling calculations.
This approach allows efficient computation of form metrics per team.

In [3]:
def build_team_history(matches_df: pd.DataFrame) -> pd.DataFrame:
    """
    Build a unified team history DataFrame where each row represents
    one team's participation in a match.
    
    This doubles the rows but enables efficient rolling calculations.
    """
    # Define columns we need for both perspectives
    base_cols = ['eventId', 'date', 'leagueId', 'tier']
    
    # Home team perspective
    home_cols = base_cols + [
        'homeTeamId', 'awayTeamId', 
        'homeTeamScore', 'awayTeamScore',
        'home_possessionPct', 'home_totalShots', 'home_shotsOnTarget',
        'home_wonCorners', 'home_foulsCommitted', 
        'home_yellowCards', 'home_redCards'
    ]
    
    # Filter to available columns
    home_cols = [c for c in home_cols if c in matches_df.columns]
    home = matches_df[home_cols].copy()
    
    # Rename to team-agnostic columns
    rename_map = {
        'homeTeamId': 'teamId',
        'awayTeamId': 'opponentId',
        'homeTeamScore': 'goals_for',
        'awayTeamScore': 'goals_against',
        'home_possessionPct': 'possession',
        'home_totalShots': 'shots',
        'home_shotsOnTarget': 'shots_on_target',
        'home_wonCorners': 'corners',
        'home_foulsCommitted': 'fouls',
        'home_yellowCards': 'yellow_cards',
        'home_redCards': 'red_cards'
    }
    home = home.rename(columns={k: v for k, v in rename_map.items() if k in home.columns})
    home['is_home'] = True
    
    # Away team perspective
    away_cols = base_cols + [
        'awayTeamId', 'homeTeamId',
        'awayTeamScore', 'homeTeamScore',
        'away_possessionPct', 'away_totalShots', 'away_shotsOnTarget',
        'away_wonCorners', 'away_foulsCommitted',
        'away_yellowCards', 'away_redCards'
    ]
    
    away_cols = [c for c in away_cols if c in matches_df.columns]
    away = matches_df[away_cols].copy()
    
    rename_map_away = {
        'awayTeamId': 'teamId',
        'homeTeamId': 'opponentId',
        'awayTeamScore': 'goals_for',
        'homeTeamScore': 'goals_against',
        'away_possessionPct': 'possession',
        'away_totalShots': 'shots',
        'away_shotsOnTarget': 'shots_on_target',
        'away_wonCorners': 'corners',
        'away_foulsCommitted': 'fouls',
        'away_yellowCards': 'yellow_cards',
        'away_redCards': 'red_cards'
    }
    away = away.rename(columns={k: v for k, v in rename_map_away.items() if k in away.columns})
    away['is_home'] = False
    
    # Combine and sort
    history = pd.concat([home, away], ignore_index=True)
    history = history.sort_values(['teamId', 'date']).reset_index(drop=True)
    
    # Compute derived fields (vectorized)
    history['result'] = np.select(
        [history['goals_for'] > history['goals_against'],
         history['goals_for'] < history['goals_against']],
        ['W', 'L'],
        default='D'
    )
    history['points'] = history['result'].map({'W': 3, 'D': 1, 'L': 0})
    history['clean_sheet'] = (history['goals_against'] == 0).astype('int8')
    history['failed_to_score'] = (history['goals_for'] == 0).astype('int8')
    
    return history

team_history = build_team_history(matches)
print(f"âœ… Built team history: {len(team_history):,} records")
print(f"   Unique teams: {team_history['teamId'].nunique():,}")
print(f"   Memory usage: {team_history.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

âœ… Built team history: 115,740 records
   Unique teams: 4,049
   Memory usage: 14.24 MB


## 2. Optimized Rolling Form Features

Using vectorized pandas rolling operations with shifted windows to prevent data leakage.

In [4]:
def compute_rolling_form(team_history: pd.DataFrame, n_games: int = 5) -> pd.DataFrame:
    """
    Compute rolling form metrics for each team before each match.
    Uses shift(1) to ensure we only use data from BEFORE the current match.
    
    All features computed here are PRE-MATCH and safe for filter conditions.
    """
    df = team_history.sort_values(['teamId', 'date']).copy()
    
    # Create boolean columns for efficient rolling
    df['is_win'] = (df['result'] == 'W').astype('int8')
    df['is_draw'] = (df['result'] == 'D').astype('int8')
    df['is_loss'] = (df['result'] == 'L').astype('int8')
    
    # Group by team - this enables efficient parallel computation
    grouped = df.groupby('teamId', sort=False)
    
    # Rolling metrics with shift(1) to exclude current match (PRE-MATCH only)
    # These are all SAFE for filter conditions
    
    # Form counts (wins, draws, losses)
    df[f'form_wins_{n_games}'] = grouped['is_win'].transform(
        lambda x: x.shift(1).rolling(n_games, min_periods=1).sum()
    ).astype('float32')
    
    df[f'form_draws_{n_games}'] = grouped['is_draw'].transform(
        lambda x: x.shift(1).rolling(n_games, min_periods=1).sum()
    ).astype('float32')
    
    df[f'form_losses_{n_games}'] = grouped['is_loss'].transform(
        lambda x: x.shift(1).rolling(n_games, min_periods=1).sum()
    ).astype('float32')
    
    # Points and goals
    df[f'form_points_{n_games}'] = grouped['points'].transform(
        lambda x: x.shift(1).rolling(n_games, min_periods=1).sum()
    ).astype('float32')
    
    df[f'form_goals_scored_{n_games}'] = grouped['goals_for'].transform(
        lambda x: x.shift(1).rolling(n_games, min_periods=1).mean()
    ).astype('float32')
    
    df[f'form_goals_conceded_{n_games}'] = grouped['goals_against'].transform(
        lambda x: x.shift(1).rolling(n_games, min_periods=1).mean()
    ).astype('float32')
    
    df[f'form_clean_sheets_{n_games}'] = grouped['clean_sheet'].transform(
        lambda x: x.shift(1).rolling(n_games, min_periods=1).sum()
    ).astype('float32')
    
    # Goal difference average
    df['goal_diff'] = df['goals_for'] - df['goals_against']
    df[f'form_goal_diff_{n_games}'] = grouped['goal_diff'].transform(
        lambda x: x.shift(1).rolling(n_games, min_periods=1).mean()
    ).astype('float32')
    
    # Clean up temp columns
    df = df.drop(columns=['is_win', 'is_draw', 'is_loss', 'goal_diff'], errors='ignore')
    
    return df

# Compute form for last 5 and 10 games
print("Computing rolling form (5 games)...")
team_history = compute_rolling_form(team_history, n_games=5)
print("Computing rolling form (10 games)...")
team_history = compute_rolling_form(team_history, n_games=10)

print(f"âœ… Computed rolling form (5 & 10 games)")

Computing rolling form (5 games)...
Computing rolling form (10 games)...
âœ… Computed rolling form (5 & 10 games)


## 3. Rolling Statistics (PRE-MATCH Averages)

In [5]:
def compute_rolling_stats(team_history: pd.DataFrame, n_games: int = 5) -> pd.DataFrame:
    """
    Compute rolling averages for match statistics.
    These represent the team's AVERAGE stats going into a match (PRE-MATCH).
    """
    df = team_history.copy()
    grouped = df.groupby('teamId', sort=False)
    
    stats_cols = ['possession', 'shots', 'shots_on_target', 'corners', 'fouls']
    stats_cols = [c for c in stats_cols if c in df.columns]
    
    for col in stats_cols:
        df[f'{col}_avg_{n_games}'] = grouped[col].transform(
            lambda x: x.shift(1).rolling(n_games, min_periods=1).mean()
        ).astype('float32')
    
    return df

team_history = compute_rolling_stats(team_history, n_games=5)
print(f"âœ… Computed rolling stats (5 games)")

âœ… Computed rolling stats (5 games)


## 4. Home/Away Specific Form (PRE-MATCH)

In [6]:
def compute_venue_specific_form(team_history: pd.DataFrame, n_games: int = 5) -> pd.DataFrame:
    """
    Compute form metrics specific to home or away matches.
    Uses efficient boolean indexing and vectorized operations.
    """
    df = team_history.copy()
    
    # Home-specific form (only for home matches)
    home_mask = df['is_home']
    home_df = df.loc[home_mask].copy()
    
    if len(home_df) > 0:
        home_grouped = home_df.groupby('teamId', sort=False)
        
        home_df[f'home_form_wins_{n_games}'] = home_grouped['result'].transform(
            lambda x: (x == 'W').shift(1).rolling(n_games, min_periods=1).sum()
        ).astype('float32')
        
        home_df[f'home_form_goals_{n_games}'] = home_grouped['goals_for'].transform(
            lambda x: x.shift(1).rolling(n_games, min_periods=1).mean()
        ).astype('float32')
    
    # Away-specific form (only for away matches)
    away_mask = ~df['is_home']
    away_df = df.loc[away_mask].copy()
    
    if len(away_df) > 0:
        away_grouped = away_df.groupby('teamId', sort=False)
        
        away_df[f'away_form_wins_{n_games}'] = away_grouped['result'].transform(
            lambda x: (x == 'W').shift(1).rolling(n_games, min_periods=1).sum()
        ).astype('float32')
        
        away_df[f'away_form_goals_{n_games}'] = away_grouped['goals_for'].transform(
            lambda x: x.shift(1).rolling(n_games, min_periods=1).mean()
        ).astype('float32')
    
    # Merge back efficiently using index alignment
    home_cols = [c for c in home_df.columns if c.startswith('home_form_')]
    away_cols = [c for c in away_df.columns if c.startswith('away_form_')]
    
    if home_cols:
        df = df.merge(
            home_df[['eventId', 'teamId'] + home_cols],
            on=['eventId', 'teamId'],
            how='left'
        )
    
    if away_cols:
        df = df.merge(
            away_df[['eventId', 'teamId'] + away_cols],
            on=['eventId', 'teamId'],
            how='left'
        )
    
    return df

team_history = compute_venue_specific_form(team_history, n_games=5)
print(f"âœ… Computed home/away specific form")

âœ… Computed home/away specific form


## 5. Optimized Head-to-Head Features

Using vectorized approach with pre-indexed lookups for efficiency.

In [7]:
def compute_h2h_features_optimized(matches_df: pd.DataFrame, max_lookback_days: int = 730) -> pd.DataFrame:
    """
    Compute head-to-head history between teams for each match.
    Optimized version using pre-indexed lookups.
    
    Parameters:
    - max_lookback_days: Only consider H2H matches within this period (default 2 years)
    """
    df = matches_df.copy()
    
    # Pre-compute team pair mappings
    # Create sorted team pair keys for consistent lookup
    df['team_pair'] = df.apply(
        lambda r: tuple(sorted([r['homeTeamId'], r['awayTeamId']])), axis=1
    )
    
    # Initialize H2H columns
    h2h_cols = ['h2h_matches', 'h2h_home_wins', 'h2h_away_wins', 'h2h_draws', 'h2h_avg_goals']
    for col in h2h_cols:
        df[col] = np.nan
    
    # Group by team pair for efficient processing
    print(f"   Processing {df['team_pair'].nunique():,} unique team pairs...")
    
    # Sort by date for chronological processing
    df = df.sort_values('date').reset_index(drop=True)
    
    # Process each unique team pair
    h2h_data = []
    
    for idx, row in df.iterrows():
        home_id = row['homeTeamId']
        away_id = row['awayTeamId']
        match_date = row['date']
        cutoff_date = match_date - pd.Timedelta(days=max_lookback_days)
        
        # Find previous meetings within lookback period
        prev_mask = (
            (df['date'] < match_date) &
            (df['date'] >= cutoff_date) &
            (
                ((df['homeTeamId'] == home_id) & (df['awayTeamId'] == away_id)) |
                ((df['homeTeamId'] == away_id) & (df['awayTeamId'] == home_id))
            )
        )
        prev_meetings = df.loc[prev_mask]
        
        if len(prev_meetings) == 0:
            h2h_data.append({
                'eventId': row['eventId'],
                'h2h_matches': 0,
                'h2h_home_wins': 0,
                'h2h_away_wins': 0,
                'h2h_draws': 0,
                'h2h_avg_goals': None
            })
        else:
            # Vectorized H2H calculation
            home_wins = 0
            away_wins = 0
            draws = 0
            
            # Current home team's results in previous meetings
            for _, m in prev_meetings.iterrows():
                if m['homeTeamId'] == home_id:
                    if m['homeTeamScore'] > m['awayTeamScore']:
                        home_wins += 1
                    elif m['homeTeamScore'] < m['awayTeamScore']:
                        away_wins += 1
                    else:
                        draws += 1
                else:
                    if m['awayTeamScore'] > m['homeTeamScore']:
                        home_wins += 1
                    elif m['awayTeamScore'] < m['homeTeamScore']:
                        away_wins += 1
                    else:
                        draws += 1
            
            total_goals = (prev_meetings['homeTeamScore'] + prev_meetings['awayTeamScore']).sum()
            
            h2h_data.append({
                'eventId': row['eventId'],
                'h2h_matches': len(prev_meetings),
                'h2h_home_wins': home_wins,
                'h2h_away_wins': away_wins,
                'h2h_draws': draws,
                'h2h_avg_goals': total_goals / len(prev_meetings)
            })
        
        # Progress indicator
        if (idx + 1) % 10000 == 0:
            print(f"   Processed {idx + 1:,} matches...")
    
    h2h_df = pd.DataFrame(h2h_data)
    
    # Merge back
    result = matches_df.merge(h2h_df, on='eventId', how='left')
    result = result.drop(columns=['team_pair'], errors='ignore')
    
    return result

# Compute H2H for all matches (or sample for large datasets)
print("Computing H2H features...")

# For efficiency, compute H2H for Tier 1 & 2 leagues first
tier_12_matches = matches[matches['tier'].isin([1, 2])].copy()
if len(tier_12_matches) > 0:
    tier_12_with_h2h = compute_h2h_features_optimized(tier_12_matches)
    print(f"âœ… Computed H2H for {len(tier_12_with_h2h):,} Tier 1/2 matches")
else:
    tier_12_with_h2h = None
    print("   No Tier 1/2 matches found")

Computing H2H features...
   Processing 12,468 unique team pairs...
   Processed 10,000 matches...
   Processed 20,000 matches...
   Processed 30,000 matches...
âœ… Computed H2H for 30,957 Tier 1/2 matches


## 6. Compute Match Outcome Labels (Ground Truth)

In [8]:
def compute_outcome_labels(matches_df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute match outcome labels for backtesting evaluation.
    These are POST-MATCH ground truth values, NOT for filter conditions.
    """
    df = matches_df.copy()
    
    # Result (1X2) - vectorized
    df['result'] = np.select(
        [df['homeTeamScore'] > df['awayTeamScore'],
         df['homeTeamScore'] < df['awayTeamScore']],
        ['H', 'A'],
        default='D'
    )
    
    # Total goals
    df['total_goals'] = (df['homeTeamScore'] + df['awayTeamScore']).astype('int16')
    
    # Over/Under thresholds - vectorized
    df['over_0_5'] = (df['total_goals'] > 0.5).astype('int8')
    df['over_1_5'] = (df['total_goals'] > 1.5).astype('int8')
    df['over_2_5'] = (df['total_goals'] > 2.5).astype('int8')
    df['over_3_5'] = (df['total_goals'] > 3.5).astype('int8')
    
    # Both Teams to Score
    df['btts'] = ((df['homeTeamScore'] > 0) & (df['awayTeamScore'] > 0)).astype('int8')
    
    # Clean sheets
    df['home_clean_sheet'] = (df['awayTeamScore'] == 0).astype('int8')
    df['away_clean_sheet'] = (df['homeTeamScore'] == 0).astype('int8')
    
    return df

matches = compute_outcome_labels(matches)
print(f"âœ… Computed outcome labels")
print(f"\nðŸ“Š Result Distribution:")
print(matches['result'].value_counts(normalize=True).round(3))
print(f"\nðŸ“Š Over 2.5 Goals: {matches['over_2_5'].mean()*100:.1f}%")
print(f"ðŸ“Š BTTS: {matches['btts'].mean()*100:.1f}%")

âœ… Computed outcome labels

ðŸ“Š Result Distribution:
result
H    0.451
A    0.298
D    0.250
Name: proportion, dtype: float64

ðŸ“Š Over 2.5 Goals: 49.3%
ðŸ“Š BTTS: 49.3%


## 7. Merge Features Back to Matches

Using indexed joins for efficient merging.

In [9]:
def merge_team_features_to_matches(matches_df: pd.DataFrame, 
                                    team_history: pd.DataFrame) -> pd.DataFrame:
    """
    Merge computed team features back to the matches DataFrame.
    Uses indexed merges for efficiency.
    """
    # Identify feature columns to merge (PRE-MATCH features)
    feature_cols = [col for col in team_history.columns 
                    if 'form_' in col or '_avg_' in col]
    
    # Home team features
    home_features = team_history.loc[team_history['is_home'], ['eventId', 'teamId'] + feature_cols].copy()
    home_features.columns = ['eventId', 'homeTeamId'] + [f'home_{col}' for col in feature_cols]
    home_features = home_features.drop_duplicates(subset=['eventId', 'homeTeamId'])
    
    # Away team features
    away_features = team_history.loc[~team_history['is_home'], ['eventId', 'teamId'] + feature_cols].copy()
    away_features.columns = ['eventId', 'awayTeamId'] + [f'away_{col}' for col in feature_cols]
    away_features = away_features.drop_duplicates(subset=['eventId', 'awayTeamId'])
    
    # Merge using indexed joins
    df = matches_df.merge(home_features, on=['eventId', 'homeTeamId'], how='left')
    df = df.merge(away_features, on=['eventId', 'awayTeamId'], how='left')
    
    return df

# Merge features
matches_enriched = merge_team_features_to_matches(matches, team_history)
print(f"âœ… Merged team features to matches")
print(f"   Total columns: {len(matches_enriched.columns)}")
print(f"   Memory usage: {matches_enriched.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

âœ… Merged team features to matches
   Total columns: 107
   Memory usage: 53.87 MB


## 8. Save Enriched Data

In [10]:
# Save enriched matches
output_path = PROCESSED_DIR / 'matches_enriched.parquet'
matches_enriched.to_parquet(output_path, index=False, compression='snappy')
print(f"âœ… Saved enriched matches to {output_path}")
print(f"   Shape: {matches_enriched.shape}")
print(f"   Size: {output_path.stat().st_size / 1024 / 1024:.2f} MB")

# Save team history for reference
history_path = PROCESSED_DIR / 'team_history.parquet'
team_history.to_parquet(history_path, index=False, compression='snappy')
print(f"âœ… Saved team history to {history_path}")

âœ… Saved enriched matches to ../data/processed/matches_enriched.parquet
   Shape: (57870, 107)
   Size: 4.82 MB
âœ… Saved team history to ../data/processed/team_history.parquet


## Summary

### PRE-MATCH Features (Safe for filter conditions):
- **Form (5 & 10 games)**: wins, draws, losses, points, goals scored/conceded, clean sheets, goal diff
- **Rolling stats (5 games)**: possession, shots, shots on target, corners, fouls averages
- **Home/Away specific**: venue-specific form metrics
- **H2H**: matches, wins, draws, avg goals (for Tier 1/2 leagues)

### POST-MATCH Features (Ground truth only):
- **Outcomes**: result (1X2), over/under, BTTS, clean sheets
- **Match stats**: actual possession, shots, corners, etc.

### Schema Alignment:
Output matches `ingest_all_data.py` expected format for `features_metadata`.

Next: `04_data_export.ipynb` for final validation and PostgreSQL export