# ‚öΩ EPL Moneyball AI: Predicting Match Outcomes with XGBoost
**Author:** Peer Nagar
**Accuracy:** 54.08% (Validated on Test Set)

### üöÄ Project Overview
This project utilizes historical Premier League data to predict match results and identify **Value Bets**.
It leverages **XGBoost** with optimized hyperparameters and advanced feature engineering, including:
* **Team Form & Momentum:** Rolling averages of recent performance.
* **Interaction Features:** Direct comparison between Home Attack vs. Away Defense.
* **Time Decay:** Giving double weight to recent matches (2024-2025).

### üõ†Ô∏è Methodology
1.  **Data Loading:** Aggregating 5 seasons of match data.
2.  **Feature Engineering:** Creating dynamic time-series features.
3.  **Model Training:** Using pre-optimized hyperparameters found via RandomizedSearchCV.
4.  **Deployment:** Generating a real-time betting report for the upcoming round.

In [14]:
import pandas as pd
import numpy as np
import glob
import os
from google.colab import drive
from xgboost import XGBClassifier, XGBRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, mean_absolute_error

# 1. Setup Environment
drive.mount('/content/drive')
FOLDER_PATH = '/content/drive/MyDrive/Colab Notebooks/Data Science/The Moneyball Project'

# 2. Load & Merge Data
all_files = glob.glob(os.path.join(FOLDER_PATH, "*.csv"))
df_list = []

for filename in all_files:
    df_temp = pd.read_csv(filename)
    df_temp['Season_File'] = os.path.basename(filename)
    df_list.append(df_temp)

if df_list:
    df = pd.concat(df_list, ignore_index=True)
    df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
    df = df.sort_values(by='Date').reset_index(drop=True)

    # Filter basic columns
    cols_to_keep = ['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR',
                    'HS', 'AS', 'HST', 'AST', 'B365H', 'B365D', 'B365A']
    existing = [c for c in cols_to_keep if c in df.columns]
    df = df[existing].copy()
    df.dropna(subset=['FTR', 'B365H'], inplace=True)

    print(f"‚úÖ Loaded {len(df)} matches successfully.")
else:
    print("‚ùå No files found.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Loaded 1720 matches successfully.


## üßπ Data Cleaning & Preprocessing
We filter the raw dataset to keep only the essential columns:
* **Match Info:** Date, Teams, Goals (FTHG, FTAG).
* **Stats:** Shots, Corners, Fouls (used for deeper analysis if needed).
* **Odds:** Bet365 odds (Home, Draw, Away) to calculate implied probabilities.

In [15]:
# Select only relevant columns for analysis and modeling
cols_to_keep = [
    'Date', 'HomeTeam', 'AwayTeam',
    'FTHG', 'FTAG', 'FTR',           # Goals and Results
    'HS', 'AS', 'HST', 'AST',        # Shots stats
    'B365H', 'B365D', 'B365A',       # Betting Odds
    'Season_File'                    # Helper column
]

# Keep only existing columns (handling potential missing columns in older files)
existing_cols = [c for c in cols_to_keep if c in df.columns]
df = df[existing_cols].copy()

# Drop rows with missing critical data (Results or Odds)
df.dropna(subset=['FTR', 'B365H'], inplace=True)

print(f"Clean Data Shape: {df.shape}")
display(df.tail())

Clean Data Shape: (1720, 13)


Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,B365H,B365D,B365A
1715,2026-01-04,Fulham,Liverpool,2,2,D,8,10,2,2,3.8,3.75,1.91
1716,2026-01-04,Leeds,Man United,1,1,D,11,15,3,2,2.7,3.3,2.63
1717,2026-01-04,Everton,Brentford,2,4,A,14,11,6,7,2.35,3.25,3.1
1718,2026-01-04,Newcastle,Crystal Palace,2,0,H,12,11,7,1,1.7,3.9,4.75
1719,2026-01-04,Tottenham,Sunderland,1,1,D,13,10,5,3,1.8,3.6,4.5


## ‚öôÔ∏è Advanced Feature Engineering
This is the core of the project. Raw stats (like "Shots on Target") are post-match metrics. To predict the future, we need **historical context**.

We construct the following features for every match, based on the **past 5 games**:
1.  **Form:** Rolling average of points earned.
2.  **Attacking Strength:** Average goals scored.
3.  **Defensive Weakness:** Average goals conceded.
4.  **Momentum:** Recent streak (last 3 games).
5.  **Home/Away Factor:** How well the team performs specifically at home vs. away.



In [16]:
def process_advanced_features(input_df):
    """
    Transforms match-by-match data into team-centric features with rolling averages.
    """
    # Define base columns needed for calculation
    base_cols = ['Date', 'HomeTeam', 'AwayTeam', 'FTR', 'FTHG', 'FTAG']

    # Create Home Stats DataFrame
    home_stats = input_df[base_cols].copy()
    home_stats['Team'] = home_stats['HomeTeam']
    home_stats['IsHome'] = 1
    home_stats['GoalsScored'] = home_stats['FTHG']
    home_stats['GoalsConceded'] = home_stats['FTAG']
    home_stats['Points'] = home_stats['FTR'].apply(lambda x: 3 if x == 'H' else (1 if x == 'D' else 0))

    # Create Away Stats DataFrame
    away_stats = input_df[base_cols].copy()
    away_stats['Team'] = away_stats['AwayTeam']
    away_stats['IsHome'] = 0
    away_stats['GoalsScored'] = away_stats['FTAG']
    away_stats['GoalsConceded'] = away_stats['FTHG']
    away_stats['Points'] = away_stats['FTR'].apply(lambda x: 3 if x == 'A' else (1 if x == 'D' else 0))

    # Combine and Sort
    team_stats = pd.concat([home_stats, away_stats]).sort_values(['Team', 'Date'])

    # --- Rolling Calculations ---

    # 1. General Form (Last 5 games)
    team_stats['Form_L5'] = team_stats.groupby('Team')['Points'].transform(lambda x: x.rolling(5).mean().shift())

    # 2. Attack & Defense (Last 5 games)
    team_stats['Attack_L5'] = team_stats.groupby('Team')['GoalsScored'].transform(lambda x: x.rolling(5).mean().shift())
    team_stats['Defense_L5'] = team_stats.groupby('Team')['GoalsConceded'].transform(lambda x: x.rolling(5).mean().shift())

    # 3. Momentum (Sum of points in last 3 games)
    team_stats['Momentum_L3'] = team_stats.groupby('Team')['Points'].transform(lambda x: x.rolling(3).sum().shift())

    # 4. Specific Home/Away Factor
    team_stats['Home_Factor'] = team_stats[team_stats['IsHome']==1].groupby('Team')['Points'].transform(lambda x: x.rolling(5).mean().shift())
    team_stats['Away_Factor'] = team_stats[team_stats['IsHome']==0].groupby('Team')['Points'].transform(lambda x: x.rolling(5).mean().shift())

    # --- Merge back to Match Data ---
    cols_to_merge = ['Date', 'Team', 'Form_L5', 'Attack_L5', 'Defense_L5', 'Momentum_L3', 'Home_Factor', 'Away_Factor']

    df_merged = input_df.copy()

    # Merge Home Features
    df_merged = df_merged.merge(team_stats[cols_to_merge], left_on=['Date', 'HomeTeam'], right_on=['Date', 'Team'], how='left')
    df_merged.rename(columns={
        'Form_L5': 'Home_Form', 'Attack_L5': 'Home_Attack', 'Defense_L5': 'Home_Defense',
        'Momentum_L3': 'Home_Momentum', 'Home_Factor': 'Home_HomeFactor'
    }, inplace=True)
    df_merged.drop(columns=['Team', 'Away_Factor'], inplace=True)

    # Merge Away Features
    df_merged = df_merged.merge(team_stats[cols_to_merge], left_on=['Date', 'AwayTeam'], right_on=['Date', 'Team'], how='left')
    df_merged.rename(columns={
        'Form_L5': 'Away_Form', 'Attack_L5': 'Away_Attack', 'Defense_L5': 'Away_Defense',
        'Momentum_L3': 'Away_Momentum', 'Away_Factor': 'Away_AwayFactor'
    }, inplace=True)
    df_merged.drop(columns=['Team', 'Home_Factor'], inplace=True)

    # Clean initial rows with NaNs
    df_merged.dropna(inplace=True)
    df_merged = df_merged.loc[:, ~df_merged.columns.duplicated()]

    # --- NEW: Add Interaction Features (The Accuracy Boosters) ---
    df_merged['Diff_Form'] = df_merged['Home_Form'] - df_merged['Away_Form']
    df_merged['Diff_Attack_Defense'] = df_merged['Home_Attack'] - df_merged['Away_Defense']
    df_merged['Diff_Momentum'] = df_merged['Home_Momentum'] - df_merged['Away_Momentum']

    return df_merged

# Apply the function
df_advanced = process_advanced_features(df)
print(f"Engineered Data Shape (Optimized): {df_advanced.shape}")

Engineered Data Shape (Optimized): (1553, 26)


## ü§ñ Model Training (XGBoost)
We train two separate models:
1.  **Winner Classifier:** Predicts Home Win / Draw / Away Win.
2.  **Goals Regressor:** Predicts the total number of goals (for Over/Under markets).

**Key Technique:** We use **Time Decay weighting**. Games played after August 2024 get double the weight (`2.0`) compared to older games. This helps the model adapt to the most recent team rosters and managerial changes.

In [17]:
from sklearn.model_selection import RandomizedSearchCV

# 1. Define Optimized Feature List
features = [
    'Home_Form', 'Away_Form',
    'Home_Attack', 'Away_Attack',
    'Home_Defense', 'Away_Defense',
    'Home_Momentum', 'Away_Momentum',
    'Home_HomeFactor', 'Away_AwayFactor',
    'Diff_Form', 'Diff_Attack_Defense', 'Diff_Momentum', # New features
    'B365H', 'B365D', 'B365A'
]

# Ensure numeric types
for col in features:
    df_advanced[col] = pd.to_numeric(df_advanced[col], errors='coerce')

# 2. Train/Test Split
split_idx = int(len(df_advanced) * 0.85)

X = df_advanced[features]
le = LabelEncoder()
y_winner = le.fit_transform(df_advanced['FTR'])
y_goals = (df_advanced['FTHG'] + df_advanced['FTAG']).values

X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train_win, y_test_win = y_winner[:split_idx], y_winner[split_idx:]
y_train_goals, y_test_goals = y_goals[:split_idx], y_goals[split_idx:]

# 3. Create Sample Weights (Time Decay)
cutoff_date = pd.Timestamp('2024-08-01')
weights = df_advanced.iloc[:split_idx]['Date'].apply(lambda x: 2.0 if x > cutoff_date else 1.0).values

# 4. Hyperparameter Tuning (Auto-Optimization)
print("üöÄ Tuning Model Parameters (this takes ~2 mins)...")
param_dist = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}

xgb_search = XGBClassifier(random_state=42)
random_search = RandomizedSearchCV(
    estimator=xgb_search, param_distributions=param_dist,
    n_iter=15, scoring='accuracy', cv=3, verbose=0, random_state=42, n_jobs=-1
)
# Fit search (without weights for stability)
random_search.fit(X_train.values, y_train_win)
best_params = random_search.best_params_
print(f"‚úÖ Best Params: {best_params}")

# 5. Train Final Models
print("Training Final Classifier...")
model_winner = XGBClassifier(**best_params, random_state=42)
model_winner.fit(X_train.values, y_train_win, sample_weight=weights)

print("Training Regressor (Goals)...")
model_goals = XGBRegressor(n_estimators=200, learning_rate=0.03, max_depth=5, random_state=42)
model_goals.fit(X_train.values, y_train_goals, sample_weight=weights)

# 6. Evaluate
acc = model_winner.score(X_test.values, y_test_win)
mae = mean_absolute_error(y_test_goals, model_goals.predict(X_test.values))

print("-" * 30)
print(f"üèÜ Final Results:")
print(f"   Winner Accuracy: {acc:.2%}")
print(f"   Goals MAE: {mae:.2f}")
print("-" * 30)

üöÄ Tuning Model Parameters (this takes ~2 mins)...
‚úÖ Best Params: {'subsample': 0.9, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.01, 'colsample_bytree': 0.9}
Training Final Classifier...
Training Regressor (Goals)...
------------------------------
üèÜ Final Results:
   Winner Accuracy: 54.08%
   Goals MAE: 1.31
------------------------------


## üîÆ Real-Time Prediction Engine
This section generates the final report.
It rebuilds the team stats based on the very latest data available (up to today) and feeds it into our trained XGBoost models.

The report highlights:
* **Predicted Winner** (with confidence %).
* **Value Bets:** Where our model sees a higher probability than the bookie's implied odds.
* **Goals Market:** Expected goals and Over/Under recommendations.
* **Handicap:** Estimated score difference.

In [18]:
# --- Helper: Get latest stats dynamically ---
def get_prediction_inputs(home_team, away_team, full_df):
    # Reconstruct history table
    base_cols = ['Date', 'HomeTeam', 'AwayTeam', 'FTR', 'FTHG', 'FTAG']

    h = full_df[base_cols].copy()
    h['Team'] = h['HomeTeam']; h['IsHome'] = 1
    h['GoalsScored'] = h['FTHG']; h['GoalsConceded'] = h['FTAG']
    h['Points'] = h['FTR'].apply(lambda x: 3 if x=='H' else (1 if x=='D' else 0))

    a = full_df[base_cols].copy()
    a['Team'] = a['AwayTeam']; a['IsHome'] = 0
    a['GoalsScored'] = a['FTAG']; a['GoalsConceded'] = a['FTHG']
    a['Points'] = a['FTR'].apply(lambda x: 3 if x=='A' else (1 if x=='D' else 0))

    history_df = pd.concat([h, a]).sort_values(['Team', 'Date'])

    if home_team not in history_df['Team'].unique() or away_team not in history_df['Team'].unique():
        return None

    h_hist = history_df[history_df['Team'] == home_team].sort_values('Date')
    a_hist = history_df[history_df['Team'] == away_team].sort_values('Date')

    stats = {}

    # Standard Stats
    stats['Home_Form'] = h_hist['Points'].tail(5).mean()
    stats['Home_Attack'] = h_hist['GoalsScored'].tail(5).mean()
    stats['Home_Defense'] = h_hist['GoalsConceded'].tail(5).mean()
    stats['Home_Momentum'] = h_hist['Points'].tail(3).sum()
    h_home = h_hist[h_hist['IsHome']==1]
    stats['Home_HomeFactor'] = h_home['Points'].tail(5).mean() if not h_home.empty else stats['Home_Form']

    stats['Away_Form'] = a_hist['Points'].tail(5).mean()
    stats['Away_Attack'] = a_hist['GoalsScored'].tail(5).mean()
    stats['Away_Defense'] = a_hist['GoalsConceded'].tail(5).mean()
    stats['Away_Momentum'] = a_hist['Points'].tail(3).sum()
    a_away = a_hist[a_hist['IsHome']==0]
    stats['Away_AwayFactor'] = a_away['Points'].tail(5).mean() if not a_away.empty else stats['Away_Form']

    # --- CRITICAL: Calculate Interaction Features on the fly ---
    stats['Diff_Form'] = stats['Home_Form'] - stats['Away_Form']
    stats['Diff_Attack_Defense'] = stats['Home_Attack'] - stats['Away_Defense']
    stats['Diff_Momentum'] = stats['Home_Momentum'] - stats['Away_Momentum']

    return stats

# --- Define Fixtures (Jan 7-8, 2026) ---
next_fixtures = [
    {'Home': 'Fulham', 'Away': 'Chelsea', 'B365H': 3.60, 'B365D': 3.75, 'B365A': 2.05},
    {'Home': 'Bournemouth', 'Away': 'Tottenham', 'B365H': 2.15, 'B365D': 3.60, 'B365A': 3.30},
    {'Home': 'Brentford', 'Away': 'Sunderland', 'B365H': 1.83, 'B365D': 3.70, 'B365A': 4.40},
    {'Home': 'Man City', 'Away': 'Brighton', 'B365H': 1.40, 'B365D': 5.25, 'B365A': 7.50},
    {'Home': 'Crystal Palace', 'Away': 'Aston Villa', 'B365H': 3.20, 'B365D': 3.40, 'B365A': 2.30},
    {'Home': 'Everton', 'Away': 'Wolves', 'B365H': 1.75, 'B365D': 3.60, 'B365A': 5.00},
    {'Home': 'Newcastle', 'Away': 'Leeds', 'B365H': 1.70, 'B365D': 3.90, 'B365A': 5.00},
    {'Home': 'Burnley', 'Away': 'Man United', 'B365H': 5.00, 'B365D': 4.10, 'B365A': 1.70},
    {'Home': 'Arsenal', 'Away': 'Liverpool', 'B365H': 1.57, 'B365D': 4.33, 'B365A': 5.50}
]

# --- Generate Report ---
print("\n" + "="*60)
print("ü§ñ FINAL OPTIMIZED AI BETTING REPORT")
print("="*60)

for fixture in next_fixtures:
    stats = get_prediction_inputs(fixture['Home'], fixture['Away'], df)

    if stats:
        stats['B365H'] = fixture['B365H']
        stats['B365D'] = fixture['B365D']
        stats['B365A'] = fixture['B365A']

        row = pd.DataFrame([stats])
        # Ensure correct column order matches training
        row = row[features]

        # Predictions
        probs = model_winner.predict_proba(row.values)[0]
        pred_class = model_winner.predict(row.values)[0]
        pred_goals = model_goals.predict(row.values)[0]

        # Parsing
        if pred_class == 2: winner = fixture['Home']
        elif pred_class == 0: winner = fixture['Away']
        else: winner = "DRAW"

        # Value Detection
        is_value = False
        val_msg = ""
        if probs[2] > (1/fixture['B365H']) + 0.05:
            is_value = True; val_msg = f"üí∞ VALUE HOME ({probs[2]:.0%} vs {1/fixture['B365H']:.0%})"
        elif probs[0] > (1/fixture['B365A']) + 0.05:
            is_value = True; val_msg = f"üí∞ VALUE AWAY ({probs[0]:.0%} vs {1/fixture['B365A']:.0%})"

        # Output
        print(f"\n‚öΩ {fixture['Home']} vs {fixture['Away']}")
        print(f"   üèÜ Pick: {winner} (Conf: {max(probs):.1%})")
        if is_value: print(f"   {val_msg}")
        print(f"   ü•Ö Exp. Goals: {pred_goals:.2f} | Score: {(stats['Home_Attack']+stats['Away_Defense'])/2:.1f}-{(stats['Away_Attack']+stats['Home_Defense'])/2:.1f}")

    else:
        print(f"‚ö†Ô∏è Missing data for {fixture['Home']} vs {fixture['Away']}")


ü§ñ FINAL OPTIMIZED AI BETTING REPORT

‚öΩ Fulham vs Chelsea
   üèÜ Pick: Chelsea (Conf: 41.4%)
   ü•Ö Exp. Goals: 2.77 | Score: 1.5-1.3

‚öΩ Bournemouth vs Tottenham
   üèÜ Pick: Bournemouth (Conf: 40.6%)
   ü•Ö Exp. Goals: 2.45 | Score: 1.6-1.7

‚öΩ Brentford vs Sunderland
   üèÜ Pick: Brentford (Conf: 45.3%)
   ü•Ö Exp. Goals: 2.33 | Score: 1.3-0.7

‚öΩ Man City vs Brighton
   üèÜ Pick: Man City (Conf: 52.2%)
   üí∞ VALUE AWAY (20% vs 13%)
   ü•Ö Exp. Goals: 3.12 | Score: 1.5-0.7

‚öΩ Crystal Palace vs Aston Villa
   üèÜ Pick: Aston Villa (Conf: 41.5%)
   ü•Ö Exp. Goals: 2.40 | Score: 1.1-2.2

‚öΩ Everton vs Wolves
   üèÜ Pick: Everton (Conf: 49.8%)
   üí∞ VALUE AWAY (27% vs 20%)
   ü•Ö Exp. Goals: 3.17 | Score: 1.1-1.3

‚öΩ Newcastle vs Leeds
   üèÜ Pick: Newcastle (Conf: 49.9%)
   üí∞ VALUE AWAY (27% vs 20%)
   ü•Ö Exp. Goals: 3.07 | Score: 1.1-1.2

‚öΩ Burnley vs Man United
   üèÜ Pick: Man United (Conf: 45.8%)
   üí∞ VALUE HOME (28% vs 20%)
   ü•Ö Exp. Goal