
# FightIQ Codex: End-to-End UFC Fight Prediction Pipeline

This notebook documents the FightIQ Codex project's leak-safe data pipeline, training workflow, and prediction tooling so it can be shared easily on Kaggle or any hosted notebook platform.



## Project Overview

- **Leak-safe feature store** built from UFC Stats historical data with strict point-in-time (PTI) joins.
- **Training scripts** for winner stacking ensemble (LightGBM/XGBoost + parity meta), multi-task method/round models, and calibrated betting helpers.
- **Upcoming ingestion** that pulls odds via TheOddsAPI, engineers full pre-fight features, and produces calibrated predictions.
- **Reusable command-line tooling** so the entire workflow can be orchestrated or cherry-picked inside a notebook.


In [None]:

from pathlib import Path
import pandas as pd
import subprocess

PROJECT_ROOT = Path('..').resolve().parent
PROJECT_ROOT



## Repository Layout

Key directories you'll interact with:

- data/: raw/silver/gold parquet layers generated from the ingestion pipeline.
- rtifacts/: trained model bundles (LightGBM/XGBoost, calibrators, metadata).
- scripts/: CLI entry points for ingestion, feature building, training, evaluation, and prediction.
- ightiq_codex/data/: upcoming feature matrices used for inference notebooks or deployments.


In [None]:

for path in [
    PROJECT_ROOT / 'data',
    PROJECT_ROOT / 'artifacts',
    PROJECT_ROOT / 'fightiq_codex' / 'data',
    PROJECT_ROOT / 'scripts'
]:
    print(f'? {path}')
    for item in sorted(path.iterdir()):
        if item.is_dir():
            print(f'  [dir] {item.name}')
        else:
            size_mb = item.stat().st_size / 1e6
            print(f'  {item.name} ({size_mb:.2f} MB)')



## Historical Feature Snapshot

The gold layer contains leak-safe rolling statistics, matchup deltas, odds features, and rankings deltas for every historical fight.


In [None]:

gold_path = PROJECT_ROOT / 'data' / 'gold_features.parquet'
gold_df = pd.read_parquet(gold_path)
print(gold_df.shape)
gold_df.head()



## Upcoming Feature Matrix

Upcoming fights are ingested via scripts/ingest_upcoming_from_odds_api.py and enriched with the same feature engineering logic using scripts/build_upcoming_features.py.


In [None]:

upcoming_path = PROJECT_ROOT / 'fightiq_codex' / 'data' / 'upcoming_features.parquet'
upcoming_df = pd.read_parquet(upcoming_path)
print(upcoming_df[['fight_url','event_date','f_1_name','f_2_name']].head())
print('Feature columns:', len(upcoming_df.columns))



### Optional: Rebuild Upcoming Features

Set RUN_INGEST to True if you want to pull fresh odds (requires THEODDS_API_KEY) and regenerate the upcoming feature matrix inside the notebook.


In [None]:

RUN_INGEST = False  # flip to True to re-run ingestion/feature build
if RUN_INGEST:
    import os
    assert os.getenv('THEODDS_API_KEY'), 'Set THEODDS_API_KEY before running ingestion'
    def run(cmd):
        print('RUN:', ' '.join(cmd))
        result = subprocess.run(cmd, cwd=str(PROJECT_ROOT), capture_output=True, text=True)
        print(result.stdout)
        if result.returncode != 0:
            print(result.stderr)
            raise RuntimeError(f'Command failed: {cmd}')
    run(['python', 'scripts/ingest_upcoming_from_odds_api.py', '--regions', 'us', '--markets', 'h2h'])
    run(['python', 'scripts/build_upcoming_features.py'])



## Training & Evaluation Entrypoints

Each training component can be executed directly from the notebook for reproducible experiments. Heavy jobs are disabled by default; enable the corresponding flags when running on your own hardware or Kaggle session.


In [None]:

RUN_TRAINING = False  # toggle to True to retrain models
if RUN_TRAINING:
    def run(cmd):
        print('RUN:', ' '.join(cmd))
        result = subprocess.run(cmd, cwd=str(PROJECT_ROOT), capture_output=True, text=True)
        print(result.stdout)
        if result.returncode != 0:
            print(result.stderr)
            raise RuntimeError(f'Command failed: {cmd}')
    run(['python', 'scripts/train_winner_enhanced.py', '--min-val-acc', '0.71', '--min-test-acc', '0.62'])
    run(['python', 'scripts/train_multitask.py'])
else:
    print('Training skipped. Set RUN_TRAINING=True to retrain the models.')



## Generate Upcoming Predictions

The prediction script loads the latest artifacts, blends parity + tuned models, and emits risk-controlled bets with calibrated method/round probabilities.


In [None]:

predictions_path = None
if RUN_INGEST or RUN_TRAINING:
    result = subprocess.run(['python', 'scripts/predict_upcoming.py'], cwd=str(PROJECT_ROOT), capture_output=True, text=True)
    print(result.stdout)
    if result.returncode != 0:
        print(result.stderr)
        raise RuntimeError('Prediction failed')
    for line in result.stdout.splitlines():
        if 'Wrote predictions:' in line:
            predictions_path = line.split(':', 1)[1].strip()
else:
    predictions_dir = PROJECT_ROOT / 'fightiq_codex' / 'outputs'
    csv_files = sorted(predictions_dir.glob('upcoming_predictions_*.csv'))
    if csv_files:
        predictions_path = csv_files[-1]

predictions_path


In [None]:

if predictions_path:
    preds_df = pd.read_csv(predictions_path)
    display(preds_df[['fight_url','f_1_name','f_2_name','predicted_winner','pred_win_prob_f1','kelly_frac_f1']].head())
else:
    print('No predictions available yet. Run the prediction cell above.')



## Next Steps

- Upload this notebook and the data/ + rtifacts/ directories as Kaggle datasets to reproduce results online.
- Customize the orchestration cell to run the full weekly pipeline (ingest ? validate ? train ? predict).
- Extend with visualization cells (calibration curves, ROI trends) for richer storytelling in the notebook environment.
