# Race Pacing Models (AutoML + Deep Learning)
Interactive notebook version of `race_pacing_models.py`.

**What you can do here:**
- Train AutoML + Deep Learning models from your uploaded pacing datasets
- Generate pre-race optimal splits (per-50 and per-100)
- Run post-race analysis with actionable per-50 suggestions


In [None]:
# race_pacing_models.py
# -*- coding: utf-8 -*-
"""
Race Pacing: AutoML + Deep Learning models for pre-race split prediction and post-race analysis.

This module builds two kinds of models per event length (n_splits = distance_m / 50):
1) AutoML (best-of-several regressors via cross-validation)
2) Deep Learning (Keras, with hyperparameter tuning if available; graceful fallback to sklearn MLP)

Targets are predicted as normalized split fractions that sum to 1 across the race.
At inference time, fractions are multiplied by the target time to return per-50 splits.

Functions you’ll likely call:
- train_all_models(data_paths: list[str], out_dir: str = "models")
- pre_race_optimal_splits(distance_m: int, stroke: str, pool: str, pb50_s: float, target_time_s: float,
                          model_kind: str = "auto", models_dir: str = "models") -> dict
- post_race_analysis(distance_m: int, stroke: str, pool: str, splits: list[float], split_interval: int = 50,
                     model_kind: str = "auto", target_time_s: float | None = None,
                     models_dir: str = "models") -> dict

Data expectations (robust parsing implemented):
- A CSV per stroke/category. Useful columns (any subset is fine; best effort parsing):
    * 'Event' (e.g., '200 Fly', '400 IM') or 'Distance' (50/100/...)
    * 'Stroke' (Free/Back/Breast/Fly/IM) – inferred from file name if missing
    * 'Pool' (LCM/SCM/SCY) – optional
    * 'Final time in seconds' OR 'FinalTime' OR 'Time' (mm:ss.xx or ss.xx) – optional (will fallback to sum(splits))
    * 'Splits' (semicolon-separated string per-50), OR split columns like '50_1', '50_2', ...
- Rows whose splits do not match the inferred number of 50s for the event are dropped.

Modeling notes:
- Features X: [distance_m, one-hot(stroke), one-hot(pool), pb50_estimate, total_time_s]
  During training, pb50_estimate = min(per-50 splits) of that row (best effort).
- Targets Y: normalized split fractions (length = n_splits); zero-padded to MAX_SPLITS for a *global* model per n_splits.
- We train a separate model per n_splits (e.g., 1, 2, 4, 8, 16, 30).

Dependencies (used if installed):
- numpy, pandas, scikit-learn, joblib
- optionally: xgboost, lightgbm, catboost
- deep learning: tensorflow/keras (preferred) and keras-tuner (optional); fallback to sklearn.neural_network.MLPRegressor.

Author: SwimForge (HydroSmasher)
"""
from __future__ import annotations

import os
import re
import json
import math
import warnings
from dataclasses import dataclass
from typing import List, Dict, Tuple, Optional

import numpy as np
import pandas as pd

from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
import joblib

# Optional imports
try:
    import xgboost as xgb
    HAS_XGB = True
except Exception:
    HAS_XGB = False

try:
    import lightgbm as lgb
    HAS_LGBM = True
except Exception:
    HAS_LGBM = False

try:
    from catboost import CatBoostRegressor
    HAS_CAT = True
except Exception:
    HAS_CAT = False

# Deep learning optional imports
try:
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    HAS_TF = True
except Exception:
    HAS_TF = False

try:
    import keras_tuner as kt  # hyperparameter tuning for keras models
    HAS_KT = True
except Exception:
    HAS_KT = False

warnings.filterwarnings("ignore", category=UserWarning)

RNG = np.random.default_rng(42)
MAX_SPLITS = 30  # supports up to 1500m in 50m increments

In [None]:
# ------------------------------
# Utilities: time parsing/formatting

In [None]:
# ------------------------------
def time_to_seconds(t: str | float | int) -> float:
    """Parse time string 'mm:ss.xx' or 'ss.xx' to seconds. Pass through floats/ints."""
    if isinstance(t, (int, float)):
        return float(t)
    if not isinstance(t, str):
        return np.nan
    t = t.strip()
    if not t:
        return np.nan
    # Replace commas with dots
    t = t.replace(',', '.')
    try:
        if ':' in t:
            mins, secs = t.split(':', 1)
            return float(mins) * 60.0 + float(secs)
        return float(t)
    except Exception:
        return np.nan


def seconds_to_time_str(s: float) -> str:
    """Format seconds into 'M:SS.ss'."""
    if s is None or np.isnan(s):
        return ""
    m = int(s // 60)
    rem = s - 60 * m
    return f"{m}:{rem:05.2f}"


def parse_splits_field(val: str | list | tuple) -> List[float]:
    """Parse splits either from semicolon-separated string or list-like."""
    if val is None:
        return []
    if isinstance(val, (list, tuple, np.ndarray)):
        return [float(x) for x in val]
    if isinstance(val, str):
        raw = [x.strip() for x in val.replace(',', ';').split(';') if x.strip()]
        out = []
        for x in raw:
            out.append(time_to_seconds(x))
        return out
    return []


def aggregate_splits(splits_50: List[float], interval: int = 100) -> List[float]:
    """Aggregate per-50 splits to per-100 splits if requested."""
    if interval == 50:
        return splits_50
    if interval == 100:
        out = []
        for i in range(0, len(splits_50), 2):
            s = splits_50[i:i+2]
            out.append(sum(s))
        return out
    raise ValueError("interval must be 50 or 100")

In [None]:
# ------------------------------
# Data ingestion & feature building

In [None]:
# ------------------------------
STROKE_ALIASES = {
    'FREE': 'Free',
    'FREESTYLE': 'Free',
    'BK': 'Back', 'BACK': 'Back', 'BACKSTROKE': 'Back',
    'BR': 'Breast', 'BREAST': 'Breast', 'BREASTSTROKE': 'Breast',
    'FLY': 'Fly', 'BUTTERFLY': 'Fly',
    'IM': 'IM', 'INDIVIDUAL MEDLEY': 'IM'
}

POOL_ALIASES = {'LCM': 'LCM', 'SCM': 'SCM', 'SCY': 'SCY', 'LC': 'LCM', 'SC': 'SCM'}


def infer_stroke_from_filename(path: str) -> Optional[str]:
    name = os.path.basename(path).lower()
    for key, norm in [('free', 'Free'), ('bk', 'Back'), ('back', 'Back'),
                      ('br', 'Breast'), ('breast', 'Breast'),
                      ('fly', 'Fly'),
                      ('im', 'IM')]:
        if key in name:
            return norm
    return None


def clean_stroke(val: str | None, fallback: Optional[str] = None) -> Optional[str]:
    if isinstance(val, str):
        up = val.strip().upper()
        return STROKE_ALIASES.get(up, val.strip().title())
    return fallback


def clean_pool(val: str | None) -> Optional[str]:
    if isinstance(val, str):
        up = val.strip().upper()
        return POOL_ALIASES.get(up, val.strip().upper())
    return None


def infer_distance(row: pd.Series) -> Optional[int]:
    # Priority: 'Distance' numeric; else extract from 'Event' like '200 Fly'
    for col in ['Distance', 'distance', 'DISTANCE']:
        if col in row and pd.notnull(row[col]):
            try:
                return int(float(row[col]))
            except Exception:
                pass
    for col in ['Event', 'event', 'EVENT']:
        if col in row and isinstance(row[col], str):
            m = re.search(r'(\d+)', row[col])
            if m:
                try:
                    return int(m.group(1))
                except Exception:
                    pass
    # Fallback: from splits length * 50
    if 'splits_50' in row and isinstance(row['splits_50'], list) and len(row['splits_50']) > 0:
        return len(row['splits_50']) * 50
    return None


def extract_splits_from_columns(df: pd.DataFrame) -> List[List[float]]:
    """Try to extract splits from a variety of column patterns."""
    potential_split_cols = [c for c in df.columns if re.fullmatch(r'(?:\D*?)?(\d{2,3})[_ ]?\d*', c.strip())]
    # Common names
    common_names = [c for c in df.columns if c.strip().lower() in ('splits', 'split', 'splits_50', 'per_50_splits')]
    splits = []
    if common_names:
        for _, r in df.iterrows():
            first = None
            for cn in common_names:
                if pd.notnull(r.get(cn)):
                    first = r.get(cn)
                    break
            splits.append(parse_splits_field(first))
        return splits

    # Try numeric sequence columns (50_1, 50_2, ...)
    seq_cols = [c for c in df.columns if re.match(r'^(50|100|split_?\d+|s\d+|p\d+)', c.strip().lower())]
    if seq_cols:
        # Preserve original order
        seq_cols_sorted = sorted(seq_cols, key=lambda x: (len(x), x))
        for _, r in df.iterrows():
            vals = []
            for c in seq_cols_sorted:
                v = r.get(c)
                if pd.isnull(v):
                    continue
                vals.append(time_to_seconds(str(v)))
            splits.append(vals)
        return splits

    # Nothing obvious
    return [[] for _ in range(len(df))]


@dataclass
class PreparedData:
    X: pd.DataFrame
    Y: np.ndarray  # shape (n_samples, n_splits)
    feature_cols: List[str]
    cat_cols: List[str]
    n_splits: int


def prepare_training_frame(df: pd.DataFrame, default_stroke: Optional[str] = None) -> pd.DataFrame:
    cols = {c: c.strip() for c in df.columns}
    df = df.rename(columns=cols)
    # Standardize key columns if present
    if 'Stroke' not in df.columns:
        df['Stroke'] = default_stroke
    df['Stroke'] = df['Stroke'].apply(lambda s: clean_stroke(s, default_stroke))
    if 'Pool' in df.columns:
        df['Pool'] = df['Pool'].apply(clean_pool)
    else:
        df['Pool'] = None

    # Time columns
    time_col_candidates = [c for c in df.columns if c.lower() in ('final time in seconds', 'finaltime', 'time', 'final_time', 'final_time_s')]
    df['splits_50'] = extract_splits_from_columns(df)
    # distance
    df['Distance_m'] = df.apply(infer_distance, axis=1)

    # total time
    if time_col_candidates:
        # use first available non-null
        def _pick_time(r):
            for c in time_col_candidates:
                v = r.get(c)
                if pd.notnull(v):
                    return time_to_seconds(v)
            return np.nan
        df['TotalTime_s'] = df.apply(_pick_time, axis=1).astype(float)
    else:
        df['TotalTime_s'] = np.nan

    # If TotalTime missing, compute from splits
    def _fix_total(r):
        if pd.notnull(r['TotalTime_s']) and r['TotalTime_s'] > 0:
            return r['TotalTime_s']
        if isinstance(r['splits_50'], list) and len(r['splits_50']) > 0:
            return float(sum(r['splits_50']))
        return np.nan
    df['TotalTime_s'] = df.apply(_fix_total, axis=1)

    # Filter rows
    def _valid_row(r):
        if pd.isnull(r['Distance_m']) or pd.isnull(r['TotalTime_s']) or r['TotalTime_s'] <= 0:
            return False
        ns = int(r['Distance_m'] // 50)
        return isinstance(r['splits_50'], list) and len(r['splits_50']) == ns
    df = df[df.apply(_valid_row, axis=1)].reset_index(drop=True)

    # pb50 estimate
    df['PB50_est_s'] = df['splits_50'].apply(lambda s: float(np.nanmin(s)) if len(s) > 0 else np.nan)

    return df


def make_features_and_targets(df: pd.DataFrame, n_splits: int) -> PreparedData:
    work = df[df['Distance_m'] // 50 == n_splits].copy()
    if work.empty:
        raise ValueError(f"No rows for n_splits={n_splits}")
    # Targets: normalized split fractions of length n_splits
    def _fractions(splits):
        arr = np.array(splits, dtype=float)
        total = arr.sum() if arr.sum() > 0 else 1.0
        return (arr / total).tolist()
    work['fractions'] = work['splits_50'].apply(_fractions)
    Y = np.stack(work['fractions'].values, axis=0)  # shape (N, n_splits)

    # Features
    work['Pool'] = work['Pool'].fillna('UNK')
    X = work[['Distance_m', 'Stroke', 'Pool', 'PB50_est_s', 'TotalTime_s']].copy()
    feature_cols = ['Distance_m', 'PB50_est_s', 'TotalTime_s']
    cat_cols = ['Stroke', 'Pool']

    return PreparedData(X=X, Y=Y, feature_cols=feature_cols, cat_cols=cat_cols, n_splits=n_splits)

In [None]:
# ------------------------------
# Model builders

In [None]:
# ------------------------------
def build_automl_model() -> List[Tuple[str, object]]:
    """Return a list of (name, estimator) candidates to try; all wrapped later in MultiOutputRegressor."""
    cands = [
        ("ridge", Ridge(alpha=1.0)),
        ("rf", RandomForestRegressor(n_estimators=500, random_state=42)),
        ("etr", ExtraTreesRegressor(n_estimators=600, random_state=42)),
        ("gbr", GradientBoostingRegressor(random_state=42))
    ]
    if HAS_XGB:
        cands.append(("xgb", xgb.XGBRegressor(
            n_estimators=800, max_depth=6, learning_rate=0.05, subsample=0.9, colsample_bytree=0.9,
            tree_method="hist", random_state=42, n_jobs=-1)))
    if HAS_LGBM:
        cands.append(("lgbm", lgb.LGBMRegressor(
            n_estimators=1000, num_leaves=63, learning_rate=0.05, subsample=0.9, colsample_bytree=0.9,
            random_state=42)))
    if HAS_CAT:
        cands.append(("cat", CatBoostRegressor(
            iterations=1200, depth=6, learning_rate=0.05, loss_function='MAE', random_seed=42, verbose=False)))
    return cands


def fit_best_automl(X: pd.DataFrame, Y: np.ndarray, feature_cols: List[str], cat_cols: List[str]) -> Tuple[str, Pipeline]:
    """Fit several regressors with CV and pick the best MAE. Return (name, pipeline)."""
    pre = ColumnTransformer([
        ("num", StandardScaler(), feature_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
    ])
    best_name, best_score, best_pipe = None, float("inf"), None
    kf = KFold(n_splits=5, shuffle=True, random_state=42)

    for name, est in build_automl_model():
        model = MultiOutputRegressor(est)
        pipe = Pipeline([("pre", pre), ("model", model)])
        # Negative MAE -> higher is better; we store MAE
        scores = -cross_val_score(pipe, X, Y, cv=kf, scoring="neg_mean_absolute_error", n_jobs=None)
        mae = float(scores.mean())
        if mae < best_score:
            best_name, best_score, best_pipe = name, mae, pipe

    if best_pipe is None:
        # Safety fallback
        est = MultiOutputRegressor(Ridge())
        best_pipe = Pipeline([("pre", pre), ("model", est)])
        best_name = "ridge"
        best_pipe.fit(X, Y)
        return best_name, best_pipe

    best_pipe.fit(X, Y)
    return best_name, best_pipe


def build_keras_model(input_dim: int, output_dim: int, hp: Optional["kt.HyperParameters"] = None) -> "keras.Model":
    """Create a simple MLP regressor in Keras. If hp provided (keras-tuner), tune width/depth/dropout."""
    if not HAS_TF:
        raise RuntimeError("TensorFlow/Keras not available")

    if hp is None:
        width = 256
        depth = 3
        dropout = 0.1
        lr = 1e-3
    else:
        width = hp.Int("width", min_value=128, max_value=512, step=64)
        depth = hp.Int("depth", min_value=2, max_value=5, step=1)
        dropout = hp.Float("dropout", min_value=0.0, max_value=0.4, step=0.05)
        lr = hp.Choice("lr", values=[1e-4, 5e-4, 1e-3])

    inputs = keras.Input(shape=(input_dim,), name="features")
    x = inputs
    for _ in range(depth):
        x = layers.Dense(width, activation="relu")(x)
        if dropout > 0:
            x = layers.Dropout(dropout)(x)
    outputs = layers.Dense(output_dim, activation="softmax", name="fractions")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr), loss="mae", metrics=["mae"])
    return model


def fit_best_deep_model(X: pd.DataFrame, Y: np.ndarray, feature_cols: List[str], cat_cols: List[str]) -> Tuple[str, object, ColumnTransformer]:
    """
    Train a deep model. Preferred: Keras with (optional) keras-tuner.
    Fallback: sklearn MLPRegressor with basic hyperparameter search.
    Returns (impl_name, trained_model, preprocessor) where preprocessor encodes inputs for the model.
    """
    pre = ColumnTransformer([
        ("num", StandardScaler(), feature_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
    ])
    X_enc = pre.fit_transform(X)
    input_dim = X_enc.shape[1]
    output_dim = Y.shape[1]

    if HAS_TF:
        # Keras path
        if HAS_KT:
            tuner = kt.RandomSearch(
                lambda hp: build_keras_model(input_dim, output_dim, hp),
                objective="val_mae",
                max_trials=12,
                executions_per_trial=1,
                overwrite=True,
                directory=".kerastuner",
                project_name=f"pacing_ns{output_dim}"
            )
            X_tr, X_va, Y_tr, Y_va = train_test_split(X_enc, Y, test_size=0.2, random_state=42)
            stop = keras.callbacks.EarlyStopping(monitor="val_mae", patience=15, restore_best_weights=True)
            tuner.search(X_tr, Y_tr, validation_data=(X_va, Y_va), epochs=200, batch_size=64, callbacks=[stop], verbose=0)
            model = tuner.get_best_models(num_models=1)[0]
            impl = "keras_tuner"
        else:
            model = build_keras_model(input_dim, output_dim, hp=None)
            stop = keras.callbacks.EarlyStopping(monitor="val_mae", patience=20, restore_best_weights=True)
            model.fit(X_enc, Y, epochs=250, batch_size=64, validation_split=0.2, callbacks=[stop], verbose=0)
            impl = "keras"
        return impl, model, pre

    # Fallback: sklearn MLP
    param_grid = {
        "hidden_layer_sizes": [(256, 256), (256, 128), (128, 128, 64)],
        "activation": ["relu"],
        "alpha": [1e-5, 1e-4, 1e-3],
        "learning_rate_init": [1e-3, 5e-4],
        "max_iter": [1000],
        "random_state": [42]
    }
    best_model = None
    best_mae = float("inf")
    for hls in param_grid["hidden_layer_sizes"]:
        for alpha in param_grid["alpha"]:
            for lr in param_grid["learning_rate_init"]:
                mlp = MLPRegressor(hidden_layer_sizes=hls, activation="relu", alpha=alpha,
                                   learning_rate_init=lr, max_iter=1000, random_state=42)
                model = MultiOutputRegressor(mlp)
                kf = KFold(n_splits=5, shuffle=True, random_state=42)
                scores = -cross_val_score(Pipeline([("model", model)]), X_enc, Y, cv=kf,
                                          scoring="neg_mean_absolute_error")
                mae = float(scores.mean())
                if mae < best_mae:
                    best_mae, best_model = mae, model
    # Fit on full data
    best_model.fit(X_enc, Y)
    return "sklearn_mlp", best_model, pre

In [None]:
# ------------------------------
# Persistence

In [None]:
# ------------------------------
def _ns_model_paths(models_dir: str, n_splits: int) -> Dict[str, str]:
    os.makedirs(models_dir, exist_ok=True)
    return {
        "automl": os.path.join(models_dir, f"automl_ns{n_splits}.joblib"),
        "dl_model": os.path.join(models_dir, f"dl_ns{n_splits}.keras" if HAS_TF else f"dl_ns{n_splits}.joblib"),
        "dl_prep": os.path.join(models_dir, f"dl_prep_ns{n_splits}.joblib"),
        "meta": os.path.join(models_dir, f"meta_ns{n_splits}.json")
    }


def save_automl(pipe: Pipeline, path: str):
    joblib.dump(pipe, path)


def load_automl(path: str) -> Pipeline:
    return joblib.load(path)


def save_dl(model, preproc: ColumnTransformer, model_path: str, prep_path: str, impl_name: str):
    if HAS_TF and isinstance(model, tf.keras.Model):
        model.save(model_path, include_optimizer=False)
    else:
        joblib.dump(model, model_path)
    joblib.dump({"pre": preproc, "impl": impl_name}, prep_path)


def load_dl(model_path: str, prep_path: str):
    meta = joblib.load(prep_path)
    pre = meta["pre"]
    impl = meta["impl"]
    if HAS_TF and os.path.isdir(model_path):
        model = tf.keras.models.load_model(model_path, compile=False)
        # compile for inference loss if needed
        model.compile(optimizer="adam", loss="mae")
    else:
        model = joblib.load(model_path)
    return impl, model, pre

In [None]:
# ------------------------------
# Training orchestration

In [None]:
# ------------------------------
def read_any_csv(path: str) -> pd.DataFrame:
    try:
        return pd.read_csv(path)
    except UnicodeDecodeError:
        return pd.read_csv(path, encoding="latin1")
    except Exception as e:
        raise


def concat_datasets(paths: List[str]) -> pd.DataFrame:
    frames = []
    for p in paths:
        if not os.path.exists(p):
            print(f"[WARN] Missing dataset file: {p}")
            continue
        df = read_any_csv(p)
        stroke_hint = infer_stroke_from_filename(p)
        frames.append(prepare_training_frame(df, default_stroke=stroke_hint))
    if not frames:
        raise RuntimeError("No valid datasets found.")
    full = pd.concat(frames, ignore_index=True)
    # sanity: only 50m-based distances
    full = full[full['Distance_m'] % 50 == 0].reset_index(drop=True)
    return full


def train_all_models(data_paths: List[str], out_dir: str = "models") -> Dict[int, Dict[str, str]]:
    """
    Train AutoML + Deep Learning models per n_splits. Returns dict of n_splits -> saved paths.
    """
    df = concat_datasets(data_paths)
    results = {}
    for n in sorted(df['Distance_m'].unique() // 50):
        try:
            prep = make_features_and_targets(df, n_splits=int(n))
        except ValueError:
            continue

        # AutoML
        best_name, automl_pipe = fit_best_automl(prep.X, prep.Y, prep.feature_cols, prep.cat_cols)

        # Deep
        dl_impl, dl_model, dl_pre = fit_best_deep_model(prep.X, prep.Y, prep.feature_cols, prep.cat_cols)

        # Save
        paths = _ns_model_paths(out_dir, int(n))
        save_automl(automl_pipe, paths["automl"])
        save_dl(dl_model, dl_pre, paths["dl_model"], paths["dl_prep"], dl_impl)

        meta = {
            "n_splits": int(n),
            "feature_cols": prep.feature_cols,
            "cat_cols": prep.cat_cols,
            "automl_best": best_name,
            "dl_impl": dl_impl
        }
        with open(paths["meta"], "w") as f:
            json.dump(meta, f, indent=2)
        results[int(n)] = paths
        print(f"[OK] Trained n_splits={n}: AutoML={best_name}, DL={dl_impl}")
    return results

In [None]:
# ------------------------------
# Inference utilities

In [None]:
# ------------------------------
def _encode_inputs(dfrow: pd.DataFrame, pre: ColumnTransformer) -> np.ndarray:
    return pre.transform(dfrow)


def _predict_fractions_automl(n_splits: int, xrow: pd.DataFrame, models_dir: str) -> np.ndarray:
    paths = _ns_model_paths(models_dir, n_splits)
    pipe = load_automl(paths["automl"])
    fracs = pipe.predict(xrow)[0]
    # normalize just in case
    fracs = np.maximum(0, np.array(fracs, dtype=float))
    s = fracs.sum()
    if s <= 0:
        fracs = np.ones(n_splits) / n_splits
    else:
        fracs = fracs / s
    return fracs


def _predict_fractions_dl(n_splits: int, xrow: pd.DataFrame, models_dir: str) -> np.ndarray:
    paths = _ns_model_paths(models_dir, n_splits)
    impl, model, pre = load_dl(paths["dl_model"], paths["dl_prep"])
    X_enc = _encode_inputs(xrow, pre)
    if HAS_TF and isinstance(model, tf.keras.Model):
        fracs = model.predict(X_enc, verbose=0)[0]
    else:
        fracs = model.predict(X_enc)[0]
    fracs = np.maximum(0, np.array(fracs, dtype=float))
    s = fracs.sum()
    if s <= 0:
        fracs = np.ones(n_splits) / n_splits
    else:
        fracs = fracs / s
    return fracs


def _build_feature_row(distance_m: int, stroke: str, pool: str, pb50_s: float, total_time_s: float) -> pd.DataFrame:
    row = pd.DataFrame([{
        "Distance_m": int(distance_m),
        "Stroke": clean_stroke(stroke, stroke),
        "Pool": clean_pool(pool) or "UNK",
        "PB50_est_s": float(pb50_s),
        "TotalTime_s": float(total_time_s)
    }])
    return row


def _choose_model_kind(kind: str) -> str:
    k = (kind or "auto").strip().lower()
    if k in ("auto", "automl", "ml"):
        return "automl"
    if k in ("dl", "deep", "deep_learning", "nn"):
        return "dl"
    return "automl"

In [None]:
# ------------------------------
# Public API

In [None]:
# ------------------------------
def pre_race_optimal_splits(distance_m: int, stroke: str, pool: str, pb50_s: float, target_time_s: float,
                            model_kind: str = "auto", models_dir: str = "models") -> Dict:
    """
    Predict optimal per-50 splits for a given event and target time.
    Returns dict with splits_50 (list), splits_100 (list), and formatted strings.
    """
    assert distance_m % 50 == 0, "distance_m must be multiple of 50"
    n_splits = int(distance_m // 50)
    kind = _choose_model_kind(model_kind)
    xrow = _build_feature_row(distance_m, stroke, pool, pb50_s, target_time_s)

    if kind == "automl":
        fracs = _predict_fractions_automl(n_splits, xrow, models_dir)
    else:
        fracs = _predict_fractions_dl(n_splits, xrow, models_dir)

    splits_50 = (fracs * float(target_time_s)).tolist()
    splits_100 = aggregate_splits(splits_50, interval=100)
    return {
        "event": {"distance_m": distance_m, "stroke": clean_stroke(stroke), "pool": clean_pool(pool) or "UNK"},
        "target_time_s": float(target_time_s),
        "pb50_s": float(pb50_s),
        "splits_50_s": splits_50,
        "splits_50_str": [seconds_to_time_str(s) for s in splits_50],
        "splits_100_s": splits_100,
        "splits_100_str": [seconds_to_time_str(s) for s in splits_100]
    }


def post_race_analysis(distance_m: int, stroke: str, pool: str, splits: List[float], split_interval: int = 50,
                       model_kind: str = "auto", target_time_s: Optional[float] = None,
                       models_dir: str = "models") -> Dict:
    """
    Analyze a finished race. You can pass per-50 or per-100 splits via 'split_interval'.
    - Predicts an "optimal" split profile for your total time using the trained model.
    - Compares actual vs. optimal per-50, and aggregates to per-100 as needed.
    Returns: dict with deltas and suggestions.
    """
    assert distance_m % 50 == 0, "distance_m must be multiple of 50"
    if split_interval not in (50, 100):
        raise ValueError("split_interval must be 50 or 100")

    # Normalize to per-50 splits
    n_splits = int(distance_m // 50)
    if split_interval == 100:
        # expand per-100 to per-50 (assume even split within each 100m for expansion)
        splits50 = []
        for s100 in splits:
            splits50.extend([float(s100)/2.0, float(s100)/2.0])
    else:
        splits50 = [float(x) for x in splits]

    if len(splits50) != n_splits:
        raise ValueError(f"Expected {n_splits} splits of {50}m; got {len(splits50)} based on split_interval={split_interval}")

    total_time = float(sum(splits50)) if target_time_s is None else float(target_time_s)
    pb50_est = float(np.nanmin(splits50)) if splits50 else 0.0

    kind = _choose_model_kind(model_kind)
    xrow = _build_feature_row(distance_m, stroke, pool, pb50_est, total_time)

    if kind == "automl":
        fracs_opt = _predict_fractions_automl(n_splits, xrow, models_dir)
    else:
        fracs_opt = _predict_fractions_dl(n_splits, xrow, models_dir)

    opt_50 = (fracs_opt * total_time).tolist()
    deltas_50 = [a - o for a, o in zip(splits50, opt_50)]  # +ve = slower than optimal; -ve = faster than optimal

    # Simple textual suggestions
    suggestions = []
    for i, d in enumerate(deltas_50, start=1):
        if d > 0.20:
            suggestions.append(f"50m #{i}: You were {d:.2f}s slower than model-optimal. Aim to pick up pace here.")
        elif d < -0.20:
            suggestions.append(f"50m #{i}: You were {-d:.2f}s faster than model-optimal. Consider redistributing effort.")
        else:
            suggestions.append(f"50m #{i}: Close to optimal (+/-0.20s).")

    # Aggregate to per-100 if needed
    actual_100 = aggregate_splits(splits50, interval=100)
    optimal_100 = aggregate_splits(opt_50, interval=100)
    deltas_100 = [a - o for a, o in zip(actual_100, optimal_100)]

    return {
        "event": {"distance_m": distance_m, "stroke": clean_stroke(stroke), "pool": clean_pool(pool) or "UNK"},
        "actual": {
            "splits_50_s": splits50,
            "splits_50_str": [seconds_to_time_str(s) for s in splits50],
            "splits_100_s": actual_100,
            "splits_100_str": [seconds_to_time_str(s) for s in actual_100],
            "total_time_s": sum(splits50),
            "total_time_str": seconds_to_time_str(sum(splits50))
        },
        "optimal": {
            "splits_50_s": opt_50,
            "splits_50_str": [seconds_to_time_str(s) for s in opt_50],
            "splits_100_s": optimal_100,
            "splits_100_str": [seconds_to_time_str(s) for s in optimal_100],
        },
        "deltas": {
            "per_50_s": deltas_50,
            "per_100_s": deltas_100
        },
        "suggestions": suggestions
    }

In [None]:
# ------------------------------
# Convenience: default dataset list & CLI

In [None]:
# ------------------------------
DEFAULT_DATASETS = [
    # Update paths as needed; these are the names the user uploaded.
    "/mnt/data/Competitive Pacing (Responses) - All time free fastest times.csv",
    "/mnt/data/Competitive Pacing (Responses) - All time Bk fastest times .csv",
    "/mnt/data/Competitive Pacing (Responses) - All time Br fastest times.csv",
    "/mnt/data/Competitive Pacing (Responses) - All time Fly fastest times .csv",
    "/mnt/data/Competitive Pacing (Responses) - All time IM fastest times.csv",
]


def _maybe_train_default():
    """
    Helper for quick training via:
        python race_pacing_models.py --train
    """
    print("[INFO] Starting training on default dataset list...")
    results = train_all_models(DEFAULT_DATASETS, out_dir="models")
    print(json.dumps(results, indent=2))


def _demo_inference():
    """
    Quick demo after training:
        python race_pacing_models.py --demo
    """
    # Example: 200 Free, LCM, PB50=24.0s, target=110.0s (1:50.00)
    out = pre_race_optimal_splits(distance_m=200, stroke="Free", pool="LCM", pb50_s=24.0, target_time_s=110.0,
                                  model_kind="auto", models_dir="models")
    print("[DEMO] Pre-race optimal splits (200 Free):")
    print(json.dumps(out, indent=2))

    # Post-race analysis with hypothetical splits @50m
    splits_50 = [26.0, 27.5, 28.0, 28.5]
    ana = post_race_analysis(distance_m=200, stroke="Free", pool="LCM", splits=splits_50, split_interval=50,
                             model_kind="auto", models_dir="models")
    print("[DEMO] Post-race analysis (200 Free):")
    print(json.dumps(ana, indent=2))


# [Notebook] CLI entry point removed; call functions directly in cells.

## Usage: Train and Infer
Run the following cells to train models on your default datasets and perform inference. Make sure the CSVs are present at the paths listed in `DEFAULT_DATASETS`.

In [None]:
from race_pacing_models import train_all_models, DEFAULT_DATASETS
results = train_all_models(DEFAULT_DATASETS, out_dir='models')
results

In [None]:
from race_pacing_models import pre_race_optimal_splits
pre = pre_race_optimal_splits(
    distance_m=200, stroke='Free', pool='LCM',
    pb50_s=24.0, target_time_s=110.0,  # 1:50.00
    model_kind='auto', models_dir='models'
)
pre

In [None]:
from race_pacing_models import post_race_analysis
post = post_race_analysis(
    distance_m=200, stroke='Free', pool='LCM',
    splits=[26.0, 27.5, 28.0, 28.5], split_interval=50,
    model_kind='auto', models_dir='models'
)
post