# Model Training - Classic Cross-Validation

Various combinations of model architectures, features, and training approaches were tested (since the baseline models' performance isn't expected to change, they won't be tested in this notebook).

## Model Architectures
1. **Baseline Model (avg)**: Predict the average `'K%'` across a pitcher's available data.
2. **Baseline Model (last)**: Predict the last observed `'K%'` from the pitcher's available data.
3. **Baseline Model (xK%)**: Predict `'K%'` using the formula: `xK% = -0.61 + (L/Str * 1.1538) + (S/Str * 1.4696) + (F/Str * 0.9417)` (see [The Definitive Pitcher Expected K% Formula](https://fantasy.fangraphs.com/the-definitive-pitcher-expected-k-formula/)).
4. [**Linear Regression Model**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
5. [**Random Forest Model**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
6. [**XGBoost Model**](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)

## Features
1. **All 67 Features**:
    - 31 categorical one-hot encoded features for teams (30 MLB teams plus `'---'` for multi-team).
    - 36 numeric features.
2. **7 Features Selected from Lasso Model** (see [03-feature-engineering.ipynb](./03-feature-engineering.ipynb)):
    - `'numeric__Pit/PA'`
    - `'numeric__Str%'`
    - `'numeric__F/Str'`
    - `'numeric__I/Str'`
    - `'numeric__Con'`
    - `'numeric__30%'`
    - `'numeric__L/SO'`

## Training Approach
Two training schemas will be utilized for model training, with **MSE** (mean squared error) used to evaluate model performance:
1. **Classical Cross-Validation**: The data will be split into K-folds, with the goal of predicting `'K%'`.
2. **Time-Series Cross-Validation**: Folds will be based on the season, using earlier years (e.g., `2021`) to predict subsequent years (e.g., `2022`), and so on (e.g., `2021-2022` to predict `2023`).

**THIS NOTEBOOK USES THE TIME-SERIES CROSS-VALIDATION APPROACH**
- For more details, refer to [02-data-partitioning.ipynb](./02-data-partitioning.ipynb).
```mermaid
graph TD
    A["Player Pool"]
    A --> B["Training Pool"]
    
    subgraph TrainingFlow [" "]
        direction LR
        D["2021 --- 2022 --- 2023 "]
        F["X 2024"]:::red
    end
    B -- Training Flow --> D

    
    subgraph CVTimeSeries ["TimeSeries CV: Previous year predicts next year's K%"]
        FoldTitle11["Fold1"]:::noBorder
        FoldTitle22["Fold2"]:::noBorder
        FoldTitle33["Fold3"]:::noBorder

        Split11["Split1"]:::noBorder
        Fold11["2021"]:::green
        Fold22["2022"]:::blue
        Fold33["2023"]:::transparent
        
        Split22["Split2"]:::noBorder
        Fold44["2021"]:::green
        Fold55["2022"]:::green
        Fold66["2023"]:::blue
        
        Split33["Split3"]:::transparent
        Fold77["Fold1"]:::transparent
        Fold88["Fold2"]:::transparent
        Fold99["Fold3"]:::transparent
        
        FoldTitle11 ~~~ Fold11
        FoldTitle22 ~~~ Fold22
        FoldTitle33 ~~~ Fold33
        
        Split11 ~~~ Split22
        Split22 ~~~ Split33

        Fold11 ~~~ Fold44
        Fold22 ~~~ Fold55
        Fold33 ~~~ Fold66

        Fold44 ~~~ Fold77
        Fold55 ~~~ Fold88
        Fold66 ~~~ Fold99
    end

    TrainingFlow --> CVTimeSeries

    classDef red fill:#FFCCCC,stroke:#FF0000,stroke-width:2px;
    classDef green fill:#CCFFCC,stroke:#00FF00,stroke-width:2px;
    classDef blue fill:#CCCCFF,stroke:#0000FF,stroke-width:2px;
    classDef noBorder fill:none,stroke:none,color:#000000;
    classDef transparent fill:#FFFFFF,stroke:#FFFFFF,stroke-width:2px,opacity:0;
```

Inspired by scikit-learn:
- https://scikit-learn.org/stable/modules/cross_validation.html
- https://scikit-learn.org/1.5/modules/cross_validation.html#time-series-split

--- 
## Development Workflow

All functions and pipelines demonstrated in this notebook are defined in the `bullpen.data_utils` and `bullpen.model_utils` modules for clarity, reusability, and unit testing. While this notebook retains the initial development and intent of these functions, their inclusion here is primarily for transparency and ease of reference.  

For production usage, refer to the source code in the `bullpen.data_utils` and `bullpen.model_utils` modules.

# Model Training - Time Series Cross-Validation
A number combination of different model architectures, features used, and training approaches were tested in the [04a-modeling-classic-cv.ipynb](./04a-modeling-classic-cv.ipynb) notebook. 

In [1]:
from itertools import product

import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import scipy.stats
import xgboost as xgb
from sklearn import set_config
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

set_config(display="text")

from bullpen import data_utils, model_utils

In [2]:
train_df = pd.read_csv(data_utils.DATA_DIR.joinpath("train.csv"))
train_df

Unnamed: 0,PlayerId,Team,Season,MLBAMID,Name,Age,TBF,K%,Rk,IP,...,02s,02h,L/SO,S/SO,L/SO%,3pK,4pW,PAu,Pitu,Stru
0,18655,ATL,2021,621345,A.J. Minter,27,221,0.257919,696,52.1,...,44,7,11,46,0.193,11,4,0,0,0
1,18655,ATL,2022,621345,A.J. Minter,28,271,0.346863,649,70.0,...,50,2,23,71,0.245,12,0,0,0,0
2,18655,ATL,2023,621345,A.J. Minter,29,260,0.315385,647,64.2,...,40,4,13,69,0.159,8,1,0,0,0
3,19343,OAK,2022,640462,A.J. Puk,27,281,0.270463,773,66.1,...,48,6,22,54,0.289,15,4,0,0,0
4,19343,MIA,2023,640462,A.J. Puk,28,242,0.322314,755,56.2,...,42,6,22,56,0.282,16,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
983,1943,HOU,2021,425844,Zack Greinke,37,697,0.172166,417,171.0,...,51,4,34,85,0.283,13,6,0,0,0
984,1943,KCR,2022,425844,Zack Greinke,38,585,0.124786,396,137.0,...,39,3,22,51,0.301,7,2,0,0,0
985,1943,KCR,2023,425844,Zack Greinke,39,593,0.163575,353,142.1,...,53,6,25,70,0.263,11,2,0,0,0
986,25918,STL,2022,668868,Zack Thompson,24,136,0.198529,967,34.2,...,40,3,10,17,0.370,3,2,0,0,0


In [3]:
year_list = train_df.Season.unique().tolist()
year_list

[2021, 2022, 2023]

In [4]:
def make_timeseries_splits(year_list, train_df):
    splits = {"train": [], "val": []}
    for idx, year in enumerate(year_list[:-1]):
        train_years = year_list[: idx + 1]
        val_year = year_list[idx + 1]

        print(f"TRAIN: {train_years} VAL: {[val_year]}")

        splits["train"].append(train_df[train_df["Season"].isin(train_years)])
        splits["val"].append(train_df[train_df["Season"] == val_year])
    return splits

In [5]:
splits = make_timeseries_splits(year_list, train_df)

TRAIN: [2021] VAL: [2022]
TRAIN: [2021, 2022] VAL: [2023]


In [6]:
# sniff test (split 1)
splits["train"][0].Season.unique(), splits["val"][0].Season.unique()

(array([2021]), array([2022]))

In [7]:
# sniff test (split 2)
splits["train"][1].Season.unique(), splits["val"][1].Season.unique()

(array([2021, 2022]), array([2023]))

In [8]:
def pred_X_y(split, target="K%", drop_cols=None):
    drop_cols = ["Name", target] if drop_cols is None else drop_cols

    X_df = split[[c for c in split.columns if c not in drop_cols]]
    y_df = split[target]
    return X_df, y_df

In [9]:
X_df, y_df = pred_X_y(splits["train"][0])
X_val_df, y_val_df = pred_X_y(splits["val"][0])

In [10]:
# sniff test
X_df.Season.unique(), X_val_df.Season.unique()

(array([2021]), array([2022]))

In [11]:
X_df, y_df = pred_X_y(splits["train"][1])
X_val_df, y_val_df = pred_X_y(splits["val"][1])

In [12]:
# sniff test
X_df.Season.unique(), X_val_df.Season.unique()

(array([2021, 2022]), array([2023]))

In [13]:
processor = model_utils.make_processing_pipeline(
    categorical_features=["Team"],
    numeric_features=[f for f in X_df.columns if f not in ("Team")],
)

processor

ColumnTransformer(transformers=[('categorical',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Team']),
                                ('numeric',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 ['PlayerId', 'Season', 'MLBAMID', 'Age', 'TBF',
                                  'Rk', 'IP', 'PA', 'Pit', 'Pit/PA', 'Str',
                                  'Str%', 'L/Str', 'S/Str', 'F/Str', 'I/Str',
                                  'AS/Str', 'I/Bll', 'AS/Pit', 'Con', '1st%',
                                  '30%', '30c', '30s', '02%', '02c', '02s',
                                  '02h', 'L/SO', 'S/SO', ...])])

In [14]:
lasso_features = [
    "Pit/PA",
    "Str%",
    "F/Str",
    "I/Str",
    "Con",
    "30%",
    "L/SO",
]
X_df_lasso = X_df[lasso_features]

processor_lasso = model_utils.make_processing_pipeline(
    categorical_features=None,
    numeric_features=list(X_df_lasso.columns),
)

processor_lasso

ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 ['Pit/PA', 'Str%', 'F/Str', 'I/Str', 'Con',
                                  '30%', 'L/SO'])])

## Model Training

### Baseline Estimators
Since the baseline models rely solely on the target variable (without incorporating any feature data), and the `ArticleModel` uses raw, unscaled data, these models are excluded from the training process in this notebook. The results of the baseline models are expected to remain consistent, and models trained using the time-series cross-validation approach should outperform these baseline models.

| Model                        | R²       | MSE     |
|------------------------------|----------|---------|
| ArticleModel()               | 0.876    | 0.000396|
| Baseline (method='mean')     | 0.835    | 0.000527|
| Baseline (method='last')     | 0.673    | 0.001041|

### Non-Baseline Estimators
With the baseline models established, we proceed with more advanced models, including:
1. [Linear Regression Model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
2. [Random Forest Model](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
3. [XGBoost Model](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)

Each of the above model architectures will be tested with two different feature sets:
1. **All 67 Features**  
   - 31 categorical one-hot features for teams (30 MLB teams and '---' for multi-team)
   - 36 numeric features
2. **7 Features Selected from Lasso Model** (see [03-feature-engineering.ipynb](./03-feature-engineering.ipynb))
   - `'numeric__Pit/PA'`
   - `'numeric__Str%'`
   - `'numeric__F/Str'`
   - `'numeric__I/Str'`
   - `'numeric__Con'`
   - `'numeric__30%'`
   - `'numeric__L/SO'`

**Note**: The baseline models do not utilize features; they rely only on the mean or the last observed value of the target variable (`'K%'`). The `ArticleModel` specifically selects columns with the highest predictive power as mentioned in the article. Therefore, testing the performance of the baseline models with the full feature set is unnecessary. For the remaining models, we will evaluate performance using both the full feature set and the Lasso-selected feature set.

### Time Series Cross-Validation
Previously, we used scikit-learn's `GridSearchCV` class for model tuning and hyperparameter optimization. In this notebook, we will implement a custom training loop to apply a time-series cross-validation scheme, ensuring that models are trained on past data to predict future data, as illustrated below:

![Time Series Cross-Validation](../assets/images/time-series-cv.png)  
Source: [scikit-learn User Guide](https://scikit-learn.org/1.6/modules/cross_validation.html#cross-validation)

In [15]:
def cross_validate_model(
    model,
    param_grid,
    splits,
    processor,
    metric_key="mean_mse",
    K=2,
    use_lasso_features=False,
):
    """
    Manual cross-validation based on custom timeseries data
    """
    results = []
    param_names = list(param_grid.keys())
    param_combinations = list(product(*param_grid.values()))

    for params in param_combinations:
        param_dict = dict(zip(param_names, params))
        print(f"Testing parameters: {param_dict}")

        split_scores = []
        for split_idx in range(K):
            # Get training and validation data
            X_df, y_df = pred_X_y(splits["train"][split_idx])
            X_val_df, y_val_df = pred_X_y(splits["val"][split_idx])
            print(f"TRAIN: {X_df.Season.unique()} VAL: {X_val_df.Season.unique()}")
            
            if use_lasso_features:
                print("Down selecting to Lasso-selected features...")
                X_df = X_df[lasso_features]
                X_val_df = X_val_df[lasso_features]
            
            # Initialize and train the model
            preds, metrics = model_utils.train_model(
                processor, model(**param_dict), X_df, y_df, results={}, name="model"
            )

            # Collect the desired metric (e.g., MSE) which is the second, or last appended
            split_scores.append(metrics["model"][-1])

        # Compute mean metric across splits
        mean_metric = np.mean(split_scores)
        results.append({**param_dict, metric_key: mean_metric})

        print(f"Mean {metric_key}: {mean_metric:.4f}")
        print()

    # Find the best hyperparameters based on the lowest metric
    best_result = min(results, key=lambda x: x[metric_key])
    return results, best_result

In [16]:
lr_param_grid = {"fit_intercept": [True, False]}

lr_results, lr_best_result = cross_validate_model(
    model=LinearRegression,
    param_grid=lr_param_grid,
    splits=splits,
    processor=processor_lasso,
    use_lasso_features=True,
)

print("Best Hyperparameters for Linear Regression:")
print(lr_best_result)

Testing parameters: {'fit_intercept': True}
TRAIN: [2021] VAL: [2022]
Down selecting to Lasso-selected features...
model params=None score=0.917 mse=0.00026
TRAIN: [2021 2022] VAL: [2023]
Down selecting to Lasso-selected features...
model params=None score=0.930 mse=0.00023
Mean mean_mse: 0.0002

Testing parameters: {'fit_intercept': False}
TRAIN: [2021] VAL: [2022]
Down selecting to Lasso-selected features...
model params=None score=-17.131 mse=0.05642
TRAIN: [2021 2022] VAL: [2023]
Down selecting to Lasso-selected features...
model params=None score=-15.983 mse=0.05547
Mean mean_mse: 0.0559

Best Hyperparameters for Linear Regression:
{'fit_intercept': True, 'mean_mse': 0.00024341758463156652}


In [17]:
rf_param_grid = {"n_estimators": [25, 50, 100, 150], "max_depth": [5, 10, 15]}

rf_results, rf_best_result = cross_validate_model(
    model=RandomForestRegressor,
    param_grid=rf_param_grid,
    splits=splits,
    processor=processor,
)

print("Best Hyperparameters for RandomForestRegressor:")
print(rf_best_result)

Testing parameters: {'n_estimators': 25, 'max_depth': 5}
TRAIN: [2021] VAL: [2022]
model params=None score=0.959 mse=0.00013
TRAIN: [2021 2022] VAL: [2023]
model params=None score=0.947 mse=0.00017
Mean mean_mse: 0.0002

Testing parameters: {'n_estimators': 25, 'max_depth': 10}
TRAIN: [2021] VAL: [2022]
model params=None score=0.981 mse=0.00006
TRAIN: [2021 2022] VAL: [2023]
model params=None score=0.982 mse=0.00006
Mean mean_mse: 0.0001

Testing parameters: {'n_estimators': 25, 'max_depth': 15}
TRAIN: [2021] VAL: [2022]
model params=None score=0.979 mse=0.00006
TRAIN: [2021 2022] VAL: [2023]
model params=None score=0.985 mse=0.00005
Mean mean_mse: 0.0001

Testing parameters: {'n_estimators': 50, 'max_depth': 5}
TRAIN: [2021] VAL: [2022]
model params=None score=0.959 mse=0.00013
TRAIN: [2021 2022] VAL: [2023]
model params=None score=0.949 mse=0.00017
Mean mean_mse: 0.0001

Testing parameters: {'n_estimators': 50, 'max_depth': 10}
TRAIN: [2021] VAL: [2022]
model params=None score=0.984 

In [18]:
xgb_param_grid = {"n_estimators": [25, 50, 100, 150], "max_depth": [5, 10, 15]}

xgb_results, xgb_best_result = cross_validate_model(
    model=xgb.XGBRegressor,
    param_grid=xgb_param_grid,
    splits=splits,
    processor=processor,
)

print("Best Hyperparameters for XGBRegressor:")
print(xgb_best_result)

Testing parameters: {'n_estimators': 25, 'max_depth': 5}
TRAIN: [2021] VAL: [2022]
model params=None score=0.996 mse=0.00001
TRAIN: [2021 2022] VAL: [2023]
model params=None score=0.994 mse=0.00002
Mean mean_mse: 0.0000

Testing parameters: {'n_estimators': 25, 'max_depth': 10}
TRAIN: [2021] VAL: [2022]
model params=None score=1.000 mse=0.00000
TRAIN: [2021 2022] VAL: [2023]
model params=None score=1.000 mse=0.00000
Mean mean_mse: 0.0000

Testing parameters: {'n_estimators': 25, 'max_depth': 15}
TRAIN: [2021] VAL: [2022]
model params=None score=1.000 mse=0.00000
TRAIN: [2021 2022] VAL: [2023]
model params=None score=1.000 mse=0.00000
Mean mean_mse: 0.0000

Testing parameters: {'n_estimators': 50, 'max_depth': 5}
TRAIN: [2021] VAL: [2022]
model params=None score=1.000 mse=0.00000
TRAIN: [2021 2022] VAL: [2023]
model params=None score=0.999 mse=0.00000
Mean mean_mse: 0.0000

Testing parameters: {'n_estimators': 50, 'max_depth': 10}
TRAIN: [2021] VAL: [2022]
model params=None score=1.000 

## Conclusion

- All models demonstrated strong performance with Time Series cross-validation. While no significant changes were expected, it was reassuring to confirm this approach's consistency.
- Although the tree-based models (`RandomForestRegressor` and `XGBRegressor`) outperformed the linear model in terms of predictive power, the interpretability and simplicity of the linear model remain appealing.

In [19]:
models = ("LinearRegression", "RandomForestRegressor", "XGBRegressor")
mses = [d["mean_mse"] for d in [lr_best_result, rf_best_result, xgb_best_result]]
results_df = (
    pd.DataFrame(zip(models, mses), columns=["model", "mse"])
    .set_index("model")
    .sort_values("mse")
    .round(6)
)
results_df

Unnamed: 0_level_0,mse
model,Unnamed: 1_level_1
XGBRegressor,0.0
RandomForestRegressor,4.5e-05
LinearRegression,0.000243
