# M03. Predict PAs
- This predicts the outcome of plate appearances
- Type: Model
- Run Frequency: Irregular
- Sources:
    - MLB API
    - Steamer
- Created: 4/19/2024
- Updated: 11/4/2025

Consider: 
- imputed starter, imputed reliever, unimputed starter, unimputed reliever variables
- Using batter woba and pitcher woba to determine quantiles, not projected
- imp_wfx

### Imports

In [1]:
%run "U01. Imports.ipynb"
%run "U02. Functions.ipynb"
%run "U03. Classes.ipynb"
%run "U04. Datasets.ipynb"
%run "U05. Models.ipynb"

In [2]:
# Set option to display numbers without scientific notation
pd.set_option('display.float_format', '{:.6f}'.format)

##### Test Device

In [5]:
def test_cuda():
    if torch.cuda.is_available():
        print("CUDA is available!")
        
    else:
        print("CUDA is NOT available. Check your GPU and drivers.")

if __name__ == "__main__":
    test_cuda()


CUDA is available!


### Data

##### Park x Weather Factors

In [None]:
multiplier_df = pd.read_csv(os.path.join(baseball_path, "Park and Weather Factors.csv"))

Choose WFX
- _unadj: predicted based on weather / predicted based on batted ball <br>
- _adj: average of actual rates in similarly predicted games / predicted based on batted ball

In [None]:
wfx_type = 'adj'
for event in events_list:
    multiplier_df[f'{event}_wfx_l'] = multiplier_df[f'{event}_wfx_{wfx_type}_l'].copy()
    multiplier_df[f'{event}_wfx_r'] = multiplier_df[f'{event}_wfx_{wfx_type}_r'].copy()

In [None]:
multiplier_df['date'].min()

##### Plate Appearances

In [None]:
complete_dataset = pd.read_csv(os.path.join(baseball_path, "Final Dataset.csv"))

##### Steamer

In [None]:
steamer_hitters_df = pd.read_csv(os.path.join(baseball_path, "A03. Steamer", "steamer_hitters_weekly_log.csv"), encoding='iso-8859-1')

In [None]:
steamer_pitchers_df = pd.read_csv(os.path.join(baseball_path, "A03. Steamer", "steamer_pitchers_weekly_log.csv"), encoding='iso-8859-1')

### Clean

##### MLB Stats API

Remove missings

In [None]:
complete_dataset = complete_dataset[~complete_dataset[batter_inputs].isin([np.inf, -np.inf]).any(axis=1)]
complete_dataset = complete_dataset[~complete_dataset[pitcher_inputs].isin([np.inf, -np.inf]).any(axis=1)]

Scale

In [None]:
%%time
complete_dataset[batter_inputs] = scale_batter_stats.transform(complete_dataset[batter_inputs])
complete_dataset[pitcher_inputs] = scale_pitcher_stats.transform(complete_dataset[pitcher_inputs])

Set data types

In [None]:
complete_dataset['date_time'] = pd.to_datetime(complete_dataset['date'], format='%Y%m%d')
complete_dataset['date_time_copy'] = complete_dataset['date_time'].copy()

complete_dataset['batter'] = complete_dataset['batter'].astype(int).astype(str)
complete_dataset['pitcher'] = complete_dataset['pitcher'].astype(int).astype(str)

Sort to prep for merge

In [None]:
complete_dataset.sort_values('date_time', inplace=True)

##### Steamer

Clean

In [None]:
steamer_hitters_df2 = clean_steamer_hitters(steamer_hitters_df).dropna(subset=batter_stats_fg)
steamer_pitchers_df2 = clean_steamer_pitchers(steamer_pitchers_df).dropna(subset=pitcher_stats_fg)

Scale

In [None]:
steamer_hitters_df2[batter_stats_fg] = scale_batter_stats_steamer.transform(steamer_hitters_df2[batter_stats_fg])
steamer_pitchers_df2[pitcher_stats_fg] = scale_pitcher_stats_steamer.transform(steamer_pitchers_df2[pitcher_stats_fg])

Remove missing pitchers (occurs occassionally in 2014)

In [None]:
steamer_pitchers_df2 = steamer_pitchers_df2[~steamer_pitchers_df2['mlbamid'].isna()].reset_index(drop=True)

Set data types

In [None]:
steamer_hitters_df2['date_time'] = pd.to_datetime(steamer_hitters_df2['date'], format='%Y%m%d')
steamer_pitchers_df2['date_time'] = pd.to_datetime(steamer_pitchers_df2['date'], format='%Y%m%d')

steamer_hitters_df2['mlbamid'] = steamer_hitters_df2['mlbamid'].astype(int).astype(str)
steamer_pitchers_df2['mlbamid'] = steamer_pitchers_df2['mlbamid'].astype(int).astype(str)

Rename for compatibility with MLB Stats API data

In [None]:
steamer_hitters_df2.rename(columns={'mlbamid': 'batter'}, inplace=True)
steamer_pitchers_df2.rename(columns={'mlbamid': 'pitcher'}, inplace=True)

Drop unnecessary columns

In [None]:
steamer_hitters_df2.drop(columns=['date', 'firstname', 'lastname', 'steamerid'], inplace=True)
steamer_pitchers_df2.drop(columns=['date', 'firstname', 'lastname', 'steamerid'], inplace=True)

Sort to prep for merge

In [None]:
steamer_hitters_df2.sort_values('date_time', inplace=True)
steamer_pitchers_df2.sort_values('date_time', inplace=True)

### Merge

##### Merge #1. Plate Appearances and Steamer Batters

In [None]:
complete_dataset = pd.merge_asof(
    complete_dataset,
    steamer_hitters_df2,
    on='date_time',
    by='batter',
    direction='backward'
)

##### Merge #2. Add Steamer Pitchers 

In [None]:
complete_dataset = pd.merge_asof(
    complete_dataset,
    steamer_pitchers_df2,
    on='date_time',
    by='pitcher',
    direction='backward'  
)

##### Merge #3. Add WFX

In [None]:
complete_dataset = pd.merge(complete_dataset, multiplier_df, on=['gamePk', 'date', 'venue_id'], how='left')

##### Free up memory

In [None]:
del steamer_hitters_df, steamer_hitters_df2, steamer_pitchers_df, steamer_pitchers_df2, multiplier_df

### Impute

For players with insufficient sample sizes, stats are imputed

##### Option 1: Steamer

In [None]:
# # First, remove from dataset if ever missing FG/Steamer stats
# complete_dataset = complete_dataset[~complete_dataset['b1_rate'].isna()]
# complete_dataset = complete_dataset[~complete_dataset['H9'].isna()]

# # Add hands to use in imputation
# batter_stats_fg_imp = batter_stats_fg + ['b_L', 'p_L', 'imp_b']
# pitcher_stats_fg_imp = pitcher_stats_fg + ['b_L', 'p_L', 'imp_p']

# ### Batters
# # Use Steamer stats to predict API/Statcast stats for those with limited samples
# batter_predictions = impute_batter_stats.predict(complete_dataset.loc[complete_dataset['pa_b'] < 40, batter_stats_fg_imp])

# # Impute inputs with limited sample size with predicted values
# complete_dataset.loc[complete_dataset['pa_b'] < 40, batter_inputs] = batter_predictions

# ### Pitchers
# # Use Steamer stats to predict API/Statcast stats for those with limited samples
# pitcher_predictions = impute_pitcher_stats.predict(complete_dataset.loc[complete_dataset['pa_p'] < 40, pitcher_stats_fg_imp])

# # Impute inputs with limited sample size with predicted values
# complete_dataset.loc[complete_dataset['pa_p'] < 40, pitcher_inputs] = pitcher_predictions

##### Option 2: Middle

In [None]:
# # First, remove from dataset if ever missing FG/Steamer stats
# complete_dataset = complete_dataset[~complete_dataset['b1_rate'].isna()]
# complete_dataset = complete_dataset[~complete_dataset['H9'].isna()]

# # Instead of imputing, just weighting with 0s
# complete_dataset[batter_inputs].fillna(0.0, inplace=True)
# complete_dataset[pitcher_inputs].fillna(0.0, inplace=True)

# # Calculate the weighted average for each column in pitcher_stats
# # Could be simplified, but I wanted to show the steps
# # Weighted average of provided value and 0. PAs and 50-PAs are weights. 
# for col in batter_inputs:
#     complete_dataset[col] = (complete_dataset[col] * complete_dataset['pa_b'] + 0.0 * (50-complete_dataset['pa_b']))/50

# # Calculate the weighted average for each column in pitcher_stats
# for col in pitcher_inputs:
#     complete_dataset[col] = (complete_dataset[col] * complete_dataset['pa_p'] + 0.0 * (50-complete_dataset['pa_p']))/50

##### Option 3: 0s and 1s

Assume 0s for player stats where sample is insufficient or missing

In [None]:
complete_dataset.loc[complete_dataset['pa_b'] < 40, batter_inputs] = 0
complete_dataset.loc[complete_dataset['pa_p'] < 40, pitcher_inputs] = 0

complete_dataset[batter_stats_fg] = complete_dataset[batter_stats_fg].fillna(0)
complete_dataset[pitcher_stats_fg] = complete_dataset[pitcher_stats_fg].fillna(0)

Assume 1 for WFX where WFX are missing

In [None]:
complete_dataset['imp_wfx'] = (complete_dataset['hr_wfx_l'].isna() | complete_dataset['hr_wfx_r'].isna()).astype(int)

In [None]:
complete_dataset[[f'{event}_wfx_l' for event in events_list]] = complete_dataset[[f'{event}_wfx_l' for event in events_list]].fillna(1)
complete_dataset[[f'{event}_wfx_r' for event in events_list]] = complete_dataset[[f'{event}_wfx_r' for event in events_list]].fillna(1)

### Sample

Drop early observations

In [None]:
complete_dataset = complete_dataset[(complete_dataset['game_date'] > '2018-01-01') & (complete_dataset['game_date'] < '2025-01-01')]

Drop atypical events

In [None]:
complete_dataset = complete_dataset.query('eventsModel != "Cut"')

Drop observations from inactive parks

In [None]:
active_parks = list(team_map['VENUE_ID'].astype(int))
complete_dataset = complete_dataset[complete_dataset['venue_id'].astype(int).isin(active_parks)]

### Shift

Many batter and pitcher stats are calculated at the end of the plate appearance. For prediction purposes, we need these stats coming into the plate appearance.

##### Batter Inputs

Sort

In [None]:
complete_dataset.sort_values(['date', 'gamePk', 'atBatIndex'], ascending=True, inplace=True)

Shift

In [None]:
complete_dataset[batter_inputs + ['ab_b', 'pa_b', 'imp_b']] = complete_dataset.groupby(['batter', 'pitchHand'])[batter_inputs + ['ab_b', 'pa_b', 'imp_b']].shift(1)

##### Pitcher Inputs

Sort

In [None]:
complete_dataset.sort_values(['date', 'gamePk', 'atBatIndex'], ascending=True, inplace=True)

Shift

In [None]:
complete_dataset[pitcher_inputs + ['ab_p', 'pa_p', 'imp_p']] = complete_dataset.groupby(['pitcher', 'batSide'])[pitcher_inputs + ['ab_p', 'pa_p', 'imp_p']].shift(1)

##### Inning Sums

Sort

In [None]:
complete_dataset.sort_values(['date', 'gamePk', 'atBatIndex'], ascending=True, inplace=True)

Shift

In [None]:
cumulative_inning_input_list = [col for col in complete_dataset.columns if col.endswith("_inning")]

complete_dataset[cumulative_inning_input_list] = complete_dataset.groupby(['gamePk', 'inning', 'pitcher'])[cumulative_inning_input_list].shift(1)
complete_dataset[cumulative_inning_input_list] = complete_dataset[cumulative_inning_input_list].fillna(0)

##### Game Sums

Sort

In [None]:
complete_dataset.sort_values(['date', 'gamePk', 'atBatIndex'], ascending=True, inplace=True)

Shift

In [None]:
cumulative_game_input_list = [col for col in complete_dataset.columns if col.endswith("_game")]
cumulative_game_input_list.remove('rbi_game')

complete_dataset[cumulative_game_input_list + ['times_faced']] = complete_dataset.groupby(['gamePk', 'pitcher'])[cumulative_game_input_list + ['times_faced']].shift(1)
complete_dataset[cumulative_game_input_list + ['times_faced']] = complete_dataset[cumulative_game_input_list + ['times_faced']].fillna(0)

### Train/Test Split

Split

In [None]:
np.random.seed(1)
complete_dataset['split'] = np.random.choice([0, 0, 1], size=len(complete_dataset))

Create masks to identify training and testing datasets

Note: to train on the entire dataset, you can simply set split = 0 for the entire dataset

In [None]:
training_mask = (complete_dataset['split'] == 0)

### Evaluations

##### Constructed Stats

This builds stats used for evaluating model performance (actual event rates, FP, wOBA, outs)

In [None]:
def constructed_stats(complete_dataset):
    # Actual Stats
    for event in events_list:
        complete_dataset[f'{event}_act'] = (complete_dataset['eventsModel'] == event).astype(int)

    # FP - Pitchers
    pitcher_weights = {'fo': 1.0460, 'go': 1.0460, 'po': 1.0460, 'lo': 1.0460, 'so': 3.0408, 'bb': -1.3508, 'b1': -1.7427, 'b2': -1.7427, 'b3': -1.7427, 'hr': -3.6639}
    for suffix in ['act', 'pred']:
        complete_dataset.loc[~training_mask, f'FP_P_{suffix}'] = sum(
            complete_dataset.loc[~training_mask, f'{col}_{suffix}'] * w
            for col, w in pitcher_weights.items()
        )
    
    # FP - Batters
    batter_weights = {'b1': 4.3665, 'b2': 6.8271, 'b3': 10.8503, 'hr': 15.2611, 'bb':  2.8725, 'hbp': 2.9639}
    for suffix in ['act', 'pred']:
        complete_dataset.loc[~training_mask, f'FP_B_{suffix}'] = sum(
            complete_dataset.loc[~training_mask, f'{col}_{suffix}'] * w
            for col, w in batter_weights.items()
        )

    # wOBA (roughly)
    woba_weights = {'b1': 0.882, 'b2': 1.254, 'b3': 1.590, 'hr': 2.050, 'bb': 0.689, 'hbp': 0.720}
    for suffix in ['act', 'pred']:
        complete_dataset.loc[~training_mask, f'wOBA_{suffix}'] = sum(
            complete_dataset.loc[~training_mask, f'{col}_{suffix}'] * w
            for col, w in woba_weights.items()
        )
    
    # Out
    complete_dataset['is_out_act'] = complete_dataset['is_out'].copy()
    complete_dataset.loc[~training_mask, 'is_out_pred'] = complete_dataset.loc[~training_mask, ['fo_pred','go_pred','po_pred','lo_pred','so_pred']].sum(axis=1)
    

    return complete_dataset

##### Summary Statistics

In [None]:
def summary_statistics(complete_dataset, year, parameters, filename, le):
    """
    Full sklearn-style summary_statistics restored for the PyTorch models.
    Includes per-output quantile dataframes like b1_year_df, hr_year_df, etc.
    """
    import pandas as pd
    import numpy as np

    # Outputs from the label encoder + additional continuous targets
    output_vars = list(le.classes_) + ['is_out', 'wOBA', 'FP_B', 'FP_P']

    quantiles = 10  # used throughout

    # ==============================
    #  Figure 1 – Pitchers: starter/imputation
    # ==============================
    print("\nFigure 1: Pitchers by Starter and Imputation Status")
    print(
        complete_dataset[~training_mask]
        .query(f'year == {year}')
        .groupby(['imp_p', 'starter'])[
            ['FP_P_pred', 'FP_P_act', 'wOBA_act', 'so_act']
        ].mean()
    )

    # ==============================
    #  Figure 2 – Pitchers: imputation only
    # ==============================
    print("\nFigure 2: Pitchers by Imputation Status")
    print(
        complete_dataset[~training_mask]
        .query(f'year == {year}')
        .groupby(['imp_p'])[
            ['FP_P_pred', 'FP_P_act', 'wOBA_act', 'so_act']
        ].mean()
    )

    # ==============================
    #  Figure 3 – Batters by Imputation Status
    # ==============================
    print("\nFigure 3: Batters by Imputation Status")
    print(
        complete_dataset[~training_mask]
        .query(f'year == {year}')
        .groupby(['imp_b'])[
            ['FP_B_pred', 'FP_B_act', 'wOBA_act', 'hr_act']
        ].mean()
    )

    # ==============================
    #  Figure 4 – FP by Venue
    # ==============================
    print("\nFigure 4: FP by Venue")
    venue_cols = ['FP_B_pred', 'FP_B_act', 'FP_P_pred', 'FP_P_act']
    means = (
        complete_dataset[~training_mask]
        .query(f'year == {year}')
        .groupby('venue_id')[venue_cols]
        .mean()
    )
    print(means)
    print(f"FP_B MSE: {np.mean((means['FP_B_pred'] - means['FP_B_act'])**2):.4f}")
    print(f"FP_P MSE: {np.mean((means['FP_P_pred'] - means['FP_P_act'])**2):.4f}")

    # ==============================
    #  Figure 5 – HR by WFX quantile
    # ==============================
    print("\nFigure 5: HRs by Quantile")
    complete_dataset['hr_wfx_quantile'] = (
        pd.qcut(
            complete_dataset['hr_wfx'],
            q=quantiles,
            duplicates='drop',
            labels=False,
        ) + 1
    )
    print(
        complete_dataset[~training_mask]
        .groupby('hr_wfx_quantile')[['hr_pred', 'hr_act']]
        .mean()
    )

    # ==============================
    #  Quantile performance tables + all_stat_df
    # ==============================
    all_stat_list = []

    for var in output_vars:
        pred_col = f"{var}_pred"
        act_col = f"{var}_act"
        q_col = f"{var}_quantile"

        # Assign quantile column
        complete_dataset.loc[~training_mask, q_col] = pd.qcut(
            complete_dataset.loc[~training_mask, pred_col],
            quantiles,
            labels=False,
            duplicates='drop'
        )

        # ---- ALL years quantile table ----
        df_all = (
            complete_dataset[~training_mask]
            .groupby(q_col)[[act_col, pred_col]]
            .mean()
            .reset_index()
        )
        mse_all = ((df_all[act_col] - df_all[pred_col]) ** 2).mean()

        # ---- Specific year quantile table ----
        df_year = (
            complete_dataset.query(f'year == {year}')
            .loc[~training_mask]
            .groupby(q_col)[[act_col, pred_col]]
            .mean()
            .reset_index()
        )
        mse_year = ((df_year[act_col] - df_year[pred_col]) ** 2).mean()

        # Aggregate stats (ALL)
        actual_all = complete_dataset.loc[~training_mask, act_col].mean()
        predicted_all = complete_dataset.loc[~training_mask, pred_col].mean()
        mult_all = actual_all / predicted_all
        stdev_all = complete_dataset.loc[~training_mask, pred_col].std()

        all_stat_list.append([
            "All", var, actual_all, predicted_all, mult_all,
            stdev_all, mse_all, filename, str(parameters['hidden_layer_sizes'])
        ])

        # Aggregate stats (YEAR)
        actual_year = (
            complete_dataset.query(f'year == {year}')
            .loc[~training_mask, act_col]
            .mean()
        )
        predicted_year = (
            complete_dataset.query(f'year == {year}')
            .loc[~training_mask, pred_col]
            .mean()
        )
        mult_year = actual_year / predicted_year
        stdev_year = (
            complete_dataset.query(f'year == {year}')
            .loc[~training_mask, pred_col]
            .std()
        )

        all_stat_list.append([
            year, var, actual_year, predicted_year, mult_year,
            stdev_year, mse_year, filename, str(parameters['hidden_layer_sizes'])
        ])

        # ==============================
        #  Restore per-variable quantile dataframes (your old behavior)
        # ==============================
        varname = f"{var}_year_df"
        globals()[varname] = df_year   # same behavior as old sklearn pipeline


    # ==============================
    #  Build and return all_stat_df
    # ==============================
    all_stat_df = pd.DataFrame(
        all_stat_list,
        columns=['Year', 'Output', 'Actual', 'Predicted', 'Multiplier',
                 'Std. Dev', 'MSE', 'File', 'Layers']
    )

    print(all_stat_df[['Year','Output','Actual','Predicted','Multiplier','Std. Dev','MSE']])

    return all_stat_df


##### Plots

In [None]:
import matplotlib.pyplot as plt

def graph_by_quantile(graph, le):
    """
    Plot predicted vs actual values by quantile for outputs.
    graph: a string suffix used in globals() variable names (e.g., '')
    le: LabelEncoder with class names
    """
    rows, columns = 5, 3
    fig, axs = plt.subplots(rows, columns, figsize=(columns*4, rows*4))

    total_plots = rows * columns
    output_vars = list(le.classes_) + ['is_out','wOBA','FP_B','FP_P']
    output_vars = output_vars[:total_plots]

    for i, var in enumerate(output_vars):
        row = i // columns
        col = i % columns
        df_name = f"{var}{graph}_df"
        if df_name not in globals():
            print(f"Warning: dataframe {df_name} not found, skipping")
            continue
        df = globals()[df_name]
        axs[row, col].plot(df[f'{var}_quantile'], df[f'{var}_pred'], color='red', label='Predicted')
        axs[row, col].plot(df[f'{var}_quantile'], df[f'{var}_act'], color='black', label='Actual')
        axs[row, col].set_title(var)
        axs[row, col].legend()

    fig.tight_layout(pad=2.0)
    plt.show()


### Model A. All - Unadjusted

##### Inputs

Batter Inputs

In [None]:
batter_input_list = batter_inputs

Remove directional proclivities

In [None]:
batter_input_list = [stat for stat in batter_input_list if "to_" not in stat]

Pitcher Inputs

In [None]:
pitcher_input_list = pitcher_inputs

Remove directional proclivities

In [None]:
pitcher_input_list = [stat for stat in pitcher_input_list if "to_" not in stat]

Hand Inputs

In [None]:
hand_input_list = ['p_L', 'b_L']

Imputation Inputs

In [None]:
imp_input_list = ['imp_b', 'imp_p']

Starter Input(s)

In [None]:
starter_input_list = ['starter']

Cumulative Inning Inputs

In [None]:
cumulative_inning_input_list = [col for col in complete_dataset.columns if col.endswith("_inning")]

In [None]:
cumulative_inning_input_list.remove('rbi_inning')

Cumulative Game Inputs

In [None]:
cumulative_game_input_list = [col for col in complete_dataset.columns if col.endswith("_game")]

In [None]:
cumulative_game_input_list.remove('rbi_game')

Game State Inputs

In [None]:
complete_dataset['winning'] = (complete_dataset['preBatterScore'] > complete_dataset['prePitcherScore']).astype(int)
complete_dataset['winning_big'] = (complete_dataset['preBatterScore'] > complete_dataset['prePitcherScore'] + 3).astype(int)

In [None]:
game_state_input_list = ['onFirst', 'onSecond', 'onThird', 'top', 'score_diff', 'prePitcherScore', 'preBatterScore', 'winning', 'winning_big', 'times_faced']

Inning Inputs

In [None]:
for inning in range(1, 12):
    complete_dataset[f'inning_{inning}'] = (complete_dataset['inning'] == inning).astype(int)
complete_dataset['inning_11'] = (complete_dataset['inning'] >= 11).astype(int)

In [None]:
inning_input_list = [col for col in complete_dataset.columns if col.startswith("inning_")]

Out Inputs

In [None]:
for out in range(0, 3):
    complete_dataset[f'outs_{out}'] = (complete_dataset['outs_pre'] == out).astype(int)

In [None]:
out_input_list = ['outs_0', 'outs_1', 'outs_2']

Venue Inputs

Note: venue inputs are not preferred following integrating into WFX

In [None]:
complete_dataset['venue_id2'] = complete_dataset['venue_id'].copy()
complete_dataset = pd.get_dummies(complete_dataset, columns=['venue_id2'], prefix='venue')

In [None]:
venue_input_list = [col for col in complete_dataset.columns if col.startswith("venue_") and col != "venue_id" and col != "venue_name"]
venue_input_list = list(dict.fromkeys(venue_input_list))

Assign batSide-specific Weather Multipliers

In [None]:
for event in events_list:
    complete_dataset[f'{event}_wfx'] = np.where(complete_dataset['batSide'] == "L", complete_dataset[f'{event}_wfx_l'], 
                                                                                    complete_dataset[f'{event}_wfx_r'])

In [None]:
multiplier_input_list = [f'{event}_wfx' for event in events_list]

Imputation and starter interactions

In [None]:
complete_dataset['imputed_starter'] = complete_dataset['imp_p'] * complete_dataset['starter']
complete_dataset['imputed_reliever'] = complete_dataset['imp_p'] * (complete_dataset['starter'] == 0).astype(int)
complete_dataset['unimputed_starter'] = (complete_dataset['imp_p'] == 0).astype(int) * complete_dataset['starter']
complete_dataset['unimputed_reliever'] = (complete_dataset['imp_p'] == 0).astype(int) * (complete_dataset['starter'] == 0).astype(int)

In [None]:
imp_starter_input_list = ['imputed_starter', 'imputed_reliever', 'unimputed_starter', 'unimputed_reliever']

Model Inputs

In [None]:
model_a_input_list = (batter_input_list + pitcher_input_list + hand_input_list + imp_input_list + starter_input_list + 
                      cumulative_inning_input_list + cumulative_game_input_list + game_state_input_list + 
                      inning_input_list + out_input_list + imp_starter_input_list + batter_stats_fg + pitcher_stats_fg)

In [None]:
n1 = len(model_a_input_list) + 1

Fill in missings

In [None]:
complete_dataset[model_a_input_list] = complete_dataset[model_a_input_list].fillna(0)

Outputs

In [None]:
output_list = ['is_out', 'eventsModel']

Other variables

In [None]:
additional_list = ['pa_b', 'pa_p', 'year', 'date', 'gamePk', 'atBatIndex', 'venue_id', 'batterName', 'pitcherName', 'imp_wfx']

Variables to keep

In [None]:
keep_list = model_a_input_list + output_list + multiplier_input_list + additional_list 

# Testing

Single model, adjusted multiplier inputs.

In [None]:
complete_dataset.loc[:, multiplier_input_list] = (complete_dataset.loc[:, multiplier_input_list] - 1)

One model, all inputs

In [None]:
model_a_input_list = (batter_input_list + pitcher_input_list + hand_input_list + imp_input_list + starter_input_list + 
                      cumulative_inning_input_list + cumulative_game_input_list + game_state_input_list + 
                      inning_input_list + out_input_list + imp_starter_input_list + batter_stats_fg + pitcher_stats_fg + multiplier_input_list)

##### Memory

Remove unnecessary columns

In [None]:
complete_dataset = complete_dataset[keep_list]

Convert boolean columns to float

In [None]:
bool_cols = complete_dataset.select_dtypes(include="bool").columns
complete_dataset[bool_cols] = complete_dataset[bool_cols].astype(float)

##### Neural Network

Create a class that works like sklearn's neural network but uses Pytorch and predicts with numpy

In [None]:
class PredictAll:
    def __init__(self, ensemble_numpy, input_columns, classes, metadata=None):
        """
        ensemble_numpy: list of models, each a list of [W1, b1, W2, b2, ..., Wn, bn]
        input_columns: list of feature names used during training (order matters!)
        classes: list of class labels (same order as in training)
        metadata: optional dict with additional info (hidden_layers, num_classifiers, etc.)
        """
        self.ensemble = ensemble_numpy
        self.input_columns = input_columns
        self.classes_ = classes
        self.metadata = metadata or {}

    @staticmethod
    def _softmax(x):
        e_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return e_x / e_x.sum(axis=1, keepdims=True)

    @staticmethod
    def _forward(model_layers, x):
        """
        Forward pass for a single model.
        model_layers: [W1, b1, W2, b2, ..., Wn, bn]
        x: numpy array of shape [n_samples, n_features]
        """
        n_layers = len(model_layers) // 2
        h = x
        for i in range(n_layers - 1):
            W = model_layers[2*i]
            b = model_layers[2*i + 1]
            h = np.maximum(0, h @ W + b)  # ReLU
        # final layer
        W = model_layers[-2]
        b = model_layers[-1]
        logits = h @ W + b
        return PredictAll._softmax(logits)

    def predict_proba(self, X):
        """
        X: pandas DataFrame, Series, or NumPy array
        Returns: numpy array [n_samples, n_classes] with probabilities
        """
        # Convert DataFrame or Series to NumPy array
        if isinstance(X, pd.DataFrame):
            # Reorder columns to match training
            x_np = X[self.input_columns].to_numpy(dtype=np.float32)
        elif isinstance(X, pd.Series):
            # Single row
            x_np = X[self.input_columns].to_numpy(dtype=np.float32).reshape(1, -1)
        else:
            x_np = np.array(X, dtype=np.float32)
            if x_np.ndim == 1:
                x_np = x_np.reshape(1, -1)

        # Check input size
        expected_size = self.ensemble[0][0].shape[0]
        if x_np.shape[1] != expected_size:
            raise ValueError(
                f"Input feature size ({x_np.shape[1]}) does not match model first layer ({expected_size})"
            )

        # Run all models in ensemble
        probs_list = [self._forward(model, x_np) for model in self.ensemble]

        # Average probabilities
        avg_probs = np.mean(probs_list, axis=0)
        return avg_probs

    def predict(self, X):
        """
        Returns predicted class labels (argmax), like sklearn's predict()
        """
        probs = self.predict_proba(X)
        return np.array([self.classes_[i] for i in np.argmax(probs, axis=1)])

Define Pytorch MLP

In [None]:
class MLP(nn.Module):
    def __init__(self, input_size, hidden_layers, output_size):
        super().__init__()
        layers = []
        prev_size = input_size
        for h in hidden_layers:
            layers.append(nn.Linear(prev_size, h))
            layers.append(nn.ReLU())
            prev_size = h
        layers.append(nn.Linear(prev_size, output_size))
        self.net = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.net(x)

##### Settings

Model

In [None]:
num_classifiers = 3 # Ensemble size
num_models = 40 # Number of voting classifiers to run in loop
random_state = random.randint(10000,90000) 

all_stat_list = [] # List of dataframes with evaluation data

model_a_parameters = {
    'hidden_layer_sizes': (168,80,40),
    'activation': 'relu',
    'max_iter': 100,
    'alpha': 0.00001,
    'learning_rate_init': 0.01, 
    'batch_size': 'auto',
    'random_state': random_state,
    # dropout = 0.1 # Need to switch to MLPDropout to use
    'early_stopping': True,
    'tol': 0.00001,
    'n_iter_no_change': 20,
    'validation_fraction': 0.05
}

Plots

In [None]:
quantiles = 10
year = 2024
venue = 19
graph = '_year' # options include '_year', '_venue', or '' (for all years and venues)

##### Train, Predict, and Evaluate

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Encode string outputs to integers
le = LabelEncoder()
y_train_np = le.fit_transform(complete_dataset['eventsModel'].values[training_mask])
y_train = torch.tensor(y_train_np, dtype=torch.long, device=device)

# Convert numeric inputs to torch tensor
X_train_np = complete_dataset.loc[training_mask, model_a_input_list].astype(float).values
X_train = torch.tensor(X_train_np, dtype=torch.float32, device=device)

input_size = X_train.shape[1]
output_size = len(le.classes_)
hidden_layers = model_a_parameters['hidden_layer_sizes']
lr = model_a_parameters['learning_rate_init']
num_epochs = model_a_parameters['max_iter']

all_stat_list = []

# Training loop
for i in range(num_models):
    print(f"Training ensemble {i+1}/{num_models}")
    ensemble = []

    all_filename = f"predict_all_{''.join(str(x) for x in hidden_layers)}_{random_state+i}_{todaysdate}"
    print(all_filename)
    
    for j in range(num_classifiers):
        # Ensure different random weights for each model
        seed = random_state + 100*j + i
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed(seed)
            torch.cuda.manual_seed_all(seed)
        np.random.seed(seed)
        random.seed(seed)

        model = MLP(input_size, hidden_layers, output_size).to(device)
        optimizer = optim.Adam(model.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss()

        # Train model
        model.train()
        for epoch in range(num_epochs):
            optimizer.zero_grad()
            outputs = model(X_train)
            loss = criterion(outputs, y_train)
            loss.backward()
            optimizer.step()

        ensemble.append(model)

    # Save PyTorch ensemble as before
    torch.save({
        'state_dicts': [m.state_dict() for m in ensemble],
        'input_size': input_size,
        'hidden_layers': hidden_layers,
        'output_size': output_size
    }, os.path.join(model_path, "M03. Plate Appearances", f'{all_filename}.pt'))
    
    # ---- NEW: Export NumPy weights for PredictAll ----
    ensemble_numpy = []
    for m in ensemble:
        state_dict = m.state_dict()
        layers = []
    
        # Identify Linear layers in order
        linear_keys = [k for k in state_dict.keys() if "weight" in k]
        linear_keys.sort()  # ensure order
        
        for i, key in enumerate(linear_keys):
            layers.append(state_dict[key].cpu().numpy().T)          # W
            bias_key = key.replace("weight", "bias")
            layers.append(state_dict[bias_key].cpu().numpy())       # b
        
        ensemble_numpy.append(layers)

    
    # ---- NEW: Build PredictAll wrapper with metadata ----
    metadata = {
        "hidden_layers": hidden_layers,
        "num_classifiers": num_classifiers,
        "random_seed": random_state,
        "training_epochs": num_epochs
    }
    
    predict_all_wrapper = PredictAll(
        ensemble_numpy=ensemble_numpy,
        input_columns=model_a_input_list,
        classes=le.classes_.tolist(),
        metadata=metadata
    )
    
    # ---- NEW: Save wrapper to disk ----
    pickle_filename = os.path.join(model_path, "M03. Plate Appearances", f"{all_filename}_wrapper.pkl")
    with open(pickle_filename, "wb") as f:
        pickle.dump(predict_all_wrapper, f)
    print(f"Saved PredictAll wrapper to {pickle_filename}")
    
    # Predict on test set as before
    X_test_np = complete_dataset.loc[~training_mask, model_a_input_list].astype(float).values
    X_test = torch.tensor(X_test_np, dtype=torch.float32, device=device)

    with torch.no_grad():
        probs_list = [F.softmax(m(X_test), dim=1) for m in ensemble]
        avg_probs = torch.stack(probs_list).mean(dim=0)

    # Store predictions in dataframe
    all_outputs_pred = [c + "_pred" for c in le.classes_]
    complete_dataset.loc[~training_mask, all_outputs_pred] = avg_probs.cpu().numpy()

    # Call your summary/stat functions
    complete_dataset = constructed_stats(complete_dataset)
    all_stat_df = summary_statistics(complete_dataset, year, parameters=model_a_parameters, filename=all_filename, le=le)
    all_stat_list.append(all_stat_df)
    graph_by_quantile(graph, le=le)


Pareto-Optimal Models

In [None]:
all_stat_df = pd.concat(all_stat_list, ignore_index=True)

pareto_optimal(all_stat_df.query(f'Year == "{year}"') # Will accept variable year and string "All"
                          .query('Output == "wOBA"')
                          .query('1.01 > Multiplier > 0.99').reset_index(drop=True), ['MSE', 'Std. Dev'], ['Minimize', 'Maximize']).sort_values('Std. Dev')

### Predict

Load model

Note: this will overwrite predict_all model from U5. Models.ipynb

In [None]:
all_filename = "predict_all_16080_36421_20251105.sav"

predict_all = pickle.load(open(os.path.join(model_path, "M03. Plate Appearances", all_filename), 'rb'))

Predict

In [None]:
all_outputs_pred = [x + "_pred1" for x in list(predict_all.classes_)]

complete_dataset[all_outputs_pred] = predict_all.predict_proba(complete_dataset[model_a_input_list])

### Model B. All - WFX Adjusted

##### Inputs

Calculate Predicted Rate x WFX Interactions

In [None]:
interactions_list = []

for event in events_list:
    complete_dataset[f'{event}_int'] = complete_dataset[f'{event}_pred1'] * complete_dataset[f'{event}_wfx']
    interactions_list.append(f'{event}_int')

Model Inputs

In [None]:
model_b_input_list = interactions_list + imp_starter_input_list #+ ['imp_wfx']
model_b_input_list = ([f"{event}_pred1" for event in events_list] + multiplier_input_list + imp_starter_input_list)

##### Settings

Model

In [None]:
num_classifiers = 3 # Ensemble size
num_models = 40 # Number of voting classifiers to run in loop
random_state = random.randint(10000,90000) 

all_adjusted_stat_list = [] # List of dataframes with evaluation data

model_b_parameters = {
    'hidden_layer_sizes': (16,),
    'activation': 'relu',
    'max_iter': 100,
    'alpha': 0.00001,
    'learning_rate_init': 0.001, 
    'batch_size': 1024,
    'random_state': random_state,
    # dropout = 0.1 # Need to switch to MLPDropout to use
    'early_stopping': True,
    'tol': 0.00001,
    'n_iter_no_change': 10,
    'validation_fraction': 0.05
}

Plots

In [None]:
quantiles = 10
year = 2024 
venue = 19
graph = '_year' # options include '_year', '_venue', or '' (for all years and venues)

##### Train, Predict, and Evaluate

In [None]:
%%time
print(f"Ensemble Size: {num_classifiers}")
for i in range(num_models):
    # Set filename
    all_adjusted_filename = f"predict_all_adjusted_{''.join(str(x) for x in model_b_parameters['hidden_layer_sizes'])}_{random_state+i}_{todaysdate}.sav"
    print(f"Model {i}: {all_adjusted_filename}")

    ### Train
    # Build list of MLP classifiers with varied random_state
    estimators = []
    for j in range(num_classifiers):
        # Determine random state
        model_b_parameters['random_state'] = random_state + 100 * j + i
        # Create model
        clf = SafeMLPClassifier(**model_b_parameters)
        estimators.append((f"mlp_{j}", clf))
    # Combine into a soft voting classifier
    predict_all_adjusted = VotingClassifier(estimators=estimators, voting='soft', n_jobs=-1)

    # Fit
    predict_all_adjusted.fit(complete_dataset[training_mask][model_b_input_list], complete_dataset[training_mask][['eventsModel']].values.ravel())

    # Save model
    pickle.dump(predict_all_adjusted, open(os.path.join(model_path, "M03. Plate Appearances", all_adjusted_filename), 'wb'))

    
    ### Predict
    all_outputs_pred = [x + "_pred" for x in list(predict_all_adjusted.classes_)]
    complete_dataset.loc[~training_mask, all_outputs_pred] = predict_all_adjusted.predict_proba(complete_dataset[~training_mask][model_b_input_list])


    ### Evaluate
    # Construct stats required for model evaluations
    complete_dataset = constructed_stats(complete_dataset)

    # Print summary statistics
    all_stat_df = summary_statistics(complete_dataset, year, parameters=model_b_parameters, filename=all_adjusted_filename, model=predict_all_adjusted)

    # Add model statistics to a running dataframe list for later evaluation across models
    all_adjusted_stat_list.append(all_stat_df)

    # Graph
    graph_by_quantile(graph, model=predict_all_adjusted)

Pareto-Optimal Models

In [None]:
all_adjusted_stat_df = pd.concat(all_adjusted_stat_list, ignore_index=True)

pareto_optimal(all_adjusted_stat_df.query(f'Year == "{year}"') # Will accept variable year and string "All"
                                   .query('Output == "wOBA"')
                                   .query('1.01 > Multiplier > 0.99').reset_index(drop=True), ['MSE', 'Std. Dev'], ['Minimize', 'Maximize']).sort_values('Std. Dev')

Note: We have the following options for predicting plate appearances using player, game, and weather inputs:
1. Kitchen Sink: One model with all features
2. Interacted Outputs: One model with player/game features. Outputs are then multiplied by wfx multipliers to create probabilities.
3. Split: Two models. First has player/game stats. Second has model 1 outputs and wfx multipliers as inputs.
4. Interacted Inputs: Two models. First has player/game stats. Second has model 1 outputs x wfx multipliers as inputs.
5. No Rain: One model with player/game stats. No wfx at all. (Just a baseline for comparison)