# Model Training and Evaluation 

## Predicting MLB game results

This notebook explores the training and evaluation of machine learning models on the outcomes of MLB games using a dataset composed of current season statistics.

### Notebook Setup

Import the necessary dependencies.

In [1]:
import pandas as pd
from datetime import datetime
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from utils.notebook_setup import setup_notebook_env, load_env_variables

Setup the notebook environment. Add the project root to `sys.path` so higher-level modules can be accessed from this notebook, and load environment variables that are required to configure connection to the PostgreSQL database.

In [2]:
setup_notebook_env()
load_env_variables()

Now, import the remaining dependencies that couldn't be accessed prior to the notebook environment setup, and connect to the database.

In [3]:
from shared.database import connect_to_db
from machine_learning.data.processing.mlb_data_pipeline import MLBDataPipeline

session = connect_to_db()

### Model Selection

Read the relevant database tables into DataFrames so training data can be prepared using the data pipeline class.

In [4]:
teams_df = pd.read_sql_table("mlb_teams", session.bind)
schedule_df = pd.read_sql_table("mlb_schedule", session.bind)
offensive_stats_df = pd.read_sql_table("mlb_offensive_stats", session.bind)
defensive_stats_df = pd.read_sql_table("mlb_defensive_stats", session.bind)

Prepare training data for a given date range.

In [5]:
data_pipeline = MLBDataPipeline(rolling_window=10, head_to_head_window=5)

start_date = datetime(2024, 5, 1)
end_date = datetime(2024, 10, 1)

training_data = data_pipeline.prepare_training_data(
    schedule_df=schedule_df,
    teams_df=teams_df,
    offensive_stats_df=offensive_stats_df,
    defensive_stats_df=defensive_stats_df,
    start_date=start_date,
    end_date=end_date
)

Inspect the training data for missing values or any other inconsistencies.

In [None]:
training_data.head()

Unnamed: 0,home_rolling_win_pct,away_rolling_win_pct,home_rolling_runs_scored,away_rolling_runs_scored,home_rolling_runs_allowed,away_rolling_runs_allowed,home_days_rest,away_days_rest,home_batting_avg,away_batting_avg,home_obp,away_obp,home_slg,away_slg,home_era,away_era,home_whip,away_whip,home_strikeouts,away_strikeouts,h2h_home_win_pct,h2h_away_win_pct,h2h_games_played,game_id,game_date,home_team_id,away_team_id,home_team_won,run_differential
0,0.6,0.5,4.7,3.7,3.7,5.4,1.0,1.0,0.234,0.248,0.3,0.312,0.385,0.392,3.61,4.04,1.16,1.26,1354,1308,0.5,0.5,2,746478,2024-05-01,116,138,True,3.0
1,0.5,0.3,3.8,3.2,5.2,5.0,1.0,1.0,0.248,0.23,0.326,0.302,0.403,0.366,3.65,3.77,1.23,1.2,1373,1406,0.5,0.5,2,745996,2024-05-01,158,139,True,6.0
2,0.3,0.9,4.1,6.7,5.5,3.7,1.0,1.0,0.221,0.246,0.278,0.315,0.34,0.411,4.67,4.26,1.44,1.23,1366,1500,0.0,1.0,5,746800,2024-05-01,145,142,False,-5.0
3,0.4,0.4,3.1,3.9,4.3,4.0,1.0,1.0,0.241,0.248,0.313,0.306,0.389,0.403,4.29,3.76,1.27,1.24,1314,1339,0.4,0.6,5,744939,2024-05-01,141,118,False,-5.0
4,0.6,0.3,3.3,2.3,3.6,3.8,1.0,1.0,0.233,0.234,0.301,0.301,0.393,0.371,4.37,4.15,1.33,1.31,1263,1356,1.0,0.0,2,745670,2024-05-01,133,134,True,4.0


Create an instance of the `TimeSeriesSplit` Time Series cross-validator, and drop any irrelevant columns from the training data. Define and instantiate the models to be evaluated, and create the empty `results` object to be used to store the performance metrics for each model.

In [7]:
tscv = TimeSeriesSplit(n_splits=5)

timestamp_cols = training_data.select_dtypes(include=["datetime64"]).columns

feature_cols = [col for col in training_data.columns if col not in ["home_team_won", "run_differential"] + list(timestamp_cols)]

models = {
    "random_forest": RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}

Fit and train each model, using the Time Series cross-validator to split the data into training and test splits. Then, predict and evaluate each model for each fold of the training data, and store the results.

In [8]:
for name, model in models.items():
    fold_scores = []
    for train_index, test_index in tscv.split(training_data):
        X_train, X_test = training_data.iloc[train_index][feature_cols], training_data.iloc[test_index][feature_cols]
        y_train, y_test = training_data.iloc[train_index]["home_team_won"], training_data.iloc[test_index]["home_team_won"]

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        scores = {
            "accuracy": accuracy_score(y_test, y_pred),
            "precision": precision_score(y_test, y_pred),
            "recall": recall_score(y_test, y_pred),
            "f1": f1_score(y_test, y_pred)
        }
        fold_scores.append(scores)
        
    results[name] = fold_scores

Inspect the results of the Random Forest classifier.

In [9]:
results_df = pd.DataFrame(results["random_forest"])

results_df.mean()

accuracy     0.535562
precision    0.549762
recall       0.608086
f1           0.576760
dtype: float64