# Model Training and Evaluation 

## Predicting MLB game results

This notebook explores the training and evaluation of machine learning models on the outcomes of MLB games using a dataset composed of current season statistics.

### Notebook Setup

Import the necessary dependencies.

In [1]:
import pandas as pd
from datetime import datetime
from utils.notebook_setup import setup_notebook_env, load_env_variables

Setup the notebook environment. Add the project root to `sys.path` so higher-level modules can be accessed from this notebook, and load environment variables that are required to configure connection to the PostgreSQL database.

In [2]:
setup_notebook_env()
load_env_variables()

Now, import the remaining dependencies that couldn't be accessed prior to the notebook environment setup, and connect to the database.

In [3]:
from shared.database import connect_to_db
from machine_learning.data.processing.mlb_data_pipeline import MLBDataPipeline

session = connect_to_db()

### Model Selection

Read the relevant database tables into DataFrames so training data can be prepared using the data pipeline class.

In [4]:
teams_df = pd.read_sql_table("mlb_teams", session.bind)
schedule_df = pd.read_sql_table("mlb_schedule", session.bind)
offensive_stats_df = pd.read_sql_table("mlb_offensive_stats", session.bind)
defensive_stats_df = pd.read_sql_table("mlb_defensive_stats", session.bind)

Prepare training data for a given date range.

In [5]:
data_pipeline = MLBDataPipeline(rolling_window=10, head_to_head_window=5)

start_date = datetime(2024, 5, 1)
end_date = datetime(2024, 6, 1)

training_data = data_pipeline.prepare_training_data(
    schedule_df=schedule_df,
    teams_df=teams_df,
    offensive_stats_df=offensive_stats_df,
    defensive_stats_df=defensive_stats_df,
    start_date=start_date,
    end_date=end_date
)

In [6]:
training_data.head()

Unnamed: 0,home_rolling_win_pct,away_rolling_win_pct,home_rolling_runs_scored,away_rolling_runs_scored,home_rolling_runs_allowed,away_rolling_runs_allowed,home_days_rest,away_days_rest,home_batting_avg,away_batting_avg,...,away_strikeouts,h2h_home_win_pct,h2h_away_win_pct,h2h_games_played,game_id,game_date,home_team_id,away_team_id,home_team_won,run_differential
0,0.6,0.5,4.7,3.7,3.7,5.4,1.0,1.0,0.234,0.248,...,1308,0.5,0.5,2,746478,2024-05-01,116,138,True,3.0
1,0.5,0.3,3.8,3.2,5.2,5.0,1.0,1.0,0.248,0.23,...,1406,0.5,0.5,2,745996,2024-05-01,158,139,True,6.0
2,0.3,0.9,4.1,6.7,5.5,3.7,1.0,1.0,0.221,0.246,...,1500,0.0,1.0,5,746800,2024-05-01,145,142,False,-5.0
3,0.4,0.4,3.1,3.9,4.3,4.0,1.0,1.0,0.241,0.248,...,1339,0.4,0.6,5,744939,2024-05-01,141,118,False,-5.0
4,0.6,0.3,3.3,2.3,3.6,3.8,1.0,1.0,0.233,0.234,...,1356,1.0,0.0,2,745670,2024-05-01,133,134,True,4.0
