# Feature Engineering

## MLB Current Season Statistics by Team

This notebook performs feature engineering using MLB current season statistics for a given matchup between two teams. The generated features will be used to make informed predictions on the matchup outcome.

### Notebook Setup

Import the necessary dependencies.

In [1]:
import pandas as pd
from datetime import datetime
from pprint import pprint

from utils.notebook_setup import setup_notebook_env, load_env_variables

Setup the notebook environment so that the root directory of the project can be accessed and the required environment variables are loaded to configure connection to the PostgreSQL database. 

In [2]:
setup_notebook_env()
load_env_variables()

Import dependencies from higher-level modules that can now be accessed, and connect to the database.

In [3]:
from shared.database import connect_to_db
from machine_learning.analysis.mlb_time_series import (
    TeamTimeSeriesAnalyzer
)
from machine_learning.analysis.mlb_feature_engineering import (
    GameFeatureGenerator
)

session = connect_to_db()

### Feature Generation

Read the relevant database tables into DataFrames and calculate rolling statistics for a given window size, all to be used in feature generation.

In [4]:
teams_df = pd.read_sql_table('mlb_teams', session.bind)
schedule_df = pd.read_sql_table('mlb_schedule', session.bind)
offensive_stats_df = pd.read_sql_table('mlb_offensive_stats', session.bind)
defensive_stats_df = pd.read_sql_table('mlb_defensive_stats', session.bind)

rolling_stats = TeamTimeSeriesAnalyzer(
    window_size=10
).calculate_rolling_stats(
    schedule_df=schedule_df,
    teams_df=teams_df
)

Generate the features for a given matchup.

In [5]:
game_feature_generator = GameFeatureGenerator(rolling_window=10, head_to_head_window=5)

home_team_id = teams_df.loc[teams_df['name'] == 'Los Angeles Dodgers'].iloc[0]['id']
away_team_id = teams_df.loc[teams_df['name'] == 'New York Yankees'].iloc[0]['id']

features = game_feature_generator.generate_game_features(
    home_team_id=home_team_id,
    away_team_id=away_team_id,
    game_date=datetime.now(),
    rolling_stats=rolling_stats,
    offensive_stats=offensive_stats_df,
    defensive_stats=defensive_stats_df,
    schedule_df=schedule_df
)

pprint(features)

{'away_batting_avg': 0.248,
 'away_days_rest': 21.0,
 'away_era': 3.74,
 'away_obp': 0.333,
 'away_rolling_runs_allowed': 4.1,
 'away_rolling_runs_scored': 4.8,
 'away_rolling_win_pct': 0.6,
 'away_slg': 0.429,
 'away_strikeouts': 1457,
 'away_whip': 1.24,
 'h2h_away_win_pct': 0.333,
 'h2h_games_played': 3,
 'h2h_home_win_pct': 0.667,
 'home_batting_avg': 0.258,
 'home_days_rest': 21.0,
 'home_era': 3.9,
 'home_obp': 0.335,
 'home_rolling_runs_allowed': 3.5,
 'home_rolling_runs_scored': 6.2,
 'home_rolling_win_pct': 0.7,
 'home_slg': 0.446,
 'home_strikeouts': 1390,
 'home_whip': 1.23}
