# MLB Player Trajectory Modeling

This notebook represents the final modeling stage of the MLBPlayerTrajectories project.

Using the longitudinal player–season dataset constructed in prior notebooks, the objective here is to estimate how a hitter’s offensive performance is likely to *change* from one season to the next. Rather than modeling raw performance levels directly, player evaluation is framed as a **trajectory problem**, where the focus is on identifying meaningful shifts in performance over time.

Specifically, this notebook seeks to identify hitters who are likely to experience:
- Breakouts, meaningful improvement relative to prior performance  
- Declines, sustained drops in offensive effectiveness  
- Bouncebacks, returns to prior levels following a down season  
- Stable performance, where production remains within expected ranges  

All modeling is performed within a clean, forward-looking backtesting framework. Features are derived exclusively from information available *prior* to the prediction season, ensuring that no future information is used during training or inference. Underlying process metrics, such as plate discipline, contact quality, and batted ball characteristics, serve as model inputs, while outcome-based metrics are reserved strictly for labeling and evaluation to prevent information leakage.

Rather than treating trajectory outcomes as fixed classes during training, the modeling task is framed as a **change estimation problem**. Models are used to estimate the expected magnitude and direction of offensive performance change, which is then mapped into interpretable trajectory categories using analyst-defined thresholds. This approach preserves information about performance magnitude while allowing flexible and transparent classification logic.

The emphasis of this notebook is on interpretability, ranking quality, and analytical clarity rather than production deployment or black-box optimization. The goal is to support forward-looking player evaluation and comparison, not to maximize predictive complexity.

## Objective

The primary objectives of this notebook are to:

- Construct player-level historical features using prior-season and multi-season data  
- Define interpretable trajectory outcomes based on future changes in offensive performance  
- Train and evaluate models that estimate breakout and decline risk in a forward-looking setting  
- Produce player-level predictions suitable for ranking, comparison, and downstream visualization  

The outputs of this notebook are designed to support both analytical interpretation and integration into a Power BI dashboard for player evaluation.

## Modeling Philosophy

This notebook follows a set of guiding principles:

- **Longitudinal perspective**: Players are modeled across multiple seasons rather than treated as static, single-season observations.  
- **Process over results**: Underlying skill indicators are used as predictive features, while outcome-based statistics are excluded from model inputs.  
- **Interpretability first**: Models are selected to balance predictive signal with clear, explainable drivers of improvement or decline.  
- **Reproducibility**: All transformations, assumptions, and evaluation steps are made explicit and applied consistently across seasons.  

By adhering to these principles, the analysis mirrors how forward-looking performance forecasting is approached in real-world baseball analytics and applied evaluation settings.

## Notebook Scope

This notebook begins with a cleaned, joined player–season dataset exported from the staging process. No additional data ingestion or SQL operations are performed here.

The workflow focuses on feature construction, trajectory labeling, and forward-looking modeling, followed by result inspection and interpretation. The notebook concludes by producing player-level outputs suitable for downstream visualization and comparative analysis.

## Load and Prepare Player–Season Data

In [3]:
import pandas as pd

df = pd.read_csv("../notebooks/data/processed/PlayerSeasonFull.csv")
df.head()

Unnamed: 0,IDfg,Name,Season,Team,Age,G,PA,wRC+,WAR,OPS,...,K%,BB/K,O-Swing%,Z-Contact%,SwStr%,EV,LA,HardHit%,Barrel%,maxEV
0,15640,Aaron Judge,2022,NYY,30,157,696,206,11.1,1.111,...,0.251,0.63,0.268,0.852,0.118,95.8,14.9,0.611,0.262,118.4
1,9777,Nolan Arenado,2022,STL,31,148,620,149,7.2,0.891,...,0.116,0.72,0.361,0.908,0.086,88.8,21.7,0.389,0.082,111.4
2,11493,Manny Machado,2022,SDP,29,150,644,152,7.1,0.898,...,0.207,0.47,0.342,0.856,0.118,91.5,16.0,0.49,0.098,112.4
3,5417,Jose Altuve,2022,HOU,32,141,604,164,6.9,0.921,...,0.144,0.76,0.314,0.91,0.068,85.9,16.1,0.297,0.077,109.8
4,9218,Paul Goldschmidt,2022,STL,34,151,651,175,6.8,0.981,...,0.217,0.56,0.276,0.818,0.099,90.7,15.7,0.472,0.115,112.3


In [4]:
df.shape

(538, 23)

In [5]:
df["Season"].value_counts().sort_index()

Season
2022    130
2023    134
2024    129
2025    145
Name: count, dtype: int64

## Column Overview
We inspect the columns left after the SQL queries and joins are finished.

In [6]:
df.columns.tolist()

['IDfg',
 'Name',
 'Season',
 'Team',
 'Age',
 'G',
 'PA',
 'wRC+',
 'WAR',
 'OPS',
 'wOBA',
 'xwOBA',
 'BB%',
 'K%',
 'BB/K',
 'O-Swing%',
 'Z-Contact%',
 'SwStr%',
 'EV',
 'LA',
 'HardHit%',
 'Barrel%',
 'maxEV']

## Core Fields Sanity Check

We briefly inspect a small subset of identifier and performance fields to confirm data consistency.

In [7]:
df[["IDfg", "Name", "Season", "Age", "PA", "wRC+", "WAR"]].head(10)

Unnamed: 0,IDfg,Name,Season,Age,PA,wRC+,WAR
0,15640,Aaron Judge,2022,30,696,206,11.1
1,9777,Nolan Arenado,2022,31,620,149,7.2
2,11493,Manny Machado,2022,29,644,152,7.1
3,5417,Jose Altuve,2022,32,604,164,6.9
4,9218,Paul Goldschmidt,2022,34,651,175,6.8
5,5361,Freddie Freeman,2022,32,708,157,6.8
6,11739,J.T. Realmuto,2022,31,562,129,6.7
7,18314,Dansby Swanson,2022,28,696,117,6.6
8,19556,Yordan Alvarez,2022,25,561,185,6.4
9,16252,Trea Turner,2022,29,708,128,6.4
