# 05 - 2026 Fantasy Predictions

Generate 2026 fantasy point predictions:

1. **Retrain models** on all data (2016-2025)
2. **Predict rate stats** (Fpoints/PA, Fpoints/IP) for 2026 using 2025 features
3. **Apply external projections** (PA/IP/W/L/SV) to get total fantasy points
4. **Generate final rankings**

In [54]:
import sys
import os

# Set working directory to project root
if 'notebooks' in os.getcwd():
    os.chdir('..')
sys.path.insert(0, os.getcwd())

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import shap

from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_absolute_error

from config.settings import PROCESSED_DATA_DIR, MODELS_DIR, RAW_DATA_DIR, RANDOM_STATE
from config.scoring import PITCHER_SCORING_TEAM

import warnings
warnings.filterwarnings('ignore')

np.random.seed(RANDOM_STATE)
print(f"Project root: {os.getcwd()}")

Project root: /Users/matthewgillies/mlb-fantasy-2026


---
## Part 1: Batter Predictions
---

### 1.1 Load Data & Retrain Model

In [55]:
# Load processed batter data
batters = pd.read_csv(f"{PROCESSED_DATA_DIR}/batters_processed.csv")
print(f"Loaded {len(batters)} batter-seasons")
print(f"Years: {batters['Season'].min()} - {batters['Season'].max()}")

# Feature columns (lag features only - these use previous year data to predict current year)
feature_cols_bat = [c for c in batters.columns if '_lag' in c]
print(f"\nFeatures: {len(feature_cols_bat)}")

Loaded 3578 batter-seasons
Years: 2016 - 2025

Features: 50


In [56]:
# Train on ALL data (2016-2025) for final model
# We use all years since we're predicting 2026, not evaluating
train_df_bat = batters.copy()

X_train_bat = train_df_bat[feature_cols_bat].copy()
y_train_bat = train_df_bat['Fpoints_PA'].copy()

# Fill NaN with median
train_medians_bat = X_train_bat.median()
X_train_bat = X_train_bat.fillna(train_medians_bat)

print(f"Training set: {len(X_train_bat)} rows (all years)")
print(f"Features: {X_train_bat.shape[1]}")

Training set: 3578 rows (all years)
Features: 50


In [57]:
# Train Random Forest (best performer from evaluation)
# Using similar params to what worked well in evaluation
rf_bat = RandomForestRegressor(
    n_estimators=200,
    max_depth=20,
    min_samples_split=5,
    min_samples_leaf=4,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

print("Training batter model on all data...")
rf_bat.fit(X_train_bat, y_train_bat)
print("Done!")

# Store for later use
batter_model = rf_bat

Training batter model on all data...
Done!


### 1.2 Prepare 2026 Prediction Data

For 2026 predictions, we use 2025 stats as `_lag1` features.

In [58]:
# Get 2025 player data to use as features for 2026 predictions
# The 2025 row already has lag1 features (which are 2024 data)
# For 2026, we need 2025 data as lag1

# Get base features from 2025 (these become lag1 for 2026)
batters_2025 = batters[batters['Season'] == 2025].copy()
print(f"Players with 2025 data: {len(batters_2025)}")

# The feature columns we need for 2026 prediction are the _lag1 columns
# But we need to populate them with 2025 current-year values

# Get the base feature names (without _lag1 suffix)
base_features = [c.replace('_lag1', '') for c in feature_cols_bat if '_lag1' in c]
print(f"Base features needed: {len(base_features)}")

Players with 2025 data: 376
Base features needed: 25


In [59]:
# Create 2026 prediction dataframe
# Use 2025 actual values as lag1, and 2025 lag1 as lag2

pred_2026_bat = batters_2025[['IDfg', 'Name', 'Team', 'Age', 'PA', 'Fpoints_PA']].copy()
pred_2026_bat = pred_2026_bat.rename(columns={'Fpoints_PA': 'Fpoints_PA_2025', 'PA': 'PA_2025', 'Age': 'Age_2025'})

# Build feature matrix for 2026 prediction
X_2026_bat = pd.DataFrame(index=batters_2025.index)

for feat in base_features:
    # lag1 for 2026 = current value in 2025
    if feat in batters_2025.columns:
        X_2026_bat[f'{feat}_lag1'] = batters_2025[feat].values
    else:
        X_2026_bat[f'{feat}_lag1'] = np.nan
    
    # lag2 for 2026 = lag1 value in 2025 (which was 2024 data)
    lag1_col = f'{feat}_lag1'
    if lag1_col in batters_2025.columns:
        X_2026_bat[f'{feat}_lag2'] = batters_2025[lag1_col].values
    else:
        X_2026_bat[f'{feat}_lag2'] = np.nan

# Ensure column order matches training
X_2026_bat = X_2026_bat[feature_cols_bat]

# Fill missing with training medians
X_2026_bat = X_2026_bat.fillna(train_medians_bat)

print(f"2026 prediction matrix: {X_2026_bat.shape}")
print(f"NaN count: {X_2026_bat.isna().sum().sum()}")

2026 prediction matrix: (376, 50)
NaN count: 0


### 1.3 Generate Rate Predictions

In [60]:
# Predict Fpoints/PA for 2026
pred_2026_bat['Predicted_Fpoints_PA'] = batter_model.predict(X_2026_bat)

# Sort by predicted rate
pred_2026_bat = pred_2026_bat.sort_values('Predicted_Fpoints_PA', ascending=False).reset_index(drop=True)

print("\n=== 2026 Batter Rate Predictions (Fpoints/PA) ===")
print(f"Players: {len(pred_2026_bat)}")
print(f"\nTop 25 by predicted Fpoints/PA:")
display_cols = ['Name', 'Team', 'Age_2025', 'PA_2025', 'Fpoints_PA_2025', 'Predicted_Fpoints_PA']
print(pred_2026_bat[display_cols].head(25).to_string(index=False))


=== 2026 Batter Rate Predictions (Fpoints/PA) ===
Players: 376

Top 25 by predicted Fpoints/PA:
                 Name  Team  Age_2025  PA_2025  Fpoints_PA_2025  Predicted_Fpoints_PA
            Juan Soto   NYM        26      715         0.777622              0.771566
          Aaron Judge   NYY        33      679         0.882180              0.767631
        Shohei Ohtani   LAD        30      727         0.784044              0.709556
         Jose Ramirez   CLE        32      673         0.775632              0.707757
       Bobby Witt Jr.   KCR        25      687         0.671033              0.674176
       Yordan Alvarez   HOU        28      199         0.557789              0.669735
          Kyle Tucker   CHC        28      597         0.703518              0.662984
          Ketel Marte   ARI        31      556         0.705036              0.634366
         Corey Seager   TEX        31      445         0.606742              0.631847
Vladimir Guerrero Jr.   TOR        26      

### 1.4 SHAP Explainability for 2026 Batter Predictions

In [61]:
# Create SHAP explainer for batter model
batter_explainer = shap.TreeExplainer(batter_model)

def explain_batter_2026(player_name):
    """
    Show SHAP waterfall plot explaining a batter's 2026 prediction.
    
    Args:
        player_name: Player name (partial match supported)
    """
    # Find player in 2026 predictions
    mask = pred_2026_bat['Name'].str.contains(player_name, case=False)
    player_df = pred_2026_bat[mask]
    
    if len(player_df) == 0:
        print(f"Player '{player_name}' not found in 2026 predictions")
        similar = pred_2026_bat[pred_2026_bat['Name'].str.contains(player_name[:3], case=False)]['Name'].head(5)
        if len(similar) > 0:
            print(f"Similar names: {similar.tolist()}")
        return
    
    if len(player_df) > 1:
        print(f"Multiple matches: {player_df['Name'].tolist()}")
        player_df = player_df.iloc[[0]]
    
    player_row = player_df.iloc[0]
    player_idx = player_df.index[0]
    
    # Get the feature row for this player from X_2026_bat
    # Need to find the matching index in X_2026_bat
    orig_idx = batters_2025[batters_2025['Name'] == player_row['Name']].index[0]
    X_player = X_2026_bat.loc[[orig_idx]]
    
    # Get prediction
    pred = player_row['Predicted_Fpoints_PA']
    
    print(f"\n=== 2026 Prediction: {player_row['Name']} ===")
    print(f"Team: {player_row['Team']}")
    print(f"2025 Stats: {player_row['PA_2025']:.0f} PA, {player_row['Fpoints_PA_2025']:.3f} Fpoints/PA")
    print(f"\nPredicted 2026 Fpoints/PA: {pred:.3f}")
    
    # Calculate SHAP values
    player_shap = batter_explainer(X_player)
    player_shap.feature_names = feature_cols_bat
    player_shap.data = X_player.values
    
    # Plot
    plt.figure(figsize=(10, 8))
    shap.plots.waterfall(player_shap[0], max_display=15, show=False)
    plt.title(f"SHAP: What's driving {player_row['Name']}'s 2026 prediction?")
    plt.tight_layout()
    plt.show()

print("Batter SHAP explainer ready - use explain_batter_2026('Player Name')")

Batter SHAP explainer ready - use explain_batter_2026('Player Name')


In [85]:
# Example: explain a batter's 2026 prediction
explain_batter_2026("Jac Caglianone")

Player 'Jac Caglianone' not found in 2026 predictions
Similar names: ['Jackson Merrill', 'Jackson Chourio', 'Jacob Wilson', 'Jackson Holliday', 'Jacob Young']


In [63]:
# Save rate predictions
os.makedirs('predictions', exist_ok=True)
pred_2026_bat.to_csv('predictions/batters_2026_rate_predictions.csv', index=False)
print("Saved to predictions/batters_2026_rate_predictions.csv")

Saved to predictions/batters_2026_rate_predictions.csv


---
## Part 2: Pitcher Predictions
---

### 2.1 Load Data & Retrain Model

In [64]:
# Load processed pitcher data
pitchers = pd.read_csv(f"{PROCESSED_DATA_DIR}/pitchers_processed.csv")
print(f"Loaded {len(pitchers)} pitcher-seasons")
print(f"Years: {pitchers['Season'].min()} - {pitchers['Season'].max()}")

# Feature columns
feature_cols_pit = [c for c in pitchers.columns if '_lag' in c]

# Also include arsenal columns if present
arsenal_cols = [c for c in pitchers.columns if c.startswith(('ff_', 'si_', 'sl_', 'ch_', 'cu_'))]
feature_cols_pit = feature_cols_pit + [c for c in arsenal_cols if c not in feature_cols_pit]

print(f"\nFeatures: {len(feature_cols_pit)}")

Loaded 3985 pitcher-seasons
Years: 2016 - 2025

Features: 119


In [65]:
# Train on ALL data for final model
train_df_pit = pitchers.copy()

X_train_pit = train_df_pit[feature_cols_pit].copy()
y_train_pit = train_df_pit['Fpoints_IP'].copy()

# Fill NaN with median
train_medians_pit = X_train_pit.median()
X_train_pit = X_train_pit.fillna(train_medians_pit)

print(f"Training set: {len(X_train_pit)} rows (all years)")
print(f"Features: {X_train_pit.shape[1]}")

Training set: 3985 rows (all years)
Features: 119


In [66]:
# Train XGBoost (best performer from evaluation)
xgb_pit = XGBRegressor(
    n_estimators=200,
    max_depth=7,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.6,
    gamma=5,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbosity=0
)

print("Training pitcher model on all data...")
xgb_pit.fit(X_train_pit, y_train_pit)
print("Done!")

# Store for later use
pitcher_model = xgb_pit

Training pitcher model on all data...
Done!


### 2.2 Prepare 2026 Prediction Data

In [67]:
# Get 2025 pitcher data
pitchers_2025 = pitchers[pitchers['Season'] == 2025].copy()
print(f"Pitchers with 2025 data: {len(pitchers_2025)}")

# Get base feature names
base_features_pit = list(set([c.replace('_lag1', '').replace('_lag2', '') 
                               for c in feature_cols_pit if '_lag' in c]))
print(f"Base features: {len(base_features_pit)}")

Pitchers with 2025 data: 423
Base features: 56


In [68]:
# Create 2026 prediction dataframe
pred_2026_pit = pitchers_2025[['IDfg', 'Name', 'Team', 'Age', 'IP', 'GS', 'G', 
                                'Fpoints_IP', 'W', 'L', 'SV', 'HLD']].copy()
pred_2026_pit = pred_2026_pit.rename(columns={
    'Fpoints_IP': 'Fpoints_IP_2025', 
    'IP': 'IP_2025', 
    'Age': 'Age_2025',
    'W': 'W_2025', 'L': 'L_2025', 'SV': 'SV_2025', 'HLD': 'HLD_2025'
})
pred_2026_pit['Role'] = pred_2026_pit['GS'].apply(lambda x: 'SP' if x > 0 else 'RP')

# Build feature matrix
X_2026_pit = pd.DataFrame(index=pitchers_2025.index)

for feat in base_features_pit:
    # lag1 for 2026 = current value in 2025
    if feat in pitchers_2025.columns:
        X_2026_pit[f'{feat}_lag1'] = pitchers_2025[feat].values
    else:
        X_2026_pit[f'{feat}_lag1'] = np.nan
    
    # lag2 for 2026 = lag1 value in 2025
    lag1_col = f'{feat}_lag1'
    if lag1_col in pitchers_2025.columns:
        X_2026_pit[f'{feat}_lag2'] = pitchers_2025[lag1_col].values
    else:
        X_2026_pit[f'{feat}_lag2'] = np.nan

# Add arsenal columns (these are current-year values, not lagged)
for col in arsenal_cols:
    if col in pitchers_2025.columns:
        X_2026_pit[col] = pitchers_2025[col].values
    else:
        X_2026_pit[col] = np.nan

# Ensure column order matches training
X_2026_pit = X_2026_pit.reindex(columns=feature_cols_pit)

# Fill missing with training medians
X_2026_pit = X_2026_pit.fillna(train_medians_pit)

print(f"2026 prediction matrix: {X_2026_pit.shape}")
print(f"NaN count: {X_2026_pit.isna().sum().sum()}")

2026 prediction matrix: (423, 119)
NaN count: 0


### 2.3 Generate Rate Predictions

In [69]:
# Predict Fpoints/IP for 2026
pred_2026_pit['Predicted_Fpoints_IP'] = pitcher_model.predict(X_2026_pit)

# Sort by predicted rate
pred_2026_pit = pred_2026_pit.sort_values('Predicted_Fpoints_IP', ascending=False).reset_index(drop=True)

print("\n=== 2026 Pitcher Rate Predictions (Fpoints/IP) ===")
print(f"Pitchers: {len(pred_2026_pit)}")
print(f"\nTop 25 by predicted Fpoints/IP (all):")
display_cols = ['Name', 'Team', 'Role', 'IP_2025', 'Fpoints_IP_2025', 'Predicted_Fpoints_IP']
print(pred_2026_pit[display_cols].head(25).to_string(index=False))


=== 2026 Pitcher Rate Predictions (Fpoints/IP) ===
Pitchers: 423

Top 25 by predicted Fpoints/IP (all):
            Name  Team Role  IP_2025  Fpoints_IP_2025  Predicted_Fpoints_IP
    Mason Miller - - -   RP     61.2         3.196078              2.817536
 Aroldis Chapman   BOS   RP     61.1         3.425532              2.634882
      Edwin Diaz   NYM   RP     66.1         3.242057              2.593045
     Griffin Jax - - -   SP     66.0         2.272727              2.591288
   Shohei Ohtani   LAD   SP     47.0         2.638298              2.516919
  Devin Williams   NYY   RP     62.0         2.258065              2.503018
      Josh Hader   HOU   RP     52.2         3.134100              2.498071
Jeremiah Estrada   SDP   RP     73.0         2.547945              2.488457
 Garrett Crochet   BOS   SP    205.1         2.639200              2.482658
    Tarik Skubal   DET   SP    195.1         2.851358              2.480192
      Cade Smith   CLE   RP     73.2         2.754098      

In [70]:
# Show SP and RP separately
print("\n=== Top 20 Starting Pitchers ===")
sp_preds = pred_2026_pit[pred_2026_pit['Role'] == 'SP']
print(sp_preds[display_cols].head(20).to_string(index=False))

print("\n=== Top 20 Relief Pitchers ===")
rp_preds = pred_2026_pit[pred_2026_pit['Role'] == 'RP']
print(rp_preds[display_cols].head(20).to_string(index=False))


=== Top 20 Starting Pitchers ===
            Name  Team Role  IP_2025  Fpoints_IP_2025  Predicted_Fpoints_IP
     Griffin Jax - - -   SP     66.0         2.272727              2.591288
   Shohei Ohtani   LAD   SP     47.0         2.638298              2.516919
 Garrett Crochet   BOS   SP    205.1         2.639200              2.482658
    Tarik Skubal   DET   SP    195.1         2.851358              2.480192
     Cole Ragans   KCR   SP     61.2         2.362745              2.449911
   Hunter Greene   CIN   SP    107.2         2.673507              2.422286
    Zack Wheeler   PHI   SP    149.2         2.765416              2.414526
    Kyle Bradish   BAL   SP     32.0         2.875000              2.314774
     Dylan Cease   SDP   SP    168.0         1.940476              2.302171
      Chris Sale   ATL   SP    125.2         2.672524              2.287775
    Jacob deGrom   TEX   SP    172.2         2.488966              2.267852
   Logan Gilbert   SEA   SP    131.0         2.526718 

### 2.4 SHAP Explainability for 2026 Pitcher Predictions

In [71]:
# Create SHAP explainer for pitcher model
# Note: Using LightGBM for SHAP since XGBoost has compatibility issues with some SHAP versions
# Train a LightGBM model for SHAP explanations
lgb_pit_shap = LGBMRegressor(
    n_estimators=200,
    max_depth=10,
    learning_rate=0.05,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=-1
)
lgb_pit_shap.fit(X_train_pit, y_train_pit)
pitcher_explainer = shap.TreeExplainer(lgb_pit_shap)

def explain_pitcher_2026(player_name):
    """
    Show SHAP waterfall plot explaining a pitcher's 2026 prediction.
    
    Args:
        player_name: Player name (partial match supported)
    """
    # Find player in 2026 predictions
    mask = pred_2026_pit['Name'].str.contains(player_name, case=False)
    player_df = pred_2026_pit[mask]
    
    if len(player_df) == 0:
        print(f"Player '{player_name}' not found in 2026 predictions")
        similar = pred_2026_pit[pred_2026_pit['Name'].str.contains(player_name[:3], case=False)]['Name'].head(5)
        if len(similar) > 0:
            print(f"Similar names: {similar.tolist()}")
        return
    
    if len(player_df) > 1:
        print(f"Multiple matches: {player_df['Name'].tolist()}")
        player_df = player_df.iloc[[0]]
    
    player_row = player_df.iloc[0]
    
    # Get the feature row for this player from X_2026_pit
    orig_idx = pitchers_2025[pitchers_2025['Name'] == player_row['Name']].index[0]
    X_player = X_2026_pit.loc[[orig_idx]]
    
    # Get prediction (use the LightGBM model for consistency with SHAP)
    pred = lgb_pit_shap.predict(X_player)[0]
    
    role = player_row['Role']
    
    print(f"\n=== 2026 Prediction: {player_row['Name']} ({role}) ===")
    print(f"Team: {player_row['Team']}")
    print(f"2025 Stats: {player_row['IP_2025']:.1f} IP, {player_row['Fpoints_IP_2025']:.3f} Fpoints/IP")
    print(f"\nPredicted 2026 Fpoints/IP: {pred:.3f}")
    
    # Calculate SHAP values
    player_shap = pitcher_explainer(X_player)
    player_shap.feature_names = feature_cols_pit
    player_shap.data = X_player.values
    
    # Plot
    plt.figure(figsize=(10, 8))
    shap.plots.waterfall(player_shap[0], max_display=15, show=False)
    plt.title(f"SHAP: What's driving {player_row['Name']}'s 2026 prediction?")
    plt.tight_layout()
    plt.show()

print("Pitcher SHAP explainer ready - use explain_pitcher_2026('Player Name')")

Pitcher SHAP explainer ready - use explain_pitcher_2026('Player Name')


In [86]:
# Example: explain a pitcher's 2026 prediction
explain_pitcher_2026("Chase Burns")

Player 'Chase Burns' not found in 2026 predictions
Similar names: ['Aroldis Chapman', 'Michael King', 'Michael Wacha', 'Michael McGreevy', 'Michael Soroka']


In [73]:
# Save rate predictions
pred_2026_pit.to_csv('predictions/pitchers_2026_rate_predictions.csv', index=False)
print("Saved to predictions/pitchers_2026_rate_predictions.csv")

Saved to predictions/pitchers_2026_rate_predictions.csv


---
## Part 3: Apply External Projections

To convert rate predictions to total fantasy points, we need:
- **Batters**: PA projections
- **Pitchers**: IP, W, L, SV, HLD projections

### Projection Files Used:
- `data/projections/batx_hitters_2026.csv` - BatX batter projections (PA)
- `data/projections/oopsy_pitcher_2026.csv` - OOPSY pitcher projections (IP, W, L, SV, HLD)

Both files use `PlayerId` which maps to FanGraphs `IDfg`.

---

In [74]:
# Create projections directory
os.makedirs('data/projections', exist_ok=True)

# Check if projections exist
proj_files = os.listdir('data/projections') if os.path.exists('data/projections') else []
print("Available projection files:")
for f in proj_files:
    print(f"  - {f}")
    
if not proj_files:
    print("  (none found - download from FanGraphs)")

Available projection files:
  - oopsy_pitcher_2026.csv
  - batx_hitters_2026.csv


### 3.1 Load & Apply Batter Projections

In [75]:
# Load batter projections
BATTER_PROJ_FILE = 'data/projections/batx_hitters_2026.csv'

if os.path.exists(BATTER_PROJ_FILE):
    batter_proj = pd.read_csv(BATTER_PROJ_FILE)
    print(f"Loaded batter projections: {len(batter_proj)} players")
    print(f"Columns: {batter_proj.columns.tolist()}")
else:
    print(f"File not found: {BATTER_PROJ_FILE}")
    print("Using 2025 PA as fallback projection")
    batter_proj = None

Loaded batter projections: 667 players
Columns: ['Name', 'Team', 'G', 'PA', 'AB', 'H', '1B', '2B', '3B', 'HR', 'R', 'RBI', 'BB', 'IBB', 'SO', 'HBP', 'SF', 'SH', 'GDP', 'SB', 'CS', 'AVG', 'BB%', 'K%', 'BB/K', 'OBP', 'SLG', 'wOBA', 'OPS', 'ISO', 'Spd', 'BABIP', 'UBR', 'wSB', 'wRC', 'wRAA', 'wRC+', 'BsR', 'Fld', 'Off', 'Def', 'WAR', 'ADP', 'InterSD', 'InterSK', 'IntraSD', 'Vol', 'Skew', 'Dim', 'FPTS', 'FPTS/G', 'SPTS', 'SPTS/G', 'P10', 'P20', 'P30', 'P40', 'P50', 'P60', 'P70', 'P80', 'P90', 'TT10', 'TT20', 'TT30', 'TT40', 'TT50', 'TT60', 'TT70', 'TT80', 'TT90', 'NameASCII', 'PlayerId', 'MLBAMID']


In [None]:
# Merge projections with predictions
if batter_proj is not None:
    # BatX uses 'PlayerId' which maps to IDfg
    id_col = 'PlayerId'
    
    # Get PA and updated Team from BatX projections
    proj_subset = batter_proj[[id_col, 'PA', 'Team', 'Name']].copy()
    proj_subset.columns = ['IDfg', 'Projected_PA', 'Proj_Team', 'Proj_Name']
    
    # Convert to numeric, coercing non-numeric IDs (like 'sa3063134' for minor leaguers) to NaN
    proj_subset['IDfg'] = pd.to_numeric(proj_subset['IDfg'], errors='coerce')
    
    # Drop rows with non-numeric IDs
    proj_subset = proj_subset.dropna(subset=['IDfg'])
    proj_subset['IDfg'] = proj_subset['IDfg'].astype(int)
    
    print(f"BatX projections with valid IDs: {len(proj_subset)}")
    
    pred_2026_bat = pred_2026_bat.merge(
        proj_subset[['IDfg', 'Projected_PA', 'Proj_Team']], 
        on='IDfg', how='left'
    )
    
    # Check how many matched
    matched = pred_2026_bat['Projected_PA'].notna().sum()
    print(f"Matched {matched} of {len(pred_2026_bat)} players with BatX projections")
    
    # Fill missing PA with 2025 PA
    pred_2026_bat['Projected_PA'] = pred_2026_bat['Projected_PA'].fillna(pred_2026_bat['PA_2025'])
    
    # Update Team with projected team where available (keep old team for fallback)
    pred_2026_bat['Team_2025'] = pred_2026_bat['Team']  # Backup old team
    pred_2026_bat['Team'] = pred_2026_bat['Proj_Team'].fillna(pred_2026_bat['Team'])
    pred_2026_bat = pred_2026_bat.drop('Proj_Team', axis=1)
    
    # Show team changes
    team_changed = pred_2026_bat[pred_2026_bat['Team'] != pred_2026_bat['Team_2025']]
    if len(team_changed) > 0:
        print(f"\nTeam changes (BatX vs 2025): {len(team_changed)}")
        print(team_changed[['Name', 'Team_2025', 'Team']].head(10).to_string(index=False))
else:
    # Use 2025 PA as projection
    pred_2026_bat['Projected_PA'] = pred_2026_bat['PA_2025']
    pred_2026_bat['Team_2025'] = pred_2026_bat['Team']

# Calculate total projected fantasy points
pred_2026_bat['Projected_Fpoints'] = pred_2026_bat['Predicted_Fpoints_PA'] * pred_2026_bat['Projected_PA']

# Sort by total
pred_2026_bat = pred_2026_bat.sort_values('Projected_Fpoints', ascending=False).reset_index(drop=True)
pred_2026_bat['Rank'] = range(1, len(pred_2026_bat) + 1)

print("\n=== 2026 Batter Total Projections ===")
display_cols = ['Rank', 'Name', 'Team', 'Projected_PA', 'Predicted_Fpoints_PA', 'Projected_Fpoints']
print(pred_2026_bat[display_cols].head(30).to_string(index=False))

### 3.2 Load & Apply Pitcher Projections

In [77]:
# Load pitcher projections
PITCHER_PROJ_FILE = 'data/projections/oopsy_pitcher_2026.csv'

if os.path.exists(PITCHER_PROJ_FILE):
    pitcher_proj = pd.read_csv(PITCHER_PROJ_FILE)
    print(f"Loaded pitcher projections: {len(pitcher_proj)} players")
    print(f"Columns: {pitcher_proj.columns.tolist()}")
else:
    print(f"File not found: {PITCHER_PROJ_FILE}")
    print("Using 2025 stats as fallback projection")
    pitcher_proj = None

Loaded pitcher projections: 4333 players
Columns: ['Name', 'Team', 'W', 'L', 'QS', 'ERA', 'G', 'GS', 'SV', 'HLD', 'BS', 'IP', 'TBF', 'H', 'R', 'ER', 'HR', 'BB', 'IBB', 'HBP', 'SO', 'K/9', 'BB/9', 'K/BB', 'HR/9', 'K%', 'BB%', 'K-BB%', 'AVG', 'WHIP', 'BABIP', 'LOB%', 'GB%', 'HR/FB', 'FIP', 'WAR', 'RA9-WAR', 'ADP', 'InterSD', 'InterSK', 'IntraSD', 'Vol', 'Skew', 'Dim', 'FPTS', 'FPTS/IP', 'SPTS', 'SPTS/IP', 'P10', 'P20', 'P30', 'P40', 'P50', 'P60', 'P70', 'P80', 'P90', 'TT10', 'TT20', 'TT30', 'TT40', 'TT50', 'TT60', 'TT70', 'TT80', 'TT90', 'NameASCII', 'PlayerId', 'MLBAMID']


In [None]:
# Merge projections with predictions
if pitcher_proj is not None:
    # OOPSY uses 'PlayerId' which maps to IDfg
    id_col = 'PlayerId'
    
    # Required columns from OOPSY (including Team for updates)
    proj_cols = ['IP', 'W', 'L', 'SV', 'HLD', 'Team']
    
    proj_subset = pitcher_proj[[id_col] + proj_cols].copy()
    proj_subset = proj_subset.rename(columns={id_col: 'IDfg', 'Team': 'Proj_Team'})
    proj_subset.columns = ['IDfg'] + [f'Proj_{c}' for c in proj_cols[:-1]] + ['Proj_Team']
    
    # Convert to numeric, coercing non-numeric IDs to NaN
    proj_subset['IDfg'] = pd.to_numeric(proj_subset['IDfg'], errors='coerce')
    
    # Drop rows with non-numeric IDs
    proj_subset = proj_subset.dropna(subset=['IDfg'])
    proj_subset['IDfg'] = proj_subset['IDfg'].astype(int)
    
    print(f"OOPSY projections with valid IDs: {len(proj_subset)}")
    
    pred_2026_pit = pred_2026_pit.merge(proj_subset, on='IDfg', how='left')
    
    # Check how many matched
    matched = pred_2026_pit['Proj_IP'].notna().sum()
    print(f"Matched {matched} of {len(pred_2026_pit)} players with OOPSY projections")
    
    # Fill missing with 2025 values
    pred_2026_pit['Proj_IP'] = pred_2026_pit['Proj_IP'].fillna(pred_2026_pit['IP_2025'])
    pred_2026_pit['Proj_W'] = pred_2026_pit['Proj_W'].fillna(pred_2026_pit['W_2025'])
    pred_2026_pit['Proj_L'] = pred_2026_pit['Proj_L'].fillna(pred_2026_pit['L_2025'])
    pred_2026_pit['Proj_SV'] = pred_2026_pit['Proj_SV'].fillna(pred_2026_pit['SV_2025'])
    pred_2026_pit['Proj_HLD'] = pred_2026_pit['Proj_HLD'].fillna(pred_2026_pit['HLD_2025'])
    
    # Update Team with projected team where available
    pred_2026_pit['Team_2025'] = pred_2026_pit['Team']  # Backup old team
    pred_2026_pit['Team'] = pred_2026_pit['Proj_Team'].fillna(pred_2026_pit['Team'])
    pred_2026_pit = pred_2026_pit.drop('Proj_Team', axis=1)
    
    # Show team changes
    team_changed = pred_2026_pit[pred_2026_pit['Team'] != pred_2026_pit['Team_2025']]
    if len(team_changed) > 0:
        print(f"\nTeam changes (OOPSY vs 2025): {len(team_changed)}")
        print(team_changed[['Name', 'Team_2025', 'Team', 'Role']].head(10).to_string(index=False))
else:
    # Use 2025 stats as projection
    pred_2026_pit['Proj_IP'] = pred_2026_pit['IP_2025']
    pred_2026_pit['Proj_W'] = pred_2026_pit['W_2025']
    pred_2026_pit['Proj_L'] = pred_2026_pit['L_2025']
    pred_2026_pit['Proj_SV'] = pred_2026_pit['SV_2025']
    pred_2026_pit['Proj_HLD'] = pred_2026_pit['HLD_2025']
    pred_2026_pit['Team_2025'] = pred_2026_pit['Team']

print("Projection columns added")

In [79]:
# Calculate total projected fantasy points
# Skill-based points
pred_2026_pit['Proj_Skill_Fpoints'] = pred_2026_pit['Predicted_Fpoints_IP'] * pred_2026_pit['Proj_IP']

# Team-based points (W/L/SV/HLD)
pred_2026_pit['Proj_Team_Fpoints'] = (
    pred_2026_pit['Proj_W'] * PITCHER_SCORING_TEAM['W'] +
    pred_2026_pit['Proj_L'] * PITCHER_SCORING_TEAM['L'] +
    pred_2026_pit['Proj_SV'] * PITCHER_SCORING_TEAM['SV'] +
    pred_2026_pit['Proj_HLD'] * PITCHER_SCORING_TEAM['HLD']
)

# Total
pred_2026_pit['Projected_Fpoints'] = pred_2026_pit['Proj_Skill_Fpoints'] + pred_2026_pit['Proj_Team_Fpoints']

# Sort by total
pred_2026_pit = pred_2026_pit.sort_values('Projected_Fpoints', ascending=False).reset_index(drop=True)
pred_2026_pit['Rank'] = range(1, len(pred_2026_pit) + 1)

print("\n=== 2026 Pitcher Total Projections ===")
display_cols = ['Rank', 'Name', 'Team', 'Role', 'Proj_IP', 'Predicted_Fpoints_IP', 
                'Proj_Skill_Fpoints', 'Proj_Team_Fpoints', 'Projected_Fpoints']
print(pred_2026_pit[display_cols].head(30).to_string(index=False))


=== 2026 Pitcher Total Projections ===
 Rank               Name  Team Role  Proj_IP  Predicted_Fpoints_IP  Proj_Skill_Fpoints  Proj_Team_Fpoints  Projected_Fpoints
    1       Tarik Skubal   DET   SP    205.0              2.480192          508.439349               16.0         524.439349
    2    Garrett Crochet   BOS   SP    197.0              2.482658          489.083702               12.0         501.083702
    3      Hunter Greene   CIN   SP    183.0              2.422286          443.278344                8.0         451.278344
    4        Paul Skenes   PIT   SP    198.0              2.166109          428.889646               14.0         442.889646
    5        Dylan Cease   SDP   SP    189.0              2.302171          435.110363                6.0         441.110363
    6        Cole Ragans   KCR   SP    173.0              2.449911          423.834582                8.0         431.834582
    7          Bryan Woo   SEA   SP    200.0              1.971158          394.23158

In [80]:
# Show SP and RP rankings separately
print("\n=== Top 25 Starting Pitchers ===")
sp_final = pred_2026_pit[pred_2026_pit['Role'] == 'SP'].copy()
sp_final['SP_Rank'] = range(1, len(sp_final) + 1)
print(sp_final[['SP_Rank', 'Name', 'Team', 'Proj_IP', 'Proj_W', 'Proj_L', 'Projected_Fpoints']].head(25).to_string(index=False))

print("\n=== Top 25 Relief Pitchers ===")
rp_final = pred_2026_pit[pred_2026_pit['Role'] == 'RP'].copy()
rp_final['RP_Rank'] = range(1, len(rp_final) + 1)
print(rp_final[['RP_Rank', 'Name', 'Team', 'Proj_IP', 'Proj_SV', 'Proj_HLD', 'Projected_Fpoints']].head(25).to_string(index=False))


=== Top 25 Starting Pitchers ===
 SP_Rank               Name Team  Proj_IP  Proj_W  Proj_L  Projected_Fpoints
       1       Tarik Skubal  DET    205.0    15.0     7.0         524.439349
       2    Garrett Crochet  BOS    197.0    14.0     8.0         501.083702
       3      Hunter Greene  CIN    183.0    12.0     8.0         451.278344
       4        Paul Skenes  PIT    198.0    14.0     7.0         442.889646
       5        Dylan Cease  SDP    189.0    13.0    10.0         441.110363
       6        Cole Ragans  KCR    173.0    12.0     8.0         431.834582
       7          Bryan Woo  SEA    200.0    14.0     9.0         404.231582
       8 Cristopher Sanchez  PHI    204.0    15.0     8.0         401.102191
       9       Jacob deGrom  TEX    174.0    11.0     8.0         400.606301
      10         Chris Sale  ATL    166.0    13.0     7.0         391.770696
      11      Logan Gilbert  SEA    167.0    12.0     7.0         384.381408
      12         Logan Webb  SFG    205.0 

---
## Part 4: Save Final Rankings
---

---
## Part 3.5: Add Position Data

Load position data from ESPN rankings to enhance our predictions.

In [None]:
# Load position data from ESPN (primary) and MLB Stats API (fallback)
from difflib import SequenceMatcher
import requests

# === Step 1: Load ESPN rankings for position data (primary source) ===
ESPN_FILE = 'data/espn/espn_rankings_2026.csv'
espn_positions = {}

if os.path.exists(ESPN_FILE):
    espn_data = pd.read_csv(ESPN_FILE)
    print(f"Loaded ESPN data: {len(espn_data)} players with positions")
    
    # Create position mapping from ESPN
    for _, row in espn_data.iterrows():
        if pd.notna(row['Position']) and row['Position']:
            espn_positions[row['Name'].lower().strip()] = row['Position']
    
    print(f"ESPN positions available: {len(espn_positions)}")
else:
    print(f"ESPN file not found: {ESPN_FILE}")

# === Step 2: Fetch MLB Stats API for positions (fallback for non-ESPN players) ===
print("\nFetching MLB Stats API for additional positions...")

mlb_positions = {}
try:
    url = 'https://statsapi.mlb.com/api/v1/sports/1/players?season=2025'
    resp = requests.get(url, timeout=15)
    if resp.status_code == 200:
        players = resp.json().get('people', [])
        print(f"MLB API returned {len(players)} players")
        
        for p in players:
            name = p.get('fullName', '').lower().strip()
            pos = p.get('primaryPosition', {}).get('abbreviation', '')
            if name and pos:
                # Map pitcher positions to SP/RP based on common patterns
                # Note: MLB API just says "P" for pitchers, we'll keep that for now
                mlb_positions[name] = pos
        
        print(f"MLB positions available: {len(mlb_positions)}")
except Exception as e:
    print(f"MLB API error: {e}")
    print("Will use ESPN positions only")

# === Step 3: Fuzzy matching helper ===
def fuzzy_match(name, name_dict, threshold=0.85):
    """Find best matching name from dictionary keys."""
    name_lower = name.lower().strip()
    
    # Exact match first
    if name_lower in name_dict:
        return name_dict[name_lower]
    
    # Fuzzy match
    best_match, best_score = None, 0
    for key in name_dict.keys():
        score = SequenceMatcher(None, name_lower, key).ratio()
        if score > best_score:
            best_score = score
            best_match = key
    
    if best_score >= threshold:
        return name_dict[best_match]
    return None

# === Step 4: Assign positions to batters ===
def get_position(name):
    """Get position, preferring ESPN over MLB API."""
    # Try ESPN first (has multi-position eligibility)
    pos = fuzzy_match(name, espn_positions)
    if pos:
        return pos, 'ESPN'
    
    # Fall back to MLB API
    pos = fuzzy_match(name, mlb_positions)
    if pos:
        # Convert P to Unknown for batters (they shouldn't be pitchers)
        if pos == 'P':
            return 'DH', 'MLB_API'  # Assume DH if MLB says pitcher but they're in batter list
        return pos, 'MLB_API'
    
    return 'Unknown', 'None'

# Apply to batters
positions = []
sources = []
for name in pred_2026_bat['Name']:
    pos, src = get_position(name)
    positions.append(pos)
    sources.append(src)

pred_2026_bat['Position'] = positions
pred_2026_bat['Position_Source'] = sources

# Summary
print("\nPosition assignment summary (batters):")
print(pred_2026_bat['Position_Source'].value_counts())

# Primary position for grouping
pred_2026_bat['Primary_Pos'] = pred_2026_bat['Position'].apply(
    lambda x: x.split('/')[0] if pd.notna(x) and x != 'Unknown' else 'Unknown'
)

print("\nPosition distribution:")
print(pred_2026_bat['Primary_Pos'].value_counts())

# Show sample with positions
print("\nTop batters with positions:")
print(pred_2026_bat[['Rank', 'Name', 'Team', 'Position', 'Position_Source', 'Projected_Fpoints']].head(20).to_string(index=False))

# Show players who got positions from MLB API (not ESPN)
mlb_api_players = pred_2026_bat[pred_2026_bat['Position_Source'] == 'MLB_API']
if len(mlb_api_players) > 0:
    print(f"\nPlayers with MLB API positions (not in ESPN 300): {len(mlb_api_players)}")
    print(mlb_api_players[['Name', 'Position', 'Projected_Fpoints']].head(15).to_string(index=False))

In [None]:
# Save final predictions with totals (including positions)
# Select columns for export
batter_export_cols = ['IDfg', 'Name', 'Team', 'Position', 'Projected_PA', 
                       'Predicted_Fpoints_PA', 'Projected_Fpoints', 'Rank']
pred_2026_bat[batter_export_cols].to_csv('predictions/batters_2026_final.csv', index=False)

pitcher_export_cols = ['IDfg', 'Name', 'Team', 'Role', 'Proj_IP', 'Proj_W', 'Proj_L', 'Proj_SV', 'Proj_HLD',
                        'Predicted_Fpoints_IP', 'Projected_Fpoints', 'Rank']
pred_2026_pit[pitcher_export_cols].to_csv('predictions/pitchers_2026_final.csv', index=False)

print("Saved final predictions:")
print("  - predictions/batters_2026_final.csv")
print("  - predictions/pitchers_2026_final.csv")

# Show sample of final batter output
print("\nBatter output sample:")
print(pred_2026_bat[['Rank', 'Name', 'Team', 'Position', 'Projected_Fpoints']].head(10).to_string(index=False))

In [None]:
# Create combined overall ranking with positions
batters_ranked = pred_2026_bat[['Name', 'Team', 'Position', 'Projected_Fpoints']].copy()
batters_ranked['Type'] = 'Batter'

pitchers_ranked = pred_2026_pit[['Name', 'Team', 'Role', 'Projected_Fpoints']].copy()
pitchers_ranked['Position'] = pitchers_ranked['Role']  # SP/RP as position
pitchers_ranked['Type'] = pitchers_ranked['Role']
pitchers_ranked = pitchers_ranked.drop('Role', axis=1)

overall = pd.concat([batters_ranked, pitchers_ranked], ignore_index=True)
overall = overall.sort_values('Projected_Fpoints', ascending=False).reset_index(drop=True)
overall['Overall_Rank'] = range(1, len(overall) + 1)

print("\n=== 2026 Overall Rankings (Top 50) ===")
print(overall[['Overall_Rank', 'Name', 'Team', 'Position', 'Type', 'Projected_Fpoints']].head(50).to_string(index=False))

In [None]:
# Save overall rankings with positions
overall_export = overall[['Overall_Rank', 'Name', 'Team', 'Position', 'Type', 'Projected_Fpoints']]
overall_export.to_csv('predictions/overall_2026_rankings.csv', index=False)
print("Saved to predictions/overall_2026_rankings.csv")

# Summary stats
print(f"\n=== Summary ===")
print(f"Total players ranked: {len(overall)}")
print(f"  Batters: {len(overall[overall['Type'] == 'Batter'])}")
print(f"  SP: {len(overall[overall['Type'] == 'SP'])}")
print(f"  RP: {len(overall[overall['Type'] == 'RP'])}")

---
## Part 5: Position-Specific Rankings
---

In [None]:
# Create position-specific rankings
# Map batters to fantasy-relevant positions

def get_eligible_positions(pos_str):
    """Expand position string to list of eligible positions."""
    if pd.isna(pos_str) or pos_str == 'Unknown':
        return []
    return pos_str.split('/')

# Fantasy-relevant position groups
POSITION_GROUPS = {
    'C': ['C'],
    '1B': ['1B'],
    '2B': ['2B'],
    '3B': ['3B'],
    'SS': ['SS'],
    'OF': ['OF', 'LF', 'CF', 'RF'],
    'DH': ['DH'],
    'SP': ['SP'],
    'RP': ['RP']
}

def player_eligible_for(position, pos_list):
    """Check if player is eligible for a position."""
    eligible_positions = POSITION_GROUPS.get(position, [position])
    return any(p in eligible_positions for p in pos_list)

# Create rankings for each position
position_rankings = {}

# Batter positions
batter_positions = ['C', '1B', '2B', '3B', 'SS', 'OF', 'DH']
for pos in batter_positions:
    # Find all players eligible at this position
    eligible = pred_2026_bat[pred_2026_bat['Position'].apply(
        lambda x: player_eligible_for(pos, get_eligible_positions(x))
    )].copy()
    eligible = eligible.sort_values('Projected_Fpoints', ascending=False).reset_index(drop=True)
    eligible[f'{pos}_Rank'] = range(1, len(eligible) + 1)
    position_rankings[pos] = eligible
    
# Pitcher positions (already have SP/RP rankings)
position_rankings['SP'] = sp_final.copy()
position_rankings['RP'] = rp_final.copy()

# Show top 15 at each position
for pos in ['C', '1B', '2B', '3B', 'SS', 'OF', 'SP', 'RP']:
    print(f"\n=== Top 15 {pos} ===")
    df = position_rankings[pos]
    if pos in batter_positions:
        cols = [f'{pos}_Rank', 'Name', 'Team', 'Position', 'Projected_Fpoints']
        print(df[cols].head(15).to_string(index=False))
    else:
        rank_col = 'SP_Rank' if pos == 'SP' else 'RP_Rank'
        cols = [rank_col, 'Name', 'Team', 'Proj_IP', 'Projected_Fpoints']
        print(df[cols].head(15).to_string(index=False))

In [None]:
# Export position-specific rankings
os.makedirs('predictions/by_position', exist_ok=True)

# Export each position
for pos in ['C', '1B', '2B', '3B', 'SS', 'OF', 'DH', 'SP', 'RP']:
    df = position_rankings[pos]
    output_file = f'predictions/by_position/{pos.lower()}_2026_rankings.csv'
    
    if pos in batter_positions:
        export_cols = [f'{pos}_Rank', 'Name', 'Team', 'Position', 'Projected_PA', 
                       'Predicted_Fpoints_PA', 'Projected_Fpoints']
        export_df = df[export_cols].copy()
        export_df.columns = ['Rank', 'Name', 'Team', 'Position', 'Projected_PA', 
                             'Fpoints_PA', 'Projected_Fpoints']
    else:
        rank_col = 'SP_Rank' if pos == 'SP' else 'RP_Rank'
        export_cols = [rank_col, 'Name', 'Team', 'Proj_IP', 'Predicted_Fpoints_IP', 'Projected_Fpoints']
        export_df = df[export_cols].copy()
        export_df.columns = ['Rank', 'Name', 'Team', 'Proj_IP', 'Fpoints_IP', 'Projected_Fpoints']
    
    export_df.to_csv(output_file, index=False)

print("Saved position-specific rankings:")
for pos in ['C', '1B', '2B', '3B', 'SS', 'OF', 'DH', 'SP', 'RP']:
    df = position_rankings[pos]
    print(f"  - predictions/by_position/{pos.lower()}_2026_rankings.csv ({len(df)} players)")

---
## Summary

Generated predictions saved to `predictions/` folder:

| File | Description |
|------|-------------|
| `batters_2026_rate_predictions.csv` | Predicted Fpoints/PA for all batters |
| `pitchers_2026_rate_predictions.csv` | Predicted Fpoints/IP for all pitchers |
| `batters_2026_final.csv` | Batter totals with PA projections, updated teams, positions |
| `pitchers_2026_final.csv` | Pitcher totals with IP/W/L/SV/HLD projections, updated teams |
| `overall_2026_rankings.csv` | Combined rankings (batters + pitchers) with positions |

### Position-Specific Rankings (`predictions/by_position/`)
| File | Description |
|------|-------------|
| `c_2026_rankings.csv` | Catchers |
| `1b_2026_rankings.csv` | First Basemen |
| `2b_2026_rankings.csv` | Second Basemen |
| `3b_2026_rankings.csv` | Third Basemen |
| `ss_2026_rankings.csv` | Shortstops |
| `of_2026_rankings.csv` | Outfielders |
| `dh_2026_rankings.csv` | Designated Hitters |
| `sp_2026_rankings.csv` | Starting Pitchers |
| `rp_2026_rankings.csv` | Relief Pitchers |

### Data Sources
- **Teams**: Updated from BatX (batters) and OOPSY (pitchers) 2026 projections
- **Positions**: From ESPN Top 300 PDF rankings
- **PA/IP Projections**: BatX and OOPSY from FanGraphs

---