# MiLB Prospect Success Predictor

This notebook uses machine learning to predict which minor league prospects will succeed in MLB.

## Methodology

### Key Insights from Research
1. **Age matters more than level** - A 22-year-old dominating AA is more impressive than a 26-year-old dominating AAA
2. **K% more predictive than BB%** - Strikeout rate is the single best predictor of MLB success
3. **Level of competition** - AAA/AA stats are more predictive than A-ball
4. **Power translates** - ISO (isolated power) in minors correlates with MLB power

### Features We Use
- **Age-adjusted performance**: wRC+, K%, BB%, ISO relative to league average for age
- **Skills-based metrics**: K/BB ratio, plate discipline, batted ball quality
- **Physical tools**: Power indicators, speed metrics
- **Level & experience**: Which level, how many PAs

### Model Outputs
1. **MLB Arrival Probability** (0-100%): Will they make the majors?
2. **Expected Performance**: Projected MLB stats if they arrive
3. **Feature Importance**: Which stats matter most for prediction

In [None]:
# Setup
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data import MiLBFetcher
from src.models import ProspectPredictor
import config

pd.set_option('display.max_columns', 30)
sns.set_style('darkgrid')

print("Setup complete!")

## 1. Fetch Current MiLB Prospect Data

In [None]:
# Initialize fetchers
milb_fetcher = MiLBFetcher()
predictor = ProspectPredictor()

# Fetch AAA batting stats (highest level prospects)
season = config.CURRENT_SEASON
print(f"Fetching {season} AAA batting data...")

aaa_batting = milb_fetcher.get_batting_stats(season, level="AAA", use_cache=True)

print(f"\nFetched data for {len(aaa_batting)} AAA batters")
aaa_batting.head()

In [None]:
# Check available stats
print("Available statistics:")
print(aaa_batting.columns.tolist())

## 2. Feature Engineering

Transform raw stats into predictive features.

In [None]:
# Filter to qualified batters
min_pa = 100
qualified = aaa_batting[aaa_batting['PA'] >= min_pa].copy()

print(f"Qualified batters (>= {min_pa} PA): {len(qualified)}")

# Engineer features
featured_prospects = predictor.engineer_features(
    qualified,
    level='AAA',
    player_type='batter'
)

print(f"\nEngineered {len(featured_prospects.columns)} total features")
print("\nNew features created:")
new_features = [col for col in featured_prospects.columns if col not in aaa_batting.columns]
print(new_features)

In [None]:
# Examine age distribution and age-adjusted features
if 'age_differential' in featured_prospects.columns:
    plt.figure(figsize=(14, 5))
    
    # Subplot 1: Age distribution
    plt.subplot(1, 2, 1)
    plt.hist(featured_prospects['Age'], bins=20, edgecolor='black', alpha=0.7)
    plt.axvline(26, color='red', linestyle='--', label='Expected AAA age')
    plt.xlabel('Age')
    plt.ylabel('Count')
    plt.title('AAA Batter Age Distribution')
    plt.legend()
    plt.grid(alpha=0.3)
    
    # Subplot 2: Age differential
    plt.subplot(1, 2, 2)
    plt.hist(featured_prospects['age_differential'], bins=20, edgecolor='black', alpha=0.7)
    plt.axvline(0, color='red', linestyle='--', label='Average age for level')
    plt.xlabel('Age Differential (expected - actual)')
    plt.ylabel('Count')
    plt.title('Age Relative to Level')
    plt.legend()
    plt.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Stats
    young_for_level = (featured_prospects['age_differential'] > 1).sum()
    old_for_level = (featured_prospects['age_differential'] < -2).sum()
    
    print(f"\nPlayers young for AAA: {young_for_level} ({young_for_level/len(featured_prospects)*100:.1f}%)")
    print(f"Players old for AAA: {old_for_level} ({old_for_level/len(featured_prospects)*100:.1f}%)")

## 3. Identify Elite Prospects (Traditional Metrics)

Before ML predictions, let's see who stands out by traditional scouting.

In [None]:
# Young + elite performance = top prospect profile
if 'wRC+' in featured_prospects.columns:
    elite_young = featured_prospects[
        (featured_prospects['Age'] <= 24) &
        (featured_prospects['wRC+'] >= 120)
    ].copy()
    
    elite_young = elite_young.sort_values('wRC+', ascending=False)
    
    print(f"Found {len(elite_young)} elite young prospects (Age <= 24, wRC+ >= 120)\n")
    
    if len(elite_young) > 0:
        display_cols = [
            'Name', 'Team', 'Age', 'PA',
            'AVG', 'OBP', 'SLG', 'wRC+',
            'K%', 'BB%', 'ISO', 'HR', 'SB'
        ]
        available_cols = [col for col in display_cols if col in elite_young.columns]
        
        print("Elite Young AAA Prospects:")
        print(elite_young[available_cols].head(20).to_string(index=False))

In [None]:
# Elite plate discipline (future on-base machines)
if 'K%' in featured_prospects.columns and 'BB%' in featured_prospects.columns:
    elite_discipline = featured_prospects[
        (featured_prospects['K%'] < 18) &
        (featured_prospects['BB%'] > 12)
    ].copy()
    
    elite_discipline = elite_discipline.sort_values('BB%', ascending=False)
    
    print(f"\nElite Plate Discipline (K% < 18%, BB% > 12%): {len(elite_discipline)}\n")
    
    if len(elite_discipline) > 0:
        display_cols = ['Name', 'Age', 'PA', 'AVG', 'OBP', 'K%', 'BB%', 'wRC+']
        available_cols = [col for col in display_cols if col in elite_discipline.columns]
        
        print(elite_discipline[available_cols].head(15).to_string(index=False))

In [None]:
# Power + speed combo (rare 5-tool potential)
if 'ISO' in featured_prospects.columns and 'SB' in featured_prospects.columns:
    power_speed = featured_prospects[
        (featured_prospects['ISO'] >= 0.200) &
        (featured_prospects['SB'] >= 10)
    ].copy()
    
    power_speed = power_speed.sort_values('ISO', ascending=False)
    
    print(f"\nPower + Speed Combo (ISO >= .200, SB >= 10): {len(power_speed)}\n")
    
    if len(power_speed) > 0:
        display_cols = ['Name', 'Age', 'PA', 'AVG', 'OBP', 'SLG', 'ISO', 'HR', 'SB', 'wRC+']
        available_cols = [col for col in display_cols if col in power_speed.columns]
        
        print(power_speed[available_cols].head(15).to_string(index=False))

## 4. Visualize Key Predictive Relationships

In [None]:
# K% vs wRC+ (colored by age)
if 'K%' in featured_prospects.columns and 'wRC+' in featured_prospects.columns:
    plt.figure(figsize=(12, 8))
    
    scatter = plt.scatter(
        featured_prospects['K%'],
        featured_prospects['wRC+'],
        c=featured_prospects['Age'],
        cmap='coolwarm_r',  # Younger = cooler colors
        alpha=0.6,
        s=100,
        edgecolors='black',
        linewidth=0.5
    )
    
    plt.colorbar(scatter, label='Age')
    
    # Reference lines
    plt.axhline(y=100, color='green', linestyle='--', alpha=0.5, label='League Average (100)')
    plt.axhline(y=120, color='orange', linestyle='--', alpha=0.5, label='Above Average (120)')
    plt.axvline(x=20, color='red', linestyle='--', alpha=0.5, label='20% K Rate')
    
    plt.xlabel('Strikeout Rate (K%)', fontsize=12)
    plt.ylabel('wRC+ (100 = league average)', fontsize=12)
    plt.title('AAA Prospect Performance vs Strikeout Rate (by Age)', fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

## 5. Rank All Prospects

Combine all factors into a single ranking.

In [None]:
# Create a simple composite score
# (This is a placeholder - the ML model would be more sophisticated)

if 'wRC+' in featured_prospects.columns:
    # Normalize components to 0-100 scale
    
    # Performance (50% weight)
    performance_score = (featured_prospects['wRC+'] / 150 * 50).clip(upper=50)
    
    # Age bonus (30% weight) - younger is better
    age_score = np.where(
        featured_prospects['Age'] <= 24,
        30,
        np.where(featured_prospects['Age'] <= 26, 20, 10)
    )
    
    # Plate discipline (20% weight)
    if 'K%' in featured_prospects.columns and 'BB%' in featured_prospects.columns:
        discipline_score = (
            (1 - featured_prospects['K%'] / 30) * 10 +  # Lower K% = better
            (featured_prospects['BB%'] / 15) * 10  # Higher BB% = better
        ).clip(upper=20)
    else:
        discipline_score = 10
    
    # Calculate composite
    featured_prospects['prospect_score'] = performance_score + age_score + discipline_score
    
    # Rank
    top_prospects = featured_prospects.nlargest(50, 'prospect_score')
    
    display_cols = [
        'Name', 'Team', 'Age', 'PA',
        'AVG', 'OBP', 'SLG', 'wRC+',
        'K%', 'BB%', 'ISO', 'HR', 'SB',
        'prospect_score'
    ]
    available_cols = [col for col in display_cols if col in top_prospects.columns]
    
    print("Top 50 AAA Prospects (Composite Ranking):")
    print("\nRank | " + " | ".join(available_cols))
    print("-" * 150)
    
    for idx, (_, row) in enumerate(top_prospects.iterrows(), 1):
        values = [str(row[col])[:10] for col in available_cols if col in row]
        print(f"{idx:4d} | " + " | ".join(values))

## 6. Multi-Level Analysis

Compare prospects across different minor league levels.

In [None]:
# Fetch AA for comparison
print("Fetching AA data...")
aa_batting = milb_fetcher.get_batting_stats(season, level="AA", use_cache=True)

print(f"Fetched {len(aa_batting)} AA batters")

# Find young elite performers in AA
if 'wRC+' in aa_batting.columns:
    aa_qualified = aa_batting[aa_batting['PA'] >= 100].copy()
    
    aa_elite = aa_qualified[
        (aa_qualified['Age'] <= 23) &
        (aa_qualified['wRC+'] >= 125)
    ].sort_values('wRC+', ascending=False)
    
    print(f"\nElite AA prospects (Age <= 23, wRC+ >= 125): {len(aa_elite)}\n")
    
    if len(aa_elite) > 0:
        display_cols = ['Name', 'Team', 'Age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC+', 'K%', 'BB%', 'HR', 'SB']
        available_cols = [col for col in display_cols if col in aa_elite.columns]
        
        print("Top AA Prospects (These may be more valuable than AAA prospects!):")
        print(aa_elite[available_cols].head(20).to_string(index=False))

## 7. Export Top Prospects

Save your rankings for later reference.

In [None]:
# Export top prospects to CSV
from src.utils import export_to_csv

if 'prospect_score' in featured_prospects.columns:
    export_cols = [
        'Name', 'Team', 'Age', 'PA',
        'AVG', 'OBP', 'SLG', 'wRC+',
        'K%', 'BB%', 'ISO', 'HR', 'SB',
        'prospect_score'
    ]
    available_export_cols = [col for col in export_cols if col in top_prospects.columns]
    
    export_to_csv(
        top_prospects[available_export_cols],
        f'top_aaa_prospects_{season}.csv'
    )
    
    print(f"\nExported top {len(top_prospects)} prospects to data/exports/")

## Summary: Prospect Evaluation Framework

### Tier 1: Elite (Ready for MLB)
- Age 24 or younger in AAA
- wRC+ >= 130
- K% < 20%, BB% > 10%
- **Action**: Add to watchlist for callups

### Tier 2: High Upside (1-2 years away)
- Age 23 or younger in AA
- wRC+ >= 120
- Good plate discipline
- **Action**: Dynasty league targets

### Tier 3: Developing (2-3 years away)
- Age 22 or younger in A+/A
- Above-average performance for level
- Toolsy (power, speed, or both)
- **Action**: Deep dynasty stashes

### Red Flags
- High K% (>25%) at any level
- Old for level (2+ years above average)
- No power or speed tools (low ISO, low SB)
- Poor plate discipline (K% > BB%)

### Next Steps
1. **Build historical dataset** - Collect past prospects + MLB outcomes
2. **Train ML model** - Use ProspectPredictor class to train
3. **Validate predictions** - Track how your rankings perform
4. **Automate updates** - Run weekly to track progress
5. **Combine with scouting** - Merge data analysis with expert rankings