# MLB Breakout Candidate Detector

This notebook identifies MLB players poised for breakout seasons or positive regression using expected statistics.

## Key Concepts

### Expected Stats (xStats)
- Baseball Savant calculates **expected stats** based on batted ball quality
- When xStats >> Actual Stats = player is "unlucky" (buy low)
- When Actual Stats >> xStats = player is "lucky" (sell high)

### Quality of Contact
- **Barrel Rate**: Optimal exit velo + launch angle combo
- **Hard Hit Rate**: Batted balls 95+ mph
- **Average Exit Velocity**: Overall quality of contact

### Breakout Score
We combine multiple factors:
1. Expected stats gap (40%)
2. Quality of contact (30%)
3. Plate discipline (20%)
4. Age/upside (10%)

In [None]:
# Setup
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data import FanGraphsFetcher, SavantLeaderboards
from src.analysis.breakout_detector import BreakoutDetector
import config

pd.set_option('display.max_columns', 30)
sns.set_style('darkgrid')

print("Setup complete!")

## 1. Fetch Current Season Data

In [None]:
# Initialize fetchers
savant = SavantLeaderboards()
detector = BreakoutDetector()

# Fetch expected stats for current season
season = config.CURRENT_SEASON
min_pa = 100  # Minimum PAs for qualified batters

print(f"Fetching {season} expected stats (min {min_pa} PA)...")
batter_xstats = savant.get_batter_expected_stats(season, min_pa=min_pa)

print(f"\nFetched data for {len(batter_xstats)} batters")
batter_xstats.head()

In [None]:
# Check available columns
print("Available statistics:")
print(batter_xstats.columns.tolist())

## 2. Find Unlucky Players (Buy Low Candidates)

These players have strong underlying metrics but poor results. They're due for positive regression.

In [None]:
# Find unlucky batters
unlucky = detector.find_unlucky_players(
    batter_xstats,
    player_type='batter',
    min_gap=0.020,  # 20+ point gap
    top_n=20
)

print(f"Found {len(unlucky)} unlucky batters\n")

# Display key stats
if len(unlucky) > 0:
    display_cols = [
        'first_name', 'last_name', 'pa',
        'ba', 'xba', 'ba_gap',
        'slg', 'xslg', 'slg_gap',
        'woba', 'xwoba', 'woba_gap'
    ]
    available_cols = [col for col in display_cols if col in unlucky.columns]
    
    print("Top 20 Unlucky Batters (Expected > Actual):")
    unlucky[available_cols].head(20)

## 3. Visualize Expected vs Actual Stats

In [None]:
# Scatter plot: xwOBA vs actual wOBA
if 'xwoba' in batter_xstats.columns and 'woba' in batter_xstats.columns:
    plt.figure(figsize=(12, 8))
    
    # All players
    plt.scatter(batter_xstats['woba'], batter_xstats['xwoba'], alpha=0.5, s=50)
    
    # Highlight unlucky players
    if len(unlucky) > 0:
        plt.scatter(
            unlucky['woba'], 
            unlucky['xwoba'], 
            color='red', 
            s=100, 
            alpha=0.7,
            label='Unlucky Players',
            edgecolors='black'
        )
    
    # Add reference line (actual = expected)
    min_val = min(batter_xstats['woba'].min(), batter_xstats['xwoba'].min())
    max_val = max(batter_xstats['woba'].max(), batter_xstats['xwoba'].max())
    plt.plot([min_val, max_val], [min_val, max_val], 'k--', alpha=0.5, label='Perfect Match')
    
    # Add buffer zones
    plt.fill_between(
        [min_val, max_val], 
        [min_val - 0.020, max_val - 0.020],
        [min_val + 0.020, max_val + 0.020],
        alpha=0.1, 
        color='gray',
        label='Â±20 point range'
    )
    
    plt.xlabel('Actual wOBA', fontsize=12)
    plt.ylabel('Expected wOBA (xwOBA)', fontsize=12)
    plt.title('Expected vs Actual Performance', fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Stats on the gap distribution
    gaps = batter_xstats['xwoba'] - batter_xstats['woba']
    print(f"\nwOBA Gap Distribution:")
    print(f"Mean gap: {gaps.mean():.3f}")
    print(f"Median gap: {gaps.median():.3f}")
    print(f"Std deviation: {gaps.std():.3f}")
    print(f"Players with 20+ point gap: {(gaps >= 0.020).sum()}")
    print(f"Players with 30+ point gap: {(gaps >= 0.030).sum()}")

## 4. Calculate Breakout Scores

Combines expected stats gap + quality of contact + discipline + age into a single score.

In [None]:
# Calculate breakout scores
scored_batters = detector.calculate_breakout_score(batter_xstats, player_type='batter')

# Top breakout candidates
breakout_candidates = detector.identify_breakout_candidates(
    batter_xstats,
    player_type='batter',
    min_score=60,
    top_n=30
)

print(f"Found {len(breakout_candidates)} high-probability breakout candidates\n")

if len(breakout_candidates) > 0:
    display_cols = [
        'first_name', 'last_name', 'age', 'pa',
        'ba', 'xba', 'woba', 'xwoba',
        'barrel_batted_rate', 'k_percent', 'bb_percent',
        'breakout_score'
    ]
    available_cols = [col for col in display_cols if col in breakout_candidates.columns]
    
    print("Top 30 Breakout Candidates:")
    breakout_candidates[available_cols].head(30)

## 5. Find Overperformers (Regression Risk)

Players whose results exceed underlying metrics - sell high candidates.

In [None]:
# Find overperformers
overperformers = detector.find_overperforming_players(
    batter_xstats,
    player_type='batter',
    min_gap=0.020,
    top_n=20
)

print(f"Found {len(overperformers)} overperforming batters\n")

if len(overperformers) > 0:
    display_cols = [
        'first_name', 'last_name', 'pa',
        'ba', 'xba', 'ba_gap',
        'woba', 'xwoba', 'woba_gap',
        'barrel_batted_rate'
    ]
    available_cols = [col for col in display_cols if col in overperformers.columns]
    
    print("Top 20 Overperformers (Actual > Expected - Regression Risk):")
    overperformers[available_cols].head(20)

## 6. Quality of Contact Leaders

Players with elite batted ball metrics regardless of current results.

In [None]:
# Barrel rate leaders
if 'barrel_batted_rate' in batter_xstats.columns:
    barrel_leaders = batter_xstats.nlargest(20, 'barrel_batted_rate')
    
    display_cols = [
        'first_name', 'last_name', 'pa',
        'barrel_batted_rate', 'avg_hit_speed', 'max_hit_speed',
        'ba', 'xba', 'woba', 'xwoba'
    ]
    available_cols = [col for col in display_cols if col in barrel_leaders.columns]
    
    print("Top 20 Barrel Rate Leaders:")
    print(barrel_leaders[available_cols].to_string(index=False))

In [None]:
# Exit velocity leaders
if 'avg_hit_speed' in batter_xstats.columns:
    exit_velo_leaders = batter_xstats.nlargest(20, 'avg_hit_speed')
    
    display_cols = [
        'first_name', 'last_name', 'pa',
        'avg_hit_speed', 'max_hit_speed',
        'hard_hit_percent', 'barrel_batted_rate',
        'woba', 'xwoba'
    ]
    available_cols = [col for col in display_cols if col in exit_velo_leaders.columns]
    
    print("\nTop 20 Average Exit Velocity Leaders:")
    print(exit_velo_leaders[available_cols].to_string(index=False))

## 7. Analyze a Specific Player

In [None]:
# Change this to analyze your player of interest
player_name = "Ohtani"  # Example - change to any player

# Get player summary
summary = detector.get_breakout_summary(batter_xstats, player_name, player_type='batter')

print(f"Breakout Analysis for {summary.get('player_name', player_name)}")
print("=" * 60)
print(f"Age: {summary.get('age')}")
print(f"\nBreakout Score: {summary.get('breakout_score', 'N/A'):.1f}")

if 'expected_stats_gaps' in summary:
    print("\nExpected Stats Gaps:")
    for stat, value in summary['expected_stats_gaps'].items():
        if value != 'N/A':
            print(f"  {stat}: {value:.3f}")

if 'quality_of_contact' in summary:
    print("\nQuality of Contact:")
    for stat, value in summary['quality_of_contact'].items():
        print(f"  {stat}: {value}")

if 'plate_discipline' in summary:
    print("\nPlate Discipline:")
    for stat, value in summary['plate_discipline'].items():
        print(f"  {stat}: {value}")

## Summary: How to Use This Analysis

### Buy Low Targets (Unlucky Players)
- xStats significantly exceed actual stats
- Good quality of contact metrics
- Due for positive regression
- **Action**: Trade for or pick up in fantasy

### Sell High Targets (Overperformers)
- Actual stats exceed xStats by large margin
- Mediocre quality of contact
- Likely to regress negatively
- **Action**: Trade away while value is high

### True Breakout Candidates
- High breakout score (60+)
- Elite quality of contact
- Young with room to grow
- Strong plate discipline
- **Action**: Buy and hold for long-term value

### Next Steps
1. Compare these findings to current trade values
2. Track week-to-week changes in xStats gaps
3. Combine with prospect data for dynasty leagues
4. Build automated alerts when gaps exceed thresholds