# MLB Statcast Data Analysis

This notebook demonstrates how to fetch and analyze MLB Statcast data.

## What is Statcast?
Statcast provides pitch-level data including:
- **Pitch metrics**: Velocity, spin rate, movement, location
- **Batted ball data**: Exit velocity, launch angle, distance
- **Outcome data**: Result of each pitch/plate appearance
- **Contextual data**: Game situation, count, runners on base

This is the same data MLB teams use for advanced analytics.

In [None]:
# Setup
import sys
sys.path.append('..')  # Add parent directory to path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

from src.data import StatcastFetcher
import config

# Set display options
pd.set_option('display.max_columns', 50)
sns.set_style('darkgrid')

print("Setup complete!")

## 1. Initialize Data Fetcher

The `StatcastFetcher` class handles data retrieval and caching.

In [None]:
# Initialize fetcher
fetcher = StatcastFetcher(cache_dir=config.CACHE_DIR)

print(f"Fetcher initialized. Cache directory: {config.CACHE_DIR}")

## 2. Example: Fetch Recent Statcast Data

Let's fetch data from the last week to explore what's available.

In [None]:
# Define date range (last 7 days)
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')

print(f"Fetching data from {start_date} to {end_date}...")

# Fetch data (this may take a minute on first run)
statcast_df = fetcher.get_statcast_data(start_date, end_date)

print(f"\nFetched {len(statcast_df):,} pitches")
print(f"Columns available: {len(statcast_df.columns)}")

In [None]:
# Explore the data structure
statcast_df.head()

In [None]:
# Check key columns
key_columns = [
    'player_name',      # Pitcher name
    'batter',           # Batter ID
    'pitcher',          # Pitcher ID
    'pitch_type',       # Type of pitch (FF, SL, CH, etc.)
    'release_speed',    # Pitch velocity
    'release_spin_rate', # Spin rate
    'launch_speed',     # Exit velocity (if ball in play)
    'launch_angle',     # Launch angle (if ball in play)
    'events',           # Outcome (hit, out, etc.)
    'description',      # Pitch result (ball, strike, foul, etc.)
    'zone',             # Strike zone location
    'plate_x',          # Horizontal location
    'plate_z',          # Vertical location
]

statcast_df[key_columns].head(10)

## 3. Quick Analysis: Pitch Velocity Distribution

Let's look at the distribution of pitch velocities by pitch type.

In [None]:
# Filter to common pitch types
common_pitches = statcast_df[statcast_df['pitch_type'].isin(['FF', 'SL', 'CH', 'CU', 'SI'])].copy()

# Pitch type mapping for better labels
pitch_names = {
    'FF': '4-Seam FB',
    'SI': 'Sinker',
    'SL': 'Slider',
    'CH': 'Changeup',
    'CU': 'Curveball'
}
common_pitches['pitch_name'] = common_pitches['pitch_type'].map(pitch_names)

# Create visualization
plt.figure(figsize=(12, 6))
sns.violinplot(data=common_pitches, x='pitch_name', y='release_speed')
plt.title('Pitch Velocity Distribution by Pitch Type', fontsize=14, fontweight='bold')
plt.xlabel('Pitch Type', fontsize=12)
plt.ylabel('Velocity (mph)', fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Summary statistics
print("\nVelocity by Pitch Type:")
common_pitches.groupby('pitch_name')['release_speed'].describe()[['mean', 'std', 'min', 'max']].round(1)

## 4. Example: Analyze a Specific Player

Let's look up a player and analyze their recent performance.

In [None]:
# Example: Look up a player (change to your favorite player)
player_lookup = fetcher.lookup_player("Ohtani", "Shohei")
player_lookup

In [None]:
# Get player ID (use key_mlbam column)
if len(player_lookup) > 0:
    player_id = player_lookup.iloc[0]['key_mlbam']
    player_name = f"{player_lookup.iloc[0]['name_first']} {player_lookup.iloc[0]['name_last']}"
    
    print(f"Analyzing {player_name} (ID: {player_id})")
    
    # Fetch batter data for the season
    batter_data = fetcher.get_batter_data(
        player_id=player_id,
        start_date="2024-04-01",
        end_date=datetime.now().strftime('%Y-%m-%d')
    )
    
    print(f"\nFound {len(batter_data):,} pitches faced")
    
    # Analyze batted balls
    batted_balls = batter_data[batter_data['type'] == 'X'].copy()  # 'X' = ball in play
    print(f"Batted balls in play: {len(batted_balls)}")
    
    if len(batted_balls) > 0:
        print(f"\nAverage exit velocity: {batted_balls['launch_speed'].mean():.1f} mph")
        print(f"Average launch angle: {batted_balls['launch_angle'].mean():.1f}Â°")
        print(f"Hard hit rate (95+ mph): {(batted_balls['launch_speed'] >= 95).mean() * 100:.1f}%")

## 5. Advanced Analysis: Exit Velocity vs Launch Angle

The relationship between exit velocity and launch angle is crucial for predicting outcomes.

In [None]:
# Use all batted balls from our dataset
batted_balls = statcast_df[statcast_df['type'] == 'X'].copy()
batted_balls = batted_balls.dropna(subset=['launch_speed', 'launch_angle', 'events'])

# Create scatter plot
plt.figure(figsize=(14, 8))

# Color by outcome
for outcome in ['home_run', 'single', 'double', 'triple']:
    mask = batted_balls['events'] == outcome
    if mask.sum() > 0:
        plt.scatter(
            batted_balls[mask]['launch_speed'],
            batted_balls[mask]['launch_angle'],
            alpha=0.6,
            label=outcome.replace('_', ' ').title(),
            s=50
        )

plt.axhline(y=10, color='r', linestyle='--', alpha=0.3, label='Optimal LA Range')
plt.axhline(y=30, color='r', linestyle='--', alpha=0.3)
plt.axvline(x=95, color='g', linestyle='--', alpha=0.3, label='Hard Hit Threshold')

plt.xlabel('Exit Velocity (mph)', fontsize=12)
plt.ylabel('Launch Angle (degrees)', fontsize=12)
plt.title('Exit Velocity vs Launch Angle by Outcome', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## Next Steps

Now that you understand the basics, you can:

1. **Analyze specific players** - Track performance trends, pitch mix, etc.
2. **Compare pitchers** - Who has the best fastball? Best breaking ball?
3. **Predict outcomes** - Build models to predict hit probability, home runs, etc.
4. **Find inefficiencies** - Identify undervalued players or approaches
5. **Park effects** - Analyze how ballparks affect batted ball outcomes

Check out the other notebooks for more examples!