# 01 - Data Collection

Pull historical stats and advanced metrics using pybaseball.

## Data Sources

### Primary (FanGraphs)
- `batting_stats()` - ~318 columns of batting stats including Statcast-derived metrics
- `pitching_stats()` - ~391 columns of pitching stats

### Supplementary (Baseball Savant)
- `statcast_batter_expected_stats()` - xBA, xSLG, xwOBA, sweet_spot%, etc.
- `statcast_pitcher_expected_stats()` - expected stats allowed
- `statcast_pitcher_pitch_arsenal()` - velocity, spin, usage by pitch type
- `statcast_pitcher_arsenal_stats()` - whiff%, run values by pitch type
- `statcast_pitcher_pitch_movement()` - horizontal/vertical break

### ID Mapping
- `chadwick_register()` - cross-reference FanGraphs IDs (IDfg) with Savant IDs (MLBAM)

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
from pybaseball import cache

# Enable caching to avoid re-downloading
cache.enable()

from src.data.collect import (
    collect_all,
    collect_fangraphs_batting,
    collect_fangraphs_pitching,
    collect_statcast_batter_expected,
    collect_statcast_pitcher_expected,
    collect_pitcher_arsenal,
    collect_pitcher_arsenal_stats,
    collect_pitcher_movement,
    collect_id_mapping,
)
from config.settings import TRAIN_START_YEAR, PREDICT_YEAR, RAW_DATA_DIR

## Run Full Collection

This will collect all data sources for 2015-2025 (or whatever is set in config/settings.py).

**Expected runtime:** 5-15 minutes depending on network speed. Data is cached after first run.

In [None]:
# Collect all data (2015-2025)
results = collect_all()

## Inspect Collected Data

### FanGraphs Batting

In [None]:
batting = pd.read_csv(f"{RAW_DATA_DIR}/fangraphs_batting.csv")
print(f"Shape: {batting.shape}")
print(f"\nYears: {batting['Season'].min()} - {batting['Season'].max()}")
print(f"\nSample columns (first 30):")
print(batting.columns[:30].tolist())

In [None]:
# Check key columns for fantasy points calculation
fantasy_cols = ['Name', 'Season', 'PA', 'AB', 'H', '1B', '2B', '3B', 'HR', 'R', 'RBI', 'BB', 'SO', 'SB', 'IDfg']
batting[fantasy_cols].head(10)

In [None]:
# Check Statcast-derived columns in FanGraphs data
statcast_cols = [c for c in batting.columns if any(x in c.lower() for x in ['xba', 'xslg', 'xwoba', 'barrel', 'ev', 'hardhit'])]
print(f"Statcast-derived columns in FanGraphs data: {statcast_cols}")

### FanGraphs Pitching

In [None]:
pitching = pd.read_csv(f"{RAW_DATA_DIR}/fangraphs_pitching.csv")
print(f"Shape: {pitching.shape}")
print(f"\nYears: {pitching['Season'].min()} - {pitching['Season'].max()}")

In [None]:
# Check key columns for fantasy points calculation (skill-based only)
fantasy_cols = ['Name', 'Season', 'IP', 'SO', 'BB', 'H', 'ER', 'W', 'L', 'SV', 'IDfg']
pitching[fantasy_cols].head(10)

### Pitcher Arsenal Data

In [None]:
arsenal = pd.read_csv(f"{RAW_DATA_DIR}/savant_pitcher_arsenal.csv")
print(f"Shape: {arsenal.shape}")
print(f"\nColumns: {arsenal.columns.tolist()}")

In [None]:
# Sample: fastball velocity and usage
ff_cols = [c for c in arsenal.columns if c.startswith('ff_')]
print(f"4-seam fastball columns: {ff_cols}")
arsenal[['player_id', 'year'] + ff_cols].head()

### Player ID Mapping

In [None]:
id_map = pd.read_csv(f"{RAW_DATA_DIR}/player_id_map.csv")
print(f"Shape: {id_map.shape}")
print(f"\nColumns: {id_map.columns.tolist()}")

# Example: look up a player
id_map[id_map['name_last'].str.lower() == 'ohtani'].head()

## Verify Key Stats

Quick sanity check on a known player.

In [None]:
# Check Mike Trout's 2023 stats
trout = batting[(batting['Name'].str.contains('Trout', case=False)) & (batting['Season'] == 2023)]
if len(trout) > 0:
    print("Mike Trout 2023:")
    print(f"  PA: {trout['PA'].values[0]}")
    print(f"  HR: {trout['HR'].values[0]}")
    print(f"  R: {trout['R'].values[0]}")
    print(f"  RBI: {trout['RBI'].values[0]}")
    print(f"  SB: {trout['SB'].values[0]}")
    print(f"  SO: {trout['SO'].values[0]}")
else:
    print("Trout not found (may have been injured)")

## Summary

Data collected and saved to `data/raw/`:

| File | Description |
|------|-------------|
| `fangraphs_batting.csv` | Primary batting stats (~318 cols) |
| `fangraphs_pitching.csv` | Primary pitching stats (~391 cols) |
| `savant_batter_expected.csv` | Batter expected stats |
| `savant_pitcher_expected.csv` | Pitcher expected stats |
| `savant_pitcher_arsenal.csv` | Pitch velocity, spin, usage |
| `savant_pitcher_arsenal_stats.csv` | Pitch whiff%, run values |
| `savant_pitcher_movement.csv` | Pitch horizontal/vertical break |
| `player_id_map.csv` | FanGraphs <-> Savant ID mapping |