# Exploring Raw Statcast Data (2022–2025)

This notebook performs an initial exploration of raw Statcast batter-level data pulled using `pybaseball`.

The goal of this step is **not modeling**, but to:
- understand the structure and grain of the data
- validate season coverage
- identify which columns are useful for downstream aggregation
- inform the design of a clean player-season dataset

All modeling and feature engineering will be performed in later steps.

In [1]:
import pandas as pd

df = pd.read_csv("../data/raw/StatcastBatter_2022_2025.csv")

## Dataset Overview

We begin by inspecting the size, columns, and basic structure of the raw dataset.

In [2]:
df.shape
df.columns

Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description',
       ...
       'api_break_z_with_gravity', 'api_break_x_arm', 'api_break_x_batter_in',
       'arm_angle', 'attack_angle', 'attack_direction', 'swing_path_tilt',
       'intercept_ball_minus_batter_pos_x_inches',
       'intercept_ball_minus_batter_pos_y_inches', 'Season'],
      dtype='object', length=119)

## Data Grain

Each row in this dataset represents a single Statcast-tracked event.  
This includes batted ball events as well as other plate appearance outcomes.

Understanding this event-level grain is critical before aggregating to a player-season level.

In [3]:
df.head()

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,api_break_z_with_gravity,api_break_x_arm,api_break_x_batter_in,arm_angle,attack_angle,attack_direction,swing_path_tilt,intercept_ball_minus_batter_pos_x_inches,intercept_ball_minus_batter_pos_y_inches,Season
0,CH,2022-10-05,80.8,-0.76,6.61,"Baker, Bryan",624415,641329,field_out,hit_into_play,...,2.68,1.34,-1.34,59.9,,,,,,2022
1,FF,2022-10-05,97.7,-0.58,6.6,"Baker, Bryan",643376,641329,strikeout,swinging_strike,...,0.81,0.17,0.17,53.6,,,,,,2022
2,CH,2022-10-05,84.9,-0.55,6.58,"Baker, Bryan",643376,641329,,ball,...,2.34,1.22,1.22,58.4,,,,,,2022
3,FF,2022-10-05,97.2,-0.42,6.6,"Baker, Bryan",643376,641329,,swinging_strike,...,0.68,0.13,0.13,57.2,,,,,,2022
4,SL,2022-10-05,86.2,-0.55,6.64,"Baker, Bryan",643376,641329,,called_strike,...,3.04,-0.63,-0.63,58.8,,,,,,2022


## Season Coverage

We verify that the dataset includes the expected seasons and confirm row counts per season.

In [4]:
df["Season"].value_counts().sort_index()

Season
2022    757507
2023    749058
2024    736231
2025    241080
Name: count, dtype: int64

## Player Presence Across Seasons

We check how frequently players appear across multiple seasons, which is important for
trajectory-based modeling and rolling feature construction.

In [5]:
df.groupby("Season")["batter"].nunique()

Season
2022    1620
2023    1353
2024    1223
2025     540
Name: batter, dtype: int64