# 01ExploreData

This notebook performs an initial exploration of **season-level FanGraphs batting data (2022–2025)** pulled programmatically via `pybaseball`.

The goal of this step is **not modeling**, but to:

- Confirm dataset grain (one row per player per season)
- Validate season coverage
- Inspect available columns
- Decide which fields are relevant for downstream storage in MySQL

All modeling and feature engineering will be performed in later steps.

In [7]:
import pandas as pd

df = pd.read_csv("../data/raw/FanGraphsBatting_2022_2025.csv")
df.head()

Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,PA,H,1B,...,maxEV,HardHit,HardHit%,Events,CStr%,CSW%,xBA,xSLG,xwOBA,L-WAR
0,15640,2022,Aaron Judge,NYY,30,157,570,696,177,87,...,118.4,247,0.611,404,0.169,0.287,0.305,0.706,0.463,11.4
1,9777,2022,Nolan Arenado,STL,31,148,557,620,163,90,...,111.4,190,0.389,489,0.155,0.241,0.266,0.445,0.339,7.1
2,11493,2022,Manny Machado,SDP,29,150,578,644,172,102,...,112.4,219,0.49,447,0.126,0.243,0.264,0.447,0.338,7.4
3,5417,2022,Jose Altuve,HOU,32,141,527,604,158,91,...,109.8,131,0.297,441,0.173,0.24,0.269,0.44,0.354,6.6
4,9218,2022,Paul Goldschmidt,STL,34,151,561,651,178,102,...,112.3,201,0.472,426,0.196,0.295,0.261,0.482,0.367,7.0


## Dataset Size

We inspect the number of rows and columns to understand the scale of the dataset.

In [15]:
df.shape

(538, 320)

## Data Grain

Each row in this dataset represents a single player-season.

The data is already aggregated at the season level, meaning no pitch-level or game-level aggregation is required.

In [16]:
df.head()

Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,PA,H,1B,...,maxEV,HardHit,HardHit%,Events,CStr%,CSW%,xBA,xSLG,xwOBA,L-WAR
0,15640,2022,Aaron Judge,NYY,30,157,570,696,177,87,...,118.4,247,0.611,404,0.169,0.287,0.305,0.706,0.463,11.4
1,9777,2022,Nolan Arenado,STL,31,148,557,620,163,90,...,111.4,190,0.389,489,0.155,0.241,0.266,0.445,0.339,7.1
2,11493,2022,Manny Machado,SDP,29,150,578,644,172,102,...,112.4,219,0.49,447,0.126,0.243,0.264,0.447,0.338,7.4
3,5417,2022,Jose Altuve,HOU,32,141,527,604,158,91,...,109.8,131,0.297,441,0.173,0.24,0.269,0.44,0.354,6.6
4,9218,2022,Paul Goldschmidt,STL,34,151,561,651,178,102,...,112.3,201,0.472,426,0.196,0.295,0.261,0.482,0.367,7.0


## Season Coverage

We verify that all expected seasons are present in the dataset.

In [17]:
df["Season"].value_counts().sort_index()

Season
2022    130
2023    134
2024    129
2025    145
Name: count, dtype: int64

## Column Overview

FanGraphs provides a large number of season-level metrics across many dimensions of player performance.

Rather than selecting features at this stage, we inspect the available columns to understand what types of information exist in the dataset.

In [18]:
df.columns.tolist()

['IDfg',
 'Season',
 'Name',
 'Team',
 'Age',
 'G',
 'AB',
 'PA',
 'H',
 '1B',
 '2B',
 '3B',
 'HR',
 'R',
 'RBI',
 'BB',
 'IBB',
 'SO',
 'HBP',
 'SF',
 'SH',
 'GDP',
 'SB',
 'CS',
 'AVG',
 'GB',
 'FB',
 'LD',
 'IFFB',
 'Pitches',
 'Balls',
 'Strikes',
 'IFH',
 'BU',
 'BUH',
 'BB%',
 'K%',
 'BB/K',
 'OBP',
 'SLG',
 'OPS',
 'ISO',
 'BABIP',
 'GB/FB',
 'LD%',
 'GB%',
 'FB%',
 'IFFB%',
 'HR/FB',
 'IFH%',
 'BUH%',
 'wOBA',
 'wRAA',
 'wRC',
 'Bat',
 'Fld',
 'Rep',
 'Pos',
 'RAR',
 'WAR',
 'Dol',
 'Spd',
 'wRC+',
 'WPA',
 '-WPA',
 '+WPA',
 'RE24',
 'REW',
 'pLI',
 'phLI',
 'PH',
 'WPA/LI',
 'Clutch',
 'FB% (Pitch)',
 'FBv',
 'SL%',
 'SLv',
 'CT%',
 'CTv',
 'CB%',
 'CBv',
 'CH%',
 'CHv',
 'SF%',
 'SFv',
 'KN%',
 'KNv',
 'XX%',
 'PO%',
 'wFB',
 'wSL',
 'wCT',
 'wCB',
 'wCH',
 'wSF',
 'wKN',
 'wFB/C',
 'wSL/C',
 'wCT/C',
 'wCB/C',
 'wCH/C',
 'wSF/C',
 'wKN/C',
 'O-Swing%',
 'Z-Swing%',
 'Swing%',
 'O-Contact%',
 'Z-Contact%',
 'Contact%',
 'Zone%',
 'F-Strike%',
 'SwStr%',
 'BsR',
 'FA% (sc)',
 

## Core Fields Sanity Check

We briefly inspect a small subset of identifier and performance fields to confirm data consistency.

In [19]:
df[["IDfg", "Name", "Season", "Age", "PA", "wRC+", "WAR"]].head(10)

Unnamed: 0,IDfg,Name,Season,Age,PA,wRC+,WAR
0,15640,Aaron Judge,2022,30,696,206,11.1
1,9777,Nolan Arenado,2022,31,620,149,7.2
2,11493,Manny Machado,2022,29,644,152,7.1
3,5417,Jose Altuve,2022,32,604,164,6.9
4,9218,Paul Goldschmidt,2022,34,651,175,6.8
5,5361,Freddie Freeman,2022,32,708,157,6.8
6,11739,J.T. Realmuto,2022,31,562,129,6.7
7,18314,Dansby Swanson,2022,28,696,117,6.6
8,19556,Yordan Alvarez,2022,25,561,185,6.4
9,16252,Trea Turner,2022,29,708,128,6.4


## Notebook 1 Conclusion

This exploration confirms that the FanGraphs dataset provides clean, season-level player data for 2022–2025.

The data:
- is aggregated at the player–season level
- includes complete coverage for all seasons
- contains a broad set of offensive metrics suitable for downstream analysis

No filtering, feature selection, or modeling decisions are finalized at this stage.
Those choices are deferred to the next notebook.