Tired of using the MLB API? Downloads folder full of CSV files exported from Baseball Savant? Look no further. Here's what you can get out of FFDB:
- Access to all of the pitch-level and play-by-play data in the pitch-tracking era (beginning in the late 2000s)
- A robust database schema based on the MLB API format
- Lightning-fast queries using Apache Parquet and DuckDB
This was originally created as an internal tool for my own sabermetrics projects, but I cleaned it up just enough to make it public in case anyone else wanted to use it as well.
(You can find more about me, the author, at https://harperawl.net. Feel free to contact me with any questions you have about this project!)
python -m venv .venv
.venv\Scripts\activate
pip install -e .RAW_DATA_DIRPROCESSED_DATA_DIRDUCKDB_PATH
This command line tool helps you set up the database and refresh it as needed.
Set up the database from start to finish:
ffdb setup --start-year 2024 --end-year 2026Refresh the database for the current year (defaults to the current year):
ffdb refreshPython:
pip install duckdbR:
install.packages("duckdb")Python:
import duckdb
conn = duckdb.connect("path/to/your/database.duckdb")
data = conn.execute("""
SELECT * FROM games LIMIT 100;
""")R:
library(duckdb)
library(dplyr)
conn <- dbConnect(duckdb(), dbdir = "path/to/your/database.duckdb", read_only = TRUE)
pitches <- dbGetQuery(conn, "
SELECT * FROM games LIMIT 100;
")See the database documentation for more information on how to query the database.