
# Datasets

To address the research questions outlined in the Project Overview, high-quality data on annual free-agent classes, player statistics, biographical information, awards, and salaries is required. Two primary data sources were used for this project: the **Lahman Baseball Database** and the **MLB Stats API**.

The limitations of the raw datasets helped define the final cleaned dataset spanning **2003–2015**:

- The MLB Stats API provides **no free-agent data prior to 2003** and lack of salary data for contracts signed.
- The Lahman Database provides **salary data only through 2017**, with most reliable salary records ending after 2015.

To support these constraints, Lahman data was pulled for the years **2000–2017**. This ensures that player salaries could be averaged over contract lengths of at least two years and that performance statistics from prior seasons could be accumulated appropriately. This range also allows three-year performance averages to be computed for the earliest MLB Stats API free-agent cohort (2003). Free-agent signing data from the MLB Stats API was collected for the years **2003–2015**, ensuring that all three-year performance averages and salary calculations could be performed consistently across seasons, especially given the lack of reliable salary data after 2017.

There are other data sources with higher-fidelity contract and salary information; however, many require agreements restricting their use for predictive modeling (e.g., Baseball Reference) or are locked behind paid subscriptions. Although the Lahman Database and MLB Stats API each have important limitations, together they provide a strong, freely accessible foundation for this analysis.

**NOTE:** The descriptions below summarize the key modeling variables extracted from each dataset. Because each dataset contains many more variables, the full versions are available in the [project repository](https://github.com/nw93929/DS6021-Project).


---

# Lahman Dataset

## Overview

The Lahman Baseball Database provides a comprehensive historical record of Major League Baseball, including tables covering **players**, **batting**, **fielding**, **pitching**, **awards**, **teams**, and **salaries**. The dataset includes performance statistics through **2025** and salary data through **2017**, with the primary limitation being the absence of salary information after 2017.

This dataset contains many of the variables necessary for predicting free-agent salaries, such as performance statistics, award histories, demographic information, and historical salary data. Access to the Lahman tables was facilitated through the **Pylahman** Python package, which provides a convenient interface for loading Lahman data into pandas DataFrames.

---

## Data Description

The following Lahman tables were utilized in constructing the cleaned dataset (described further in the Data Cleaning section):

* **Batting**: yearly offensive statistics
    * playerID — Unique Lahman identifier for a player
    * yearID — Year the award was won
    * G_batting — Games played (batting)
    * AB — At-bats
    * R — Runs scored
    * H — Hits
    * 2B — Doubles
    * 3B — Triples
    * HR — Home runs
    * RBI — Runs batted in
    * SB — Stolen bases
    * CS — Caught stealing
    * BB — Walks
    * SO — Strikeouts
    * IBB — Intentional walks
    * HBP — Hit by pitch
    * SH — Sacrifice hits
    * SF — Sacrifice flies
    * GIDP — Grounded into double plays
* **Pitching**: pitching performance metrics
    * playerID — Unique Lahman identifier for a player
    * yearID — Year the award was won
    * W — Wins
    * L — Losses
    * G — Games pitched
    * GS — Games started
    * CG — Complete games
    * SHO — Shutouts
    * SV — Saves
    * IPOuts — Outs recorded pitched (innings × 3)
    * H — Hits allowed
    * ER — Earned runs allowed
    * HR — Home runs allowed
    * BB — Walks allowed
    * SO — Strikeouts
    * IBB — Intentional walks
    * WP — Wild pitches
    * HBP — Hit batters
    * BK — Balks
    * BFP — Batters faced
    * GF — Games finished
    * R — Runs allowed (earned + unearned)
    * SH — Sacrifice hits allowed
    * SF — Sacrifice flies allowed
    * GIDP — Grounded into double plays allowed
    * ERA — Earned run average (averaged for modeling)
    * BAOpp — Opponents' batting average (averaged for modeling)
* **Fielding**: defensive performance metrics
    * playerID — Unique Lahman identifier for a player
    * yearID — Year the award was won
    * POS — Position of the player
    * InnOuts — Total defensive outs played (innings × 3)
    * PO — Putouts
    * A — Assists
    * E — Errors
    * DP — Double plays turned
    * PB — Passed balls (catchers)
    * WP — Wild pitches allowed while catching
    * ZR — Zone Rating (averaged for modeling)
* **Awards**: Awards won by players each season
    * playerID — Unique Lahman identifier for a player
    * yearID — Year the award was won
    * AwardID — award won by the player
* **Salaries**: Salaries of players each season
    * playerID — Unique Lahman identifier for a player
    * yearID — Year the award was won
    * salary — salary of the player that season
* **People**: Basic biogprahic information of players
    * playerID — Unique Lahman identifier for a player
    * yearID — Year the award was won
    * birthYear — Used to calculate age at free agency
    * givenName — full name of the player
    
* **Allstar**: AllStar particpants for each season
    * playerID — Unique Lahman identifier for a player
    * yearID — Year the award was won

---

## Additional Lahman Information
* Learn more: https://sabr.org/lahman-database/
* Pylahman documentation: https://pypi.org/project/pylahman/
* Kaggle Version of Lahman with table overviews: https://www.kaggle.com/datasets/dalyas/lahman-baseball-database

---

# MLB Stats API — Free Agent Data

## Overview

The **MLB Stats API**, maintained by Major League Baseball, provides an endpoint that returns **free-agent signings by year**. This allows us to gather complete free-agent classes for each offseason, which is essential for modeling player contract outcomes. Although the API also includes information on minor-league contracts, this analysis focuses exclusively on **Major League contracts**.

Free-agent signing data was retrieved using the **BaseballR** package and exported as a CSV covering the years **2003–2015**. When combined with player statistics and salary data from the Lahman Dataset, this API enables the construction of a comprehensive dataset for predicting player AAV and contract length.

---

## Data Description

Below are the key variables utilized from the API-generated CSV for modeling:

* **notes** — Contains details regarding contract length and contract type  
* **season** — The offseason year in which a player entered free agency  
    * Example: value of 2010 means the player entered free agency following the 2010 season
* **player_full_name** — The player’s full name as listed in the API  

---

## Additional Information

* MLB Stats API documentation (requires account access): [https://statsapi.mlb.com/](https://statsapi.mlb.com/)
* BaseballR package information: [https://billpetti.github.io/baseballr/](https://billpetti.github.io/baseballr/)


---