# Merging Provided Data with Supplemental Data

This notebook focuses on merging two datasets to create a consolidated view of player performance. The primary data set, `data/k.csv`, contains aggregated statistics for individual pitchers across MLB seasons, while the supplemental data set, `data/supplemental-stats.csv`, provides more granular, team-specific metrics. 

## Overview of Datasets

### Provided Data (`data/k.csv`)
This data set includes 8 columns summarizing player performance for each season:
1. **MLBAMID**: MLB player ID.
2. **PlayerId**: FanGraphs player ID.
3. **Name**: Player name.
4. **Team**: Team abbreviation (or `"- - -"` for players with multiple teams in a season).
5. **Age**: Player’s age during the 2024 season.
6. **Season**: Season year.
7. **TBF**: Total batters faced in the season.
8. **K%**: Strikeout percentage for the season.

### Supplemental Data (`data/supplemental-stats.csv`)
This data set contains 36 columns detailing additional player statistics, including granular performance metrics such as pitches per plate appearance, strike percentages, and inning-specific data. These statistics are broken down by team, which allows for a more detailed analysis of multi-team players.

## Data Consolidation Challenges
While the provided data set aggregates statistics across all teams a player has played for in a season, the supplemental data set retains team-specific records. For example, the provided data set shows one row per player-season, whereas the supplemental data set has multiple rows for players who played for multiple teams. In addition, the indicator for a player having a multi-team season is `'TOT'` in the supplemental data and marked as `'- - -'` in the provided data.  

Consider the case of **Yohan Ramírez**'s 2022 and 2024 seasons:  
- In the provided data set, his statistics are aggregated into a single row under `"- - -"`.
- In the supplemental data set, his stats are detailed separately for teams `SEA`, `CLE`, `PIT`, `NYM`, `BAL`, `LAD`, `BOS` and a total across all teams (`TOT`).
- **Notice that there are multiple `TOT` indicators in 2022 and 2024. The highest `TOT` is taken to represent that players year.

### Example: Yohan Ramírez
#### Provided Data (`data/k.csv`)
|      |   MLBAMID |   PlayerId | Name          | Team   |   Age |   Season |   TBF |       K% |
|-----:|----------:|-----------:|:--------------|:-------|------:|---------:|------:|---------:|
|  267 |    670990 |      19444 | Yohan Ramírez | - - -  |    29 |     2024 |   208 | 0.216346 |
| 1299 |    670990 |      19444 | Yohan Ramírez | - - -  |    27 |     2022 |   167 | 0.191617 |

#### Supplemental Data (`data/supplemental-stats.csv`)
|      |   Rk | Name          |   Age | Tm   |   IP |   PA |   Pit |   Pit/PA |   Str |   Str% |   L/Str |   S/Str |   F/Str |   I/Str |   AS/Str |   I/Bll |   AS/Pit |   Con |   1st% |   30% |   30c |   30s |   02% |   02c |   02s |   02h |   L/SO |   S/SO |   L/SO% |   3pK |   4pW |   PAu |   Pitu |   Stru |   Season |
|-----:|-----:|:--------------|------:|:-----|-----:|-----:|------:|---------:|------:|-------:|--------:|--------:|--------:|--------:|---------:|--------:|---------:|------:|-------:|------:|------:|------:|------:|------:|------:|------:|-------:|-------:|--------:|------:|------:|------:|-------:|-------:|---------:|
| 1942 |  784 | Yohan Ramírez |    27 | TOT  | 37.1 |  167 |   620 |     3.71 |   383 |  0.618 |   0.308 |   0.164 |   0.243 |   0.285 |    0.692 |       0 |    0.427 | 0.762 |  0.569 | 0.042 |     7 |     5 | 0.281 |    47 |    26 |     3 |     12 |     20 |   0.375 |     7 |     1 |     0 |      0 |      0 |     2022 |
| 1943 |  785 | Yohan Ramírez |    27 | TOT  | 10.1 |   51 |   198 |     3.88 |   117 |  0.591 |   0.274 |   0.256 |   0.214 |   0.256 |    0.726 |       0 |    0.429 | 0.647 |  0.569 | 0.02  |     1 |     1 | 0.137 |     7 |     5 |     1 |      3 |      8 |   0.273 |     2 |     0 |     0 |      0 |      0 |     2022 |
| 1944 |  786 | Yohan Ramírez |    27 | SEA  |  8.1 |   40 |   158 |     3.95 |    92 |  0.582 |   0.217 |   0.326 |   0.217 |   0.239 |    0.783 |       0 |    0.456 | 0.583 |  0.55  | 0.025 |     1 |     1 | 0.175 |     7 |     5 |     1 |      2 |      8 |   0.2   |     2 |     0 |     0 |      0 |      0 |     2022 |
| 1945 |  787 | Yohan Ramírez |    27 | CLE  |  2   |   11 |    40 |     3.64 |    25 |  0.625 |   0.48  |   0     |   0.2   |   0.32  |    0.52  |       0 |    0.325 | 1     |  0.636 | 0     |     0 |     0 | 0     |     0 |     0 |     0 |      1 |      0 |   1     |     0 |     0 |     0 |      0 |      0 |     2022 |
| 1946 |  788 | Yohan Ramírez |    27 | PIT  | 27   |  116 |   422 |     3.64 |   266 |  0.63  |   0.323 |   0.124 |   0.256 |   0.297 |    0.677 |       0 |    0.427 | 0.817 |  0.569 | 0.052 |     6 |     4 | 0.345 |    40 |    21 |     2 |      9 |     12 |   0.429 |     5 |     1 |     0 |      0 |      0 |     2022 |
| 4097 |  794 | Yohan Ramírez |    29 | TOT  | 44.9 |  208 |   761 |     3.66 |   467 |  0.614 |   0.315 |   0.171 |   0.227 |   0.287 |    0.685 |       0 |    0.42  | 0.75  |  0.51  | 0.077 |    16 |     7 | 0.207 |    43 |    30 |     1 |      9 |     36 |   0.2   |    10 |     5 |     0 |      0 |      0 |     2024 |
| 4098 |  795 | Yohan Ramírez |    29 | TOT  |  7.1 |   33 |   133 |     4.03 |    77 |  0.579 |   0.377 |   0.182 |   0.195 |   0.247 |    0.623 |       0 |    0.361 | 0.708 |  0.303 | 0.061 |     2 |     1 | 0.03  |     1 |     1 |     0 |      0 |      7 |   0     |     0 |     1 |     0 |      0 |      0 |     2024 |
| 4099 |  796 | Yohan Ramírez |    29 | TOT  | 37.2 |  175 |   628 |     3.59 |   390 |  0.621 |   0.303 |   0.169 |   0.233 |   0.295 |    0.697 |       0 |    0.433 | 0.757 |  0.549 | 0.08  |    14 |     6 | 0.24  |    42 |    29 |     1 |      9 |     29 |   0.237 |    10 |     4 |     0 |      0 |      0 |     2024 |
| 4100 |  797 | Yohan Ramírez |    29 | NYM  |  8.1 |   41 |   147 |     3.59 |    93 |  0.633 |   0.323 |   0.247 |   0.151 |   0.28  |    0.677 |       0 |    0.429 | 0.635 |  0.537 | 0.122 |     5 |     2 | 0.195 |     8 |     4 |     0 |      3 |      8 |   0.273 |     2 |     2 |     0 |      0 |      0 |     2024 |
| 4101 |  798 | Yohan Ramírez |    29 | BAL  |  6   |   24 |   104 |     4.33 |    61 |  0.587 |   0.41  |   0.164 |   0.213 |   0.213 |    0.59  |       0 |    0.346 | 0.722 |  0.375 | 0.083 |     2 |     1 | 0.042 |     1 |     1 |     0 |      0 |      6 |   0     |     0 |     1 |     0 |      0 |      0 |     2024 |
| 4102 |  799 | Yohan Ramírez |    29 | LAD  | 29.1 |  134 |   481 |     3.59 |   297 |  0.617 |   0.296 |   0.145 |   0.259 |   0.3   |    0.704 |       0 |    0.435 | 0.794 |  0.552 | 0.067 |     9 |     4 | 0.254 |    34 |    25 |     1 |      6 |     21 |   0.222 |     8 |     2 |     0 |      0 |      0 |     2024 |
| 4103 |  800 | Yohan Ramírez |    29 | BOS  |  1.1 |    9 |    29 |     3.22 |    16 |  0.552 |   0.25  |   0.25  |   0.125 |   0.375 |    0.75  |       0 |    0.414 | 0.667 |  0.111 | 0     |     0 |     0 | 0     |     0 |     0 |     0 |      0 |      1 |   0     |     0 |     0 |     0 |      0 |      0 |     2024 |

#### Desired Output
|      |   PlayerId | Team   |   Season |   MLBAMID | Name          |   Age |   TBF |       K% |   Rk |   IP |   PA |   Pit |   Pit/PA |   Str |   Str% |   L/Str |   S/Str |   F/Str |   I/Str |   AS/Str |   I/Bll |   AS/Pit |   Con |   1st% |   30% |   30c |   30s |   02% |   02c |   02s |   02h |   L/SO |   S/SO |   L/SO% |   3pK |   4pW |   PAu |   Pitu |   Stru |
|-----:|-----------:|:-------|---------:|----------:|:--------------|------:|------:|---------:|-----:|-----:|-----:|------:|---------:|------:|-------:|--------:|--------:|--------:|--------:|---------:|--------:|---------:|------:|-------:|------:|------:|------:|------:|------:|------:|------:|-------:|-------:|--------:|------:|------:|------:|-------:|-------:|
| 1213 |      19444 | - - -  |     2022 |    670990 | Yohan Ramírez |    27 |   167 | 0.191617 |  784 | 37.1 |  167 |   620 |     3.71 |   383 |  0.618 |   0.308 |   0.164 |   0.243 |   0.285 |    0.692 |       0 |    0.427 | 0.762 |  0.569 | 0.042 |     7 |     5 | 0.281 |    47 |    26 |     3 |     12 |     20 |   0.375 |     7 |     1 |     0 |      0 |      0 |
| 1214 |      19444 | - - -  |     2023 |    670990 | Yohan Ramírez |    28 |   176 | 0.198864 |  770 | 38.1 |  177 |   708 |     4    |   428 |  0.605 |   0.332 |   0.138 |   0.266 |   0.264 |    0.668 |       0 |    0.404 | 0.794 |  0.554 | 0.045 |     8 |     2 | 0.237 |    42 |    19 |     5 |     12 |     23 |   0.343 |     4 |     5 |     0 |      0 |      0 |
| 1215 |      19444 | - - -  |     2024 |    670990 | Yohan Ramírez |    29 |   208 | 0.216346 |  794 | 44.9 |  208 |   761 |     3.66 |   467 |  0.614 |   0.315 |   0.171 |   0.227 |   0.287 |    0.685 |       0 |    0.42  | 0.75  |  0.51  | 0.077 |    16 |     7 | 0.207 |    43 |    30 |     1 |      9 |     36 |   0.2   |    10 |     5 |     0 |      0 |      0 |
---

## Workflow for Merging and Aggregation
1. **Standardization**: Harmonize column names and formats between datasets. For example, rename the `Tm` column in the supplemental data set to `Team` for consistency and the multi-team indicator from `'TOT'` to `'- - - '`.
2. **Aggregation**: Consolidate team-specific rows in the supplemental data set into a single row per player-season by selecting only the _first_ `TOT` rows where it exists and re-naming it to `- - -` for consistency with the provided data.
3. **Joining**: Merge the aggregated supplemental data with the provided data set, ensuring alignment on player identifiers (`Name`, `Season`, `Age`, and `Team`).

---

## Development Workflow

All functions demonstrated in this notebook are defined in the `bullpen.data_utils` module for clarity, reusability, and unit testing. While this notebook retains the initial development and intent of these functions, their inclusion here is primarily for transparency and ease of reference.  

For production usage, refer to the source code in the `bullpen.data_utils` module.

In [1]:
import pandas as pd

from bullpen.data_utils import DATA_DIR

print(f"{DATA_DIR=}")

DATA_DIR=PosixPath('/Users/logan/Desktop/repos/mlb-pitcher-xK/data')


## Loading Data

Refer to the companion notebook: [01a-data-processing-fixing-names.ipynb](./01a-data-processing-fixing-names.ipynb) for detailed preprocessing steps.

**Note**: While the data loading process is encapsulated in the `bullpen.data_utils.load_data()` function for modularity and reuse, the relevant code is included here to provide full transparency into the data processing pipeline. This allows for a clear understanding of the transformations and ensures traceability for debugging or further customization.

In [2]:
provided_data = pd.read_csv(DATA_DIR.joinpath("k.csv"))
supplemental_data = pd.read_csv(DATA_DIR.joinpath("supplemental-stats.csv"))


supplemental_data.Name = supplemental_data.Name.replace(
    {
        "Manny Banuelos": "Manny Bañuelos",
        "Ralph Garza": "Ralph Garza Jr.",
        "Luis Ortiz": "Luis L. Ortiz",
        "Jose Hernandez": "Jose E. Hernandez",
        "Hyeon-jong Yang": "Hyeon-Jong Yang",
        "Adrián Martinez": "Adrián Martínez",
    }
)

provided_data.Name = provided_data.Name.replace(
    {
        "Eduardo Rodriguez": "Eduardo Rodríguez",
        "Jose Alvarez": "José Álvarez",
        "Sandy Alcantara": "Sandy Alcántara",
        "Carlos Martinez": "Carlos Martínez",
        "Phillips Valdez": "Phillips Valdéz",
        "Jovani Moran": "Jovani Morán",
        "Jose Cuas": "José Cuas",
        "Jorge Alcala": "Jorge Alcalá",
        "Jhoan Duran": "Jhoan Durán",
        "Jesus Tinoco": "Jesús Tinoco",
        "Brent Honeywell": "Brent Honeywell Jr.",
        "Adrian Morejon": "Adrián Morejón",
    }
)

In [3]:
supplemental_data[supplemental_data.Name == "Yohan Ramírez"]

Unnamed: 0,Rk,Name,Age,Tm,IP,PA,Pit,Pit/PA,Str,Str%,...,02h,L/SO,S/SO,L/SO%,3pK,4pW,PAu,Pitu,Stru,Season
835,836,Yohan Ramírez,26,SEA,27.2,114,436,3.82,275,0.631,...,2,4,31,0.114,6,1,0,0,0,2021
1942,784,Yohan Ramírez,27,TOT,37.1,167,620,3.71,383,0.618,...,3,12,20,0.375,7,1,0,0,0,2022
1943,785,Yohan Ramírez,27,TOT,10.1,51,198,3.88,117,0.591,...,1,3,8,0.273,2,0,0,0,0,2022
1944,786,Yohan Ramírez,27,SEA,8.1,40,158,3.95,92,0.582,...,1,2,8,0.2,2,0,0,0,0,2022
1945,787,Yohan Ramírez,27,CLE,2.0,11,40,3.64,25,0.625,...,0,1,0,1.0,0,0,0,0,0,2022
1946,788,Yohan Ramírez,27,PIT,27.0,116,422,3.64,266,0.63,...,2,9,12,0.429,5,1,0,0,0,2022
3009,770,Yohan Ramírez,28,TOT,38.1,177,708,4.0,428,0.605,...,5,12,23,0.343,4,5,0,0,0,2023
3010,771,Yohan Ramírez,28,PIT,34.1,156,602,3.86,373,0.62,...,5,9,22,0.29,4,3,0,0,0,2023
3011,772,Yohan Ramírez,28,CHW,4.0,21,106,5.05,55,0.519,...,0,3,1,0.75,0,2,0,0,0,2023
4097,794,Yohan Ramírez,29,TOT,44.9,208,761,3.66,467,0.614,...,1,9,36,0.2,10,5,0,0,0,2024


In [4]:
supplemental_data.Tm = supplemental_data.Tm.replace("TOT", "- - -")
supplemental_data

Unnamed: 0,Rk,Name,Age,Tm,IP,PA,Pit,Pit/PA,Str,Str%,...,02h,L/SO,S/SO,L/SO%,3pK,4pW,PAu,Pitu,Stru,Season
0,1,Fernando Abad,35,BAL,17.2,82,299,3.65,183,0.612,...,1,5,5,0.500,1,0,0,0,0,2021
1,2,Cory Abbott,25,CHC,17.1,82,352,4.29,203,0.577,...,0,4,8,0.333,1,2,0,0,0,2021
2,3,Albert Abreu,25,NYY,36.2,156,642,4.12,392,0.611,...,1,9,26,0.257,7,2,0,0,0,2021
3,4,Bryan Abreu,24,HOU,36.0,161,689,4.28,407,0.591,...,3,6,30,0.167,6,4,0,0,0,2021
4,5,Domingo Acevedo,27,OAK,11.0,44,174,3.95,114,0.655,...,1,0,9,0.000,4,1,0,0,0,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4393,1090,Ryan Zeferjahn,26,LAA,17.0,64,212,3.31,137,0.646,...,0,1,17,0.056,3,1,0,0,0,2024
4394,1091,Angel Zerpa,24,KCR,53.2,239,952,3.98,602,0.632,...,5,27,22,0.551,7,2,0,0,0,2024
4395,1092,Tyler Zuber,29,TBR,3.1,15,75,5.00,43,0.573,...,1,1,3,0.250,0,0,0,0,0,2024
4396,1093,Yosver Zulueta,26,CIN,16.1,69,286,4.14,175,0.612,...,1,9,11,0.450,1,0,0,0,0,2024


In [5]:
supplemental_data[supplemental_data.Name == "Yohan Ramírez"]

Unnamed: 0,Rk,Name,Age,Tm,IP,PA,Pit,Pit/PA,Str,Str%,...,02h,L/SO,S/SO,L/SO%,3pK,4pW,PAu,Pitu,Stru,Season
835,836,Yohan Ramírez,26,SEA,27.2,114,436,3.82,275,0.631,...,2,4,31,0.114,6,1,0,0,0,2021
1942,784,Yohan Ramírez,27,- - -,37.1,167,620,3.71,383,0.618,...,3,12,20,0.375,7,1,0,0,0,2022
1943,785,Yohan Ramírez,27,- - -,10.1,51,198,3.88,117,0.591,...,1,3,8,0.273,2,0,0,0,0,2022
1944,786,Yohan Ramírez,27,SEA,8.1,40,158,3.95,92,0.582,...,1,2,8,0.2,2,0,0,0,0,2022
1945,787,Yohan Ramírez,27,CLE,2.0,11,40,3.64,25,0.625,...,0,1,0,1.0,0,0,0,0,0,2022
1946,788,Yohan Ramírez,27,PIT,27.0,116,422,3.64,266,0.63,...,2,9,12,0.429,5,1,0,0,0,2022
3009,770,Yohan Ramírez,28,- - -,38.1,177,708,4.0,428,0.605,...,5,12,23,0.343,4,5,0,0,0,2023
3010,771,Yohan Ramírez,28,PIT,34.1,156,602,3.86,373,0.62,...,5,9,22,0.29,4,3,0,0,0,2023
3011,772,Yohan Ramírez,28,CHW,4.0,21,106,5.05,55,0.519,...,0,3,1,0.75,0,2,0,0,0,2023
4097,794,Yohan Ramírez,29,- - -,44.9,208,761,3.66,467,0.614,...,1,9,36,0.2,10,5,0,0,0,2024


In [6]:
merged = (
    provided_data.merge(
        supplemental_data,
        left_on=["Name", "Season", "Age", "Team"],
        right_on=["Name", "Season", "Age", "Tm"],
        how="left",
    )
    # Ensure top TOT is taken from supplemental data
    .groupby(["PlayerId", "Team", "Season"])
    .first()
    .reset_index()
    .drop("Tm", axis=1)
    .reset_index(drop=True)
    .sort_values(["Name", "Season", "Team"])
)
merged

Unnamed: 0,PlayerId,Team,Season,MLBAMID,Name,Age,TBF,K%,Rk,IP,...,02s,02h,L/SO,S/SO,L/SO%,3pK,4pW,PAu,Pitu,Stru
1106,18655,ATL,2021,621345,A.J. Minter,27,221,0.257919,696,52.1,...,44,7,11,46,0.193,11,4,0,0,0
1107,18655,ATL,2022,621345,A.J. Minter,28,271,0.346863,649,70.0,...,50,2,23,71,0.245,12,0,0,0,0
1108,18655,ATL,2023,621345,A.J. Minter,29,260,0.315385,647,64.2,...,40,4,13,69,0.159,8,1,0,0,0
1109,18655,ATL,2024,621345,A.J. Minter,30,134,0.261194,676,34.1,...,20,1,7,28,0.200,6,3,0,0,0
1177,19343,OAK,2022,640462,A.J. Puk,27,281,0.270463,773,66.1,...,48,6,22,54,0.289,15,4,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248,10310,PHI,2021,554430,Zack Wheeler,31,849,0.290931,1107,213.1,...,155,15,70,176,0.283,50,6,0,0,0
249,10310,PHI,2022,554430,Zack Wheeler,32,607,0.268534,1037,153.0,...,87,5,39,124,0.239,27,2,0,0,0
250,10310,PHI,2023,554430,Zack Wheeler,33,787,0.269377,1023,192.0,...,121,8,41,170,0.194,43,3,0,0,0
251,10310,PHI,2024,554430,Zack Wheeler,34,787,0.284625,1043,200.0,...,132,11,71,153,0.317,39,10,0,0,0


In [7]:
provided_data.shape, merged.shape

((1892, 8), (1892, 39))

In [8]:
provided_data.shape[0] == merged.shape[0]

True

In [9]:
merged[merged.Name == "Yohan Ramírez"]

Unnamed: 0,PlayerId,Team,Season,MLBAMID,Name,Age,TBF,K%,Rk,IP,...,02s,02h,L/SO,S/SO,L/SO%,3pK,4pW,PAu,Pitu,Stru
1213,19444,- - -,2022,670990,Yohan Ramírez,27,167,0.191617,784,37.1,...,26,3,12,20,0.375,7,1,0,0,0
1214,19444,- - -,2023,670990,Yohan Ramírez,28,176,0.198864,770,38.1,...,19,5,12,23,0.343,4,5,0,0,0
1215,19444,- - -,2024,670990,Yohan Ramírez,29,208,0.216346,794,44.9,...,30,1,9,36,0.2,10,5,0,0,0


In [10]:
merged.duplicated().sum()

0

In [11]:
merged.dtypes

PlayerId      int64
Team         object
Season        int64
MLBAMID       int64
Name         object
Age           int64
TBF           int64
K%          float64
Rk            int64
IP          float64
PA            int64
Pit           int64
Pit/PA      float64
Str           int64
Str%        float64
L/Str       float64
S/Str       float64
F/Str       float64
I/Str       float64
AS/Str      float64
I/Bll       float64
AS/Pit      float64
Con         float64
1st%        float64
30%         float64
30c           int64
30s           int64
02%         float64
02c           int64
02s           int64
02h           int64
L/SO          int64
S/SO          int64
L/SO%       float64
3pK           int64
4pW           int64
PAu           int64
Pitu          int64
Stru          int64
dtype: object