In [1]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn.preprocessing import StandardScaler

# Frankenstein MLB Hitter Builder
The goal of this notebook is to intake hitting and fielding data and calculate the similarity score of players based on hitting outcomes, swing style and defensive metrics. In doing so, we create a sort of frankenstein player that is a combination of skillsets from other players. This tool can help front offices determine what players to target in free agency by inputting players with desired skillsets. This notebook will take the user through all the datasets as well as the cleaning steps and the final function to calculate player similarity. All data in here is from the 3 most recent years (2023-2025) of the MLB since this tool will be used for free agency, we want to use very recent data to help front offices make decisions on players

Lets first load in all the dataframes of statistics which were found from baseball savant. There are several different datasets we load in with differing values that represent the different parts of the game we described above. For example, we will load in one dataframe that contains the percentile rankings of hitters for several different values as well as a dataframe that contains the fielding run value for all qualified players

Lets first create a function that will load in the dataframes for each of the different metrics we are evaulating. Each set of statistics has 3 different csv files that represent the 3 years we are evaluating on. The function below will take these 3 dataframes and will concatenate them into one dataframe that we will then clean and input into the similarity function

In [2]:
def load_csv(file: str) -> pd.DataFrame:
    """
    This function will take in the file name for the set of 3 files per statistic and will
    convert them to a concatenated dataframe for further use
    """
    years = [2023, 2024, 2025]
    dfs = []
    for year in years:
        df = pd.read_csv(f'data/{file}_{year}.csv')
        df_cols = df.columns.to_list()
        if 'year' not in df_cols:
            df['year'] = year
        dfs.append(df)

    df = pd.concat(dfs)
    return df

Lets now implement the function on all of the different datasets we will implement in this project

In [3]:
df_hit = load_csv('percentile_rankings_bat')
df_frv = load_csv('fielding-run-value')
df_batted = load_csv('batted-ball')
df_zone = load_csv('swing-take')
df_swing_path = load_csv('bat-tracking-swing-path')
df_catcher = load_csv('catcher_throwing')
df_oaa = load_csv('outs_above_average')

The first dataframe we will work with is the df_hit dataframe. This dataframe gives the percentile ranking in several different statistics of all qualified hitters. Example statistics include xwoba, xobp and brl_percent. Higher values represent a player being in the upper tier of the league in that specific statistic

In [4]:
df_hit

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,k_percent,bb_percent,whiff_percent,chase_percent,arm_strength,sprint_speed,oaa,bat_speed,squared_up_rate,swing_length
0,"Hedges, Austin",595978,2023,,,,,,,,...,,,,,,4,,,,
1,"Stowers, Kyle",669065,2023,,,,,,,,...,,,,,,15,,,,
2,"Davis, Jonathan",641505,2023,,,,,,,,...,,,,,15.0,88,,,,
3,"Isbel, Kyle",664728,2023,,,,,,,,...,,,,,42.0,56,97.0,,,
4,"Crawford, Brandon",543063,2023,,,,,,,,...,,,,,54.0,14,90.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
612,"Schwarber, Kyle",656941,2025,98.0,51.0,99.0,99.0,95.0,99.0,99.0,...,11.0,97.0,5.0,90.0,,14,,98.0,23.0,28.0
613,"Morel, Christopher",666624,2025,,,,,,,,...,,,,,93.0,66,,,,
614,"Hernández, Teoscar",606192,2025,49.0,50.0,69.0,73.0,14.0,75.0,72.0,...,29.0,7.0,8.0,36.0,63.0,69,3.0,64.0,37.0,7.0
615,"Caissie, Owen",683357,2025,,,,,,,,...,,,,,,64,,,,


Lets take a look at the number of NA values in the dataset

In [5]:
df_hit.isna().sum()

player_name            0
player_id              0
year                   0
xwoba               1077
xba                 1077
xslg                1077
xiso                1077
xobp                1077
brl                 1077
brl_percent         1077
exit_velocity       1077
max_ev                 0
hard_hit_percent    1077
k_percent           1077
bb_percent          1077
whiff_percent       1077
chase_percent       1077
arm_strength         655
sprint_speed           0
oaa                 1060
bat_speed           1180
squared_up_rate     1180
swing_length        1180
dtype: int64

Wow pretty significant number of NA values in this dataframe. Lets investigate it so we can see if there are any trends we can find. It seems like a number of the columns have a similar number of NA values so there may be several position players included in this that did not have enough plate appearances. We will create a dataframe of the players that likely have all NA values across the columns that have 1077 missing values

In [6]:
df_na = df_hit[df_hit['xwoba'].isna()]

Lets now see how many players in this dataframe have NA values across most of the columns

In [7]:
df_na.isna().sum()

player_name            0
player_id              0
year                   0
xwoba               1077
xba                 1077
xslg                1077
xiso                1077
xobp                1077
brl                 1077
brl_percent         1077
exit_velocity       1077
max_ev                 0
hard_hit_percent    1077
k_percent           1077
bb_percent          1077
whiff_percent       1077
chase_percent       1077
arm_strength         551
sprint_speed           0
oaa                  908
bat_speed           1055
squared_up_rate     1055
swing_length        1055
dtype: int64

Of the players missing xwoba, we can see that they are missing the rest of the columns that are missing 1077 values so we will filter out these players from the dataset using the dropna function and including how = all to tell the function to only drop the row if all these values are NA

In [8]:
cols_to_drop = [x for x in df_na.columns if x not in ['player_name', 'player_id', 'year', 'max_ev',
                                                       'arm_strength','sprint_speed', 'oaa', 'bat_speed',
                                                       'squared_up_rate', 'swing_length']]
df_hit = df_hit.dropna(how = 'all', subset = cols_to_drop)

Lets now see how many NA values we now have in the dataset

In [9]:
df_hit.isna().sum()

player_name           0
player_id             0
year                  0
xwoba                 0
xba                   0
xslg                  0
xiso                  0
xobp                  0
brl                   0
brl_percent           0
exit_velocity         0
max_ev                0
hard_hit_percent      0
k_percent             0
bb_percent            0
whiff_percent         0
chase_percent         0
arm_strength        104
sprint_speed          0
oaa                 152
bat_speed           125
squared_up_rate     125
swing_length        125
dtype: int64

Great so this deleted all of the NA values across those columns. The arm strength, oaa, bat speed, squared up rate and swing length columns seem to be the ones that have different NA values from the rest, however, we will not be using teh bottom 4 statistics shown above because they are either uninformative to the comparison of hitters or they are already covered by another metric that will be imported later in the analysis. For example, oaa will already be covered by fielding run value (FRV) later because FRV can be used to compare players across multiple positions while oaa cannot

In [10]:
df_hit = df_hit.drop(columns = ['squared_up_rate', 'swing_length', 'oaa', 'bat_speed'])

Lets now see how many remaining NA values we have

In [11]:
df_hit.isna().sum()

player_name           0
player_id             0
year                  0
xwoba                 0
xba                   0
xslg                  0
xiso                  0
xobp                  0
brl                   0
brl_percent           0
exit_velocity         0
max_ev                0
hard_hit_percent      0
k_percent             0
bb_percent            0
whiff_percent         0
chase_percent         0
arm_strength        104
sprint_speed          0
dtype: int64

We are now only missing values in the arm strength column so lets take a look as to why this might be

In [12]:
df_hit[df_hit['arm_strength'].isna()]

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,exit_velocity,max_ev,hard_hit_percent,k_percent,bb_percent,whiff_percent,chase_percent,arm_strength,sprint_speed
22,"Bailey, Patrick",672275,2023,29.0,38.0,44.0,49.0,17.0,32.0,63.0,37.0,41,33.0,17.0,17.0,43.0,63.0,,22
59,"Ozuna, Marcell",542303,2023,96.0,88.0,98.0,98.0,82.0,98.0,98.0,87.0,96,88.0,44.0,64.0,41.0,40.0,,28
174,"Maldonado, Martín",455117,2023,1.0,1.0,13.0,47.0,1.0,26.0,51.0,32.0,49,45.0,2.0,37.0,7.0,24.0,,1
198,"Perez, Salvador",521692,2023,52.0,69.0,74.0,69.0,22.0,67.0,53.0,65.0,85,67.0,38.0,1.0,21.0,1.0,,3
220,"Vázquez, Christian",543877,2023,1.0,11.0,3.0,8.0,5.0,3.0,7.0,11.0,25,14.0,41.0,33.0,43.0,23.0,,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
550,"Stephenson, Tyler",663886,2025,34.0,6.0,44.0,65.0,30.0,37.0,90.0,61.0,30,84.0,1.0,78.0,14.0,67.0,,34
596,"Narváez, Carlos",665966,2025,13.0,9.0,26.0,40.0,10.0,37.0,51.0,75.0,68,56.0,26.0,51.0,28.0,39.0,,11
600,"Raleigh, Cal",663728,2025,92.0,16.0,96.0,98.0,81.0,98.0,99.0,79.0,93,85.0,14.0,95.0,10.0,25.0,,18
605,"Kirk, Alejandro",672386,2025,87.0,91.0,81.0,70.0,83.0,66.0,59.0,75.0,76,91.0,95.0,64.0,83.0,42.0,,2


It seems like these are mostly catchers and DHs remaining but this was expected as the catchers will have their own arm_strength column later on and DHs do not play the field often. Therefore, we will leave these values as NA for the time being and will account for it in our final function to compare players. Lets go ahead and rename this arm strength column to show that it is for hitters and not catchers

In [13]:
df_hit = df_hit.rename(columns = {'arm_strength':'arm_strength_hit'})

Next we will work with the oaa dataframe which we will use to get the position of each player in our dataset.

In [14]:
df_oaa

Unnamed: 0,"last_name, first_name",player_id,display_team_name,year,primary_pos_formatted,fielding_runs_prevented,outs_above_average,outs_above_average_infront,outs_above_average_lateral_toward3bline,outs_above_average_lateral_toward1bline,outs_above_average_behind,outs_above_average_rhh,outs_above_average_lhh,actual_success_rate_formatted,adj_estimated_success_rate_formatted,diff_success_rate_formatted
0,"Abrams, CJ",682928,Nationals,2023,SS,-6,-9,-2,0,-7,0,-3,-6,73%,75%,-2%
1,"Abreu, José",547989,Astros,2023,1B,-5,-6,0,-5,-3,2,0,-6,73%,74%,-1%
2,"Acuña Jr., Ronald",660670,Braves,2023,RF,-6,-7,-3,-2,-2,0,-3,-4,87%,89%,-2%
3,"Adames, Willy",642715,Brewers,2023,SS,12,16,7,4,4,0,10,6,78%,75%,3%
4,"Ahmed, Nick",605113,D-backs,2023,SS,4,6,0,3,3,0,4,2,77%,74%,3%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,"Witt Jr., Bobby",677951,Royals,2025,SS,18,24,10,15,-3,3,13,11,76%,72%,4%
252,"Wood, James",695578,Nationals,2025,LF,-6,-7,-1,-5,0,-2,-4,-3,87%,90%,-3%
253,"Yastrzemski, Mike",573262,---,2025,RF,-2,-3,-1,-1,1,-2,4,-7,87%,88%,-1%
254,"Young, Cole",702284,Mariners,2025,2B,-6,-9,-4,-6,0,0,-7,-1,72%,75%,-3%


Lets take a look at the columns in the dataset to see which ones we will implement

In [15]:
df_oaa.columns

Index(['last_name, first_name', 'player_id', 'display_team_name', 'year',
       'primary_pos_formatted', 'fielding_runs_prevented',
       'outs_above_average', 'outs_above_average_infront',
       'outs_above_average_lateral_toward3bline',
       'outs_above_average_lateral_toward1bline', 'outs_above_average_behind',
       'outs_above_average_rhh', 'outs_above_average_lhh',
       'actual_success_rate_formatted', 'adj_estimated_success_rate_formatted',
       'diff_success_rate_formatted'],
      dtype='object')

As mentioned above we really only need the player name, player_id, year and position from this dataframe. Everything else is mostly diving deeper into the base statistics which we do not need for this analysis, so we will filter down the dataframe to a manageable number of columns

In [16]:
df_oaa = df_oaa[['last_name, first_name', 'player_id', 'year', 'primary_pos_formatted']]

Check the number of NA values per column

In [17]:
df_oaa.isna().sum()

last_name, first_name    0
player_id                0
year                     0
primary_pos_formatted    0
dtype: int64

Beautiful we have zero NA values in this dataframe so we can merge it with the hit dataframe! We will merge the dataframes on both player ID and year because there are typically several players in the MLB that have very similar or the exact same names. Merging on player ID helps us mitigate players with the same name interfering with the merge. We will also include an indicator column which will tell us how many rows did not merge. Since df_hit contains most of the values we will use in this analysis, we will use the how = 'left' method because in the scenario where there is data for a player in the df_hit dataframe but not a separate dataframe, we want to be able to keep that player in the analysis instead of deleting them by implementing an inner merge. Finally when we merge, we will first merge on player ID and then by year since we are using 3 years worth of data

In [18]:
df_hit_oaa = df_hit.merge(df_oaa, left_on = ['player_id', 'year'], right_on = ['player_id', 'year'], how = 'left', indicator = True)

Lets take a look to see the resulting dataframe after the merge

In [19]:
df_hit_oaa

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,hard_hit_percent,k_percent,bb_percent,whiff_percent,chase_percent,arm_strength_hit,sprint_speed,"last_name, first_name",primary_pos_formatted,_merge
0,"Andrus, Elvis",462101,2023,2.0,20.0,2.0,2.0,7.0,3.0,4.0,...,9.0,78.0,21.0,77.0,53.0,15.0,30,"Andrus, Elvis",2B,both
1,"Urías, Ramón",602104,2023,8.0,15.0,7.0,10.0,12.0,22.0,35.0,...,32.0,29.0,30.0,28.0,46.0,19.0,34,"Urías, Ramón",3B,both
2,"Chisholm Jr., Jazz",665862,2023,25.0,11.0,53.0,74.0,4.0,50.0,79.0,...,56.0,10.0,30.0,3.0,60.0,80.0,77,"Chisholm Jr., Jazz",CF,both
3,"Trout, Mike",545361,2023,95.0,83.0,92.0,91.0,96.0,62.0,95.0,...,95.0,15.0,88.0,26.0,92.0,75.0,96,"Trout, Mike",CF,both
4,"Taylor, Chris",621035,2023,28.0,11.0,32.0,54.0,32.0,29.0,62.0,...,28.0,5.0,78.0,4.0,86.0,66.0,74,"Taylor, Chris",SS,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
755,"Ward, Taylor",621493,2025,66.0,16.0,67.0,79.0,46.0,90.0,84.0,...,47.0,16.0,82.0,55.0,92.0,51.0,45,"Ward, Taylor",LF,both
756,"Barger, Addison",680718,2025,64.0,59.0,72.0,74.0,38.0,65.0,70.0,...,91.0,32.0,36.0,37.0,30.0,99.0,56,"Barger, Addison",3B,both
757,"Arias, Gabriel",672356,2025,8.0,4.0,28.0,50.0,2.0,48.0,70.0,...,49.0,1.0,18.0,1.0,6.0,94.0,40,"Arias, Gabriel",SS,both
758,"Schwarber, Kyle",656941,2025,98.0,51.0,99.0,99.0,95.0,99.0,99.0,...,100.0,11.0,97.0,5.0,90.0,,14,,,left_only


Lets check the indicator column to see how many rows didnt match up

In [20]:
df_hit_oaa['_merge'].value_counts()

_merge
both          612
left_only     148
right_only      0
Name: count, dtype: int64

In [21]:
df_na_oaa = df_hit_oaa[df_hit_oaa['_merge']!='both']

In [22]:
df_na_oaa

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,hard_hit_percent,k_percent,bb_percent,whiff_percent,chase_percent,arm_strength_hit,sprint_speed,"last_name, first_name",primary_pos_formatted,_merge
6,"Bailey, Patrick",672275,2023,29.0,38.0,44.0,49.0,17.0,32.0,63.0,...,33.0,17.0,17.0,43.0,63.0,,22,,,left_only
22,"Pederson, Joc",592626,2023,90.0,62.0,83.0,83.0,91.0,62.0,79.0,...,96.0,59.0,91.0,53.0,46.0,61.0,26,,,left_only
24,"Ozuna, Marcell",542303,2023,96.0,88.0,98.0,98.0,82.0,98.0,98.0,...,88.0,44.0,64.0,41.0,40.0,,28,,,left_only
67,"Maldonado, Martín",455117,2023,1.0,1.0,13.0,47.0,1.0,26.0,51.0,...,45.0,2.0,37.0,7.0,24.0,,1,,,left_only
81,"Perez, Salvador",521692,2023,52.0,69.0,74.0,69.0,22.0,67.0,53.0,...,67.0,38.0,1.0,21.0,1.0,,3,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
743,"Suzuki, Seiya",673548,2025,83.0,46.0,87.0,90.0,64.0,97.0,95.0,...,82.0,24.0,78.0,54.0,93.0,87.0,76,,,left_only
750,"Narváez, Carlos",665966,2025,13.0,9.0,26.0,40.0,10.0,37.0,51.0,...,56.0,26.0,51.0,28.0,39.0,,11,,,left_only
751,"Raleigh, Cal",663728,2025,92.0,16.0,96.0,98.0,81.0,98.0,99.0,...,85.0,14.0,95.0,10.0,25.0,,18,,,left_only
754,"Kirk, Alejandro",672386,2025,87.0,91.0,81.0,70.0,83.0,66.0,59.0,...,91.0,95.0,64.0,83.0,42.0,,2,,,left_only


So it looks like the values that did not transfer over in this case were several catchers and DHs just looking at the names in the preview of the dataframe. Catchers do not have an oaa statistic and neither do most DHs because they typically do not play the field. Even their positions are not listed, so we may need to figure out a way to either manually or automatically fill those in later. For now lets drop the columns we no longer need

In [23]:
df_hit_oaa = df_hit_oaa.drop(columns = ['last_name, first_name', '_merge'])

In [24]:
df_hit_oaa

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,exit_velocity,max_ev,hard_hit_percent,k_percent,bb_percent,whiff_percent,chase_percent,arm_strength_hit,sprint_speed,primary_pos_formatted
0,"Andrus, Elvis",462101,2023,2.0,20.0,2.0,2.0,7.0,3.0,4.0,18.0,50,9.0,78.0,21.0,77.0,53.0,15.0,30,2B
1,"Urías, Ramón",602104,2023,8.0,15.0,7.0,10.0,12.0,22.0,35.0,49.0,51,32.0,29.0,30.0,28.0,46.0,19.0,34,3B
2,"Chisholm Jr., Jazz",665862,2023,25.0,11.0,53.0,74.0,4.0,50.0,79.0,69.0,74,56.0,10.0,30.0,3.0,60.0,80.0,77,CF
3,"Trout, Mike",545361,2023,95.0,83.0,92.0,91.0,96.0,62.0,95.0,89.0,92,95.0,15.0,88.0,26.0,92.0,75.0,96,CF
4,"Taylor, Chris",621035,2023,28.0,11.0,32.0,54.0,32.0,29.0,62.0,7.0,66,28.0,5.0,78.0,4.0,86.0,66.0,74,SS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
755,"Ward, Taylor",621493,2025,66.0,16.0,67.0,79.0,46.0,90.0,84.0,53.0,66,47.0,16.0,82.0,55.0,92.0,51.0,45,LF
756,"Barger, Addison",680718,2025,64.0,59.0,72.0,74.0,38.0,65.0,70.0,86.0,97,91.0,32.0,36.0,37.0,30.0,99.0,56,3B
757,"Arias, Gabriel",672356,2025,8.0,4.0,28.0,50.0,2.0,48.0,70.0,55.0,83,49.0,1.0,18.0,1.0,6.0,94.0,40,SS
758,"Schwarber, Kyle",656941,2025,98.0,51.0,99.0,99.0,95.0,99.0,99.0,98.0,98,100.0,11.0,97.0,5.0,90.0,,14,


Lets check the value counts for the position column we just merged in to confirm whether there were actually any catchers or DHs in the dataframe we just merged in

In [25]:
df_hit_oaa['primary_pos_formatted'].value_counts()

primary_pos_formatted
2B    98
SS    97
CF    92
RF    85
3B    82
1B    81
LF    77
Name: count, dtype: int64

As expected there are no catchers and DHs so lets move on to the next dataframe to clean and merge with these which will be the fielding run value. This dataframe will contain data for qualifying players on the fielding run value metric which can be used to compare the defensive value of players across positions. For example, we can now compare shortstops to catchers on a level playing field to determine if both players bring similar value to their respective positions

In [26]:
df_frv

Unnamed: 0,name,id,total_runs,inf_of_runs,range_runs,arm_runs,dp_runs,catching_runs,framing_runs,throwing_runs,...,outs_total,outs_2,outs_3,outs_4,outs_5,outs_6,outs_7,outs_8,outs_9,year
0,"Bailey, Patrick",672275,19.530234,,,,,19.530234,16.957725,4.926869,...,2297,2297,0,0,0,0,0,0,0,2023
1,"Doyle, Brenton",686668,18.775926,18.775926,12.181571,6.594356,,,,,...,3071,0,0,0,0,0,0,3071,0,2023
2,"Hedges, Austin",595978,16.333202,,,,,16.333202,14.485383,-0.228077,...,1712,1701,0,0,0,0,0,0,0,2023
3,"Swanson, Dansby",621020,15.740000,15.740000,14.775248,0.008807,0.955945,,,,...,3838,0,0,0,0,3838,0,0,0,2023
4,"Giménez, Andrés",665926,15.460235,15.460235,13.391425,0.310122,1.758688,,,,...,3924,0,0,3924,0,0,0,0,0,2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,"Castellanos, Nick",592206,-12.151476,-12.151476,-10.581465,-1.570011,,,,,...,3625,0,0,0,0,0,0,0,3625,2025
295,"Ramírez, Agustín",682663,-12.656344,,,,,-12.656344,0.522107,-6.246042,...,1817,1817,0,0,0,0,0,0,0,2025
296,"Soto, Juan",665742,-12.662585,-12.662585,-10.693049,-1.969537,,,,,...,4122,0,0,0,0,0,0,0,4122,2025
297,"Adell, Jo",666176,-12.978205,-12.978205,-11.017387,-1.960818,,,,,...,3777,0,0,0,0,0,0,2172,1605,2025


In this one we only need the name, player ID and total fielding runs so we will drop everything else in the dataset

In [27]:
df_frv = df_frv[['name', 'id', 'total_runs', 'year']]

Lets take a look at the number of NA values in the dataset

In [28]:
df_frv.isna().sum()

name          0
id            0
total_runs    0
year          0
dtype: int64

Looks like we do not have any NA values with this one either so lets merge it in with the compiled dataframe so far

In [29]:
df_frv_hit_oaa = df_hit_oaa.merge(df_frv, left_on=['player_id', 'year'], right_on = ['id', 'year'], how = 'left', indicator=True)

In [30]:
df_frv_hit_oaa

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,bb_percent,whiff_percent,chase_percent,arm_strength_hit,sprint_speed,primary_pos_formatted,name,id,total_runs,_merge
0,"Andrus, Elvis",462101,2023,2.0,20.0,2.0,2.0,7.0,3.0,4.0,...,21.0,77.0,53.0,15.0,30,2B,"Andrus, Elvis",462101.0,3.925308,both
1,"Urías, Ramón",602104,2023,8.0,15.0,7.0,10.0,12.0,22.0,35.0,...,30.0,28.0,46.0,19.0,34,3B,"Urías, Ramón",602104.0,-6.949099,both
2,"Chisholm Jr., Jazz",665862,2023,25.0,11.0,53.0,74.0,4.0,50.0,79.0,...,30.0,3.0,60.0,80.0,77,CF,"Chisholm Jr., Jazz",665862.0,3.439776,both
3,"Trout, Mike",545361,2023,95.0,83.0,92.0,91.0,96.0,62.0,95.0,...,88.0,26.0,92.0,75.0,96,CF,"Trout, Mike",545361.0,3.512130,both
4,"Taylor, Chris",621035,2023,28.0,11.0,32.0,54.0,32.0,29.0,62.0,...,78.0,4.0,86.0,66.0,74,SS,"Taylor, Chris",621035.0,-2.797986,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
755,"Ward, Taylor",621493,2025,66.0,16.0,67.0,79.0,46.0,90.0,84.0,...,82.0,55.0,92.0,51.0,45,LF,"Ward, Taylor",621493.0,-2.242588,both
756,"Barger, Addison",680718,2025,64.0,59.0,72.0,74.0,38.0,65.0,70.0,...,36.0,37.0,30.0,99.0,56,3B,"Barger, Addison",680718.0,-0.207882,both
757,"Arias, Gabriel",672356,2025,8.0,4.0,28.0,50.0,2.0,48.0,70.0,...,18.0,1.0,6.0,94.0,40,SS,"Arias, Gabriel",672356.0,2.080435,both
758,"Schwarber, Kyle",656941,2025,98.0,51.0,99.0,99.0,95.0,99.0,99.0,...,97.0,5.0,90.0,,14,,,,,left_only


Great lets take a look at the value counts of those that did not merge properly

In [31]:
df_frv_hit_oaa['_merge'].value_counts()

_merge
both          701
left_only      59
right_only      0
Name: count, dtype: int64

Looks like we have a few values that didnt merge, my first thought are these are the DHs that we've been discussing since they likely did not play enough innings to qualify for this statistic. We will create a separate dataframe for only those values that did not merge to investigate this

In [32]:
df_frv_hit_oaa_na = df_frv_hit_oaa[df_frv_hit_oaa['_merge']!='both']

In [33]:
df_frv_hit_oaa_na

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,bb_percent,whiff_percent,chase_percent,arm_strength_hit,sprint_speed,primary_pos_formatted,name,id,total_runs,_merge
22,"Pederson, Joc",592626,2023,90.0,62.0,83.0,83.0,91.0,62.0,79.0,...,91.0,53.0,46.0,61.0,26,,,,,left_only
24,"Ozuna, Marcell",542303,2023,96.0,88.0,98.0,98.0,82.0,98.0,98.0,...,64.0,41.0,40.0,,28,,,,,left_only
83,"Turner, Justin",457759,2023,65.0,60.0,68.0,64.0,61.0,42.0,23.0,...,44.0,92.0,71.0,12.0,9,,,,,left_only
88,"Morel, Christopher",666624,2023,68.0,30.0,84.0,89.0,29.0,77.0,95.0,...,49.0,2.0,43.0,99.0,82,,,,,left_only
98,"Buxton, Byron",621439,2023,35.0,3.0,52.0,81.0,19.0,50.0,90.0,...,71.0,19.0,53.0,,94,,,,,left_only
102,"Martinez, J.D.",502110,2023,90.0,75.0,96.0,97.0,47.0,89.0,98.0,...,35.0,7.0,28.0,,19,,,,,left_only
112,"Rooker, Brent",667670,2023,71.0,26.0,80.0,87.0,47.0,85.0,93.0,...,60.0,1.0,57.0,61.0,42,,,,,left_only
121,"Cabrera, Miguel",408234,2023,18.0,37.0,15.0,12.0,34.0,7.0,11.0,...,49.0,46.0,49.0,,1,,,,,left_only
132,"Jiménez, Eloy",650391,2023,32.0,55.0,48.0,45.0,20.0,63.0,58.0,...,19.0,24.0,9.0,,31,,,,,left_only
153,"Bell, Josh",605137,2023,72.0,60.0,72.0,69.0,70.0,81.0,64.0,...,72.0,46.0,54.0,2.0,17,,,,,left_only


Looking through this list of players, I can confirm these players are all DHs, so this can provide us with the opportunity to input that into the position column for each of these players for future reference

In [34]:
for i in df_frv_hit_oaa_na.index:
    df_frv_hit_oaa.at[i, 'primary_pos_formatted'] = 'DH'

Now that we have filled the position of those players in we may be able to get an idea of who the remaining positions are and can fill those in as well so lets take a look by filtering down to the remaining players who do not have a position value

We'll go ahead and just call this dataframe with the missing positions catchers for now

In [35]:
catchers = df_frv_hit_oaa[df_frv_hit_oaa['primary_pos_formatted'].isna()]

In [36]:
catchers

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,bb_percent,whiff_percent,chase_percent,arm_strength_hit,sprint_speed,primary_pos_formatted,name,id,total_runs,_merge
6,"Bailey, Patrick",672275,2023,29.0,38.0,44.0,49.0,17.0,32.0,63.0,...,17.0,43.0,63.0,,22,,"Bailey, Patrick",672275.0,19.530234,both
67,"Maldonado, Martín",455117,2023,1.0,1.0,13.0,47.0,1.0,26.0,51.0,...,37.0,7.0,24.0,,1,,"Maldonado, Martín",455117.0,-13.352223,both
81,"Perez, Salvador",521692,2023,52.0,69.0,74.0,69.0,22.0,67.0,53.0,...,1.0,21.0,1.0,,3,,"Perez, Salvador",521692.0,-7.446756,both
90,"Vázquez, Christian",543877,2023,1.0,11.0,3.0,8.0,5.0,3.0,7.0,...,33.0,43.0,23.0,,7,,"Vázquez, Christian",543877.0,7.528239,both
94,"Rutschman, Adley",668939,2023,93.0,95.0,77.0,56.0,97.0,68.0,39.0,...,91.0,94.0,81.0,,40,,"Rutschman, Adley",668939.0,5.876166,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
721,"Kelly, Carson",608348,2025,44.0,23.0,42.0,54.0,50.0,38.0,55.0,...,76.0,75.0,88.0,,17,,"Kelly, Carson",608348.0,2.993758,both
729,"Stephenson, Tyler",663886,2025,34.0,6.0,44.0,65.0,30.0,37.0,90.0,...,78.0,14.0,67.0,,34,,"Stephenson, Tyler",663886.0,-5.878187,both
750,"Narváez, Carlos",665966,2025,13.0,9.0,26.0,40.0,10.0,37.0,51.0,...,51.0,28.0,39.0,,11,,"Narváez, Carlos",665966.0,12.270085,both
751,"Raleigh, Cal",663728,2025,92.0,16.0,96.0,98.0,81.0,98.0,99.0,...,95.0,10.0,25.0,,18,,"Raleigh, Cal",663728.0,6.979936,both


In [37]:
catchers['player_name'].unique()

array(['Bailey, Patrick', 'Maldonado, Martín', 'Perez, Salvador',
       'Vázquez, Christian', 'Rutschman, Adley', 'Alvarez, Francisco',
       'Langeliers, Shea', 'Yoshida, Masataka', 'Judge, Aaron',
       'Gomes, Yan', 'Smith, Will', 'Stephenson, Tyler', 'Burleson, Alec',
       'Tellez, Rowdy', 'Ruiz, Keibert', 'Rogers, Jake', 'Heim, Jonah',
       'Realmuto, J.T.', 'Fraley, Jake', 'Wong, Connor',
       'Cooper, Garrett', 'Contreras, Willson', 'Murphy, Sean',
       'Grossman, Robbie', 'Raleigh, Cal', 'Sheets, Gavin',
       'Sabol, Blake', 'Moreno, Gabriel', 'Contreras, William',
       'Grandal, Yasmani', 'Díaz, Elias', 'Kirk, Alejandro', 'Naylor, Bo',
       'Fermin, Freddy', "O'Hoppe, Logan", 'Diaz, Yainer',
       "O'Hearn, Ryan", 'Wells, Austin', 'Lee, Korey', 'Jeffers, Ryan',
       'Amaya, Miguel', "d'Arnaud, Travis", 'Tauchman, Mike',
       'Quero, Edgar', 'Dingler, Dillon', 'Baldwin, Drake', 'Hicks, Liam',
       'Pagés, Pedro', 'Andujar, Miguel', 'Ramírez, Agustín',
  

Unfortunately it seems like there are a mix of position players just looking at the list so we are going to go ahead and fill these in manually by creating a dictionary that will map the player name to the position they play

In [38]:
pos_dict = {'Bailey, Patrick':'C', 'Maldonado, Martín':'C', 'Perez, Salvador':'C',
       'Vázquez, Christian':'C', 'Rutschman, Adley':'C', 'Alvarez, Francisco':'C',
       'Langeliers, Shea':'C', 'Yoshida, Masataka':'LF', 'Judge, Aaron':'RF',
       'Gomes, Yan':'C', 'Smith, Will':'C', 'Stephenson, Tyler':'C', 'Burleson, Alec':'LF',
       'Tellez, Rowdy':'1B', 'Ruiz, Keibert':'C', 'Rogers, Jake':'C', 'Heim, Jonah':'C',
       'Realmuto, J.T.':'C', 'Fraley, Jake':'RF', 'Wong, Connor':'C',
       'Cooper, Garrett':'1B', 'Contreras, Willson':'C', 'Murphy, Sean':'C',
       'Grossman, Robbie':'LF', 'Raleigh, Cal':'C', 'Sheets, Gavin':'DH',
       'Sabol, Blake':'C', 'Moreno, Gabriel':'C', 'Contreras, William':'C',
       'Grandal, Yasmani':'C', 'Díaz, Elias':'C', 'Kirk, Alejandro':'C', 'Naylor, Bo':'C',
       'Fermin, Freddy':'C', "O'Hoppe, Logan":'C', 'Diaz, Yainer':'C',
       "O'Hearn, Ryan":'1B', 'Wells, Austin':'C', 'Lee, Korey':'C', 'Jeffers, Ryan':'C',
       'Amaya, Miguel':'C', "d'Arnaud, Travis":'C', 'Tauchman, Mike':'RF',
       'Quero, Edgar':'C', 'Dingler, Dillon':'C', 'Baldwin, Drake':'C', 'Hicks, Liam':'C',
       'Pagés, Pedro':'C', 'Andujar, Miguel':'3B', 'Ramírez, Agustín':'C',
       'Hays, Austin':'LF', 'Goodman, Hunter':'C', 'Rice, Ben':'DH', 'Caratini, Victor':'C',
       'Kelly, Carson':'C', 'Narváez, Carlos':'C'}

catchers["primary_pos_formatted"] = catchers["player_name"].map(pos_dict).fillna(catchers["primary_pos_formatted"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  catchers["primary_pos_formatted"] = catchers["player_name"].map(pos_dict).fillna(catchers["primary_pos_formatted"])


Lets check to make sure this mapping worked

In [39]:
catchers['primary_pos_formatted'].value_counts()

primary_pos_formatted
C     76
LF     4
RF     3
1B     3
DH     3
3B     1
Name: count, dtype: int64

The following code will transfer these positions into the main dataframe

In [40]:
df_frv_hit_oaa.loc[catchers.index, "primary_pos_formatted"] = catchers["primary_pos_formatted"]

Lets now check our balance of positions

In [41]:
df_frv_hit_oaa['primary_pos_formatted'].value_counts()

primary_pos_formatted
2B    97
SS    97
CF    92
RF    88
1B    84
3B    83
LF    81
C     76
DH    62
Name: count, dtype: int64

In [42]:
df_frv_hit_oaa.isna().sum()

player_name                0
player_id                  0
year                       0
xwoba                      0
xba                        0
xslg                       0
xiso                       0
xobp                       0
brl                        0
brl_percent                0
exit_velocity              0
max_ev                     0
hard_hit_percent           0
k_percent                  0
bb_percent                 0
whiff_percent              0
chase_percent              0
arm_strength_hit         104
sprint_speed               0
primary_pos_formatted      0
name                      59
id                        59
total_runs                59
_merge                     0
dtype: int64

Great so it looks like we no longer have any players without a position so lets go ahead and drop the columns we do not need from the merge

In [43]:
df_frv_hit_oaa = df_frv_hit_oaa.drop(columns = ['name', 'id', '_merge'])

In [44]:
df_frv_hit_oaa

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,max_ev,hard_hit_percent,k_percent,bb_percent,whiff_percent,chase_percent,arm_strength_hit,sprint_speed,primary_pos_formatted,total_runs
0,"Andrus, Elvis",462101,2023,2.0,20.0,2.0,2.0,7.0,3.0,4.0,...,50,9.0,78.0,21.0,77.0,53.0,15.0,30,2B,3.925308
1,"Urías, Ramón",602104,2023,8.0,15.0,7.0,10.0,12.0,22.0,35.0,...,51,32.0,29.0,30.0,28.0,46.0,19.0,34,3B,-6.949099
2,"Chisholm Jr., Jazz",665862,2023,25.0,11.0,53.0,74.0,4.0,50.0,79.0,...,74,56.0,10.0,30.0,3.0,60.0,80.0,77,CF,3.439776
3,"Trout, Mike",545361,2023,95.0,83.0,92.0,91.0,96.0,62.0,95.0,...,92,95.0,15.0,88.0,26.0,92.0,75.0,96,CF,3.512130
4,"Taylor, Chris",621035,2023,28.0,11.0,32.0,54.0,32.0,29.0,62.0,...,66,28.0,5.0,78.0,4.0,86.0,66.0,74,SS,-2.797986
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
755,"Ward, Taylor",621493,2025,66.0,16.0,67.0,79.0,46.0,90.0,84.0,...,66,47.0,16.0,82.0,55.0,92.0,51.0,45,LF,-2.242588
756,"Barger, Addison",680718,2025,64.0,59.0,72.0,74.0,38.0,65.0,70.0,...,97,91.0,32.0,36.0,37.0,30.0,99.0,56,3B,-0.207882
757,"Arias, Gabriel",672356,2025,8.0,4.0,28.0,50.0,2.0,48.0,70.0,...,83,49.0,1.0,18.0,1.0,6.0,94.0,40,SS,2.080435
758,"Schwarber, Kyle",656941,2025,98.0,51.0,99.0,99.0,95.0,99.0,99.0,...,98,100.0,11.0,97.0,5.0,90.0,,14,DH,


Great lets now move on to the next dataframe to merge into the dataset which will be df_batted. This dataframe gives us information on the tendencies of each players batted balls. For example, it gives us the percentage of time a player hits a groundball or flyball in addition to the percentage of time they hit the ball to a specific field. This can give us an idea of the approach a hitter takes into each of their at bats

In [45]:
df_batted

Unnamed: 0,id,name,bbe,gb_rate,air_rate,fb_rate,ld_rate,pu_rate,pull_rate,straight_rate,oppo_rate,pull_gb_rate,straight_gb_rate,oppo_gb_rate,pull_air_rate,straight_air_rate,oppo_air_rate,year
0,608324,"Bregman, Alex",537,0.353818,0.646182,0.310987,0.249534,0.085661,0.407821,0.370577,0.221601,0.199255,0.122905,0.031657,0.208566,0.247672,0.189944,2023
1,676801,"McCormick, Chas",289,0.422145,0.577855,0.311419,0.224913,0.041522,0.422145,0.332180,0.245675,0.249135,0.134948,0.038062,0.173010,0.197232,0.207612,2023
2,663656,"Tucker, Kyle",492,0.386179,0.613821,0.327236,0.241870,0.044715,0.441057,0.327236,0.231707,0.209350,0.128049,0.048780,0.231707,0.199187,0.182927,2023
3,665161,"Peña, Jeremy",450,0.544444,0.455556,0.188889,0.228889,0.037778,0.417778,0.355556,0.226667,0.284444,0.202222,0.057778,0.133333,0.153333,0.168889,2023
4,514888,"Altuve, Jose",290,0.493103,0.506897,0.217241,0.203448,0.086207,0.451724,0.382759,0.165517,0.262069,0.210345,0.020690,0.189655,0.172414,0.144828,2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
246,682622,"Marte, Noelvi",256,0.457031,0.542969,0.253906,0.214844,0.074219,0.445312,0.320312,0.234375,0.269531,0.144531,0.042969,0.175781,0.175781,0.191406,2025
247,595777,"Profar, Jurickson",259,0.362934,0.637066,0.312741,0.193050,0.131274,0.444015,0.305019,0.250965,0.208494,0.104247,0.050193,0.235521,0.200772,0.200772,2025
248,683734,"Vaughn, Andrew",333,0.417417,0.582583,0.270270,0.252252,0.060060,0.375375,0.381381,0.243243,0.195195,0.180180,0.042042,0.180180,0.201201,0.201201,2025
249,608324,"Bregman, Alex",365,0.389041,0.610959,0.265753,0.232877,0.112329,0.463014,0.361644,0.175342,0.219178,0.134247,0.035616,0.243836,0.227397,0.139726,2025


So there are a few overlapping columns on here that we do not need, mainly the ground ball and flyball rates to the specific fields. We already have the general groundball and flyball rates in addition to the field tendencies so we would essentially double count these columns if we were to include flyball and groundball rates to specific fields. Therefore we will drop columns like these.

In [46]:
df_batted = df_batted.drop(columns = ['bbe', 'pull_gb_rate', 'straight_gb_rate', 'oppo_gb_rate',
                                      'pull_air_rate', 'straight_air_rate', 'oppo_air_rate'])

Great lets check the number of NA values in this dataset

In [47]:
df_batted.isna().sum()

id               0
name             0
gb_rate          0
air_rate         0
fb_rate          0
ld_rate          0
pu_rate          0
pull_rate        0
straight_rate    0
oppo_rate        0
year             0
dtype: int64

Awesome no NA values again, lets go ahead and merge this with the compiled dataframe. 

In [48]:
df_frv_hit_oaa_batted = df_frv_hit_oaa.merge(df_batted, left_on = ['player_id', 'year'], right_on = ['id', 'year'], how = 'left', indicator = True)

Now lets see if there are any values that did not merge

In [49]:
df_frv_hit_oaa_batted[df_frv_hit_oaa_batted['_merge']!='both']

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,name,gb_rate,air_rate,fb_rate,ld_rate,pu_rate,pull_rate,straight_rate,oppo_rate,_merge


It looks like this one cleanly merged over so lets go ahead and drop the columns we do not need from this

In [50]:
df_frv_hit_oaa_batted = df_frv_hit_oaa_batted.drop(columns = ['_merge', 'name', 'id'])

Lets now move on to our next merge which will be with the zone dataframe. This dataframe will tell us how each player performs based on the location of the pitch. For example, runs_heart tells us how the player hits when the pitch is in the heart of the plate or in common baseball terms when the pitch is "right down the middle." We can also see how a player performs when the pitch is on the edge of the strikezone with runs_shadow

In [51]:
df_zone

Unnamed: 0,year,"last_name, first_name",player_id,team_id,pa,pitches,runs_all,runs_heart,runs_shadow,runs_chase,runs_waste
0,2023,"Rivera, Emmanuel",656896,109,283,1132,-7.972698,-10.113963,-11.540433,9.249256,4.432443
1,2023,"Moreno, Gabriel",672515,109,380,1544,0.133253,-3.183273,-9.783677,8.007793,5.092410
2,2023,"Rutschman, Adley",668939,110,681,2904,21.693932,5.629867,-26.349328,27.574728,14.838664
3,2023,"Henderson, Gunnar",683002,110,620,2405,12.863924,8.582633,-19.424328,13.636175,10.069445
4,2023,"Duvall, Adam",594807,111,352,1336,6.756137,6.876232,-10.717165,3.702287,6.894783
...,...,...,...,...,...,...,...,...,...,...,...
295,2025,"Davis, Henry",680779,134,283,1069,-19.462939,-13.457292,-13.556189,4.351386,3.199156
296,2025,"Gorman, Nolan",669357,138,400,1655,-7.973539,-12.064263,-9.544836,8.665801,4.969759
297,2025,"Morel, Christopher",666624,139,305,1177,-5.341804,-3.955563,-13.254566,7.402819,4.465506
298,2025,"Kiner-Falefa, Isiah",643396,141,459,1631,-14.940250,-6.657197,-15.158539,2.991070,3.884416


Lets drop some columns that we do not need such as the team ID, plate appearances and number of pitches since these will be uninformative to us

In [52]:
df_zone = df_zone.drop(columns = ['team_id', 'pa', 'pitches', 'runs_all'])

Lets check to see if there are any NA values in the dataset we need to clean up

In [53]:
df_zone.isna().sum()

year                     0
last_name, first_name    0
player_id                0
runs_heart               0
runs_shadow              0
runs_chase               0
runs_waste               0
dtype: int64

There are none so lets go ahead with the merge into the compiled dataframe

In [54]:
df_frv_hit_oaa_batted_zone = df_frv_hit_oaa_batted.merge(df_zone, left_on = ['player_id', 'year'],
                                                          right_on = ['player_id', 'year'], how = 'left', indicator = True)

Lets go ahead and check out the merged values to see if any did not merge

In [55]:
df_frv_hit_oaa_batted_zone['_merge'].value_counts()

_merge
both          760
left_only       0
right_only      0
Name: count, dtype: int64

It looks like all of the values merged cleanly so we can drop the columns we do not need from the merge

In [56]:
df_frv_hit_oaa_batted_zone = df_frv_hit_oaa_batted_zone.drop(columns = ['last_name, first_name', '_merge'])

Great lets move on to the next dataframe to merge into the dataset which is the swing path dataframe. This dataset tell us how the batter physically swings the bat. For example, we have avg bat speed which tells us the average velocity of a players swing as it comes through the zone. In addition, we have the swing tilt which tells us the angle of there bat relative to a common reference point as it comes through the zone. This can inform us of which players have an uppercut swing or a flatter swing and may also inform us on their general approach of hitting line drives or homeruns

In [57]:
df_swing_path

Unnamed: 0,id,name,side,avg_bat_speed,swing_tilt,attack_angle,attack_direction,ideal_attack_angle_rate,avg_intercept_y_vs_plate,avg_intercept_y_vs_batter,avg_batter_y_position,avg_batter_x_position,competitive_swings,year
0,519317,"Stanton, Giancarlo",R,80.981810,27.705285,8.370458,-0.056991,0.572539,0.242772,26.190684,25.947912,34.480968,386,2023
1,670242,"Wallner, Matt",L,77.402934,34.240172,13.354930,-1.127491,0.497409,8.472905,35.373420,26.900515,31.697891,386,2023
2,660271,"Ohtani, Shohei",L,77.382653,35.271624,12.613620,-2.032904,0.506211,-2.586794,27.522917,30.109711,28.738236,322,2023
3,660670,"Acuña Jr., Ronald",R,77.302990,34.582286,6.764888,5.193785,0.484407,1.761949,30.599879,28.837930,27.943054,481,2023
4,677594,"Rodríguez, Julio",R,77.171916,33.441041,7.823291,-4.647368,0.493204,15.785897,35.411550,19.625653,28.325287,515,2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
221,624428,"Frazier, Adam",L,65.886248,39.180122,11.392098,3.022402,0.456119,2.923323,24.042431,21.119108,22.087547,809,2025
222,527038,"Flores, Wilmer",R,65.529888,32.142503,11.227196,-6.604417,0.498054,3.697373,31.382987,27.685613,25.257294,771,2025
223,805779,"Wilson, Jacob",R,63.878761,35.131265,2.826548,-1.020782,0.363384,2.020192,30.573419,28.553227,25.665803,721,2025
224,680757,"Kwan, Steven",L,63.681147,35.618400,3.242778,-2.507621,0.353249,6.951045,29.794135,22.843090,22.666925,954,2025


Ok so we only need a few columns from this dataframe including bat speed, attack angle and batter side. The rest of the columns are a little too specific for this analysis to provide us with any further comparison between players

In [58]:
df_swing_path = df_swing_path[['id', 'name', 'side', 'avg_bat_speed', 'swing_tilt',
                                'attack_angle', 'attack_direction', 'year']]

Lets check to see if there are any NA values that need to be cleaned up here

In [59]:
df_swing_path.isna().sum()

id                  0
name                0
side                0
avg_bat_speed       0
swing_tilt          0
attack_angle        0
attack_direction    0
year                0
dtype: int64

There are none so lets proceed with the merge which will almost provide us with the final dataframe

In [60]:
df_frv_hit_oaa_batted_zone_path = df_frv_hit_oaa_batted_zone.merge(df_swing_path, left_on = ['player_id', 'year'],
                                                                   right_on = ['id', 'year'], how = 'left', indicator = True)

In [61]:
df_frv_hit_oaa_batted_zone_path

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,runs_chase,runs_waste,id,name,side,avg_bat_speed,swing_tilt,attack_angle,attack_direction,_merge
0,"Andrus, Elvis",462101,2023,2.0,20.0,2.0,2.0,7.0,3.0,4.0,...,6.873864,7.410766,,,,,,,,left_only
1,"Urías, Ramón",602104,2023,8.0,15.0,7.0,10.0,12.0,22.0,35.0,...,6.098820,7.534916,,,,,,,,left_only
2,"Chisholm Jr., Jazz",665862,2023,25.0,11.0,53.0,74.0,4.0,50.0,79.0,...,5.796056,5.583808,665862.0,"Chisholm Jr., Jazz",L,72.940250,36.151425,15.408843,-2.060608,both
3,"Trout, Mike",545361,2023,95.0,83.0,92.0,91.0,96.0,62.0,95.0,...,15.071760,7.084039,,,,,,,,left_only
4,"Taylor, Chris",621035,2023,28.0,11.0,32.0,54.0,32.0,29.0,62.0,...,11.369140,6.977911,621035.0,"Taylor, Chris",R,71.168454,39.096488,14.993788,-1.330861,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
755,"Ward, Taylor",621493,2025,66.0,16.0,67.0,79.0,46.0,90.0,84.0,...,18.862416,10.472059,621493.0,"Ward, Taylor",R,69.470514,32.118654,9.095254,-2.890135,both
756,"Barger, Addison",680718,2025,64.0,59.0,72.0,74.0,38.0,65.0,70.0,...,8.873050,5.586002,680718.0,"Barger, Addison",L,75.928020,28.809564,9.064356,-4.995597,both
757,"Arias, Gabriel",672356,2025,8.0,4.0,28.0,50.0,2.0,48.0,70.0,...,1.843013,6.032518,672356.0,"Arias, Gabriel",R,75.076463,29.076045,9.727773,1.203260,both
758,"Schwarber, Kyle",656941,2025,98.0,51.0,99.0,99.0,95.0,99.0,99.0,...,25.343806,19.711952,656941.0,"Schwarber, Kyle",L,77.337696,30.311409,14.571615,-5.779753,both


Lets take a look to see how many values did not cleanly merge over

In [62]:
df_frv_hit_oaa_batted_zone_path['_merge'].value_counts()

_merge
both          639
left_only     121
right_only      0
Name: count, dtype: int64

Looks like we have a good amount of values that did not cleanly merge over so lets take a look to see what is happening here. We will again do this by creating a separate dataframe for the values that did not merge

In [63]:
na = df_frv_hit_oaa_batted_zone_path[df_frv_hit_oaa_batted_zone_path['_merge']!='both']

In [64]:
na

Unnamed: 0,player_name,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,runs_chase,runs_waste,id,name,side,avg_bat_speed,swing_tilt,attack_angle,attack_direction,_merge
0,"Andrus, Elvis",462101,2023,2.0,20.0,2.0,2.0,7.0,3.0,4.0,...,6.873864,7.410766,,,,,,,,left_only
1,"Urías, Ramón",602104,2023,8.0,15.0,7.0,10.0,12.0,22.0,35.0,...,6.098820,7.534916,,,,,,,,left_only
3,"Trout, Mike",545361,2023,95.0,83.0,92.0,91.0,96.0,62.0,95.0,...,15.071760,7.084039,,,,,,,,left_only
5,"Donovan, Brendan",680977,2023,81.0,92.0,64.0,37.0,90.0,17.0,25.0,...,10.950574,8.188444,,,,,,,,left_only
12,"DeJong, Paul",657557,2023,5.0,9.0,12.0,25.0,2.0,19.0,32.0,...,3.943391,6.065749,,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,"Simpson, Chandler",802415,2025,16.0,98.0,6.0,1.0,62.0,1.0,1.0,...,8.174175,3.894835,,,,,,,,left_only
722,"Triolo, Jared",669707,2025,32.0,27.0,27.0,28.0,50.0,15.0,25.0,...,12.115955,4.471382,,,,,,,,left_only
725,"Dubón, Mauricio",643289,2025,5.0,46.0,3.0,4.0,17.0,1.0,2.0,...,9.206300,4.866868,,,,,,,,left_only
729,"Stephenson, Tyler",663886,2025,34.0,6.0,44.0,65.0,30.0,37.0,90.0,...,8.248199,5.052446,,,,,,,,left_only


Looks like its a mix of players in the dataset. Lets check the position counts to see if it was a specific position that did not have values for this dataset

In [65]:
na['primary_pos_formatted'].value_counts()

primary_pos_formatted
CF    21
C     19
2B    18
SS    14
RF    12
3B    10
DH    10
1B    10
LF     7
Name: count, dtype: int64

Its not entirely clear what exactly is happening with the values here. It could be due to poor technology for specific teams or errors in the technology itself. There are star players with NA values such as Mike Trout in addition to lower level players such as Romy Gonzales so it is unclear what the exact trend is for the NA values. Therefore, since 121 rows is not a significant amount of the dataset, we are going to fill in the missing values with the median value for each of the years. Swing tendencies tend to change on a year by year basis with several players now focusing their swing path to hit more homeruns, so a yearly median will help to capture these trends

In [66]:
cols_to_impute = [
    "attack_angle",
    "attack_direction",
    "swing_tilt",
    "avg_bat_speed"
]

for col in cols_to_impute:
    df_frv_hit_oaa_batted_zone_path[col] = (
        df_frv_hit_oaa_batted_zone_path
        .groupby(["year"])[col]
        .transform(lambda x: x.fillna(x.median()))
    )

Great lets check to ensure there are no more NA values in the rows where data did not transfer over

In [67]:
na.isna().sum()

player_name                0
player_id                  0
year                       0
xwoba                      0
xba                        0
xslg                       0
xiso                       0
xobp                       0
brl                        0
brl_percent                0
exit_velocity              0
max_ev                     0
hard_hit_percent           0
k_percent                  0
bb_percent                 0
whiff_percent              0
chase_percent              0
arm_strength_hit          24
sprint_speed               0
primary_pos_formatted      0
total_runs                 9
gb_rate                    0
air_rate                   0
fb_rate                    0
ld_rate                    0
pu_rate                    0
pull_rate                  0
straight_rate              0
oppo_rate                  0
runs_heart                 0
runs_shadow                0
runs_chase                 0
runs_waste                 0
id                       121
name          

Great it looks like this worked but we still have several NA values for the side column which may need to be filled in manually as this cannot be imputed with a typical median or mean value of the dataset. We will deal with this later in the process, for now we will drop the columns from the merge

In [68]:
df_frv_hit_oaa_batted_zone_path = df_frv_hit_oaa_batted_zone_path.drop(columns = ['_merge', 'name', 'id'])

Great lets move on to the catcher dataframe which contains data specific to catcher defense such as the catcher stealing runs which measures the number of extra caught stealings compared to the expectation of a league average catcher. In addition there is an arm strength column which gives the average speed of a catchers throw from home to second in miles per hour

In [69]:
df_catcher

Unnamed: 0,player_id,player_name,team_name,start_year,end_year,sb_attempts,catcher_stealing_runs,caught_stealing_above_average,n_cs,rate_cs,...,pop_time,exchange_time,arm_strength,n_xcs_with_flight_over_xcs,n_xcs_with_exchange_over_xcs,n_xcs_with_accuracy_over_xcs,n_xcs_with_ground_other_over_xcs,n_xcs_with_onfly_other_over_xcs,n_xcs_with_untracked_other_over_xcs,year
0,455117,"Maldonado, Martín",HOU,2023,2023,85,1.391864,2.141329,12,0.141176,...,1.907291,0.662818,82.045695,2.608737,0.020640,0.421121,0.460068,-1.300118,-0.069120,2023
1,518595,"d'Arnaud, Travis",ATL,2023,2023,50,-2.954708,-4.545704,4,0.080000,...,1.958026,0.670359,79.237302,-0.050713,-0.646161,-1.129784,0.164626,-2.883674,0.000000,2023
2,518735,"Grandal, Yasmani",CWS,2023,2023,94,-5.832308,-8.972782,14,0.148936,...,2.032824,0.684243,74.059276,-9.039378,-3.947226,-0.174694,-0.287031,4.475547,0.000000,2023
3,521692,"Perez, Salvador",KC,2023,2023,47,-1.230398,-1.892920,5,0.106383,...,1.875000,0.628516,76.748297,-1.565024,1.328547,0.054263,-0.102824,-1.593861,-0.014022,2023
4,542194,"Bethancourt, Christian",TB,2023,2023,52,3.569891,5.492140,13,0.250000,...,1.833143,0.670024,85.130685,4.680717,-0.713384,2.210408,0.464124,-1.149725,0.000000,2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,690924,"Fulford, Braxton",COL,2025,2025,24,0.115731,0.178047,6,0.250000,...,1.887368,0.734368,80.723206,0.627236,-1.821717,-1.484049,1.606202,1.250375,0.000000,2025
59,691019,"Teel, Kyle",CWS,2025,2025,43,-1.395105,-2.146315,7,0.162791,...,1.919559,0.690059,80.719943,1.964464,-1.419498,1.341067,-0.077551,-3.954797,0.000000,2025
60,693307,"Dingler, Dillon",DET,2025,2025,64,2.473274,3.805036,19,0.296875,...,1.922767,0.713383,85.320790,8.445639,-3.885599,-0.499964,0.223888,-0.369169,-0.109760,2025
61,696100,"Goodman, Hunter",COL,2025,2025,62,-0.723858,-1.113628,11,0.177419,...,1.909535,0.659163,79.796120,0.739331,-0.705639,-2.589737,1.650859,0.457645,-0.666087,2025


To be able to compare catchers to other positions such as outfield we will only use the arm strength value here since the rest of these statistics are only applicable to catchers. Lets filter the dataframe down to the name, ID and arm strength

In [70]:
df_catcher = df_catcher[['player_id', 'player_name', 'year','arm_strength']]

Lets see how many NA values there are in this dataset

In [71]:
df_catcher.isna().sum()

player_id       0
player_name     0
year            0
arm_strength    0
dtype: int64

We have none so lets merge this in with the compiled dataframe so far

In [72]:
df_frv_hit_oaa_batted_zone_path_catcher = df_frv_hit_oaa_batted_zone_path.merge(df_catcher, left_on = ['player_id', 'year'],
                                                                                right_on = ['player_id', 'year'], how = 'left',
                                                                                indicator = True)

Great lets see how many of the values did not transfer over

In [73]:
df_frv_hit_oaa_batted_zone_path_catcher['_merge'].value_counts()

_merge
left_only     681
both           79
right_only      0
Name: count, dtype: int64

In this case it may be more valuable to look at the values that actually did merge over to ensure it was all the catchers in the dataset. Lets check this by creating a separate dataframe for the values that did merge over

In [74]:
df_na_catch = df_frv_hit_oaa_batted_zone_path_catcher[df_frv_hit_oaa_batted_zone_path_catcher['_merge'] == 'both']

Lets look at the positions to ensure that we do indeed only have catchers

In [75]:
df_na_catch['primary_pos_formatted'].value_counts()

primary_pos_formatted
C     76
DH     3
Name: count, dtype: int64

Interesting so we have 3 DHs in here but this is not surprising as several catchers end up becoming DHs later in their careers or if they get hurt so lets take a look at who these 3 are

In [76]:
df_na_catch[df_na_catch['primary_pos_formatted']=='DH']

Unnamed: 0,player_name_x,player_id,year,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,...,runs_chase,runs_waste,side,avg_bat_speed,swing_tilt,attack_angle,attack_direction,player_name_y,arm_strength,_merge
178,"Diaz, Yainer",673237,2023,84.0,90.0,96.0,93.0,30.0,66.0,79.0,...,1.210958,5.718457,R,73.681408,30.3367,11.094366,3.992933,"Diaz, Yainer",85.018561,both
377,"Contreras, Willson",575929,2024,93.0,71.0,75.0,75.0,96.0,39.0,76.0,...,7.526626,8.155944,,71.773074,32.499119,9.939621,-1.685956,"Contreras, Willson",81.138686,both
486,"Garver, Mitch",641598,2024,10.0,1.0,7.0,44.0,16.0,30.0,55.0,...,15.501316,7.653187,,71.773074,32.499119,9.939621,-1.685956,"Garver, Mitch",77.468085,both


Yep so looking at the statistics for these 3 players, I can see all were catchers during this year, however, most of their appearances were at DH during these specific years. They played enough innings at catcher to have catching statistics so we will leave those stats in there, however, we will still consider them to be DHs for these years since they had more appearances there. Lets drop the merge columns and rename the arm strength column for catchers to ensure it does not get mixed with the fielder arm strength

In [77]:
df_frv_hit_oaa_batted_zone_path_catcher = df_frv_hit_oaa_batted_zone_path_catcher.drop(columns = ['player_name_y', '_merge'])
df_frv_hit_oaa_batted_zone_path_catcher = df_frv_hit_oaa_batted_zone_path_catcher.rename(columns = {'player_name_x':'player_name', 'arm_strength':'arm_strength_c'})

Lets see what NA values remain that we will need to fill in

In [78]:
df_frv_hit_oaa_batted_zone_path_catcher.isna().sum()

player_name                0
player_id                  0
year                       0
xwoba                      0
xba                        0
xslg                       0
xiso                       0
xobp                       0
brl                        0
brl_percent                0
exit_velocity              0
max_ev                     0
hard_hit_percent           0
k_percent                  0
bb_percent                 0
whiff_percent              0
chase_percent              0
arm_strength_hit         104
sprint_speed               0
primary_pos_formatted      0
total_runs                59
gb_rate                    0
air_rate                   0
fb_rate                    0
ld_rate                    0
pu_rate                    0
pull_rate                  0
straight_rate              0
oppo_rate                  0
runs_heart                 0
runs_shadow                0
runs_chase                 0
runs_waste                 0
side                     121
avg_bat_speed 

Lets now take a look at the total_runs column. I am thinking that these are all the DHs in the dataset and we will likely give them a lower run value since DHs are not typically known for their defensive prowess

In [79]:
DH = df_frv_hit_oaa_batted_zone_path_catcher[df_frv_hit_oaa_batted_zone_path_catcher['total_runs'].isna()]

In [80]:
DH['primary_pos_formatted'].value_counts()

primary_pos_formatted
DH    59
Name: count, dtype: int64

As we had originally thought, these players are all DHs so we will go ahead and fill these run values in with a value of zero. Although it is likely ok to infer these players are below average fielders since they play DH, it would likely skew the player similarity to assume they are all below average, so its more of a fair comparison to assume they just have 0 total runs since they never actually play the field and this is their actual value of this statistic since they dont play the field

In [81]:
mask = df_frv_hit_oaa_batted_zone_path_catcher["primary_pos_formatted"] == "DH"
df_frv_hit_oaa_batted_zone_path_catcher.loc[mask, "total_runs"] = 0

Great lets check to see how many NA values we have left

In [82]:
df_frv_hit_oaa_batted_zone_path_catcher.isna().sum()

player_name                0
player_id                  0
year                       0
xwoba                      0
xba                        0
xslg                       0
xiso                       0
xobp                       0
brl                        0
brl_percent                0
exit_velocity              0
max_ev                     0
hard_hit_percent           0
k_percent                  0
bb_percent                 0
whiff_percent              0
chase_percent              0
arm_strength_hit         104
sprint_speed               0
primary_pos_formatted      0
total_runs                 0
gb_rate                    0
air_rate                   0
fb_rate                    0
ld_rate                    0
pu_rate                    0
pull_rate                  0
straight_rate              0
oppo_rate                  0
runs_heart                 0
runs_shadow                0
runs_chase                 0
runs_waste                 0
side                     121
avg_bat_speed 

Ok we may need to manually fill in the side here since we have no tell from the data who hits on what side of the plate. Lets go ahead and see which players there are and fill them in manually with a dictionary

In [83]:
no_side = df_frv_hit_oaa_batted_zone_path_catcher[df_frv_hit_oaa_batted_zone_path_catcher['side'].isna()]

In [84]:
no_side['player_name'].unique()

array(['Andrus, Elvis', 'Urías, Ramón', 'Trout, Mike', 'Donovan, Brendan',
       'DeJong, Paul', 'Duran, Jarren', 'Heyward, Jason',
       'Bader, Harrison', 'Kemp, Tony', 'Taylor, Michael A.',
       'Frazier, Adam', 'Ruiz, Esteury', 'Wiemer, Joey',
       'Vázquez, Christian', 'Buxton, Byron', 'India, Jonathan',
       'Ward, Taylor', 'Gorman, Nolan', 'Siri, Jose', 'Stephenson, Tyler',
       'McLain, Matt', 'Burleson, Alec', 'Tellez, Rowdy',
       'Yastrzemski, Mike', 'Baty, Brett', 'Profar, Jurickson',
       'Díaz, Aledmys', 'Kiner-Falefa, Isiah', 'Rojas, Josh',
       'Pratto, Nick', 'Rizzo, Anthony', 'Call, Alex', 'Laureano, Ramón',
       'Conforto, Michael', 'Farmer, Kyle', 'McCutchen, Andrew',
       'Fraley, Jake', 'Olivares, Edward', 'Berti, Jon',
       'Contreras, Willson', 'Murphy, Sean', 'Grossman, Robbie',
       'Meyers, Jake', 'Peterson, Jace', 'Marte, Starling',
       'Sheets, Gavin', 'Sabol, Blake', 'Bae, Ji Hwan', 'Franco, Wander',
       'Cronenworth, Jake', '

In [85]:
bat_side_dict = {'Andrus, Elvis':'R', 'Urías, Ramón':'R', 'Trout, Mike':'R', 'Donovan, Brendan':'L',
       'DeJong, Paul':'R', 'Duran, Jarren':'L', 'Heyward, Jason':'L',
       'Bader, Harrison':'R', 'Kemp, Tony':'L', 'Taylor, Michael A.':'R',
       'Frazier, Adam':'L', 'Ruiz, Esteury':'R', 'Wiemer, Joey':'R',
       'Vázquez, Christian':'R', 'Buxton, Byron':'R', 'India, Jonathan':'R',
       'Ward, Taylor':'R', 'Gorman, Nolan':'L', 'Siri, Jose':'R', 'Stephenson, Tyler':'R',
       'McLain, Matt':'R', 'Burleson, Alec':'L', 'Tellez, Rowdy':'L',
       'Yastrzemski, Mike':'L', 'Baty, Brett':'L', 'Profar, Jurickson':'S',
       'Díaz, Aledmys':'R', 'Kiner-Falefa, Isiah':'R', 'Rojas, Josh':'L',
       'Pratto, Nick':'L', 'Rizzo, Anthony':'L', 'Call, Alex':'R', 'Laureano, Ramón':'R',
       'Conforto, Michael':'L', 'Farmer, Kyle':'R', 'McCutchen, Andrew':'R',
       'Fraley, Jake':'L', 'Olivares, Edward':'R', 'Berti, Jon':'R',
       'Contreras, Willson':'R', 'Murphy, Sean':'R', 'Grossman, Robbie':'S',
       'Meyers, Jake':'R', 'Peterson, Jace':'L', 'Marte, Starling':'R',
       'Sheets, Gavin':'L', 'Sabol, Blake':'L', 'Bae, Ji Hwan':'L', 'Franco, Wander':'S',
       'Cronenworth, Jake':'L', 'Mateo, Jorge':'R', 'Grandal, Yasmani':'S',
       'Anderson, Brian':'R', 'Kirk, Alejandro':'R', 'Walls, Taylor':'S',
       'Kelenic, Jarred':'L', 'Blackmon, Charlie':'L', 'García Jr., Luis':'L',
       'Fermin, Freddy':'R', 'Cave, Jake':'L', 'Friedl, TJ':'L', 'Brennan, Will':'L',
       'Margot, Manuel':'R', 'Estrada, Thairo':'R', 'Díaz, Elias':'R',
       'Fitzgerald, Tyler':'R', 'Nootbaar, Lars':'L', 'Wade Jr., LaMonte':'L',
       'Moreno, Gabriel':'R', 'Kim, Ha-Seong':'R', 'Rosario, Amed':'R',
       'Correa, Carlos':'R', 'Rojas, Johan':'R', 'Drury, Brandon':'R', 'Joe, Connor':'R',
       'Espinal, Santiago':'R', 'Alvarez, Francisco':'R', 'Perdomo, Geraldo':'S',
       'Toro, Abraham':'S', 'Yoshida, Masataka':'L', 'Amaya, Miguel':'R',
       'Larnach, Trevor':'L', 'Jiménez, Eloy':'R', 'Singleton, Jon':'L',
       'Freeman, Tyler':'R', "Hayes, Ke'Bryan":'R', "d'Arnaud, Travis":'R',
       'Garver, Mitch':'R', 'Taylor, Tyrone':'R', 'Tauchman, Mike':'L',
       'Bauers, Jake':'L', 'Lile, Daylen':'L', 'Bleday, JJ':'L', 'Isbel, Kyle':'L',
       'Young, Jacob':'R', 'Iglesias, Jose':'R', 'Hicks, Liam':'L', 'Schmitt, Casey':'R',
       'Andujar, Miguel':'R', 'Sanoja, Javier':'R', 'Caballero, José':'R',
       'Westburg, Jordan':'R', 'Jeffers, Ryan':'R', 'Caratini, Victor':'S',
       'Rutschman, Adley':'S', 'Allen, Nick':'R', 'Marte, Noelvi':'R',
       'Cowser, Colton':'L', 'Simpson, Chandler':'L', 'Triolo, Jared':'R',
       'Dubón, Mauricio':'R', 'Gonzalez, Romy':'R', 'Albies, Ozzie':'S',
       'Bailey, Patrick':'S', 'Bell, Josh':'S', 'Castro, Willi':'S',
       'Collins, Isaac':'S', 'De La Cruz, Elly':'S', 'Domínguez, Jasson':'S',
       'Edman, Tommy':'S', 'Edwards, Xavier':'S', 'Happ, Ian':'S', 'Heim, Jonah':'S',
       'Lee, Brooks':'S', 'Lindor, Francisco':'S', 'Mangum, Jake':'S', 'Marte, Ketel':'S',
       'Martínez, Angel':'S', 'Perdomo, Geraldo':'S', 'Polanco, Jorge':'S',
       'Pérez, Wenceel':'S', 'Quero, Edgar':'S', 'Raleigh, Cal':'S', 'Ramírez, José':'S',
       'Rengifo, Luis':'S', 'Reynolds, Bryan':'S', 'Rocchio, Brayan':'S', 'Santana, Carlos':'S',
       'Toglia, Michael':'S', 'Santander, Anthony':'S', 'Profar, Jurickson':'S',
       'Taveras, Leodys':'S', 'Perkins, Blake':'S', 'Candelario, Jeimer':'S', 'Ruiz, Keibert':'S',
       'Moncada, Yoán':'S', 'Waters, Drew':'S', 'Rodríguez, Endy':'S'}

df_frv_hit_oaa_batted_zone_path_catcher["side"] = df_frv_hit_oaa_batted_zone_path_catcher["player_name"].map(bat_side_dict).fillna(df_frv_hit_oaa_batted_zone_path_catcher["side"])

Lets check the split of batter sides

In [86]:
df_frv_hit_oaa_batted_zone_path_catcher['side'].value_counts()

side
R    403
L    276
S     81
Name: count, dtype: int64

One final item to take care of is to ensure that the arm strength of the catchers and the arm strength of the fielders are on a comparable scale since the catchers arm strength is measured in miles per hour while the fielders are measured in a percentile ranking. To merge these two values into a single comparable column, we will first group the arm strength values by position (either catcher or fielder) and then we will take the z score of every player in the position category which will tell us if the player is on the higher or lower end of the scale with respect to their position. this then gives us a value to compare as now a catcher with a great arm and a fielder with a great arm will both have higher z scores relative to their positions. Since most DHs do not have an arm strength value we will assume their arm strength is 0 after z scoring to represent position average 

In [87]:
# catcher vs fielder
df_frv_hit_oaa_batted_zone_path_catcher["def_group"] = np.where(
    df_frv_hit_oaa_batted_zone_path_catcher["primary_pos_formatted"] == "C",
    "C",
    "F"
)

# unified raw arm strength
df_frv_hit_oaa_batted_zone_path_catcher["arm_strength_raw"] = np.where(
    df_frv_hit_oaa_batted_zone_path_catcher["def_group"] == "C",
    df_frv_hit_oaa_batted_zone_path_catcher["arm_strength_c"],
    df_frv_hit_oaa_batted_zone_path_catcher["arm_strength_hit"]
)

# z-score within role
def zscore(s):
    m = s.mean()
    sd = s.std(ddof=0)
    if sd == 0 or np.isnan(sd):
        return s * 0.0
    return (s - m) / sd

df_frv_hit_oaa_batted_zone_path_catcher["arm_strength_z"] = (
    df_frv_hit_oaa_batted_zone_path_catcher
      .groupby("def_group")["arm_strength_raw"]
      .transform(zscore)
)

# neutralize DHs
mask_dh_missing_arm = (
    (df_frv_hit_oaa_batted_zone_path_catcher["primary_pos_formatted"] == "DH") &
    (df_frv_hit_oaa_batted_zone_path_catcher["arm_strength_hit"].isna())
)

df_frv_hit_oaa_batted_zone_path_catcher.loc[mask_dh_missing_arm, "arm_strength_z"] = 0.0

Lets go ahead and rename our final dataframe to just df for simplicity

In [88]:
df = df_frv_hit_oaa_batted_zone_path_catcher.copy()

In [89]:
df.columns

Index(['player_name', 'player_id', 'year', 'xwoba', 'xba', 'xslg', 'xiso',
       'xobp', 'brl', 'brl_percent', 'exit_velocity', 'max_ev',
       'hard_hit_percent', 'k_percent', 'bb_percent', 'whiff_percent',
       'chase_percent', 'arm_strength_hit', 'sprint_speed',
       'primary_pos_formatted', 'total_runs', 'gb_rate', 'air_rate', 'fb_rate',
       'ld_rate', 'pu_rate', 'pull_rate', 'straight_rate', 'oppo_rate',
       'runs_heart', 'runs_shadow', 'runs_chase', 'runs_waste', 'side',
       'avg_bat_speed', 'swing_tilt', 'attack_angle', 'attack_direction',
       'arm_strength_c', 'def_group', 'arm_strength_raw', 'arm_strength_z'],
      dtype='object')

Lets go ahead and drop some columns we no longer need

In [90]:
df = df.drop(columns = ['def_group', 'arm_strength_raw', 'arm_strength_hit', 'arm_strength_c'])

Finally lets do one last check of the NA values in the dataframe

In [91]:
df.isna().sum()

player_name              0
player_id                0
year                     0
xwoba                    0
xba                      0
xslg                     0
xiso                     0
xobp                     0
brl                      0
brl_percent              0
exit_velocity            0
max_ev                   0
hard_hit_percent         0
k_percent                0
bb_percent               0
whiff_percent            0
chase_percent            0
sprint_speed             0
primary_pos_formatted    0
total_runs               0
gb_rate                  0
air_rate                 0
fb_rate                  0
ld_rate                  0
pu_rate                  0
pull_rate                0
straight_rate            0
oppo_rate                0
runs_heart               0
runs_shadow              0
runs_chase               0
runs_waste               0
side                     0
avg_bat_speed            0
swing_tilt               0
attack_angle             0
attack_direction         0
a

Ok so the code below will introduce a recency weighting for each of the statistics of each player. Essentially, it will take a weighted average where the more recent years matter more to the similarity function than previous years. Recent performance can signal a player turning a corner in their careers especially for players like Cal Raleigh who may not have been as great of a player in year 2023. Each of the metrics for each year in our dataset will be multiplied by the weight for each year and then will be summed across each year to give the total weighted value for each player and statistic across the 3 years of data. If a player is not present in all 3 years then the weighting is adjusted to reflect the one or two seasons they were present. After this weighting is completed we will then group by each player ID to get a single row for each of the players in the dataset. The position that the player most recently occupied is considered their position in the total set. We do not include any categorical variables in the actual weighting as there is no way to actually compile a weighted version of these variables. The final output is a single row for each player along with their weighted average of each statistic across the 3 years in the dataset

In [92]:
weights = {2025: 0.6, 2024: 0.3, 2023: 0.1}
df["w"] = df["year"].map(weights)

# keep only years we want
df = df[df["w"].notna()].copy()

# renormalize weights per player
df["w_norm"] = df["w"] / df.groupby("player_id")["w"].transform("sum")

metrics = ['xwoba', 'xba', 'xslg', 'xiso',
       'xobp', 'brl', 'brl_percent', 'exit_velocity', 'max_ev',
       'hard_hit_percent', 'k_percent', 'bb_percent', 'whiff_percent',
       'chase_percent', 'sprint_speed','total_runs', 'gb_rate','air_rate',
       'fb_rate','ld_rate', 'pu_rate', 'pull_rate', 'straight_rate', 'oppo_rate',
       'runs_heart', 'runs_shadow', 'runs_chase', 'runs_waste',
       'avg_bat_speed', 'swing_tilt', 'attack_angle', 'attack_direction',
       'arm_strength_z']

# weighted values
for m in metrics:
    df[m] = df[m] * df["w_norm"]

# make sure last() means most recent year for side/position
df = df.sort_values(["player_id", "year"])

agg = (
    df.groupby("player_id", as_index=False)
      .agg(
          player_name=("player_name", "first"),
          seasons=("year", "nunique"),
          side=("side", "last"),
          position=("primary_pos_formatted", "last"),
          **{m: (m, lambda s: s.sum(min_count=1)) for m in metrics}
      )
)

Lets view the compiled dataframe to see if everything looks correct.

In [93]:
agg

Unnamed: 0,player_id,player_name,seasons,side,position,xwoba,xba,xslg,xiso,xobp,...,oppo_rate,runs_heart,runs_shadow,runs_chase,runs_waste,avg_bat_speed,swing_tilt,attack_angle,attack_direction,arm_strength_z
0,408234,"Cabrera, Miguel",1,R,DH,18.00,37.0,15.0,12.00,34.00,...,0.254753,-11.267340,-11.378386,10.545004,4.595481,68.187690,32.443072,6.239414,5.291925,0.000000
1,444482,"Peralta, David",1,L,LF,29.00,84.0,37.0,19.00,32.00,...,0.259146,-14.173137,-5.588909,5.526147,6.032973,72.132424,31.814685,5.161792,-1.703671,-0.578106
2,453568,"Blackmon, Charlie",2,L,DH,43.75,58.0,28.5,20.25,64.75,...,0.206856,-11.494330,-0.827476,14.000658,7.777908,68.571365,31.521866,9.635325,-4.700856,0.846499
3,455117,"Maldonado, Martín",1,R,C,1.00,1.0,13.0,47.00,1.00,...,0.234043,-5.327882,-23.071604,6.061845,4.536643,71.736160,31.713440,16.433472,-4.188886,0.859357
4,456781,"Solano, Donovan",1,R,1B,79.00,89.0,50.0,32.00,90.00,...,0.223729,3.631072,-18.280239,9.574619,8.991467,68.095668,36.985999,8.634592,-0.336964,-1.196007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
379,805367,"Meidroth, Chase",1,R,SS,21.00,57.0,4.0,3.00,64.00,...,0.282723,-23.767765,-12.808674,17.718882,9.899262,67.434461,29.379809,6.673115,2.072339,-0.921384
380,805779,"Wilson, Jacob",1,R,SS,26.00,90.0,17.0,5.00,53.00,...,0.251111,-1.935684,-1.022065,7.335145,8.355824,63.878761,35.131265,2.826548,-1.020782,1.035302
381,807713,"Shaw, Matt",1,R,3B,24.00,24.0,24.0,24.00,32.00,...,0.267327,-9.528795,-3.065717,7.403877,4.578656,69.579360,26.773627,11.064701,-2.014000,-0.578106
382,807799,"Yoshida, Masataka",2,L,DH,56.75,73.5,49.5,37.00,65.75,...,0.277196,-1.850419,-10.440524,12.366297,7.816875,71.673850,32.131734,8.934647,-1.144793,0.018531


Great it looks like we have 384 unique hitters in the dataset. Lets change the name to be a more readable format of firstname lastname by using common pandas string methods

In [94]:
agg['player_name_lst'] = agg['player_name'].str.split(',')

In [95]:
agg['player_name_final'] = agg['player_name_lst'].apply(lambda x:x[1] + " " + x[0])
agg = agg.drop(columns = ['player_name_lst', 'player_name'])

Lets set the index as the player name to make it easier to view different parts of the data

In [96]:
agg['player_name_final'] = agg['player_name_final'].str.strip()
agg = agg.set_index('player_name_final')

Now that looks good from a general view of it. Lets now scale the variables in the dataset to ensure each one is on a level playing field and does not dominate the overall cosine similarity score. We will do so by implementing sklearns StandardScaler function which will take the z score of each variable in the dataset

In [97]:
X = agg[metrics].copy()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled = pd.DataFrame(X_scaled, columns=metrics, index=agg.index)

In [98]:
X_scaled

Unnamed: 0_level_0,xwoba,xba,xslg,xiso,xobp,brl,brl_percent,exit_velocity,max_ev,hard_hit_percent,...,oppo_rate,runs_heart,runs_shadow,runs_chase,runs_waste,avg_bat_speed,swing_tilt,attack_angle,attack_direction,arm_strength_z
player_name_final,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Miguel Cabrera,-0.973740,-0.311189,-1.094740,-1.223324,-0.416479,-1.358764,-1.283686,-0.065250,0.351915,-0.644496,...,0.218874,-0.712410,0.123674,-0.245059,-1.335660,-1.516157,-0.016799,-1.296730,1.892059,0.008080
David Peralta,-0.571754,1.434285,-0.298572,-0.973320,-0.489701,-0.992821,-1.071572,0.151833,0.351915,0.776756,...,0.333104,-1.019815,0.996192,-1.093824,-0.768569,0.129732,-0.205084,-1.659352,-0.039528,-0.582027
Charlie Blackmon,-0.032727,0.468704,-0.606182,-0.928676,0.709316,-0.736660,-0.939000,-1.530557,-1.477575,-1.437118,...,-1.026435,-0.736423,1.713776,0.339344,-0.080191,-1.356074,-0.292823,-0.153996,-0.867096,0.872152
Martín Maldonado,-1.594991,-1.648147,-1.167119,0.026699,-1.624649,-0.663471,0.130413,-0.499415,-0.745779,-0.061418,...,-0.319586,-0.084073,-1.638582,-1.003230,-1.358872,-0.035604,-0.235421,2.133599,-0.725733,0.885278
Donovan Solano,1.255455,1.619973,0.171891,-0.509025,1.633749,-0.809849,-0.718047,0.803080,-0.471355,0.667429,...,-0.587739,0.863698,-0.916487,-0.409166,0.398558,-1.554552,1.344412,-0.490746,0.337839,-1.212755
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Chase Meidroth,-0.864107,0.431566,-1.492824,-1.544759,0.681858,-1.468547,-1.601859,-0.825039,-1.934947,-0.972477,...,0.946075,-2.034833,-0.091881,0.968152,0.756684,-1.830432,-0.934654,-1.150788,1.003084,-0.932431
Jacob Wilson,-0.681386,1.657111,-1.022361,-1.473329,0.279134,-1.358764,-1.460449,-1.621008,-1.340363,-1.628440,...,0.124191,0.274789,1.684450,-0.787895,0.147797,-3.314000,0.788673,-2.445169,0.149027,1.064875
Matt Shaw,-0.754475,-0.793979,-0.769035,-0.794745,-0.489701,-0.626877,-0.541284,-1.548647,-0.928728,-1.446228,...,0.545791,-0.528488,1.376456,-0.776272,-1.342298,-0.935501,-1.715552,0.326993,-0.125215,-0.582027
Masataka Yoshida,0.442347,1.044338,0.153796,-0.330450,0.745927,-0.462202,-0.762237,-0.571776,0.420521,-0.270962,...,0.802389,0.283809,0.265017,0.062949,-0.064819,-0.061603,-0.110086,-0.389776,0.114786,0.026996


In [99]:
X_scaled.columns

Index(['xwoba', 'xba', 'xslg', 'xiso', 'xobp', 'brl', 'brl_percent',
       'exit_velocity', 'max_ev', 'hard_hit_percent', 'k_percent',
       'bb_percent', 'whiff_percent', 'chase_percent', 'sprint_speed',
       'total_runs', 'gb_rate', 'air_rate', 'fb_rate', 'ld_rate', 'pu_rate',
       'pull_rate', 'straight_rate', 'oppo_rate', 'runs_heart', 'runs_shadow',
       'runs_chase', 'runs_waste', 'avg_bat_speed', 'swing_tilt',
       'attack_angle', 'attack_direction', 'arm_strength_z'],
      dtype='object')

This is our main function that will take as input the three names of players you want the frankenstein player to be like. For example, you could input into the function that you want a player that hits like Aaron Judge, fields like Patrick Bailey and swings like Juan Soto. A similarity score will be calculated for each category of statistics. The hit_like statistics are batted ball outcomes such as xwoba, xobp, gb rate and gb rate. The field_like statistics are defensive statistics fielding run value and arm strength. Finally, the swing_like statistics are specific swing characteristics such as attack angle and attack direction. Each specific skillset carries a specific weight to it that can be adjusted in the function call. The batted and offense weights are both associated with the hit_like player, the swing weight is associated with the swing_like player and the defense weight is associated with the field_like player. The default settings are to weight offense to 0.6, the batted ball profile to 0.2, swing mechanics to 0.1 and defense to 0.1. For each skillset, the function will then calculate the cosine similarity of the player you input to all other players in the dataset for that skillset and will multiply the calculated cosine similarity score by the specified weight. It will then add that similarity score for that skillset to each player in the dataset's aggregate score You do not need to input a player for all 3 archetypes, and the function will default to the hit_like player if there is no player input into the field_like and swing_like inputs. For example, if you input Aaron Judge into the hit_like option and nothing into the field_like option, then by default the function will use Aaron Judge for the field_like player. There is also an option to toggle the counterpart filter which essentially acts as a way to filter to more budget level players that are not superstars. So you may be able to find someone that can provide relatively similar profiles to the ones that you input but at a much cheaper price tag compared to your typical superstar. To do this you can toggle counterpart to True, then you can choose a stat that typically signals overall player performance such as xwoba to filter the players down too and then also choose a percentage to filter the players. For example if the percentage is 5% (0.05) then the function will take the players that are less than the 95th percentile of the current players in the dataset for that statistic. The final output will be the top n players that are most similar to those specific skillsets that you specified in the function definition. An example output is provided below which helps detail the inputs and outputs we described here.

In [100]:
def frankenstein_recommend(
    hit_like,
    field_like=None,
    swing_like=None,
    agg=None,
    X_scaled=None,
    top_n=15,
    weights=None,          
    position=None,
    bats=None,
    min_shared = 2,
    counterpart=False,               
    counterpart_by="xwoba",          
    counterpart_top_pct=0.05
):
    OFFENSE = [
        "xwoba", "xba", "xslg", "xiso", "xobp",
        "brl_percent", "hard_hit_percent",
        "exit_velocity", "max_ev",
        "k_percent", "bb_percent",
        "whiff_percent", "chase_percent"
    ]

    BATTED_BALL = [
        "gb_rate", "air_rate", "fb_rate", "ld_rate", "pu_rate",
        "pull_rate", "straight_rate", "oppo_rate"
    ]

    SWING_PATH = [
        "attack_angle",
        "attack_direction",
        "swing_tilt",
        "avg_bat_speed"
    ]

    DEFENSE = [
        "total_runs",
        "arm_strength_z"
    ]

    
    if weights is None:
        weights = {"offense": 0.6, "batted": 0.2, "swing": 0.1, "defense": 0.1}

    def masked_cosine_to_all(a_vec, B, min_shared=2):
        sims = np.full(B.shape[0], np.nan, dtype=float)
        for i in range(B.shape[0]):
            b = B[i]
            mask = ~np.isnan(a_vec) & ~np.isnan(b)
            if mask.sum() < min_shared:
                continue
            a_m = a_vec[mask]
            b_m = b[mask]
            denom = np.linalg.norm(a_m) * np.linalg.norm(b_m)
            if denom == 0:
                continue
            sims[i] = np.dot(a_m, b_m) / denom
        return sims

    # helper: similarity vector vs everyone for a given prototype + feature set
    def sims_for(player_name, cols):
        cols = [c for c in cols if c in X_scaled.columns]  # in case some cols missing
        if len(cols) == 0:
            return np.zeros(len(agg), dtype=float)

        idx = agg.index.get_loc(player_name)
        A = X_scaled[cols].to_numpy()
        a_vec = A[idx]

        sims = masked_cosine_to_all(a_vec, A, min_shared=min_shared)

        # Treat "not enough overlap" as 0 contribution (so other blocks can still matter)
        return np.nan_to_num(sims, nan=0.0)

    scores_hit = np.zeros(len(agg), dtype=float)
    scores_field = np.zeros(len(agg), dtype=float)
    scores_swing = np.zeros(len(agg), dtype=float)

    # offense prototype is required
    scores_hit += weights["offense"] * sims_for(hit_like, OFFENSE)
    # if you want batted-ball similarity tied to the hitter prototype:
    scores_hit += weights["batted"] * sims_for(hit_like, BATTED_BALL)
    
    field_proto = field_like if field_like is not None else hit_like
    scores_field += weights["defense"] * sims_for(field_proto, DEFENSE)

    swing_proto = swing_like if swing_like is not None else hit_like
    scores_swing += weights["swing"] * sims_for(swing_proto, SWING_PATH)

    results = agg.copy()
    results["score_hit"] = scores_hit
    results["score_field"] = scores_field
    results["score_swing"] = scores_swing
    results['total_score'] = results["score_hit"] + results["score_field"] + results["score_swing"]

    # optional filters
    if position is not None:
        results = results[results["position"] == position]
    if bats is not None:
        results = results[results["side"] == bats]

    

    if counterpart:
        cutoff = results[counterpart_by].quantile(1 - counterpart_top_pct)
        results = results[results[counterpart_by] < cutoff]

    # exclude prototypes from output
    exclude = {hit_like, field_like, swing_like}
    exclude = {x for x in exclude if x is not None}
    results = results.drop(index=[x for x in exclude if x in results.index], errors="ignore")

    results = results[['score_hit', 'score_field', 'score_swing', 'total_score']].sort_values("total_score", ascending=False).head(top_n)

    results = results.reset_index(drop = False)

    result_melt = pd.melt(results, id_vars = ['player_name_final', 'total_score'], value_vars = [
        'score_hit', 'score_field', 'score_swing'
    ])

    if hit_like is not None and field_like is not None and swing_like is None:
        title = f'Hitters Most like {hit_like} (Offense and Swing) and {field_like} (Fielding)'
    elif hit_like is not None and field_like is None and swing_like is not None:
        title = f'Hitters Most like {hit_like} (Offense and Fielding) and {swing_like} (Swing)'
    elif hit_like is not None and field_like is not None and swing_like is not None:
        title = f'Hitters Most like {hit_like} (Offense), {field_like} (Fielding) and {swing_like} (Swing)'
    elif hit_like is not None and field_like is None and swing_like is None:
        title = f'Hitters Most like {hit_like} (Offense, Fielding and Swing)'

    chart = alt.Chart(result_melt, title = title).mark_bar().encode(
        x = alt.X('sum(value):Q', title = 'Total Score'),
        y=alt.Y("player_name_final:N",title="Player Name").sort('-x'),
        color = alt.Color('variable', title = 'Prototype Type'),
        tooltip = ['player_name_final', 'variable', 'value', 'total_score']
    )

    return chart

Here is an example output of the function below. In this specific call we are looking for the players who have a combination of the most similar hitting outcomes to Aaron Judge, defensive attributes of Patrick Bailey and swing like Luis Arraez. We also adjust the weighting of the attributes by inputting a dictionary with the specified weights below. Therefore, this specific call will weight the player simnilarity evenly between the offense, batted ball and defense attributes and weight the swing the least at 0.1. Based off the resulting chart, we can see that Wyatt Langford most closely resembles this frankenstein player we are looking for. A majority of his score comes from his similarity in batted ball outcomes with Aaron Judge with a similarity score in this set of statistics of 0.43 out of the total 0.67. 

In [101]:
chart = frankenstein_recommend(hit_like = 'Aaron Judge', field_like='Patrick Bailey', swing_like='Luis Arraez',
                       agg=agg, X_scaled=X_scaled, top_n=15, weights={"offense": 0.3, "batted": 0.3, "swing": 0.1, "defense": 0.3},
                       position=None, bats=None, counterpart=True, counterpart_by='xwoba', counterpart_top_pct=0
)

In [102]:
chart

In [103]:
chart.save('figures/hitter_example.png')