In [1]:
import pandas as pd
import numpy as np
import altair as alt

This is our main function that will take as input the three names of players you want the frankenstein player to be like. For example, you could input into the function that you want a player that has outcomes like Paul Skenes, has command like Jacob deGrom and a similar arsenal to Logan Webb. A similarity score will be calculated for each category of statistics. The pitcher_like statistics are pitching outcomes such as xera, strikeout percentage and whiff percentage. The command_like statistics have to do with how a pitcher performs within each part of the baseball savant strikezone areas. For example, there are statistics for how a pitcher performs on pitches in the heart and shadow zones with runs_heart and runs_shadow. The arsenal_like statistics have to do with the usage of each pitch a pitcher throws in their arsenal. Examples include, fastball percentage and fastball spin. Each specific skillset carries a specific weight to it that can be adjusted in the function call. The outcomes and batted weights both have to do with the pitcher_like player while the arsenal weight is associated with the arsenal_like player and the command weight is associated with the command_like player. The default settings are to weight outcomes to 0.55, the batted ball profile to 0.2, pitch arsenal to 0.2 and command to 0.05. The function will then take the similarity of the player you input to all other players in the dataset for that skillset and will add that similarity to each players total score that is present in the dataset. It will then move to the next skill set and complete the same similarity matching and add that score to each player's aggregate score over all skillsets. You can also change the number of minimum values that the pitchers have that are not NA since there are NA values in the dataset still. You can also filter by the pitchers throwing hand if you would like to see just left handed or right handed pitchers that are similar to the frankenstein pitcher you created. There is also an option to toggle the counterpart filter which essentially acts as a way to filter to more budget level players that are not superstars. So you may be able to find someone that can provide relatively similar profiles to the ones that you input but at a much cheaper price tag compared to your typical superstar. To do this you can toggle counterpart to True, then you can choose a stat to filter the players down too and then also choose a percentage to filter the players. For example if the percentage is 5% then the function will take the players that are less than the 95th percentile of the current players in the dataset. The final output will be the top n players that are most similar to those specific skillsets that you specified in the function definition. An example output is provided below which helps detail the inputs and outputs we described here. The main goal of this is to help front offices decide on which players to sign 

In [2]:
def frankenstein_pitcher_recommend(
    pitcher_like,
    command_like=None,
    arsenal_like=None,
    top_n=15,
    weights=None,
    throws=None,                     # optional filter on pitch_hand
    min_shared=3,                   # minimum shared (non-NA) features per comparison
    counterpart=False,
    counterpart_by="xera",
    counterpart_top_pct=0.05,
    counterpart_higher_is_better=False,  # xera -> False, k_percent -> True, etc.
    exclude_prototypes=True
):
    X_scaled = pd.read_csv('../../data/cleaned_player_data/cleaned_pitchers.csv')
    X_scaled = X_scaled.set_index('player_name_final')
    df = X_scaled.drop(columns = ['pitch_hand']).copy()

    OUTCOMES = [
        "xera", "k_percent", "bb_percent",
        "whiff_percent", "chase_percent",
        "hard_hit_percent", "xwoba", "xba",
        "xslg", "xiso", "xobp", "brl", "brl_percent"
    ]

    BATTED_BALL = [
        "gb_rate", "air_rate", "pull_rate",
        "straight_rate", "oppo_rate"
    ]

    COMMAND_ZONES = ["runs_heart", "runs_shadow", "runs_chase", "runs_waste"]

    ARSENAL_USAGE = ["n_ff", "n_si", "n_fc", "n_sl", "n_ch", "n_cu", "n_fs", "n_st", "n_sv"]

    ARSENAL_SHAPE = [
        "ff_avg_spin","si_avg_spin","fc_avg_spin","sl_avg_spin","ch_avg_spin","cu_avg_spin",
        "fs_avg_spin","st_avg_spin","sv_avg_spin", "ball_angle",
        "ff_avg_speed","si_avg_speed","fc_avg_speed","sl_avg_speed","ch_avg_speed","cu_avg_speed",
        "fs_avg_speed","st_avg_speed","sv_avg_speed"
    ]

    if weights is None:
        weights = {"outcomes": 0.25, "batted": 0.25, "arsenal": 0.25, "command": 0.25}

    
    def masked_cosine_to_all(a_vec, B, min_shared=3):
        sims = np.full(B.shape[0], np.nan, dtype=float)
        for i in range(B.shape[0]):
            b = B[i]
            mask = ~np.isnan(a_vec) & ~np.isnan(b)
            if mask.sum() < min_shared:
                continue
            a_m = a_vec[mask]
            b_m = b[mask]
            denom = np.linalg.norm(a_m) * np.linalg.norm(b_m)
            if denom == 0:
                continue
            sims[i] = np.dot(a_m, b_m) / denom
        return sims

    # helper: similarity vector vs everyone for a given prototype + feature set
    def sims_for(player_name, cols):
        cols = [c for c in cols if c in df.columns]  # in case some cols missing
        if len(cols) == 0:
            return np.zeros(len(df), dtype=float)

        idx = df.index.get_loc(player_name)
        A = df[cols].to_numpy()
        a_vec = A[idx]

        sims = masked_cosine_to_all(a_vec, A, min_shared=min_shared)

        # Treat "not enough overlap" as 0 contribution (so other blocks can still matter)
        return np.nan_to_num(sims, nan=0.0)

    scores_pitch = np.zeros(len(df), dtype=float)
    scores_ars = np.zeros(len(df), dtype=float)
    scores_comm = np.zeros(len(df), dtype=float)

    # Main prototype drives OUTCOMES + BATTED_BALL
    scores_pitch += weights["outcomes"] * sims_for(pitcher_like, OUTCOMES)
    scores_pitch += weights["batted"]   * sims_for(pitcher_like, BATTED_BALL)

    # Arsenal can come from arsenal_like or fall back to pitcher_like
    arsenal_proto = arsenal_like if arsenal_like is not None else pitcher_like
    scores_ars += weights["arsenal"] * sims_for(arsenal_proto, ARSENAL_USAGE + ARSENAL_SHAPE)

    # Command block can come from command_like (if provided)
    command_proto = command_like if command_like is not None else pitcher_like
    scores_comm += weights["command"] * sims_for(command_proto, COMMAND_ZONES)

    results = X_scaled.copy()
    results["score_outcomes"] = scores_pitch
    results["score_arsenal"] = scores_ars
    results["score_command"] = scores_comm
    results["total_score"] = results["score_outcomes"] + results["score_arsenal"] + results["score_command"]

    # optional throw-hand filter
    if throws is not None and "pitch_hand" in results.columns:
        results = results[results["pitch_hand"] == throws]

    # optional counterpart filter (exclude elite)
    higher_is_better = ["xera", "k_percent", "bb_percent",
        "whiff_percent", "chase_percent", "hard_hit_percent", "xwoba", "xba",
        "xslg", "xiso", "xobp", "brl", "brl_percent", "runs_heart",
        "runs_shadow", "runs_chase", "runs_waste", "ff_avg_spin", "si_avg_spin",
        "fc_avg_spin", "sl_avg_spin", "ch_avg_spin", "cu_avg_spin", "fs_avg_spin",
        "st_avg_spin", "sv_avg_spin", "ff_avg_speed", "si_avg_speed",
        "fc_avg_speed", "sl_avg_speed", "ch_avg_speed", "cu_avg_speed", "fs_avg_speed",
        "st_avg_speed", "sv_avg_speed", "oppo_rate"
    ]

    lower_is_better = [
        "air_rate", "pull_rate", "straight_rate"
    ]

    if counterpart:
        if counterpart_by in higher_is_better:
            cutoff = results[counterpart_by].quantile(1 - counterpart_top_pct)
            results = results[results[counterpart_by] < cutoff]

        elif counterpart_by in lower_is_better:
            cutoff = results[counterpart_by].quantile(counterpart_top_pct)
            results = results[results[counterpart_by] > cutoff]

        else:
            raise ValueError(f"Unknown direction for metric: {counterpart_by}")

    # exclude prototypes from output
    if exclude_prototypes:
        exclude = {pitcher_like, command_like, arsenal_like}
        exclude = {x for x in exclude if x is not None}
        results = results.drop(index=[x for x in exclude if x in results.index], errors="ignore")

    results = results[['score_outcomes', 'score_arsenal', 'score_command', 'total_score']].sort_values("total_score", ascending=False).head(top_n)

    results = results.reset_index(drop = False)

    result_melt = pd.melt(results, id_vars = ['player_name_final', 'total_score'], value_vars = [
    'score_outcomes', 'score_arsenal', 'score_command'
    ])

    if pitcher_like is not None and command_like is not None and arsenal_like is None:
        title = f'Pitchers Most like {pitcher_like} (Outcomes and Arsenal) and {command_like} (Command)'
    elif pitcher_like is not None and command_like is None and arsenal_like is not None:
        title = f'Pitchers Most like {pitcher_like} (Outcomes and Command) and {arsenal_like} (Arsenal)'
    elif pitcher_like is not None and command_like is not None and arsenal_like is not None:
        title = f'Pitchers Most like {pitcher_like} (Outcomes), {command_like} (Command) and {arsenal_like} (Arsenal)'
    elif pitcher_like is not None and command_like is None and arsenal_like is None:
        title = f'Pitchers Most like {pitcher_like} (Outcomes, Command and Arsenal)'

    chart = alt.Chart(result_melt, title = title).mark_bar().encode(
        x = alt.X('sum(value):Q', title = 'Total Score'),
        y = alt.Y('player_name_final', title = 'Player Name').sort('-x'),
        color = alt.Color('variable', title = 'Prototype Type'),
        tooltip = ['player_name_final', 'variable', 'value', 'total_score']
    )

    return chart

In [3]:
chart = frankenstein_pitcher_recommend(
    pitcher_like="Paul Skenes",
    command_like="Logan Webb",
    arsenal_like="Lucas Erceg",
    counterpart=True,
    counterpart_by="xera",
    counterpart_top_pct=0.05,
    counterpart_higher_is_better=True,
    exclude_prototypes=True,
    throws = 'L'
)

Here is an example output of the function below. In this specific call we are looking for left handed pitchers who have a combination of the most similar pitching outcomes to Paul Skenes, command of Logan Webb and arsenal of Lucas Erceg. In addition, we implemented the counterpart option where we will filter the resulting pitchers based on their xera. Therefore, in this specific call the resulting pitchers will be less than the 95th percentile in the league for the xera statistic. This could potentially filter down the resulting pitchers to more budget type pitchers that wont cost the team as much to sign but still have similar metrics to the elite pitchers input into the function. Based off the resulting chart, we can see that Christopher Sanchez most closely resembles this frankenstein player we are looking for. A majority of his score comes from his similarity in pitching outcomes with Paul Skenes with a similarity score in this set of statistics of 0.43 out of the total 0.75. 

In [4]:
chart