<a href="https://colab.research.google.com/github/marcomedugno/Baseball-Projects/blob/main/Predicting_Muscle_and_Joint_Injuries_Before_They_Happen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Predicting Muscle and Joint Injuries Before They Happen***

**code is hidden for readability. To see code, click "Show code".*

###**IMPORTING LIBRARIES**

In [None]:
# @title
!pip install pybaseball > /dev/null 2>&1
# !pip install haversine > /dev/null 2>&1
import pandas as pd
from IPython.core.display import display, HTML
from pybaseball import statcast, playerid_reverse_lookup, cache
import numpy as np
import unicodedata
import requests
from bs4 import BeautifulSoup
import io
import zipfile
from io import StringIO
import re
from datetime import timedelta
import xgboost as xgb
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import precision_recall_curve, auc
import warnings
warnings.filterwarnings('ignore', category=pd.errors.SettingWithCopyWarning)
cache.enable()

###**INTRODUCTION**

One quote that is never more true than in Major League Baseball is "Availability is the best ability." Teams may spend countless hours evaluating and developing talent, however, all that effort is moot if players cannot stay healthy. Because of this, teams have invested heavily in injury prevention. For example, [the Los Angeles Dodgers have tried partnering with sports technology company](https://dodgers.mlblogs.com/dodgers-partner-with-kitman-labs-on-injury-prevention-7a8cfdf168be), Kitman Labs, to identify players who might be at risk of injury in real-time, using biometric data.

Unfortunately, some injuries are caused by plays that are hard to prevent, aside from adding on more protective gear, such as awkward slides or hit-by-pitches. However, some injuries may be preventable. Many injuries to muscles or joints can often be attributed to overuse. Teams understand this to some extent, which is why you see far fewer pitchers these days reaching 100 pitches or more per start, or why the phrase "load management" has become so prevalent in sports.

In this notebook, I will attempt to identify the probability of a muscle or joint injury to each player for each game during the 2023 season using metrics that can contribute to fatigue or "overuse". If successful, this model can potentially help managers decide when it is time to give a player a day off to avoid a serious injury that could ultimately lead to an IL stint.

This notebook will focus solely on position players. Pitchers will have very different "overuse" attributes. The target variable for this notebook is whether the player suffered a muscle or joint injury that landed them on the Injured List. Of course, there may be less severe injuries where a player does not land on the IL, and maybe even plays through it, however, these are not used as a positive case of injury, mainly due to the difficulty in deciphering when these injuries occur.

##**Calculating Baserunning Energy Expended**

One attribute that teams can use to identify potential overuse among position players is the energy they are exerting on the basepaths. Not everyone is a fan of this, but teams are certainly paying attention to baserunning energy expenditure. Former Mets' Manager, Buck Showalter, expressed his displeasure with this in an interview with Foul Territory.

In [None]:
# @title
# HTML content of the tweet to be displayed
tweet_html = """
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Buck Showalter gets SPICY on "Load Management" 🌶️<br><br>"We had a guy who had a triple and 2 doubles, and they said he probably needs a day off because he did so much running! ... You go tell Brandon Nimmo he can't play today because he hit too well last night."<br><br>▶️… <a href="https://t.co/m8qwVPEvte">pic.twitter.com/m8qwVPEvte</a></p>&mdash; Foul Territory (@FoulTerritoryTV) <a href="https://twitter.com/FoulTerritoryTV/status/1754951842079846905?ref_src=twsrc%5Etfw">February 6, 2024</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
"""

# Display the tweet in the notebook
display(HTML(tweet_html))

Showalter makes a valid point. There is a fine line when deciding to give players extra rest. Of course, the best players always want to be on the field and fans want to see their favorite players. Some older fans might even lament of a time when stars played every single day, ran hard on every ball, and played through any injury before more research was done on injury prevention and energy preservation. It's up to teams to be responsible for player's health and make sure they are protecting them from the gruesomeness of a long season, but also balance the emotions involved with sitting an otherwise healthy player.

My first step is to calculate the total baserunning expenditure for each player and game. I used Statcast event data to compile this information, in addition to Stathead Baseball's Game Stats Finder to aggregate all stolen base attempts during the 2023 season. I then assigned a point value system to different baserunning events. The more energy a player typically exerts on the event type, the higher the point value. The point values are pretty subjective and observational, but teams with access to player biometric data and smart devices can potentially do a better job of finding the correct point assignment for each event. The point assignments are as follows:



*   **Single** - 1

  *   Some singles may require effort, others may not, such as a single to left field *unless you're [Sean Casey](https://www.youtube.com/watch?v=XghUJ36DsVc)

*   **Double** - 2.5
  *   Reflects the sprint energy to second base
*   **Triple** - 4
  * Maximum exertion for full-speed running
*   **Home Run** - 3
  * Trotting, less intense but covers all bases
*   **Walk**: 0.5
  * Minimal physical exertion
*   **Hit by Pitch** - 0.6
  * Slightly higher to account for the potential pain and discomfort
*   **Left-Side Groundout** - 0.75
  * Assuming the player asserts some energy on these ground balls as opposed to right-side groundouts
*   **Stolen Base Attempt** - 2.5
  * Short distance, but typically always at full speed and with a slide attempt

Based on this point system, the day where a player exerted the most baserunning energy was 4/22/2023 when Bryan De La Cruz racked up 4 doubles, 1 home run, 1 walk, 2 left-side groundouts, and 1 stolen base attempt across 2 games during the Marlins' doubleheader against the Guardians.

The most energy expended during a single game came on 6/23/2023 when Elly De La Cruz (no relation to Bryan) [hit for the cycle](https://www.youtube.com/watch?v=W3bvOAB8Vuc) and attempted 2 stolen bases.


In [None]:
# @title
# Fetch statcast data
statcast_data = statcast(start_dt="2023-03-30", end_dt="2023-10-01")

# Function for finding left-side groundouts
def is_left_side_groundout(row):
    if pd.notnull(row['bb_type']) and pd.notnull(row['hit_location']) and pd.notnull(row['events']):
        return ((row['bb_type'] == 'ground_ball') and
                (row['hit_location'] in [5, 6]) and
                (row['events'] in ['field_out', 'grounded_into_double_play', 'field_error']))
    else:
        return False

statcast_data['left_side_groundout'] = statcast_data.apply(is_left_side_groundout, axis=1)

# Create 'batter_team' based on conditions
statcast_data['batter_team'] = np.where(statcast_data['inning_topbot'] == 'Bot',
                                        statcast_data['home_team'],
                                        statcast_data['away_team'])


# Aggregate the DataFrame without Total_Swings
grouped = statcast_data.groupby(['batter', 'game_date','batter_team'])
result = grouped.agg(
    Total_Singles=('events', lambda x: (x == 'single').sum()),
    Total_Doubles=('events', lambda x: (x == 'double').sum()),
    Total_Triples=('events', lambda x: (x == 'triple').sum()),
    Total_Home_Runs=('events', lambda x: (x == 'home_run').sum()),
    Total_Walks=('events', lambda x: (x == 'walk').sum()),
    Total_Hit_By_Pitches=('events', lambda x: (x == 'hit_by_pitch').sum()),
    Total_Left_Side_Groundouts=('left_side_groundout', 'sum')
).reset_index()

# Perform reverse lookup to get player names for batter IDs
unique_batter_ids = statcast_data['batter'].unique()
player_details = playerid_reverse_lookup(player_ids=unique_batter_ids, key_type='mlbam')

# Concatenate 'name_first' and 'name_last' to create a full name column
player_details['full_name'] = player_details['name_first'] + ' ' + player_details['name_last']

# Create a dictionary mapping player IDs to full names
player_id_to_name = dict(zip(player_details['key_mlbam'], player_details['full_name']))

# Add batter names to the aggregated DataFrame
result['batter_name'] = result['batter'].map(player_id_to_name)

# Define the desired column order
cols = result.columns.tolist()
# Move 'batter_name' to follow 'batter'
new_order = cols[:1] + ['batter_name'] + cols[1:-1]

# Reorder the DataFrame according to the new column order
result = result[new_order]

'''
Stolen Bases CSV was put together manually by using Stathead Baseball's Game Stats Finder
'''

# Read the stolen bases CSV file into a pandas DataFrame
stolen_base_attempts = pd.read_excel('stolen_base_attempts.xlsx')

# Step 2: Ensure the 'Player' column in 'stolen_base_attempts' is free of accent marks and lowercase. Accent marks and casing can often lead to inconsistencies between names in different data files.
stolen_base_attempts['Player'] = stolen_base_attempts['Player'].apply(
    lambda name: unicodedata.normalize('NFKD', name).encode('ascii', 'ignore').decode('ascii').lower()
)

# Step 2: Ensure the 'batter_name' column in 'statcast_data' is free of accent marks and lowercase
result['batter_name'] = result['batter_name'].apply(
    lambda name: unicodedata.normalize('NFKD', name).encode('ascii', 'ignore').decode('ascii').lower()
)

# Make sure team abbreviations match. Different data files sometimes use different team abbreviations.
team_replacements = {
    'ARI': 'AZ',
    'CHW': 'CWS',
    'KCR': 'KC',
    'SDP': 'SD',
    'SFG': 'SF',
    'TBR': 'TB',
    'WSN': 'WSH'
}

# Replace the values in the 'Team' column
stolen_base_attempts['Team'] = stolen_base_attempts['Team'].replace(team_replacements)

# Merge stolen base attempts
result = result.merge(stolen_base_attempts, left_on=['batter_name','game_date','batter_team'], right_on=['Player','Date','Team'], how='left')

#Replace NAs in stolen base column with 0. That player did not steal a base that game.
result.update(result[['Stolen Base Attempts']].fillna(0))

result.drop(columns=['Player','Date','Team'], inplace=True)

# Creating an Energy Expended Attribute (subjective due to lack of wearable tech data)
result['energy_expended'] = (
    result['Total_Singles'] * 1 + # Some singles may require effort, others may not, such as a single to the left-side outfielder (unless you're Sean Casey)
    result['Total_Doubles'] * 2.5 +  # Reflects the sprint energy to second base
    result['Total_Triples'] * 4 +  # Maximum exertion for full-speed running
    result['Total_Home_Runs'] * 3 +  # Trotting, less intense but covers all bases
    result['Total_Walks'] * 0.5 +  # Minimal physical exertion
    result['Total_Hit_By_Pitches'] * 0.6 +  # Slightly higher to account for the potential pain and discomfort
    result['Total_Left_Side_Groundouts'] * 0.75 + # Assuming player asserts some energy on these ground balls. More likely they hustle than on right side groundouts
    result['Stolen Base Attempts'] * 2.5 # Short distance, but typically always at full speed and with a slide attempt
)

# Sorting the DataFrame by 'energy_expended' in descending order
result.sort_values(by='energy_expended', ascending=False).head(10)

This is a large query, it may take a moment to complete


100%|██████████| 186/186 [00:13<00:00, 14.29it/s]


Gathering player lookup table. This may take a moment.


Unnamed: 0,batter,batter_name,game_date,batter_team,Total_Singles,Total_Doubles,Total_Triples,Total_Home_Runs,Total_Walks,Total_Hit_By_Pitches,Total_Left_Side_Groundouts,Stolen Base Attempts,energy_expended
23073,650559,bryan de la cruz,2023-04-22,MIA,0,4,0,1,1,0,2,1.0,17.5
4132,542932,jon berti,2023-09-27,MIA,2,1,1,1,2,0,1,1.0,15.75
45511,682829,elly de la cruz,2023-06-23,CIN,1,1,1,1,0,0,0,2.0,15.5
33739,666969,adolis garcia,2023-04-22,TEX,0,2,0,3,0,1,0,0.0,14.6
24907,657041,lane thomas,2023-07-23,WSH,2,1,0,0,0,0,0,4.0,14.5
11027,596115,trevor story,2023-08-13,BOS,1,3,0,0,0,0,0,2.0,13.5
27955,663656,kyle tucker,2023-09-10,HOU,0,0,2,0,1,0,0,2.0,13.5
43414,678662,ezequiel tovar,2023-09-16,COL,3,1,1,0,0,0,2,1.0,13.5
31950,665923,esteury ruiz,2023-04-26,OAK,2,0,0,0,0,1,1,4.0,13.35
45910,682998,corbin carroll,2023-04-25,AZ,1,1,1,0,1,0,0,2.0,13.0


The next step was to create aggregated metrics to track the baserunning energy expenditure for each player and game across 4 different periods: yesterday, last week, last month, and season-to-date. Each of these attributes will be used as a measure of fatigue in my final dataset.

In [None]:
# @title
# Getting Subset of Results to use for aggregation
result = result[['batter', 'batter_name', 'game_date', 'batter_team','energy_expended']]

# Convert game_date to date time field
result['game_date'] = pd.to_datetime(result['game_date'])

"""
Getting Energy Expended Yesterday
"""

# Create a shifted 'game_date' column for comparison
result['previous_game_date'] = result.groupby('batter')['game_date'].shift()

# Shift the 'energy_expended' to align with the 'previous_game_date'
result['baserunning_energy_yesterday'] = result.groupby('batter')['energy_expended'].shift()

# Check if the 'game_date' is exactly one day after 'previous_game_date'
result['baserunning_energy_yesterday'] = result.apply(
    lambda row: row['baserunning_energy_yesterday'] if row['game_date'] == row['previous_game_date'] + pd.Timedelta(days=1) else 0, axis=1
)

# Drop the 'previous_game_date' column since it's no longer needed
result.drop(columns=['previous_game_date'], inplace=True)

result.reset_index(drop=True, inplace=True)

"""
Getting Energy expended last week and last month
"""

result['baserunning_energy_last_week'] = 0
result['baserunning_energy_last_month'] = 0


# Group by batter to avoid cross-batter calculations
for (batter, group) in result.groupby('batter'):
    for idx, row in group.iterrows():
        # Calculate the start dates for last week and last month
        one_week_ago = row['game_date'] - pd.Timedelta(days=7)
        one_month_ago = row['game_date'] - pd.Timedelta(days=30)

        # Filter the group for the last week and last month ranges
        last_week_rows = group[(group['game_date'] > one_week_ago) & (group['game_date'] < row['game_date'])]
        last_month_rows = group[(group['game_date'] > one_month_ago) & (group['game_date'] < row['game_date'])]

        # Sum the energy expended for each period
        result.loc[idx, 'baserunning_energy_last_week'] = last_week_rows['energy_expended'].sum()
        result.loc[idx, 'baserunning_energy_last_month'] = last_month_rows['energy_expended'].sum()

# Calculate cumulative energy expended for each batter
result['cumulative_energy'] = result.groupby('batter')['energy_expended'].cumsum()

'''
Getting Energy expended for the whole season to date
'''

# Shift the cumulative sum by one to exclude the current day's energy, filling missing values with 0
result['baserunning_energy_season_to_date'] = result.groupby('batter')['cumulative_energy'].shift(fill_value=0)

result.drop(columns='cumulative_energy', inplace=True)
result.drop(columns='energy_expended', inplace=True)

result.sort_values(by='baserunning_energy_yesterday', ascending=False).head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result['game_date'] = pd.to_datetime(result['game_date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result['previous_game_date'] = result.groupby('batter')['game_date'].shift()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result['baserunning_energy_yesterday'] = result.groupby('batter')['ene

Unnamed: 0,batter,batter_name,game_date,batter_team,baserunning_energy_yesterday,baserunning_energy_last_week,baserunning_energy_last_month,baserunning_energy_season_to_date
23074,650559,bryan de la cruz,2023-04-23,MIA,17.5,22.75,48.25,48.25
4133,542932,jon berti,2023-09-28,MIA,15.75,28.75,47.75,262.1
45512,682829,elly de la cruz,2023-06-24,CIN,15.5,29.25,69.0,69.0
33740,666969,adolis garcia,2023-04-23,TEX,14.6,22.35,53.1,53.1
24908,657041,lane thomas,2023-07-24,WSH,14.5,26.25,72.85,285.0
27956,663656,kyle tucker,2023-09-11,HOU,13.5,20.75,63.75,387.7
43415,678662,ezequiel tovar,2023-09-17,COL,13.5,19.5,58.75,326.4
31951,665923,esteury ruiz,2023-04-27,OAK,13.35,22.35,72.45,72.45
13589,607208,trea turner,2023-07-16,PHI,13.0,13.75,67.45,253.2
10814,596019,francisco lindor,2023-07-07,NYM,13.0,27.1,79.45,218.05


###**Calculating Torque and Total Swings**

Aside from running, another likely cause of overuse comes from swinging the bat. Swinging a baseball bat is a very unnatural thing for our bodies to do, so each time a player does so, they put a strain on their body. Of course, not all swings are made equal. An Aaron Judge swing is not the same as a Steven Kwan swing. Aaron Judge recently admitted that swinging during the offseason likely contributed to a recent spring training ab injury.

> *“I think just from swinging from November all the way until now, every single day, it put some wear and tear on it,” Judge said via Brian Hoch of MLB.com. “Especially coming back after a [right] toe injury when your mechanics are a little messed up and you’re just working on some things."*

In a few words, torque refers to the rotational force applied around an axis. A high torque swing can often lead to good results, including more power, but will also put more stress on the body.





Torque can be calculated by the following equation:

**τ=r * F * sin(θ)**

where:


*   **r** is the lever arm distance
  *   I will derive this by estimating the length between the players' shoulder and the point of contact (arm length + bat length).
*   F is the force applied
  * Force can be approximated by mass * acceleration. For the sake of simplicity and given the available data, I'll use the product of each player's estimated bat weight (in ounces), and their average launch speed during the 2023 season, as a proxy for force.
*   θ is the angle in radians at which the force is applied.
  * For this, I used SwingGraphs' Vertical Bat Angle (VBA) metric.


To estimate each player's arm length, bat length, and bat weight, I scraped each player's height and weight from a Newsday table and used those attributes in a simple formula to get my estimations. The following table displays the input variables that will go into my torque calculation for 10 players.



In [None]:
# @title
'''
Getting player's height and weight to help with torque estimation
'''

# URL where I am scraping player height and weight
base_url = 'https://newsday.sportsdirectinc.com/baseball/mlb-players.aspx?page=/data/mlb/players/{}_players.html' # *It's possible that players' heights and weights have been updated since running this code. In that case, end reslts may be different.

# All letters in the alphabet to iterate over
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

# Initialize an empty list to store all players' data
all_players_data = []

# Iterate over each letter in the alphabet
for letter in alphabet:
    # Construct the URL for the current letter
    url = base_url.format(letter)

    # Send a GET request to the website and get the response
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the specific table by class name, which contains the player data
        players_table = soup.find_all('table', {'class': 'sdi-data-wide'})[2]  # Selecting the third table as indicated

        # Extract the rows, skipping the header row
        rows = players_table.find_all('tr')[1:]  # Skip the header

        # Iterate over each row to extract the data
        for row in rows:
            cols = row.find_all('td')
            if len(cols) > 1:  # Ensuring that the row has multiple columns
                player_name = cols[0].get_text(strip=True)
                player_team = cols[2].get_text(strip=True)
                player_height = cols[4].get_text(strip=True)
                player_weight = cols[5].get_text(strip=True)
                # Append the data to the list
                all_players_data.append({
                    'Player': player_name,
                    'Team': player_team,
                    'Height': player_height,
                    'Weight': player_weight
                })
    else:
        print(f"Failed to retrieve the webpage for letter {letter}. Status code:", response.status_code)

# Convert the list of dictionaries to a DataFrame
height_and_weight = pd.DataFrame(all_players_data)


In [None]:
# @title
'''
I calculate torque in baseball swings by converting measurements to SI units, estimating force with bat weight and launch speed, and applying these to the lever arm distance and bat angle at contact in the torque formula.
'''

def height_to_inches(height_str):
    try:
        # Check if the height string is in the expected format
        if '\'' in height_str and '"' in height_str:
            feet, inches = height_str.split('\'')
            inches = inches.replace('"', '')
            return int(feet) * 12 + int(inches)
        else:
            # Return None or a default value if the format is unexpected
            return None
    except ValueError:
        # Log the error or handle it as needed
        print(f"Cannot convert height: {height_str}")
        return None

# Apply the updated function
height_and_weight['Height_in'] = height_and_weight['Height'].apply(height_to_inches)

# Fill Height NAs with mean:
average_height = height_and_weight['Height_in'].dropna().mean()
height_and_weight['Height_in'].fillna(average_height, inplace=True)

# Ensure the 'Weight' column is treated as a string
height_and_weight['Weight'] = height_and_weight['Weight'].astype(str)

# Apply the .str accessor and extract method
height_and_weight['Weight'] = height_and_weight['Weight'].str.extract('(\d+)').astype(float)  # Extracts digits and converts to float

# Fill Weight NAs with mean:
mean_weight = height_and_weight['Weight'].mean(skipna=True)  # Calculate mean weight, skipping NaN values
height_and_weight['Weight'].fillna(mean_weight, inplace=True)  # Fill NaN values in 'Weight' with the calculated mean

# Function to estimate bat size
def estimate_bat_size(row):
    # Example logic for bat size estimation
    scaled_height = (row['Height_in'] - height_and_weight['Height_in'].min()) / (height_and_weight['Height_in'].max() - height_and_weight['Height_in'].min())
    scaled_weight = (row['Weight'] - height_and_weight['Weight'].min()) / (height_and_weight['Weight'].max() - height_and_weight['Weight'].min())
    estimate = 32 + (scaled_height * 2 + scaled_weight) / 3 * 4
    return round(estimate, 2)

# Apply the function to estimate bat size
height_and_weight['Estimated Bat Size (inches)'] = height_and_weight.apply(estimate_bat_size, axis=1)

# Estimate arm length from height (22% of height as a simple approximation)
height_and_weight['Estimated Arm Length (inches)'] = height_and_weight['Height_in'] * 0.22

# calculate the total distance from shoulder to contact point. This will be used as the 'Lever Arm Distance' in the torque formula
height_and_weight['Estimated Shoulder to Contact (inches)'] = height_and_weight['Estimated Arm Length (inches)'] + height_and_weight['Estimated Bat Size (inches)']

def estimate_bat_weight_simplified(row):
    # Set a base average weight for the bat
    base_weight = 32  # Average bat weight in ounces

    # Adjust based on the player's height and weight
    # Assuming taller and heavier players might prefer slightly heavier bats
    height_adjustment = (row['Height_in'] - 66) / 10  # Example adjustment for height above 66 inches
    weight_adjustment = (row['Weight'] - 180) / 50  # Example adjustment for weight above 180 pounds

    # Calculate final estimated bat weight
    estimated_weight = base_weight + height_adjustment + weight_adjustment

    # Ensure the estimated bat weight is within a realistic range
    if estimated_weight > 34:  # Cap the weight
        estimated_weight = 34
    elif estimated_weight < 29:  # The lightest MLB bat in history is Rod Carew and Ozzie Smith at 29 ounces
        estimated_weight = 29

    return round(estimated_weight, 1)  # Round to 1 decimal place for neatness

# Apply the function to estimate bat weight
height_and_weight['Estimated Bat Weight (ounces)'] = height_and_weight.apply(estimate_bat_weight_simplified, axis=1)

'''

Get Launch Speed to be used in Torque Calculation. This will be used for Force (mass x acceleration) as a substitute for acceleration. Not a perfect substitute since launch speed measures the speed of the ball, whereas we would theoretically be more interested in speed of the swing.

'''

# Prepare `launch_speed` DataFrame from `statcast_data`
launch_speed = statcast_data[['batter', 'launch_speed']]

# Convert `launch_speed` to numeric, handling errors
launch_speed['launch_speed'] = pd.to_numeric(launch_speed['launch_speed'], errors='coerce')

# Filter out batters with less than 50 data points
batter_counts = launch_speed['batter'].value_counts()
launch_speed_filtered = launch_speed[launch_speed['batter'].isin(batter_counts.index[batter_counts >= 50])]

# Group by 'batter' and calculate the mean launch speed, skipping NaN values
average_launch_speed = launch_speed_filtered.groupby('batter')['launch_speed'].mean().reset_index()

# Round the average launch speed to 2 decimal places
average_launch_speed['launch_speed'] = average_launch_speed['launch_speed'].round(2)

"""

USE SWINGGRAPHS TO GET VBA (vertical bat angle). VBA WILL BE USED TO ESTIMATE THE ANGLE BETWEEN THE FORCE AND THE LEVER ARM IN THE TORQUE CALCULATION.

"""

# Read the CSV file into a pandas DataFrame
vba_data = pd.read_csv('VBA_All.csv')

# Filter only players that played in 2023
months_2023_columns = vba_data.columns[-6:]

# Calculate the average for each batter in 2023
vba_data['average_vba_2023'] = vba_data[months_2023_columns].mean(axis=1).round(2)

# Drop rows with NaN in the 'average_2023' column. These players did not play in 2023.
vba_data.dropna(subset=['average_vba_2023'], inplace=True)

vba_data = vba_data[['batter', 'Name', 'Team', 'average_vba_2023']]

# Merge `vba_data` with `average_launch_speed` on the 'batter' column
merged_data = pd.merge(average_launch_speed, vba_data, on='batter', how='inner')

# Rename the 'launch_speed' column to 'average_launch_speed_2023'
merged_data.rename(columns={'launch_speed': 'average_launch_speed_2023'}, inplace=True)

# Step 1: Clean the 'Player' column in 'height_and_weight'
# Remove (L) or (R) from names
height_and_weight['Player'] = height_and_weight['Player'].str.replace(r'\(L\)|\(R\)', '', regex=True)

# Remove accent marks and strip any leading/trailing whitespace
height_and_weight['Player'] = height_and_weight['Player'].apply(
    lambda name: unicodedata.normalize('NFKD', name).encode('ascii', 'ignore').decode('ascii').strip()
)

# Convert names to "First Name Last Name" format
height_and_weight['Player'] = height_and_weight['Player'].apply(
    lambda name: ' '.join(name.split(', ')[::-1])
)

# Step 2: Ensure the 'Name' column in 'merged_data' is also free of accent marks
merged_data['Name'] = merged_data['Name'].apply(
    lambda name: unicodedata.normalize('NFKD', name).encode('ascii', 'ignore').decode('ascii')
)

# Rename 'Player' to 'Name' in height_and_weight for consistency
height_and_weight_renamed = height_and_weight.rename(columns={'Player': 'Name'})

# Identify duplicates in both DataFrames
duplicate_names_merged = merged_data['Name'].duplicated(keep=False)
duplicate_names_height_weight = height_and_weight_renamed['Name'].duplicated(keep=False)

# Splitting DataFrames based on unique and duplicated 'Name' values
merged_data_unique = merged_data[~duplicate_names_merged]
merged_data_duplicates = merged_data[duplicate_names_merged]
height_and_weight_unique = height_and_weight_renamed[~duplicate_names_height_weight]
height_and_weight_duplicates = height_and_weight_renamed[duplicate_names_height_weight]

# Merge Strategy
# For unique names
merged_unique = pd.merge(height_and_weight_unique, merged_data_unique, on='Name', how='inner')
# For duplicated names
merged_duplicates = pd.merge(height_and_weight_duplicates, merged_data_duplicates, on=['Name', 'Team'], how='inner') # This extra step is needed for players with duplicate names like Carlos Perez

# Combine results
final_merged_data = pd.concat([merged_unique, merged_duplicates], ignore_index=True)

# Adjusting Columns (if 'batter' exists in columns)
if 'batter' in final_merged_data.columns:
    final_merged_data = final_merged_data[['batter'] + [col for col in final_merged_data.columns if col != 'batter']]

# Drop specified columns
final_merged_data.drop(['Height', 'Weight', 'Height_in','Team'], axis=1, inplace=True)

final_merged_data = final_merged_data.rename(columns={'Team_x': 'Team'})

# Print the DataFrame with all the attributes that will go into the Torque value
final_merged_data.head(10)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  launch_speed['launch_speed'] = pd.to_numeric(launch_speed['launch_speed'], errors='coerce')


Unnamed: 0,batter,Name,Team,Estimated Bat Size (inches),Estimated Arm Length (inches),Estimated Shoulder to Contact (inches),Estimated Bat Weight (ounces),average_launch_speed_2023,Team_y,average_vba_2023
0,682928,CJ Abrams,WAS,33.79,16.28,50.07,33.0,81.23,WAS,26.95
1,547989,Jose Abreu,HOU,34.29,16.5,50.79,34.0,82.42,HOU,32.7
2,677800,Wilyer Abreu,BOS,33.39,15.4,48.79,33.1,83.37,BOS,35.4
3,660670,Ronald Acuna Jr.,ATL,33.61,15.84,49.45,33.1,86.43,ATL,33.38
4,642715,Willy Adames,MIL,33.87,16.06,49.93,33.5,81.27,MIL,33.88
5,677941,Jordyn Adams,LAA,33.71,16.28,49.99,32.8,82.97,LAA,32.15
6,656180,Riley Adams,WAS,34.67,16.72,51.39,34.0,84.41,WAS,34.17
7,666176,Jo Adell,LAA,34.13,16.5,50.63,33.6,80.82,LAA,28.77
8,501303,Ehire Adrianza,LAA,33.7,16.06,49.76,33.1,75.92,ATL,35.5
9,605113,Nick Ahmed,SF,33.87,16.28,50.15,33.2,79.23,ARI,28.82


Using the previously mentioned attributes, I was able to calculate the estimated torque per swing for each player in 2023. Unsurprisingly, Aaron Judge led the way with an estimated torque of 76.599 Nm, which was 6.628 Nm higher than the next highest value.

MLB superstars Freddie Freeman and Mike Trout were also among the top 10 in estimated torque.

In [None]:
# @title
def calculate_torque(dataframe):
    # Constants for unit conversion
    inches_to_meters = 0.0254
    ounces_to_kilograms = 0.0283495

    # Convert Estimated Shoulder to Contact from inches to meters
    dataframe['Estimated Shoulder to Contact (meters)'] = dataframe['Estimated Shoulder to Contact (inches)'] * inches_to_meters

    # Convert Bat Weight from ounces to kilograms
    dataframe['Estimated Bat Weight (kilograms)'] = dataframe['Estimated Bat Weight (ounces)'] * ounces_to_kilograms

    # Convert VBA from degrees to radians
    dataframe['average_vba_2023 (radians)'] = np.radians(dataframe['average_vba_2023'])

    # Calculate Torque
    # Note: This uses a simplified model for force using bat weight and launch speed as a proxy
    dataframe['Torque (Nm)'] = (dataframe['Estimated Shoulder to Contact (meters)'] *
                                dataframe['Estimated Bat Weight (kilograms)'] *
                                dataframe['average_launch_speed_2023'] *
                                np.sin(dataframe['average_vba_2023 (radians)']))

    return dataframe[['batter','Name', 'Team', 'Torque (Nm)']]

# Calculate the torque for each batter
result_df = calculate_torque(final_merged_data)

# Display the results
result_df.sort_values(by='Torque (Nm)',ascending=False).head(10)



Unnamed: 0,batter,Name,Team,Torque (Nm)
234,592450,Aaron Judge,NYY,76.599121
160,518692,Freddie Freeman,LAD,69.971104
284,669016,Brandon Marsh,PHI,66.896538
256,663616,Trevor Larnach,MIN,66.807835
184,682985,Riley Greene,DET,65.528761
490,545361,Mike Trout,LAA,65.244276
412,667670,Brent Rooker,OAK,63.82901
250,677008,Heston Kjerstad,BAL,63.03935
488,669707,Jared Triolo,PIT,63.005939
94,671213,Triston Casas,BOS,62.710767


While torque measures the stress that each baseball swing puts on the body, it is also important to know the frequency of this motion. Naturally, the more times someone swings, the more stress they are putting on their body.

For swing aggregations, I used a very similar time frame method as the previous baserunning energy calculation. I looked at swings taken yesterday, last week, last month, and the season-to-date. In the working dataframe, you can now see the inclusion of torque and swing counts for each batter and game throughout the 2023 season, still sorted by baserunning energy yesterday in descending order.

In [None]:
# @title
'''

Calculate total number of swings players have taken yesterday, last week, last month, season to date

'''

# Drop the 'Name' column
result_df_modified = result_df.drop('Name', axis=1)

# Merge result and result_df_modified on the 'batter' column
merged_results = pd.merge(result, result_df_modified, on='batter')

# Columns before rearrangement
columns = list(merged_results.columns)

# Remove 'Team' from its current position and move to third column
columns.remove('Team')
columns.insert(2, 'Team')

# Reorder the DataFrame according to the new columns order
merged_results = merged_results[columns]

# Ensure game_date is in datetime format
statcast_data['game_date'] = pd.to_datetime(statcast_data['game_date'])

# Calculate total swings for each game based on at bat descriptions
statcast_data['total_swings'] = statcast_data['description'].isin([
    'hit_into_play', 'foul', 'swinging_strike', 'foul_tip', 'swinging_strike_blocked'
]).astype(int)

total_swings_grouped = statcast_data.groupby(['batter', 'game_date'])['total_swings'].sum().reset_index()

# Set the index to game_date for easier rolling window calculations
total_swings_grouped.set_index('game_date', inplace=True)

# Use a custom function to shift and resample for calculating totals
def calculate_rolling_swings(df, days, freq='D'):
    return df.groupby('batter')['total_swings'].transform(lambda x: x.rolling(f'{days}D', closed='left').sum())

# Yesterday (1 day back)
total_swings_grouped['swings_yesterday'] = calculate_rolling_swings(total_swings_grouped, 1)

# Last Week (7 days back)
total_swings_grouped['swings_last_week'] = calculate_rolling_swings(total_swings_grouped, 7)

# Last Month (30 days back)
total_swings_grouped['swings_last_month'] = calculate_rolling_swings(total_swings_grouped, 30)

# Season To Date assumes from the earliest date in the dataset to the current row's date
season_start = total_swings_grouped.index.min()
season_length = (total_swings_grouped.index.max() - season_start).days
total_swings_grouped['swings_season_to_date'] = calculate_rolling_swings(total_swings_grouped, season_length)

# Reset index to enable merging
total_swings_grouped.reset_index(inplace=True)

# Merge the enhanced total_swings_grouped with the additional columns into merged_results
merged_results_with_swings = pd.merge(
    merged_results,
    total_swings_grouped[['batter', 'game_date', 'swings_yesterday', 'swings_last_week', 'swings_last_month', 'swings_season_to_date']],
    on=['batter', 'game_date'],
    how='left'
)

# Drop the current team. Keep the correct team from the game date
merged_results_with_swings = merged_results_with_swings.drop('Team', axis=1)

#Replace NAs in swing columns with 0
merged_results_with_swings.update(merged_results_with_swings[['swings_yesterday', 'swings_last_week', 'swings_last_month', 'swings_season_to_date']].fillna(0))

# Display the DataFrame
merged_results_with_swings.sort_values(by='baserunning_energy_yesterday',ascending=False).head(10)


Unnamed: 0,batter,batter_name,game_date,batter_team,baserunning_energy_yesterday,baserunning_energy_last_week,baserunning_energy_last_month,baserunning_energy_season_to_date,Torque (Nm),swings_yesterday,swings_last_week,swings_last_month,swings_season_to_date
19983,650559,bryan de la cruz,2023-04-23,MIA,17.5,22.75,48.25,48.25,58.85656,14.0,44.0,133.0,133.0
2851,542932,jon berti,2023-09-28,MIA,15.75,28.75,47.75,262.1,43.987097,14.0,39.0,101.0,705.0
41771,682829,elly de la cruz,2023-06-24,CIN,15.5,29.25,69.0,69.0,58.080923,9.0,51.0,123.0,123.0
30427,666969,adolis garcia,2023-04-23,TEX,14.6,22.35,53.1,53.1,52.196117,8.0,40.0,143.0,143.0
21636,657041,lane thomas,2023-07-24,WSH,14.5,26.25,72.85,285.0,45.426925,5.0,38.0,199.0,729.0
24673,663656,kyle tucker,2023-09-11,HOU,13.5,20.75,63.75,387.7,62.366122,4.0,49.0,183.0,1059.0
39706,678662,ezequiel tovar,2023-09-17,COL,13.5,19.5,58.75,326.4,54.41958,18.0,49.0,247.0,1235.0
28649,665923,esteury ruiz,2023-04-27,OAK,13.35,22.35,72.45,72.45,51.214909,8.0,58.0,182.0,182.0
15892,641584,jake fraley,2023-06-04,CIN,13.0,20.75,66.5,119.1,55.294541,6.0,42.0,154.0,353.0
8414,596019,francisco lindor,2023-07-07,NYM,13.0,27.1,79.45,218.05,50.507968,13.0,49.0,192.0,716.0


##**Adding Positions and Ages**

The next variables that I wanted to track to identify injury risks were position and age. Different positions require different amounts of activity and unique movements that may place more or less strain on the body. For example, infielders are expected to display quick, explosive movements, sprint short distances, change directions rapidly, and dive or leap for balls. Outfielders, on the other hand, must cover larger distances and occasionally may be subject to diving plays or wall collisions. Catchers are at especially high risk due to the physical demands of squatting for extended periods, making them prone to knee or back injuries. Because of this, it is common for players to change positions as they get older, such as a center fielder moving to the corner outfield or a catcher moving to first base.

Speaking of getting older, I also made sure to include age, as it's conventional wisdom that as athletes get older, their bodies are more likely to break down, leading to more frequent injury occurrences and longer recovery times. Some older players may have acquired life-long injuries at some point that they will be forced to manage for the rest of their careers. Perhaps the one advantage that older players may have when it comes to health is experience, contributing to refined technique and a better understanding of their body's needs, functions, and limitations.

primary_position simply reflects the position the player played most often during the 2023 season, not necessarily the position that they played on that date. Age, on the other hand, does reflect the player's precise age on that date.



In [None]:
# @title
# Step 1: Melt the DataFrame to transform position columns into rows
melted_data = pd.melt(statcast_data, id_vars=['game_date', 'batter'], value_vars=['fielder_2', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6', 'fielder_7', 'fielder_8', 'fielder_9'], var_name='position', value_name='player_id')

# Step 2: Map the position columns to human-readable positions
position_map = {
    'fielder_2': 'C',
    'fielder_3': '1B',
    'fielder_4': '2B',
    'fielder_5': '3B',
    'fielder_6': 'SS',
    'fielder_7': 'LF',
    'fielder_8': 'CF',
    'fielder_9': 'RF'
}
melted_data['position'] = melted_data['position'].map(position_map)

# Aggregate counts and determine primary position
position_counts = melted_data.groupby(['player_id', 'position']).size().reset_index(name='counts')
primary_position = position_counts.loc[position_counts.groupby('player_id')['counts'].idxmax()]

# Create a new DataFrame for batters and their primary positions
batter_primary_position = primary_position[['player_id', 'position']].rename(columns={'player_id': 'batter', 'position': 'primary_position'})

# Add primary position to the original df
result_with_position = merged_results_with_swings.merge(batter_primary_position, on='batter', how='left')

# NA's are just DH's
result_with_position['primary_position'].fillna('DH', inplace=True)

'''
biofile.csv is from Lehman files. Used to find Birthdate, which will be transformed to Age
'''
bio_df = pd.read_csv('biofile.csv')

# Select only the required columns
bio_df = bio_df[['PLAYERID','NICKNAME', 'LAST', 'BIRTHDATE','PLAY.LASTGAME']]

# Convert 'PLAY.LASTGAME' to datetime format
bio_df['PLAY.LASTGAME'] = pd.to_datetime(bio_df['PLAY.LASTGAME'])

# Filter to include only rows where 'PLAY.LASTGAME' is no earlier than 2023
bio_df = bio_df[bio_df['PLAY.LASTGAME'].dt.year >= 2023]

# Combine 'FIRST' and 'LAST' into one column 'FULL_NAME'
bio_df['FULL_NAME'] = bio_df['NICKNAME'] + ' ' + bio_df['LAST']

bio_df = bio_df.drop(['NICKNAME','LAST','PLAY.LASTGAME'],axis=1)

# Get a list of all column names
columns = list(bio_df.columns)

# Remove 'Full Name' from the list
columns.remove('FULL_NAME')

# Insert 'Full Name' at the desired position (second position, index 1)
columns.insert(1, 'FULL_NAME')

bio_df = bio_df[columns]

# URL of the zip file
url = 'https://www.retrosheet.org/rosters.zip'

# Send a GET request to the URL to fetch the zip file
response = requests.get(url)

# Make sure the request was successful
if response.status_code == 200:
    # Use io.BytesIO to treat the fetched content as a file-like object for zipfile
    zip_file = io.BytesIO(response.content)

    # Create a zipfile object
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        # Initialize an empty string to concatenate file contents
        concatenated_text = ''

        # Loop through each file in the zip archive
        for file in zip_ref.namelist():
            # Check if the file ends with '2023.ros'
            if file.endswith('2023.ROS'):
                # Read the file content
                with zip_ref.open(file) as f:
                    file_content = f.read().decode('utf-8')
                    # Concatenate the content to the text variable
                    concatenated_text += file_content + "\n"  # Add a newline for separation

else:
    print(f"Failed to download the file: HTTP {response.status_code}")

# Preprocess the concatenated text to replace "\r\n" with "\n"
formatted_text = concatenated_text.replace('\r\n', '\n')

# Use StringIO to treat the formatted text as a file-like object
df = pd.read_csv(StringIO(formatted_text), header=None)

# Rename columns according to the fields
df.columns = ['Player ID', 'Last Name', 'First Name', 'Batting Hand', 'Throwing Hand', 'Team', 'Position']

# Select only the required columns: Player ID, First Name, Last Name, and Team
df = df[['Player ID', 'Team']]

# Dictionary of teams to be replaced
team_replacements = {
    'TBA': 'TB',
    'NYA': 'NYY',
    'ANA': 'LAA',
    'CHA': 'CWS',
    'KCA': 'KC',
    'CHN': 'CHC',
    'LAN': 'LAD',
    'NYN': 'NYM',
    'SLN': 'STL',
    'SFN': 'SF',
    'SDN': 'SD',
    'ARI': 'AZ',
    'WAS': 'WSH'

}

# Replace the values in the 'Team' column
df['Team'] = df['Team'].replace(team_replacements)

# Merge bio_df with df to add the 'Team' column from df to bio_df based on matching 'PLAYERID' and 'Player ID'
bio_df_updated = pd.merge(bio_df, df[['Player ID', 'Team']], left_on='PLAYERID', right_on='Player ID', how='left')

# Drop the 'Player ID' column from the result since it's redundant
bio_df_updated.drop(columns=['Player ID'], inplace=True)

# Get the current column order
columns = list(bio_df_updated.columns)

# Remove 'Team' if it's already in the list to avoid duplication in the next step
columns.remove('Team')

# Insert 'Team' as the third column
columns.insert(2, 'Team')

# Reindex the DataFrame columns based on the new order
bio_df_updated = bio_df_updated[columns]

bio_df_updated = bio_df_updated.drop('PLAYERID',axis=1)

# Step 2: Ensure the 'Name' column in 'bio_df_updated' is free of accent marks
bio_df_updated['FULL_NAME'] = bio_df_updated['FULL_NAME'].apply(
    lambda name: unicodedata.normalize('NFKD', name).encode('ascii', 'ignore').decode('ascii').lower()
)

bio_df_updated = bio_df_updated.drop_duplicates()

# Step 2: Ensure the 'Name' column in 'result_with_position' is free of accent marks
result_with_position['batter_name'] = result_with_position['batter_name'].apply(
    lambda name: unicodedata.normalize('NFKD', name).encode('ascii', 'ignore').decode('ascii').lower()
)
# Function to replace '. ' with '.' in initials
def replace_initials(name):
    # Pattern to match initials e.g., 'j. ' to 'j.'
    pattern = re.compile(r'(?<=\b[A-Za-z])\. (?=[A-Za-z]\.)')
    # Replace '. ' with '.' for initials
    return pattern.sub('.', name)

# Apply the function to clean 'batter_name'
result_with_position['batter_name'] = result_with_position['batter_name'].apply(replace_initials)

# Identify Unique and Shared Names
# Count 'batter' IDs per 'batter_name'
batter_counts = result_with_position.groupby('batter_name')['batter'].nunique()

# Names with single 'batter' ID (unique players)
unique_players = batter_counts[batter_counts == 1].index

# For 'result_with_position'
result_with_position['merge_key'] = result_with_position.apply(
    lambda x: f"{x['batter_name']}-{x['batter_team']}" if x['batter_name'] not in unique_players else x['batter_name'], axis=1)

# For 'bio_df_updated'
bio_df_updated['merge_key'] = bio_df_updated.apply(
    lambda x: f"{x['FULL_NAME']}-{x['Team']}" if x['FULL_NAME'] not in unique_players else x['FULL_NAME'], axis=1)


# Step 3: Merge Data
merged_df = pd.merge(
    result_with_position,
    bio_df_updated[['merge_key', 'BIRTHDATE']],
    on='merge_key',
    how='left'
)

# Drop the 'merge_key' column as it's no longer needed
merged_df.drop(columns=['merge_key'], inplace=True)

# Convert 'BIRTHDATE' to datetime
merged_df['BIRTHDATE'] = pd.to_datetime(merged_df['BIRTHDATE'])

# Ensure 'game_date' is also in datetime format (if it's not already)
merged_df['game_date'] = pd.to_datetime(merged_df['game_date'])

# Calculate age in years
merged_df['age'] = (merged_df['game_date'] - merged_df['BIRTHDATE']).dt.days / 365.25

merged_df['age'] = merged_df['age'].round(3).astype(float)

merged_df = merged_df.drop('BIRTHDATE',axis=1)

merged_df = merged_df.drop_duplicates()

merged_df.sort_values(by='baserunning_energy_yesterday', ascending=False).head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Team'] = df['Team'].replace(team_replacements)


Unnamed: 0,batter,batter_name,game_date,batter_team,baserunning_energy_yesterday,baserunning_energy_last_week,baserunning_energy_last_month,baserunning_energy_season_to_date,Torque (Nm),swings_yesterday,swings_last_week,swings_last_month,swings_season_to_date,primary_position,age
22436,650559,bryan de la cruz,2023-04-23,MIA,17.5,22.75,48.25,48.25,58.85656,14.0,44.0,133.0,133.0,LF,26.349
3199,542932,jon berti,2023-09-28,MIA,15.75,28.75,47.75,262.1,43.987097,14.0,39.0,101.0,705.0,SS,33.681
45259,682829,elly de la cruz,2023-06-24,CIN,15.5,29.25,69.0,69.0,58.080923,9.0,51.0,123.0,123.0,SS,1.448
33550,666969,adolis garcia,2023-04-23,TEX,14.6,22.35,53.1,53.1,52.196117,8.0,40.0,143.0,143.0,RF,30.141
24125,657041,lane thomas,2023-07-24,WSH,14.5,26.25,72.85,285.0,45.426925,5.0,38.0,199.0,729.0,RF,27.918
27484,663656,kyle tucker,2023-09-11,HOU,13.5,20.75,63.75,387.7,62.366122,4.0,49.0,183.0,1059.0,RF,26.648
43194,678662,ezequiel tovar,2023-09-17,COL,13.5,19.5,58.75,326.4,54.41958,18.0,49.0,247.0,1235.0,SS,22.127
31690,665923,esteury ruiz,2023-04-27,OAK,13.35,22.35,72.45,72.45,51.214909,8.0,58.0,182.0,182.0,CF,24.194
18064,641584,jake fraley,2023-06-04,CIN,13.0,20.75,66.5,119.1,55.294541,6.0,42.0,154.0,353.0,RF,28.027
9936,596019,francisco lindor,2023-07-07,NYM,13.0,27.1,79.45,218.05,50.507968,13.0,49.0,192.0,716.0,SS,29.643


##**Team Miles Traveled**

While many of the discussed variables leading to potential injury are somewhat common knowledge, one variable that has become of increasing interest to teams in recent years is sleep. The Chicago Cubs have equipped their players with [sleep monitoring technology](https://fatiguescience.com/blog/the-chicago-cubs-turn-to-wearable-sleep-tech-for-edge-in-mlb-pursuit/), and sleep became a [hot topic in Milwaukee](https://www.mlb.com/news/brewers-discuss-sleeping-schedules-and-routines) after the Brewers played 17 games in 17 days and Yusei Kikuchi sparked debate by revealing he typically gets 14 hours of sleep!

Although I don't have access to exactly how much sleep each player is getting a night, I can hypothesize that traveling often is likely to have a major impact on players' sleeping schedules. Anyone who has ever taken a flight has probably experienced the jet lag that comes with switching time zones, and the difficulty of trying to snooze on an airplane, albeit the planes that MLB players travel on are far more comfortable than your average commercial flight.

To measure the impact that traveling may have on injury risk, I will add the distance traveled that each player has accumulated before each game date, again using yesterday, last week, last month, and season-to-date as time frames. First, I will need to get the stadium coordinates for each game of the 2023 season:

In [None]:
# @title
'''

Getting Stadium Coordinates for every 2023 MLB Game (need to fix international/special location games)

'''

stadium_coordinates = pd.read_csv('mlb_game_log.csv')

# Creating the home games DataFrame
home_games = statcast_data[['game_date', 'home_team', 'away_team']].drop_duplicates()
# For home games, the 'team' is the 'home_team'
home_games_home = home_games.copy()
home_games_home['team'] = home_games_home['home_team']
# For away games, the 'team' is the 'away_team', but we keep the 'home_team' column to indicate the home team
home_games_away = home_games.copy()
home_games_away['team'] = home_games_away['away_team']

# Combining the adjusted home and away rows into a single DataFrame
combined_games = pd.concat([home_games_home[['game_date', 'team', 'home_team']], home_games_away[['game_date', 'team', 'home_team']]])

combined_games.sort_values(by=['game_date', 'home_team', 'team'], inplace=True)

# Reset index
combined_games.reset_index(drop=True, inplace=True)

# Dictionary of team names to be replaced to match before merging
team_replacements = {
    'WAS': 'WSH',
    'CHW': 'CWS',
    'ARI': 'AZ',
    'KCR': 'KC',
    'SDP': 'SD',
    'SFG': 'SF',
    'TBR': 'TB'
}

# Replace the values in the 'Team' column
stadium_coordinates['Team'] = stadium_coordinates['Team'].replace(team_replacements)

stadium_coordinates['Date'] = pd.to_datetime(stadium_coordinates['Date'], format='%Y%m%d').dt.strftime('%Y-%m-%d')

combined_games['game_date'] = pd.to_datetime(combined_games['game_date'])

stadium_coordinates['Date'] = pd.to_datetime(stadium_coordinates['Date'])

df = pd.merge(combined_games, stadium_coordinates, left_on=['home_team','game_date'], right_on=['Team','Date'], how='left')

df.drop(['Team','home_team'], axis=1, inplace=True)

df.head(10)

Unnamed: 0,game_date,team,Date,STADIUM LATITUDE,STADIUM LONGITUDE
0,2023-03-30,BAL,2023-03-30,42.346222,-71.09771
1,2023-03-30,BOS,2023-03-30,42.346222,-71.09771
2,2023-03-30,CHC,2023-03-30,41.948059,-87.655647
3,2023-03-30,MIL,2023-03-30,41.948059,-87.655647
4,2023-03-30,CIN,2023-03-30,39.09721,-84.506462
5,2023-03-30,PIT,2023-03-30,39.09721,-84.506462
6,2023-03-30,CWS,2023-03-30,29.757179,-95.355537
7,2023-03-30,HOU,2023-03-30,29.757179,-95.355537
8,2023-03-30,KC,2023-03-30,39.05164,-94.480431
9,2023-03-30,MIN,2023-03-30,39.05164,-94.480431


Then, using the Haversine formula, I will find the distances that each player traveled during the season. If you're curious for more data on MLB travel distances by team, you can check [Baseball Savant's animation](https://baseballsavant.mlb.com/visuals/map), which dates back to 1974. In 2023, the Oakland Athletics traveled more than any other team with 51,527 miles and the Milwaukee Brewers traveled the least with only 25,426 total miles.

In [None]:
# @title
# Function to calculate distance using Haversine formula
def haversine(lon1, lat1, lon2, lat2):
    # convert decimal degrees to radians
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 3956 # Radius of Earth in miles
    return c * r

# Sort the dataframe by team and date to calculate distances correctly
df.sort_values(['team', 'game_date'], inplace=True)

# Calculate the distance between consecutive games for each team.
df['prev_latitude'] = df.groupby('team')['STADIUM LATITUDE'].shift(1)
df['prev_longitude'] = df.groupby('team')['STADIUM LONGITUDE'].shift(1)

# Calculate distance using the Haversine formula
df['distance_traveled'] = df.apply(lambda x: haversine(x['prev_longitude'], x['prev_latitude'], x['STADIUM LONGITUDE'], x['STADIUM LATITUDE']), axis=1)

# Fill NaN distances with 0 (first game of the season for each team)
df['distance_traveled'].fillna(0, inplace=True)

# Initialize columns for aggregated distances
df['miles_traveled_yesterday'] = np.nan
df['miles_traveled_last_week'] = np.nan
df['miles_traveled_last_month'] = np.nan
df['miles_traveled_season_to_date'] = np.nan

# For each row in the DataFrame, calculate the required distances
for index, row in df.iterrows():
    team = row['team']
    date = row['game_date']

    # Yesterday
    df.loc[index, 'miles_traveled_yesterday'] = df[(df['team'] == team) & (df['game_date'] == date - timedelta(days=0))]['distance_traveled'].sum()

    # Last week
    week_ago = date - timedelta(days=7)
    df.loc[index, 'miles_traveled_last_week'] = df[(df['team'] == team) & (df['game_date'] > week_ago) & (df['game_date'] <= date)]['distance_traveled'].sum()

    # Last month
    month_ago = date - pd.DateOffset(months=1)
    df.loc[index, 'miles_traveled_last_month'] = df[(df['team'] == team) & (df['game_date'] > month_ago) & (df['game_date'] <= date)]['distance_traveled'].sum()

    # Season to date
    df.loc[index, 'miles_traveled_season_to_date'] = df[(df['team'] == team) & (df['game_date'] <= date)]['distance_traveled'].sum()

# Finalize the DataFrame
final_df = df[['game_date', 'team', 'miles_traveled_yesterday', 'miles_traveled_last_week', 'miles_traveled_last_month', 'miles_traveled_season_to_date']].copy()

# Merge the dataframes on 'game_date'/'Date' and 'Team'
merged_df = pd.merge(merged_df, final_df, left_on=['game_date', 'batter_team'], right_on=['game_date', 'team'], how='left')

merged_df.drop(['team'], axis=1, inplace=True)

merged_df.sort_values(by='baserunning_energy_yesterday', ascending=False).head(10)

Unnamed: 0,batter,batter_name,game_date,batter_team,baserunning_energy_yesterday,baserunning_energy_last_week,baserunning_energy_last_month,baserunning_energy_season_to_date,Torque (Nm),swings_yesterday,swings_last_week,swings_last_month,swings_season_to_date,primary_position,age,miles_traveled_yesterday,miles_traveled_last_week,miles_traveled_last_month,miles_traveled_season_to_date
20021,650559,bryan de la cruz,2023-04-23,MIA,17.5,22.75,48.25,48.25,58.85656,14.0,44.0,133.0,133.0,LF,26.349,0.0,1088.450288,2197.533515,2197.533515
2858,542932,jon berti,2023-09-28,MIA,15.75,28.75,47.75,262.1,43.987097,14.0,39.0,101.0,705.0,SS,33.681,0.0,1096.901629,5925.996344,40294.405403
41837,682829,elly de la cruz,2023-06-24,CIN,15.5,29.25,69.0,69.0,58.080923,9.0,51.0,123.0,123.0,SS,1.448,0.0,891.760266,3914.744342,13556.997719
30479,666969,adolis garcia,2023-04-23,TEX,14.6,22.35,53.1,53.1,52.196117,8.0,40.0,143.0,143.0,RF,30.141,0.0,1102.311669,2968.809296,2968.809296
21676,657041,lane thomas,2023-07-24,WSH,14.5,26.25,72.85,285.0,45.426925,5.0,38.0,199.0,729.0,RF,27.918,0.0,598.196212,5130.674299,25844.248365
39767,678662,ezequiel tovar,2023-09-17,COL,13.5,19.5,58.75,326.4,54.41958,18.0,49.0,247.0,1235.0,SS,22.127,0.0,944.843192,6073.924685,34415.723755
24716,663656,kyle tucker,2023-09-11,HOU,13.5,20.75,63.75,387.7,62.366122,4.0,49.0,183.0,1059.0,RF,26.648,0.0,230.480085,5703.109873,33420.14833
28699,665923,esteury ruiz,2023-04-27,OAK,13.35,22.35,72.45,72.45,51.214909,8.0,58.0,182.0,182.0,CF,24.194,0.0,2653.751779,8336.990881,8336.990881
10865,607208,trea turner,2023-07-16,PHI,13.0,13.75,67.45,253.2,49.004213,12.0,30.0,223.0,842.0,SS,30.042,0.0,1018.076274,5998.998714,23392.626473
42236,682998,corbin carroll,2023-04-26,AZ,13.0,27.0,78.6,78.6,41.120046,11.0,51.0,163.0,163.0,RF,22.678,0.0,1269.032214,4715.561869,4715.561869


##**Add Target Variable**

Now that I have all my feature columns, it's time for me to add my target variable: whether the player got injured on that specific game date. My definition of a player getting injured on a game date is any date that a player's Injured List (IL) designation sets as the retroactive date.

Teams may give a player a few days of rest before deciding to put them on the Injured List. If they do decide to put that player on the IL, they can set a retroactive date which can be no earlier than the player's last appearance. This does not guarantee that the retroactive date is the exact date that the player got injured, but it is the closest estimate.

I was able to scrape 2023 Injured List data from [Fangraphs' Injury Report](https://www.fangraphs.com/roster-resource/injury-report?season=2023&groupby=all). I then filtered out pitchers, as well as injuries that are less likely to come from overuse, such as concussions, bone fractures, syndromes, hand injuries, etc. The following dataframe displays 10 rows in my filtered 2023 injuries dataset:

In [None]:
# @title
'''

To get 2023, I manually pulled the data from Fangraphs' injury report: https://www.fangraphs.com/roster-resource/injury-report?season=2023&groupby=all

'''

file_path = 'mlb_injuries_2023.xlsx'

# Read the Excel file
injuries_df = pd.read_excel(file_path, engine='openpyxl')

# Remove any rows where the injury seems to be something that happens by chance (rather than overuse). Broken bones are often random and could be caused by a hit by pitch. Overuse does not have much to do with these.
# Also remove surgeries, as these are often lingering injuries and hard to determine when they occurred. Other miscellaneous injuries removed include Anxiety, Infections, and Concussions.
injuries_df = injuries_df[~injuries_df['Injury / Surgery'].str.contains('surgery|fractured|TBD|Anxiety|Appendicitis|Torn|concussion|Hand|Blisters|syndrome|Facial|Foot|Sprained|infection|Heel|bone|sugery|Blister|laceration|contusion|Finger', case=False, na=False)]

# Convert the 'Injury / Surgery Date' column to datetime, coercing errors to NA
injuries_df['Injury / Surgery Date'] = pd.to_datetime(injuries_df['Injury / Surgery Date'], errors='coerce')

# Remove rows where 'Pos' is 'SP' or 'RP'
injuries_df = injuries_df[~injuries_df['Pos'].isin(['SP', 'RP'])]

# Drop rows where 'Injury / Surgery Date' is NA. Need to know when injury occured for this exercise.
injuries_df = injuries_df.dropna(subset=['Injury / Surgery Date'])

# Reset index
injuries_df.reset_index(drop=True, inplace=True)

# Renaming Injury / Surgery Date column
injuries_df.rename(columns={'Injury / Surgery Date': 'injury_date'}, inplace=True)

# Displaying the head of the injury df
injuries_df.head(10)



Unnamed: 0,Name,Team,Pos,injury_date,Injury / Surgery,Status,IL Retro Date,Eligible to Return,Return Date,Latest Update
0,Victor Robles,WSN,OF,2023-06-20,Back spasms,60-Day IL,2023-06-21,2023-08-20,,No timetable for return
1,Jake Marisnick,LAD,OF,2023-07-18,Strained hamstring,60-Day IL,2023-07-19,2023-09-17,,Rehab assignment (9/12)
2,Greg Jones,TBR,INF/OF,2023-07-22,Strained hamstring,60-Day IL,2023-09-16,2023-11-15,,Out for 2023 season
3,Otto Lopez,TOR,INF/OF,2023-07-22,Strained oblique,60-Day IL,2023-08-01,2023-09-30,,Rehab assignment (9/19)
4,Brad Miller,TEX,INF/OF,2023-08-01,Strained hamstring,60-Day IL,2023-08-02,2023-10-01,,Out for 2023 season
5,Starling Marte,NYM,OF,2023-08-05,Strained groin,10-Day IL,2023-08-06,2023-08-16,,Out for 2023 season
6,Mark Mathias,SFG,INF/OF,2023-08-13,Strained shoulder,60-Day IL,2023-08-14,2023-10-13,,Out for 2023 season
7,Billy McKinney,NYY,OF/1B,2023-08-20,Back spasms,10-Day IL,2023-08-21,2023-08-31,,No timetable for return
8,Avisaíl García,MIA,OF,2023-08-22,Strained hamstring,10-Day IL,2023-08-23,2023-09-02,,No timetable for return
9,Matt McLain,CIN,INF,2023-08-27,Strained oblique,10-Day IL,2023-08-28,2023-09-07,,Out for 2023 season


In my working dataframe, I added a new column called got_injured which is set to 1.0 if the player got hurt on that specific date, and 0.0 if the player did not get injured. The following shows the full working data frame for some rows where got_injured was 1.0:

In [None]:
# @title
'''

Adding got_injured column to original df

'''

# remove accents and lowercase
injuries_df['Name'] = injuries_df['Name'].apply(
    lambda name: unicodedata.normalize('NFKD', name).encode('ascii', 'ignore').decode('ascii').lower()
)

injuries_df['got_injured'] = 1

# Merge the dataframes on 'game_date'/'injury_date' and 'batter_name'/'Name'
final_merged_df = pd.merge(merged_df, injuries_df, left_on=['batter_name', 'game_date'], right_on=['Name', 'injury_date'], how='left')

# Drop unneccessary columns
final_merged_df.drop(['Name','Team','Pos','Injury / Surgery','IL Retro Date','Eligible to Return','Return Date','Latest Update','Status','injury_date'], axis=1, inplace=True)

# If a player did not get injured, set got_injured to 0
final_merged_df['got_injured'] = final_merged_df['got_injured'].fillna(0)

# Display df head
final_merged_df.sort_values(by='got_injured', ascending=False).head(10)


Unnamed: 0,batter,batter_name,game_date,batter_team,baserunning_energy_yesterday,baserunning_energy_last_week,baserunning_energy_last_month,baserunning_energy_season_to_date,Torque (Nm),swings_yesterday,swings_last_week,swings_last_month,swings_season_to_date,primary_position,age,miles_traveled_yesterday,miles_traveled_last_week,miles_traveled_last_month,miles_traveled_season_to_date,got_injured
34733,670042,luke raley,2023-09-20,TB,0.0,4.0,38.9,243.8,49.405346,0.0,8.0,130.0,795.0,RF,29.002,0.0,1803.924066,7489.776623,36628.877052,1.0
33697,669357,nolan gorman,2023-09-12,STL,1.0,13.5,23.5,231.55,60.376649,14.0,59.0,122.0,921.0,2B,23.34,0.0,781.508593,3091.952503,24003.495354,1.0
24231,663611,nick madrigal,2023-09-16,CHC,1.0,11.6,44.35,173.4,39.894674,12.0,37.0,127.0,469.0,3B,26.533,0.0,1502.170577,2868.733266,22260.016708,1.0
26394,664056,harrison bader,2023-09-17,CIN,0.0,4.5,32.75,202.55,42.200153,0.0,18.0,131.0,597.0,CF,29.29,0.0,724.143269,5625.313257,25964.681026,1.0
5275,571745,mitch haniger,2023-09-25,SF,0.0,3.75,34.75,113.45,55.654944,3.0,21.0,123.0,421.0,LF,32.756,345.251119,1287.936193,6270.86832,42765.954288,1.0
4158,545350,jake marisnick,2023-07-18,LAD,3.5,4.1,29.7,55.7,53.607537,3.0,9.0,87.0,175.0,CF,32.301,0.0,2629.759754,5434.12884,24043.378283,1.0
16463,641856,billy mckinney,2023-08-20,NYY,0.0,2.25,19.25,62.1,58.367638,0.0,18.0,103.0,270.0,LF,28.991,0.0,1365.189755,6056.198284,27207.447644,1.0
5822,572761,matt carpenter,2023-09-10,SD,0.0,1.6,10.35,86.45,62.052398,0.0,21.0,37.0,403.0,1B,37.788,0.0,1300.771449,6330.337814,38303.451029,1.0
35636,671213,triston casas,2023-09-14,BOS,0.0,9.75,53.35,247.1,62.710767,0.0,47.0,191.0,889.0,1B,23.663,0.0,1197.289282,7098.319783,33895.476942,1.0
19342,649966,luis urias,2023-09-20,BOS,1.1,10.35,29.4,79.5,44.304188,9.0,47.0,113.0,292.0,2B,26.297,0.0,1643.041346,8137.8579,35538.518288,1.0


##**Run Model**

Finally, it's time to run the model. I split the data into training and test sets, allocating 70% of the data for training the model, and the remaining 30% of the data for testing the model.

When deciding which model to pick for this project, I focused on one key issue: my dataset is very imbalanced. This means that there are far more instances where players did not get hurt (and got_injured is 0.0) than instances where they did get injured. For a highly imbalanced dataset, it is very difficult for a model to capture the minority class. Even players with a high relative risk of injury, are unlikely to get injured, therefore a model is rarely going to predict a positive case.

I decided to use the XGBoost classifier as my model for this experiment, primarily because of its effectiveness in handling imbalanced datasets. I was able to use XGBoost's 'scale_pos_weight' parameter to adjust the algorithm's balance towards the minority class, which in this context, is the instances where injuries occurred.

XGBoost offers some additional advantages, such as its ability to build complex models and its efficiency when handling large datasets.

After training the model and applying it to my test set, I created a dataframe to display the highest injury risk probabilities for any player and game during the 2023 season.

Unfortunately, the model did not do great at correctly predicting injuries for the highest at-risk players during the target date, and the Precision-Recall AUC (a performance metric where a higher score is better) was quite low, however, that is generally expected of an imbalanced dataset. What the model did very well at was predicting injuries for the next day. The highest injury probability in the test set was Shohei Ohtani on 09/02/2023. Ohtani was placed on the Injured List retroactive to 09/03/2023 with a strained oblique. The third highest injury risk was Jake Marisnick on 07/17/2023. Marisnick was placed on the Injured List retroactive to 07/18/2023 with a strained hamstring.

In fact, in 3 of the top 10 instances where the predicted injury probability was the highest, the player was placed on the Injured List with a retroactive date a day later than the game date in the test set. Considering that there were only 5 instances in the test set where a player got injured on the next day, out of 13,421 total instances, it is quite impressive that the model caught 3 of the instances among its most at-risk players.

In [None]:
# @title
# Categorical columns needing encoding
categorical_features = ['primary_position','batter_team']

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_features = encoder.fit_transform(final_merged_df[categorical_features])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out())
final_merged_df_encoded = pd.concat([final_merged_df.drop(columns=categorical_features), encoded_df], axis=1)

#  Set feature variables
features_columns = final_merged_df_encoded.columns.drop(['batter_name','game_date', 'got_injured'])
features = final_merged_df_encoded[features_columns]

#  Set target variable
target = final_merged_df_encoded['got_injured']

#  Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
scale_pos_weight = float(np.sum(y_train == 0)) / np.sum(y_train == 1)

clf = xgb.XGBClassifier(scale_pos_weight=scale_pos_weight, use_label_encoder=False, eval_metric='logloss')
clf.fit(X_train, y_train)

# Use StratifiedKFold for cross-validation in CalibratedClassifierCV
cv = StratifiedKFold(n_splits=5)
calibrated_clf = CalibratedClassifierCV(clf, method='sigmoid', cv=cv)
calibrated_clf.fit(X_train, y_train)

y_pred_proba_calibrated = calibrated_clf.predict_proba(X_test)[:, 1]

# Evaluate model using Precision-Recall AUC
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_calibrated)
auc_score = auc(recall, precision)
print(f"Precision-Recall AUC: {auc_score}")

predictions_df = pd.DataFrame({'predicted_injury_probability': y_pred_proba_calibrated}, index=X_test.index)
result_df = final_merged_df.loc[X_test.index, ['batter_name', 'game_date', 'got_injured']].copy()
result_df = result_df.merge(predictions_df, left_index=True, right_index=True)
result_df.rename(columns={'got_injured': 'actual_injury'}, inplace=True)

# Create a next_game column which is equal to the day after game_date
final_merged_df['next_game'] = final_merged_df['game_date'] + pd.Timedelta(days=1)

# Ensure game_date and next_game are in datetime format
final_merged_df['game_date'] = pd.to_datetime(final_merged_df['game_date'])
final_merged_df['next_game'] = pd.to_datetime(final_merged_df['next_game'])

# Create a temporary DataFrame to facilitate the merge
temp_df = final_merged_df[['batter', 'game_date', 'got_injured']].copy()

# Rename the columns in temp_df to match the 'next_game' and align 'got_injured' for merging
temp_df.rename(columns={'game_date': 'next_game', 'got_injured': 'injured_next_day'}, inplace=True)

# Merge the original DataFrame with the temporary one
final_merged_df = pd.merge(final_merged_df, temp_df, on=['batter', 'next_game'], how='left')

final_merged_df.drop(['next_game'], axis=1, inplace=True)

# Ensure game_date in both DataFrames are in datetime format for proper comparison
final_merged_df['game_date'] = pd.to_datetime(final_merged_df['game_date'])
result_df['game_date'] = pd.to_datetime(result_df['game_date'])

# Perform the merge operation
result_df = pd.merge(result_df,
                     final_merged_df[['batter_name', 'game_date', 'injured_next_day']],
                     left_on=['batter_name', 'game_date'],
                     right_on=['batter_name', 'game_date'],
                     how='left')


result_df.sort_values(by='predicted_injury_probability', ascending=False, inplace=True)


#  Insert 'predicted_injury_probability' in the third column

columns = list(result_df.columns)
columns.remove('predicted_injury_probability')
columns.insert(2, 'predicted_injury_probability')
result_df = result_df[columns]

# Replace NaN values in the 'injured_next_day' column with "Did Not Play"
result_df['injured_next_day'] = result_df['injured_next_day'].fillna("Did Not Play")

# Display Highest Injury Probability Instances
print("Number of rows where got_injured is 1.0:", result_df[result_df['actual_injury'] == 1.0].shape[0])
print("Total number of rows in result_df:", result_df.shape[0])
result_df.head(55)



Precision-Recall AUC: 0.0011577552212508975
Number of rows where got_injured is 1.0: 7
Total number of rows in result_df: 13421


Unnamed: 0,batter_name,game_date,predicted_injury_probability,actual_injury,injured_next_day
6385,shohei ohtani,2023-09-02,0.20066,0.0,1.0
3578,jose siri,2023-07-29,0.200413,0.0,0.0
3458,jake marisnick,2023-07-17,0.200389,0.0,1.0
1691,rafael ortega,2023-09-14,0.200377,0.0,0.0
1453,xander bogaerts,2023-09-01,0.200377,0.0,0.0
2701,travis blankenhorn,2023-09-09,0.200377,0.0,Did Not Play
11079,brendan rodgers,2023-09-23,0.197596,0.0,0.0
4333,jake marisnick,2023-07-16,0.178594,0.0,0.0
7844,luis urias,2023-09-19,0.156898,0.0,1.0
3722,willson contreras,2023-09-18,0.139422,0.0,Did Not Play


A metric like predicted injury probability can be very valuable for managers as they set their lineups. Seeing a high injury risk next to one of their star players may be a wake-up call to them that they need a day off, or else they risk an increased chance of losing that player for an extended period of time.

Of course, deciding when a player needs a day off is a very nuanced decision, and any model should be used in conjunction with other observational, non-quantifiable metrics, like simply asking the player how he feels that day. Teams can surely improve from my model by collecting data that I don't have access to, like surveying players, or collecting biometric data through wearable technology.

In conclusion, while injuries can seem to be completely random, and sometimes they are, models like this can sometimes help us identify warning signs that a player is at a higher risk for injury. Predicted Injury Probability could be just one of many metrics that managers have available to them to make more informmed lineup and in-game decisions.