# Data Collection (_pybaseball_)

The goal of this notebook is to collect batting and pitching data from the _pybaseball_ library for building an at-bat predcition model.

In [19]:
import pandas as pd
from pybaseball import statcast, playerid_lookup
from datetime import datetime

### Establishing the Batter and Pitcher Matchup: ###

The model will ask for input on who the batter and pitcher matchup is.

The model covers the 2024 MLB season. However, this can easily be modified so that it applies to a different year by swapping the values for the two variables below.

In [20]:
# Defining the start date as the opening day of the 2024 MLB season and the end date as today
start_date = '2024-04-01'
end_date = datetime.now().strftime('%Y-%m-%d')

# Define the batter
batter_name = 'Shohei Ohtani' # input("Enter the batter's name: ")

# Define the pitcher
pitcher_name = 'Yoshinobu Yamamoto'  # input("Enter pitcher's name: ")

# Separate into first and last names
def name_splitter(name):
    name_parts = name.split()
    return name_parts[0], ' '.join(name_parts[1:])

batter_first_name, batter_last_name = name_splitter(batter_name)
pitcher_first_name, pitcher_last_name = name_splitter(pitcher_name)

# Get player IDs
def get_player_id(first_name, last_name):
    try:
        player_df = playerid_lookup(last_name, first_name)
        if not player_df.empty:
            print(player_df)
            return player_df['key_mlbam'].values[0]
        else:
            raise ValueError(f"Player '{first_name} {last_name}' not found.")
    except Exception as e:
        print(f"Error occurred: {e}")
        return None

batter_id = get_player_id(batter_first_name, batter_last_name)
pitcher_id = get_player_id(pitcher_first_name, pitcher_last_name)


  name_last name_first  key_mlbam key_retro  key_bbref  key_fangraphs  \
0    ohtani     shohei     660271  ohtas001  ohtansh01          19755   

   mlb_played_first  mlb_played_last  
0            2018.0           2024.0  
  name_last name_first  key_mlbam key_retro  key_bbref  key_fangraphs  \
0  yamamoto  yoshinobu     808967  yamay001  yamamyo01          33825   

   mlb_played_first  mlb_played_last  
0            2024.0           2024.0  


### Fetch Batting Stats:

In [21]:
# Fetch statcast data for the batter
data = statcast(start_date, end_date)
batter_data = data[data['batter'] == batter_id]

# Display the first few rows of the player's at-bat data
print(batter_data.head())


This is a large query, it may take a moment to complete


That's a nice request you got there. It'd be a shame if something were to happen to it.
We strongly recommend that you enable caching before running this. It's as simple as `pybaseball.cache.enable()`.
Since the Statcast requests can take a *really* long time to run, if something were to happen, like: a disconnect;
gremlins; computer repair by associates of Rudy Giuliani; electromagnetic interference from metal trash cans; etc.;
you could lose a lot of progress. Enabling caching will allow you to immediately recover all the successful
subqueries if that happens.
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[col

     pitch_type  game_date  release_speed  release_pos_x  release_pos_z  \
2414         FC 2024-09-11           90.6           -1.2           6.04   
2515         FC 2024-09-11           89.9          -1.36           6.03   
3029         FC 2024-09-11           89.8          -2.82           5.84   
624          SI 2024-09-11           91.6           2.31           5.81   
1439         SI 2024-09-11           93.0           2.19           5.79   

           player_name  batter  pitcher     events    description  ...  \
2414  Armstrong, Shawn  660271   542888  field_out  hit_into_play  ...   
2515  Armstrong, Shawn  660271   542888        NaN       foul_tip  ...   
3029  Thompson, Keegan  660271   624522  field_out  hit_into_play  ...   
624      Wicks, Jordan  660271   696136     single  hit_into_play  ...   
1439     Wicks, Jordan  660271   696136       walk           ball  ...   

      post_home_score  post_bat_score  post_fld_score  if_fielding_alignment  \
2414                8   

In [22]:
# Observing the columns available in the dataset
print("Columns in batting_data:")
columns = batting_data.columns
print(', '.join(columns))

Columns in batting_data:


NameError: name 'batting_data' is not defined