
# Data Collection
This notebook illustrates the collection of 'Statcast_data.csv' data file. It will detail the code with the pybaseball library in addition to metadata about the data itself.

The statcast data is collected thanks in part to James LeDoux and company python library pybaseball. The link to the official github page is here: https://github.com/jldbc/pybaseball.

This package scrapes Baseball Reference, Baseball Savant, and FanGraphs, all websites that house statistical and baseball related information. Specifically for this notebook, the package retrieves statcast data (detailed in the Proposal document) on the individual pitch level. The data will be collected on the following terms:

Identify the classes in our target suitable for overall analysis. In statcast terms, the classes will be "called_strike", "ball", and "blocked_ball".
Order pitchers who threw the most pitches in the 2018 regular season. That is done below in the pitchers list object.
To get an even sample of pitches from each pitcher and a variety of pitchers, select the top 400 pitchers in our ordering and collect 350 pitches each. This is chosen because our 400th rank pitcher, Gen Giles, threw 351 pitches last year. Thus, to ensure an even amount between all pitchers, each pitcher will have 350 pitches in the final dataset. The data will be collected from the entire 2018 regular season, which started on March 29 and ended on September 30.
Select appropriate features that can only be measured during the duration of a pitch. The duration, or timeline of a pitch, is defined as the moment when the pitcher releases the baseball out of his hand to the moment the catcher receives the ball. Thus, features about hitting the ball, or any information after a pitch has been thrown is excluded. The only feature considered will be the target, which is the result of the pitch.
Logical execution
The logic of the data collection is based on the pybaseball functionality:

Grab a unique identification label for each pitcher to be used in collected his respective data
Pull the data from Statcast through pybaseball, resulting in a pandas dataframe, based on the unique identification. This dataframe will be a random sample of 350 pitches thrown in the 2018 regular season by the particular pitcher.
Instatiate a dataframe by performing step 2 above. Then, loop through all of the pitchers and append their respective data to the instatiated dataframe. This will result in our final dataframe. For reference, the last pitcher will be Ken Giles.
Save that dataframe as a csv file for future use.
(Note from the author: The logic is not necessarily elegant, but it get's the job done. However, there are some hiccups. Due to random minor bugs and errors that crept up during execution of the looping through pitcher names, not all 400 pitchers ended in the dataframe. If there was a possible disruption of the loop with a particular pitcher, the pitcher was simply bypassed. This execution resulting in 368 pitchers resulting in the dataframe. Still an ample amount.)

Let's begin the process now.

In [1]:
#import dependencies
import pybaseball
import pandas as pd
import numpy as np
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup
import pathlib

PITCHER_NAMES = pathlib.Path.cwd().parent / 'references' / 'pitcher_names.txt'
DATA_FOLDER = pathlib.Path.cwd().parent / 'data'

#set up a few constants
#number of pitches
SAMPLE_SIZE = 350

#classes of the target variable
TARGET_CLASSES = ['ball', 'called_strike', 'blocked_ball']

#resulting features we want
FEATURES_TO_KEEP = ['player_name', 'p_throws', 'pitch_name', 'release_speed','release_spin_rate',
                    'release_pos_x', 'release_pos_y',
                    'release_pos_z', 'pfx_x', 'pfx_z', 'vx0','vy0', 'vz0', 
                    'ax', 'ay', 'az', 'sz_top', 'sz_bot', 
                    'release_extension','description']

In [2]:
def read_pitchers(file):
    '''
    # read in pitcher_names.txt file, 
    # split the file into list of list, 
    # where each individual list has two elements, the first and last names, respectively

    '''
    with open(file) as f:
        names = f.read().split(',')
        for name in names:
            if '\n' in name:
                names = [name.replace('\n', '') for name in names]
            split_names = [name.split(' ') for name in names]
        
        print(f' Number of Pitchers: {len(names)}')
        return split_names


## Using pybaseball 
Now begin the execution of the loop. This goes through steps 1-4 in the logical execution portion above.

We'll use a few constraints:
- collect 350 pitches from each pitcher so that there is balance between pitchers
- collect 400 pitches from each pitcher to further ensure balance

In [3]:


def collect_statcast(sample_size, target, features, pitcher_names):
    """TODO"""
    
    #loop through all the names
    pitchers = pd.DataFrame(columns = features)
    for fname, lname in pitcher_names[:2]:
        
        #grap the unique identifier of the pitcher
        player = playerid_lookup(lname, fname)
        
        #to avoid any possible errors, execute following try statement:
        # grab the unique identifier value
        # get all available data in time frame
        # filter data to only have appropriate targets, defined above
        # append particular pitcher to 'master' dataframe
        #if any of these steps fail, particularly the grabbing of 'ID'
        #pass on to next pitcher
        try:
            ID = player['key_mlbam'].iloc[player['key_mlbam'].argmax()]
            df = statcast_pitcher('2018-03-29', '2018-09-30', player_id = ID)
            df = df[df['description'].isin(target)].sample(sample_size, random_state=2019)
            data = df[features]
            pitchers = pitchers.append(data, ignore_index=True)
        except ValueError:
            pass
    return pitchers



In [5]:
def convert_to_csv(data):
    '''
    todo
    '''

    data.to_csv(DATA_FOLDER / 'raw' / 'Statcast_data.csv')

#convert_to_csv(pitchers)

In [6]:
def main():

    names = read_pitchers(PITCHER_NAMES)
    
    pitchers = collect_statcast(SAMPLE_SIZE, TARGET_CLASSES, FEATURES_TO_KEEP, names)

    convert_to_csv(pitchers)

In [7]:
main()

Number of Pitchers: 401
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
