
# Data Collection
This notebook illustrates the collection of 'Statcast_data.csv' data file. It will detail the code with the pybaseball library in addition to metadata about the data itself.

The statcast data is collected thanks in part to James LeDoux and company python library pybaseball. The link to the official github page is here: https://github.com/jldbc/pybaseball.

This package scrapes Baseball Reference, Baseball Savant, and FanGraphs, all websites that house statistical and baseball related information. Specifically for this notebook, the package retrieves statcast data (detailed in the Proposal document) on the individual pitch level. The data will be collected on the following terms:

Identify the classes in our target suitable for overall analysis. In statcast terms, the classes will be "called_strike", "ball", and "blocked_ball".
Order pitchers who threw the most pitches in the 2018 regular season. That is done below in the pitchers list object.
To get an even sample of pitches from each pitcher and a variety of pitchers, select the top 400 pitchers in our ordering and collect 350 pitches each. This is chosen because our 400th rank pitcher, Gen Giles, threw 351 pitches last year. Thus, to ensure an even amount between all pitchers, each pitcher will have 350 pitches in the final dataset. The data will be collected from the entire 2018 regular season, which started on March 29 and ended on September 30.
Select appropriate features that can only be measured during the duration of a pitch. The duration, or timeline of a pitch, is defined as the moment when the pitcher releases the baseball out of his hand to the moment the catcher receives the ball. Thus, features about hitting the ball, or any information after a pitch has been thrown is excluded. The only feature considered will be the target, which is the result of the pitch.
Logical execution
The logic of the data collection is based on the pybaseball functionality:

Grab a unique identification label for each pitcher to be used in collected his respective data
Pull the data from Statcast through pybaseball, resulting in a pandas dataframe, based on the unique identification. This dataframe will be a random sample of 350 pitches thrown in the 2018 regular season by the particular pitcher.
Instatiate a dataframe by performing step 2 above. Then, loop through all of the pitchers and append their respective data to the instatiated dataframe. This will result in our final dataframe. For reference, the last pitcher will be Ken Giles.
Save that dataframe as a csv file for future use.
(Note from the author: The logic is not necessarily elegant, but it get's the job done. However, there are some hiccups. Due to random minor bugs and errors that crept up during execution of the looping through pitcher names, not all 400 pitchers ended in the dataframe. If there was a possible disruption of the loop with a particular pitcher, the pitcher was simply bypassed. This execution resulting in 368 pitchers resulting in the dataframe. Still an ample amount.)

Let's begin the process now.

In [3]:
#import dependencies
import pybaseball
import pandas as pd
import numpy as np
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup
import pathlib

In [130]:
PITCHER_NAMES = pathlib.Path('../references/pitcher_names.txt')


with open(PITCHER_NAMES) as f:
    names = f.read().split(',')
    for name in names:
        if '\n' in name:
            names = [name.replace('\n', '') for name in names]
print(f' Number of Pitchers: {len(names)}')
    
#test if '\n' is really out

Number of Pitchers: 400


In [127]:
PITCHER_NAMES = pathlib.Path('../references/pitcher_names.txt')

with open(PITCHER_NAMES) as f:
    names = f.read().split(',') 

names

['Dallas Keuchel',
 'Kyle Gibson',
 'Kyle Freeland',
 'Mike Clevinger',
 'Jon Lester',
 '\nZack Greinke',
 'Gio Gonzalez',
 'Mike Foltynewicz',
 'Jhoulys Chacin',
 'Lucas Giolito',
 '\nKyle Hendricks',
 'Justin Verlander',
 'Max Scherzer',
 'Jose Quintana',
 'Patrick Corbin',
 '\nRick Porcello',
 'Sean Newcomb',
 'Reynaldo Lopez',
 'Blake Snell',
 'Tanner Roark',
 '\nCorey Kluber',
 'JA Happ',
 'Julio Teheran',
 'Luis Severino',
 'Cole Hamels',
 'Lance Lynn',
 'Jake Odorizzi',
 '\nJose Berrios',
 'Jacob deGrom',
 'Matthew Boyd',
 'Kevin Gausman',
 'Steven Matz',
 'Jon Gray',
 '\nJameson Taillon',
 'Jakob Junis',
 'Andrew Cashner',
 'Danny Duffy',
 'Jake Arrieta',
 'Charlie Morton',
 '\nZack Wheeler',
 'James Shields',
 'Tyler Anderson',
 'Jose Urena',
 'Carlos Carrasco',
 'Trevor Williams',
 '\nTyson Ross',
 'Miles Mikolas',
 'Mike Fiers',
 'Andrew Heaney',
 'Dylan Bundy',
 'Felix Hernandez',
 '\nLuis Castillo',
 'Chase Anderson',
 'David Price',
 'Derek Holland',
 'Andrew Suarez',
 'C