# Data Collection

This notebook illustrates the collection of 'Statcast_data.csv' data file. It will detail the code with the pybaseball library in addition to metadata about the data itself. 

The statcast data is collected thanks in part to James LeDoux and company python library pybaseball. The link to the official github page is here: https://github.com/jldbc/pybaseball.

This package scrapes Baseball Reference, Baseball Savant, and FanGraphs, all websites that house statistical and baseball related information. Specifically for this notebook, the package retrieves statcast data (detailed in the Proposal document) on the individual pitch level. The data will be collected on the following terms:

- Identify the classes in our target suitable for overall analysis. In statcast terms, the classes will be "called_strike", "ball", and "blocked_ball".
- Order pitchers who threw the most pitches in the 2018 regular season. That is done below in the `pitchers` list object. 
- To get an even sample of pitches from each pitcher and a variety of pitchers, select the top 400 pitchers in our ordering and collect 350 pitches each. This is chosen because our 400th rank pitcher, Gen Giles, threw 351 pitches last year. Thus, to ensure an even amount between all pitchers, each pitcher will have 350 pitches in the final dataset. The data will be collected from the entire 2018 regular season, which started on March 29 and ended on September 30. 
- Select appropriate features that can only be measured during the duration of a pitch. The duration, or timeline of a pitch, is defined as the moment when the pitcher releases the baseball out of his hand to the moment the catcher receives the ball. Thus, features about hitting the ball, or any information after a pitch has been thrown is excluded. The only feature considered will be the target, which is the result of the pitch. 

### Logical execution
The logic of the data collection is based on the pybaseball functionality:

1. Grab a unique identification label for each pitcher to be used in collected his respective data
2. Pull the data from Statcast through pybaseball, resulting in a pandas dataframe, based on the unique identification. This dataframe will be a random sample of 350 pitches thrown in the 2018 regular season by the particular pitcher.
3. Instatiate a dataframe by performing step 2 above. Then, loop through all of the pitchers and append their respective data to the instatiated dataframe. This will result in our final dataframe. For reference, the last pitcher will be Ken Giles. 
4. Save that dataframe as a csv file for future use. 

(Note from the author: The logic is not necessarily elegant, but it get's the job done. However, there are some hiccups. Due to random minor bugs and errors that crept up during execution of the looping through pitcher names, not all 400 pitchers ended in the dataframe. If there was a possible disruption of the loop with a particular pitcher, the pitcher was simply bypassed. This execution resulting in 368 pitchers resulting in the dataframe. Still an ample amount.)

### Let's begin the process now.

In [1]:
#import dependencies
import pybaseball
import pandas as pd
import numpy as np
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup

In [2]:
#define a list of pitchers to loop through. These names and their order were scraped 
#off the Baseball Savant Website and put into this list

pitchers = ["Dallas Keuchel","Kyle Gibson","Kyle Freeland","Mike Clevinger","Jon Lester",
"Zack Greinke","Gio Gonzalez","Mike Foltynewicz","Jhoulys Chacin","Lucas Giolito",
"Kyle Hendricks","Justin Verlander","Max Scherzer","Jose Quintana","Patrick Corbin",
"Rick Porcello","Sean Newcomb","Reynaldo Lopez","Blake Snell","Tanner Roark",
"Corey Kluber","JA Happ","Julio Teheran","Luis Severino","Cole Hamels","Lance Lynn","Jake Odorizzi",
"Jose Berrios","Jacob deGrom","Matthew Boyd","Kevin Gausman","Steven Matz","Jon Gray",
"Jameson Taillon","Jakob Junis","Andrew Cashner","Danny Duffy","Jake Arrieta","Charlie Morton",
"Zack Wheeler","James Shields","Tyler Anderson","Jose Urena","Carlos Carrasco","Trevor Williams",
"Tyson Ross","Miles Mikolas","Mike Fiers","Andrew Heaney","Dylan Bundy","Felix Hernandez",
"Luis Castillo","Chase Anderson","David Price","Derek Holland","Andrew Suarez","Chris Stratton",
"Marco Gonzales","Wade LeBlanc","Francisco Liriano","CC Sabathia","Ivan Nova","Matt Harvey",
"Chris Archer","Marco Estrada","Sonny Gray","Luke Weaver","Clayton Richard","Sal Romano","German Marquez",
"Gerrit Cole","Ryan Yarbrough","Mike Leake","Eduardo Rodriguez","Trevor Richards","Junior Guerra","Brad Keller",
"Wei-Yin Chen","Bartolo Colon","Masahiro Tanaka","Jack Flaherty","Noah Syndergaard","Tyler Chatwood","Jaime Barria",
"Aaron Nola","Sam Gaviglio","Clayton Kershaw","Sean Manaea","Joey Lucchesi","Stephen Strasburg","Anibal Sanchez","Dylan Covey",
"Tyler Skaggs","Dan Straily","Nick Pivetta","Mike Minor","Rich Hill","Trevor Bauer","Madison Bumgarner",
"Tyler Mahle","Chad Bettis","Vince Velasquez","Jordan Zimmermann","Michael Fulmer","Hector Santiago","Carlos Rodon",
"Ty Blach","Jason Hammel","Zach Eflin","Mike Montgomery","Zack Godley","Tyler Glasnow","Carlos Martinez","Aaron Sanchez",
"Eric Lauer","Robbie Ray","Dereck Rodriguez","John Gant","Alex Cobb","Yovani Gallardo","Kenta Maeda","Brian Johnson",
"Marcus Stroman","Alex Wood","Daniel Mengden","James Paxton","David Hess","Anthony DeSclafani","Steven Brault",
"Ian Kennedy","Walker Buehler","Antonio Senzatela","Seth Lugo","Ryan Borucki","Brent Suter","Mike Wright","Trevor Cahill",
"Nathan Eovaldi","Miguel Castro","Shane Bieber","Robbie Erlin","Edwin Jackson","Homer Bailey","Jaime Garcia","Burch Smith",
"Felix Pena","Adam Ottavino","Domingo German","Michael Wacha","Noe Ramirez","Martin Perez","Ross Stripling","Freddy Peralta",
"Jesse Chavez","Joe Musgrove","Garrett Richards","Blaine Hardy","Matt Moore","Jordan Lyles","Jordan Hicks","Heath Fillmyer",
"Clay Buchholz","Caleb Smith","Wade Miley","Jason Vargas","Hector Velazquez","Michael Lorenzen","Buck Farmer","Nick Kingham",
"Matt Koch","Brad Hand","Chad Kuhl","Brian Flynn","Steve Cishek","Hyun-Jin Ryu","Fernando Rodney","Bryan Mitchell","Nick Tropeano",
"Yefry Ramirez","Robert Gsellman","Collin McHugh","Yusmeiro Petit","Lance McCullers Jr","Matt Andriese","Joe Biagini","Doug Fister",
"Matt Barnes","Yonny Chirinos","Justin Anderson","Reyes Moronta","Josh Hader","Zach Davies","Brandon McCarthy","Elieser Hernandez",
"Jake Faria","Cam Bedrosian","Adam Plutko","Mychal Givens","Sergio Romo","Blake Parker","Jeurys Familia","Brad Peacock",
"Brad Brach","Kyle Barraclough","Chad Green","Tayron Guerrero","TJ McFarland","Drew Steckenrider","Joe Jimenez","Jose Alvarado",
"Dan Jennings","Raisel Iglesias","Brett Anderson","Shane Greene","Jake Diekman","Josh Tomlin","Trevor Hildenberger","Yoshihisa Hirano",
"Tyler Clippard","Felipe Vazquez","Eric Skoglund","Jarlin Garcia","Seunghwan Oh","Richard Rodriguez","Jim Johnson","Brad Boxberger",
"Heath Hembree","Edwin Diaz","Jesse Biddle","Kyle Crick","Jared Hughes","Adam Warren","Erick Fedde","Lou Trivino","Paul Sewald",
"Kenley Jansen","Ryan Pressly","Ryne Stanek","Matt Magill","Brad Ziegler","Jeremy Hellickson","Blake Treinen","Amir Garrett",
"Bryan Shaw","Chasen Shreve","Jonathan Holder","Alex Colome","Ryan Tepera","Alex Claudio","Brandon Kintzler","John Axford",
"Cory Gearrin","Scott Alexander","Corey Oswalt","Hansel Robles","Dan Winkler","AJ Minter","Jose Alvarez","Taylor Rogers",
"Dylan Floro","Hector Rondon","Matt Strahm","Justin Wilson","Michael Feliz","Frankie Montas","Taylor Williams","Carl Edwards Jr",
"Austin Gomber","Jefry Rodriguez","Joakim Soria","Sam Dyson","Aroldis Chapman","Jorge De La Rosa","Kevin McCarthy","Joe Kelly",
"Alex Wilson","Jose Leclerc","Austin Pruitt","Drew VerHagen","Jace Fry","Diego Castillo","Emilio Pagan","Seranthony Dominguez",
"Zach Duke","Archie Bradley","Fernando Romero","Sam Freeman","Craig Stammen","Bud Norris","Drew Pomeranz","Wade Davis",
"Will Harris","Mike Mayers","Juan Minaya","Pablo Lopez","David Hernandez","Luis Perdomo","Keone Kela","David Robertson",
"AJ Cole","Andrew Chafin","Shohei Ohtani","Edgar Santana","Ariel Jurado","Jerry Blevins","Chris Volstad","Shane Carle",
"Tanner Scott","Kirby Yates","Adam Morgan","Wandy Peralta","Tony Watson","Louis Coleman","Luis Avilan","Jalen Beeks",
"Pedro Baez","Chris Rusin","John Brebbia","Victor Arano","Greg Holland","Chris Bassitt","Dan Otero","Addison Reed",
"Tommy Hunter","Pierce Johnson","Phil Maton","Matt Grace","Wilmer Font","Roenis Elias","Jacob Barnes","Jake McGee",
"Danny Barnes","Pedro Strop","Daniel Norris","Jeremy Jeffress","Caleb Ferguson","Eddie Butler","Chaz Roe","Adam Cimber",
"Johnny Cueto","Adam Wainwright","Scott Oberg","Drew Hutchison","Hunter Strickland","Andrew Triggs","Jake Petricka",
"Justin Miller","Oliver Drake","Odrisamer Despaigne","Will Smith","Yu Darvish","Edubray Ramos","Luis Garcia",
"Brandon Woodruff","Hector Neris","Deck McGuire","Chris Devenski","Austin Bibens-Dirkx","Chasen Bradford","Ryan Madson",
"Wander Suero","Brian Duensing","Tim Hill","Carson Fulmer","Harrison Musgrave","Cody Reed","Randy Rosario","Brandon Maurer",
"Zack Britton","Nick Vincent","Framber Valdez","Thomas Pannone","Erasmo Ramirez","Aaron Loup","Chris Hatcher","Adam Conley",
"Dellin Betances","Zach McAllister","Jeff Samardzija","Jimmy Yacabonis","Luke Jackson","Daniel Stumpf","James Pazos","Luis Cessa",
"Kendall Graveman","Shawn Kelley","Juan Nicasio","Hunter Wood","Kohl Stewart","Jacob Nix","Paul Fry","Craig Kimbrel",
"Daniel Hudson","Ryan Buchter","Cody Allen","Neil Ramirez","Jorge Lopez","Chris Martin","Sean Reid-Foley","Ken Giles"]

print(f" Number of pitchers: {len(pitchers)}")

#split the full names into first name and last name
pitchers = [name.split() for name in pitchers]


 Number of pitchers: 400


Now begin the execution of the loop. This goes through steps 1-4 in the logical execution portion above. 



In [3]:
#set up a few constants
#number of pitches
sample_size = 350

#number of pitchers
num_of_pitchers = 400

#classes of the target variable
target_classes = ['ball', 'called_strike', 'blocked_ball']

#resulting features we want
features_to_keep = ['player_name', 'p_throws', 'pitch_name', 'release_speed','release_spin_rate',
                    'release_pos_x', 'release_pos_y',
                    'release_pos_z', 'pfx_x', 'pfx_z', 'vx0','vy0', 'vz0', 
                    'ax', 'ay', 'az', 'sz_top', 'sz_bot', 
                    'release_extension','description']

#to create data, we'll make a dataframe first in the format we want,
#then loop through all the pitcher's names
#and append their respective data to the first dataframe

#make first dataframe, with Chris Sale;
#based on documentation in pybaseball docs examples
#player_id was already given
data = statcast_pitcher('2018-03-29', '2018-09-30', player_id = 519242)

#filter data of only [ball, called_strike, blocked_ball] classes in 'description'
#and grab random sample of sample_size value
data = data[data['description'].isin(['ball', 'called_strike', 'blocked_ball'])].sample(sample_size, random_state = 2019)


#repeat process above for all pitchers
#loop through all the names
for name in pitchers[:num_of_pitchers]:
    
    #grap the unique identifier of the pitcher
    player = playerid_lookup(str(name[1]), str(name[0]))
    
    #to avoid any possible errors, execute following try statement:
    # grab the unique identifier value
    # get all available data in time frame
    # filter data to only have appropriate targets, defined above
    # append particular pitcher to 'master' dataframe
    #if any of these steps fail, particularly the grabbing of 'ID'
    #pass on to next pitcher
    try:
        ID = player['key_mlbam'].iloc[player['key_mlbam'].argmax()]
        df = statcast_pitcher('2018-03-29', '2018-09-30', player_id = ID)
        df = df[df['description'].isin(target_classes)].sample(sample_size, random_state=2019)
        data = data.append(df, ignore_index=True)
    except ValueError:
        pass

#create a copy of final resulting dataframe    
baseball = data[features_to_keep].copy()

#verify that code above works
baseball



Gathering Player Data
Gathering player lookup table. This may take a moment.


will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.


Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering player lookup table. This may take a moment.

Unnamed: 0,player_name,p_throws,pitch_name,release_speed,release_spin_rate,release_pos_x,release_pos_y,release_pos_z,pfx_x,pfx_z,vx0,vy0,vz0,ax,ay,az,sz_top,sz_bot,release_extension,description
0,Chris Sale,L,2-Seam Fastball,95.1,2314.0,3.2655,54.4995,5.2575,1.7213,0.4271,-9.8035,-138.1130,0.1339,23.9464,31.0012,-27.0426,3.2971,1.5059,6.001,ball
1,Chris Sale,L,4-Seam Fastball,96.7,2324.0,3.1728,54.3094,5.3966,0.9349,1.0015,-9.0084,-140.5865,-2.4218,14.3766,32.1373,-18.6501,3.3136,1.5730,6.191,called_strike
2,Chris Sale,L,Slider,80.8,2521.0,3.3517,55.0820,5.1205,-1.0168,-0.1223,-3.7285,-117.3223,1.2140,-8.2088,25.0797,-33.8129,3.9119,1.7080,5.420,called_strike
3,Chris Sale,L,4-Seam Fastball,96.2,2329.0,3.1334,54.0207,5.2136,1.3175,0.9402,-12.0533,-139.3669,-5.1407,20.1652,36.1370,-18.9205,3.5553,1.5639,6.479,called_strike
4,Chris Sale,L,4-Seam Fastball,96.5,2437.0,3.3033,54.3597,5.0589,1.2794,0.7425,-14.0287,-139.8559,-3.3434,19.7342,30.0266,-21.9652,3.3450,1.6241,6.141,ball
5,Chris Sale,L,Slider,82.0,2405.0,3.3698,54.7927,5.1764,-1.2296,-0.2017,-3.8118,-119.0530,1.1852,-10.4098,27.0917,-34.5871,3.4820,1.7685,5.705,called_strike
6,Chris Sale,L,2-Seam Fastball,95.0,2260.0,3.4219,54.5096,4.7217,1.5492,0.2624,-9.1001,-137.8954,0.7146,21.5620,31.9468,-29.2870,3.6999,1.8906,5.990,ball
7,Chris Sale,L,Changeup,86.8,1983.0,3.1186,54.5559,4.8509,1.5110,-0.0854,-10.3517,-125.9333,-0.9164,17.9693,26.0566,-33.1188,3.4679,1.5305,5.942,called_strike
8,Chris Sale,L,Changeup,88.4,2263.0,3.3458,54.5125,4.7730,1.6507,0.1413,-11.4093,-128.0743,-0.5577,20.3268,27.3658,-30.7650,3.5189,1.7289,5.988,called_strike
9,Chris Sale,L,4-Seam Fastball,99.2,2443.0,3.2924,53.8375,5.1692,1.3019,0.8664,-16.1773,-143.4878,-7.2347,21.9161,33.1794,-18.6039,3.4287,1.4890,6.663,ball


In [4]:
#verify that code above works: Ken Giles should be last pitcher
baseball.tail()


Unnamed: 0,player_name,p_throws,pitch_name,release_speed,release_spin_rate,release_pos_x,release_pos_y,release_pos_z,pfx_x,pfx_z,vx0,vy0,vz0,ax,ay,az,sz_top,sz_bot,release_extension,description
128795,Ken Giles,R,4-Seam Fastball,96.8,2252.0,-2.0947,54.8756,6.1266,-0.5925,1.4848,3.5119,-140.7344,-5.557,-8.4711,31.5861,-11.9222,3.464,1.5896,5.624,ball
128796,Ken Giles,R,Slider,88.0,2316.0,-1.9062,56.1755,6.2282,0.287,0.1921,7.9625,-127.3866,-4.9651,1.1966,27.3637,-29.4953,3.2302,1.5342,4.324,ball
128797,Ken Giles,R,4-Seam Fastball,95.9,2512.0,-1.8896,55.3921,6.4011,-0.8028,1.5344,7.4805,-139.0689,-10.6161,-11.5094,26.4429,-11.1119,3.4374,1.3679,5.106,called_strike
128798,Ken Giles,R,4-Seam Fastball,97.7,2400.0,-2.2301,55.0385,6.0804,-1.0659,1.3443,5.986,-141.6711,-7.2205,-15.3199,35.5183,-13.2212,3.3353,1.6169,5.46,called_strike
128799,Ken Giles,R,Slider,86.0,2179.0,-1.9239,55.5515,6.1872,-0.0142,0.1733,3.7824,-124.7511,-5.4985,-0.9959,28.2202,-29.5437,3.2713,1.5507,4.951,ball


In [5]:
#put resulting dataframe into a csv file. 

baseball.to_csv('Statcast_data.csv')

## Information regarding data

Since the data is very technical and readers may not be familiar with the terms and what they could mean, the following is a list of the features and a short description of each, taken from the official documentation from BaseballSavant website.

Note that the distance from home plate, where the batter stands, and the pitcher’s mound, where the pitcher throws, is 60 feet and 2 inches. 
- Release_speed: pitch velocity, reported out-of-hand.
- Release_pos_x: horizontal release position of the ball measured in feet from the catchers perspective.
- Release_pos_z: vertical release position of the ball measured in feet from the catchers perspective.
- Player_name: the name of the pitcher
- Description: description of the resulting pitch: ball, blocked_ball, called strike.
- P_throws: hand the pitcher throws with.
- Pfx_x: Horizontal movement in feet from the catcher’s perspective.
- Pfx_z: Vertical movement in feet from the catcher’s perspective.
- Vx0: The velocity of the pitch, measured in feet per second, in the x-dimension, determined at y=50 feet.
- Vy0: The velocity of the pitch, in feet per second, in the y-dimension, determined at y=50 feet.
- Vz0: the velocity of the pitch, in feet per second, in the z-dimension, determined at y=50 feet.
- ax: the acceleration of the pitch, in feet per second per second, in the x-dimension, determined at y=50 feet.
- ay:   the acceleration of the pitch, in feet per second per second, in the y-dimension, determined at y=50 feet.
- az: the acceleration of the pitch, in feet per second per second, in the z-dimension, determined at y=50 feet.
- Sz_top: Top of the batter’s strike zone set by the operator when the ball is halfway to the plate
- Sz_bottom: Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate.
- Release_spin_rate: Spin rate of the pitch tracked by Statcast.
- Release_extension: Release extension of pitch in feet as tracked by Statcast.
- Release_pos_y: Release position of the pitch measured in feet from the catcher’s perspective
- Pitch_name: the type of pitch derived from Statcast. 

# Exclusion of possibile features
Some possible releveant features that Statcast provides was intentionally not collected. These include    'fielder_2', which is the identification number of the catcher. 


'fielder_2' was excluded because the catcher is a sort-of wildcard. A catcher iss is most influential when a pitch is near the edge of the strike-zone. Catcher's have a varying level of 'pitch-framing' ability; pitch framing is a skill where a catcher can 'present', or make a pitch near the edge of the strike zone look like the ball crossed over the plate. A good catcher can trick an umpire into calling an actual ball as a strike. A poor defensive catcher may not get as many calls like this as a good defensive catcher, so it appears that including the catcher would be important. However, the intent of this analysis is focusing on the pitcher and his ability to throw a strike. As such, this analysis wants to remove as much 'human' elements out of consideration, including the catcher. While a having a catcher feature in the dataset may contribute to predictive performance, the assumption is that professional umpires act as close to perfect as possible and can disregard a catcher's influence. A strong assumption indeed, but one that will be made for the purposes of analysis. Future analysis can focus on using improved ball detection technology, such as automated strike zones and pitch trajectory information further replace any human elements of a pitch sequence.
