## Scraping Baseball Savant

The purpose of this notebook is to scrape [Baseball Savant](https://baseballsavant.mlb.com/)'s Pitch F/X and Statcast data and load it into a SQL database. For my SQL needs, I use MySQLWorkbench.

### Loading Packages

In [1]:
import pandas as pd
import numpy as np
import requests
import io
import mysql.connector
import sqlalchemy
from sqlalchemy import create_engine
import datetime as dt

### Defining Data Types

I found it easier to use SQLAlchemy's functions for DDL instead of writing SQL code and running it in MySQLWorkbench. The Data Definition for the columns provided from Baseball Savant's data is shown here.

In [2]:
dtypes = {'pitch_type' : sqlalchemy.types.VARCHAR(length=2),
          'game_date' : sqlalchemy.DATETIME(),
          'release_speed': sqlalchemy.types.Float(precision=1, asdecimal=True),
          'release_pos_x': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'release_pos_z': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'player_name': sqlalchemy.types.VARCHAR(length=64),
          'batter': sqlalchemy.types.INTEGER(),
          'pitcher': sqlalchemy.types.INTEGER(),
          'events': sqlalchemy.types.VARCHAR(length=128),
          'description': sqlalchemy.types.VARCHAR(length=1024),
          'spin_dir': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'spin_rate_deprecated': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'break_angle_deprecated': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'break_length_deprecated': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'zone': sqlalchemy.types.INTEGER(),
          'des': sqlalchemy.types.VARCHAR(length=1024),
          'game_type': sqlalchemy.types.VARCHAR(length=1),
          'stand': sqlalchemy.types.VARCHAR(length=1),
          'p_throws': sqlalchemy.types.VARCHAR(length=1),
          'home_team': sqlalchemy.types.VARCHAR(length=3),
          'away_team': sqlalchemy.types.VARCHAR(length=3),
          'type': sqlalchemy.types.VARCHAR(length=1),
          'hit_location': sqlalchemy.types.INTEGER(),
          'bb_type': sqlalchemy.types.VARCHAR(length=16),
          'balls': sqlalchemy.types.INTEGER(),
          'strikes': sqlalchemy.types.INTEGER(),
          'game_year': sqlalchemy.types.INTEGER(),
          'pfx_x': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'pfx_z': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'plate_x': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'plate_z': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'on_3b': sqlalchemy.types.INTEGER(),
          'on_2b': sqlalchemy.types.INTEGER(),
          'on_1b': sqlalchemy.types.INTEGER(),
          'outs_when_up': sqlalchemy.types.INTEGER(),
          'inning': sqlalchemy.types.INTEGER(),
          'inning_topbot': sqlalchemy.types.VARCHAR(length=3),
          'hc_x': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'hc_y': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'tfs_deprecated': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'tfs_zulu_deprecated': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'fielder_2': sqlalchemy.types.INTEGER(), 
          'umpire': sqlalchemy.types.INTEGER(),
          'sv_id': sqlalchemy.types.VARCHAR(length=16),
          'vx0': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'vy0': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'vz0': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'ax': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'ay': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'az': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'sz_top': sqlalchemy.types.Float(precision=2, asdecimal=True),
          'sz_bot': sqlalchemy.types.Float(precision=2, asdecimal=True),
          'hit_distance_sc': sqlalchemy.types.INTEGER(),
          'launch_speed': sqlalchemy.types.Float(precision=1, asdecimal=True),
          'launch_angle': sqlalchemy.types.Float(precision=1, asdecimal=True),
          'effective_speed': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'release_spin_rate': sqlalchemy.types.INTEGER(),
          'release_extension': sqlalchemy.types.Float(precision=3, asdecimal=True),
          'game_pk': sqlalchemy.types.INTEGER(),
          'pitcher.1': sqlalchemy.types.INTEGER(),
          'fielder_2.1': sqlalchemy.types.INTEGER(),
          'fielder_3': sqlalchemy.types.INTEGER(),
          'fielder_4': sqlalchemy.types.INTEGER(),
          'fielder_5': sqlalchemy.types.INTEGER(),
          'fielder_6': sqlalchemy.types.INTEGER(),
          'fielder_7': sqlalchemy.types.INTEGER(),
          'fielder_8': sqlalchemy.types.INTEGER(),
          'fielder_9': sqlalchemy.types.INTEGER(),
          'release_pos_y': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'estimated_ba_using_speedangle': sqlalchemy.types.Float(precision=3, asdecimal=True),
          'estimated_woba_using_speedangle': sqlalchemy.types.Float(precision=3, asdecimal=True),
          'woba_value': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'woba_denom': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'babip_value': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'iso_value': sqlalchemy.types.Float(precision=4, asdecimal=True),
          'launch_speed_angle': sqlalchemy.types.INTEGER(),
          'at_bat_number': sqlalchemy.types.INTEGER(),
          'pitch_number': sqlalchemy.types.INTEGER(),
          'pitch_name': sqlalchemy.types.VARCHAR(length=16),
          'home_score': sqlalchemy.types.INTEGER(),
          'away_score': sqlalchemy.types.INTEGER(),
          'bat_score': sqlalchemy.types.INTEGER(),
          'fld_score': sqlalchemy.types.INTEGER(),
          'post_away_score': sqlalchemy.types.INTEGER(),
          'post_home_score': sqlalchemy.types.INTEGER(),
          'post_bat_score': sqlalchemy.types.INTEGER(),
          'post_fld_score': sqlalchemy.types.INTEGER(),
          'if_fielding_alignment': sqlalchemy.types.VARCHAR(length=64),
          'of_fielding_alignment': sqlalchemy.types.VARCHAR(length=64)}

### Downloading the Data and Loading to SQL

This next section required a bit of fine tuning from a performance standpoint. Baseball Savant queries time out if they take too long, and its CSV exporting functionality cuts off at a fixed number of rows. For this reason, I decided to download one day of data at a time.

The queries also don't work on occasion for unknown reasons. To counteract this issue, I have included a try/else statement that forces the loop to stop if a data pull for a certain date does not work. For this example, I am running five days worth of uploads (September 1, 2020 to September 5, 2020).

In [3]:
engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}"
                       .format(user="root",
                               pw="rootroot",
                               db="mlb"))

latest = pd.read_sql("SELECT max(game_date) from pitch_tracking;",con = engine)
print(latest.iloc[0,0])

2020-08-31 00:00:00


In [4]:
date = pd.to_datetime("08/31/2020")

while date <= pd.to_datetime("09/05/2020"):
    
    start_dt = date.strftime("%Y-%m-%d")
    end_dt = date.strftime("%Y-%m-%d")
    
    url = "https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7CPO%7CS%7C=&hfSea=&hfSit=&player_type=pitcher&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt={}&game_date_lt={}&team=&position=&hfRO=&home_road=&hfFlag=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_abs=0&type=details&".format(start_dt, end_dt)
    
    s = requests.get(url, timeout = None)
    
    try:
        df = pd.read_csv(io.StringIO(s.content.decode('utf-8')))
        
        engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}"
                               .format(user="root",
                                       pw="rootroot",
                                       db="mlb"))
        
        df.to_sql('pitch_tracking', con = engine, if_exists = 'append',
                  chunksize = 1000,dtype = dtypes, index = False)
        
        print("Download successful for {}".format(date.strftime("%Y-%m-%d")))
        
        date = date + dt.timedelta(1)
    except:
        print("\n\nDownload failed at {}\n\n".format(date.strftime("%Y-%m-%d")))
        break

Download successful for 2020-08-31
Download successful for 2020-09-01
Download successful for 2020-09-02
Download successful for 2020-09-03
Download successful for 2020-09-04
Download successful for 2020-09-05


Baseball Savant carries data going back to 2008, so the number of pitches by year stored in the database can be shown here.

In [5]:
engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}"
                       .format(user="root",
                               pw="rootroot",
                               db="mlb"))

pitches_by_year = pd.read_sql("""SELECT game_year, count(*) from pitch_tracking
GROUP BY game_year
ORDER BY game_year ASC;""",con = engine)

pitches_by_year

Unnamed: 0,game_year,count(*)
0,2008,723745
1,2009,726125
2,2010,719561
3,2011,718963
4,2012,716238
5,2013,720702
6,2014,714305
7,2015,712840
8,2016,726023
9,2017,732476


In [6]:
engine.dispose()