# Step00 Let's find the data

In this project, I plan to classify Max Scherzer's pitch selection based on the situation of a particular game.  Although the pararmeters are yet to be defined, potenial factors include the balls, strikes, outs, and/or inning of a particular game.  I hope to classify Max's choice of pitch type and the location of his pitch based on a particular situation.  To do this, I must acquire data pertaining to every pitch he threw in the 2019 season.  This notebook documents key findings in the pursuit of this data.  Wish me luck!

<img src='../images/madmax.png' alt='Drawing' style='width: 450px;'/>

## sabr.org provides some guidance
Helpful hints provided here: https://sabr.org/sabermetrics/data

"For those of us who want to do more complicated things, Baseball Reference, awesome as it is, just isn’t enough. We need the raw data on our own computers, so we can manipulate it in ways that B-R never thought of. There are two main sources of raw data: the Lahman Database and Retrosheet."

__Leads__
- Lahman Database
- Retrosheet

## Review Lahman Database

Lahman (http://www.seanlahman.com/baseball-archive/statistics/) provides raw data in MS Access, SQL, csv, R, and SQL Lite formats for the 2019 season.  Will grab the csv for inital analysis of the suitability of this data.  If it looks good, SQL or SQL Lite look like viable options for more sophisticated EDA in... later innings.

### Review Lahman Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
lahman = pd.read_csv('../data/lahman/core/Pitching.csv')

In [3]:
lahman.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,...,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP
0,bechtge01,1871,1,PH1,,1,2,3,3,2,...,,7,,0,146.0,0,42,,,
1,brainas01,1871,1,WS3,,12,15,30,30,30,...,,7,,0,1291.0,0,292,,,
2,fergubo01,1871,1,NY2,,0,0,1,0,0,...,,2,,0,14.0,0,9,,,
3,fishech01,1871,1,RC1,,4,16,24,24,22,...,,20,,0,1080.0,1,257,,,
4,fleetfr01,1871,1,NY2,,0,1,1,1,1,...,,0,,0,57.0,0,21,,,


Wow, this goes back all the way to 1871?  Wow, let's take a look at 2019 only...

In [4]:
s19 = lahman.loc[(lahman.yearID == 2019)]

Can we find Max Scherzer the playerID column?  Let's find out...

In [5]:
s19.loc[(s19.playerID == 'scherma01')]

Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,...,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP
47441,scherma01,2019,1,WAS,NL,11,7,27,27,0,...,2.0,0,7.0,0,693.0,0,59,0.0,2.0,7.0


Ok, so I cheated a little and reviewed the People.csv document with Numbers (on Mac) to discover Max's playerID (scherma01).  Looks like this data set is a dud.  Some great aggregated data provided but this is not granular enough.  The search continues for more granular data.

## Let's take a look at Retrosheet

Like Lahman, this site provides some grat historical data about MLB history but it's not providing what's need for this analysis, back to the drawing board.

## MLB official API

Thanks to a helpful post from Micheal Willard (http://michealwillard.com/mlbam_api/), it appears that MLB offers access to its API.  Worth digging into this resource.

In [6]:
# Base url for MLB API
api = 'http://gd2.mlb.com/components/game/mlb/'

### Oh, imagine that, there's a Python wrapper for the MLB API... homerun!

Source: https://pypi.org/project/MLB-StatsAPI/ <br>
toddrob99 (GitHub handle) provides the repo here: https://github.com/toddrob99/MLB-StatsAPI <br>
Let's import this package...

In [10]:
# !pip install MLB-StatsAPI
# !python3 -m pip install --upgrade MLB-StatsAPI

In [11]:
import statsapi

In [12]:
### toddrob99 provides the following block of code for logging
import logging
logger = logging.getLogger('statsapi')
logger.setLevel(logging.DEBUG)
rootLogger = logging.getLogger()
rootLogger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
formatter = logging.Formatter("%(asctime)s - %(levelname)8s - %(name)s(%(thread)s) - %(message)s")
ch.setFormatter(formatter)
rootLogger.addHandler(ch)

### Medium tutorial
Austin L.E. Krause provides a tutorial of toddrob99's tool here: <br>
https://medium.com/better-programming/using-the-mlb-stats-api-to-get-daily-data-88f48266656c

Let's step up to the plate, I try this out... <br>
Swing and a miss, too high level.

## Reddit, of course
Oh, what do we have here, toddrob99 provides some info on his Reddit page: <br>
https://www.reddit.com/r/baseball/comments/bjovz3/new_python_wrapper_for_mlb_stats_api/

...and swing and a miss.  Strike two.
Let's get some fresh pinetar on this Lousville Slugger and defend the plate...

## As said by that breathing piece of garbage, Bill O'Reilly...

<table><tr>
<td> <img src='../images/bor_sucks.gif' alt='Drawing' style='width: 450px;'/> </td>
<td> <img src='../images/bor_live.gif' alt='Drawing' style='width: 400px;'/> </td>
</tr>

## Said differently, let's keep looking... and it appears we've found pay dirt!
The baseball savant website has the data we need!  It provides the per-pitch data for which we search.  The link below provides data pertaining to every pitch Max Max threw in 2019. <br>

https://baseballsavant.mlb.com/statcast_search?hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2019%7C&hfSit=&player_type=pitcher&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfPull=&pitchers_lookup%5B%5D=453286&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_pas=0&chk_pitch_type=on&chk_pitch_result=on&chk_count=on&chk_batter_stands=on&chk_inning=on&chk_metric1_gt=on&chk_metric2_gt=on&chk_metric3_gt=on&chk_outs=on&chk_pitcher_throws=on&chk_runner_on=on&chk_metric1_lt=on&chk_metric2_lt=on&chk_metric3_lt=on#results

## It's time for...

<img src='../images/closer_look.gif' alt='Drawing' style='width: 450px;'/>

In [13]:
madmax = pd.read_csv('../data/savant/madmax_2019_pitches.csv')

In [49]:
madmax.head()

Unnamed: 0,Pitch,MPH,EV (MPH),Pitcher,Batter,Dist,Spin Rate,LA (°),Zone,Date,Count,Inning,Pitch Result,PA Result
0,FF,92.7,76.7,Max Scherzer,Adam Haseley,261.0,2299.0,38.5,6.0,2019-09-24,0-0,Top 6,hit_into_play,Adam Haseley flies out to center fielder Victo...
1,CH,84.4,51.5,Max Scherzer,Scott Kingery,3.0,1494.0,-24.0,9.0,2019-09-24,1-2,Top 6,hit_into_play_no_out,Scott Kingery singles on a soft ground ball to...
2,FF,97.4,76.3,Max Scherzer,Scott Kingery,203.0,2465.0,27.1,8.0,2019-09-24,1-2,Top 6,foul,
3,SL,84.8,,Max Scherzer,Scott Kingery,,2295.0,,14.0,2019-09-24,0-2,Top 6,ball,
4,CU,79.1,69.2,Max Scherzer,Scott Kingery,164.0,2818.0,62.8,4.0,2019-09-24,0-1,Top 6,foul,


## We are in business!
Ok, we are in business.  That ends Step00: Data Research, we'll pick up in Step01: Data Cleaning and analysis.