# Baseball Data Pipelines

This file goes through an analytical pipeline of the Statcast baseball data collected from the website www.baseballsavant.com. This file will detail a baseline performance using two methods: Logistic Regression and Decision Tree. These two models are specifically chosen for tow main reasons: first is that they serve the role of "traditional models", and second is that they are, to an extent, interpretable. These two models, Logistic Regression and Decision Trees, will serve as a baseline in terms of performance and interpretability. The ultimate goal of this analytical project is utilize newer, more advanced modeling methods and extract knowledge with interpretation techniques. Thus, a benchmark needs to be established, both in terms of performance metrics (Accuracy, AUC, etc.) and interpretation power, such as coefficients and tree structure, to conclude that the proposed models and interpretation techniques do provide significant advantages over such traditional methods.

As such, the following will outline the process of this notebook. The main goals include: 
- Establish performance metrics for both models on raw data.
- Investigate interpretable aspects for both models. Explore the validity of such aspects
- Establish appropriate preprocessing steps to improve performance of models
- Re-evaluate models on processed data
- Investigate interpretable aspects on model built on processed data


Here is the outline of steps to be performed in this notebook.

1. Read in data.
    - Structure data as Pandas Dataframe
2. Ensure data quality
    - Investiate and remedy possible issues, such as missing data, poorly formatted data, data types, etc.
        - Check missing data
        - Check data types
        - Initial check on data values, see if any illogical values (for instance, 0mph for pitch speed)
3. Light Exploratory Data Analysis
    - Visualize data distributions, measures of central tendencies, correlations
        - Build Histograms
        - Build Box-plots
        - Scatter plots of interesting features noted in step 2
        - Build correlation matrix
        - Check target feature balance
        - Analysis of each
4. Establish initial baselines
    - Establish models with initial hyperparameters used in sci-kit learn library. Build training and testing split procedures, and train models on appropriate splits
        - If target value imbalance, balance out levels using simple oversampler
        - Instantiate models using scikit-learn library
        - Build cross-validation schema
        - Implement model with cross-validation schema
        - Show evaluation metrics: classification accuracy, f1 score, AUC
5. Interpret both models
    - Look into models more in depth
        - Logit Regression: show coefficients
        - Decision Tree: Visualize the tree in the notebook
        - Analyze both elements
6. Apply data preprocessing steps
    - Build data pipelines with preprocessing steps to improve results
        - Feature selection based on correlation analysis, statistical methods
        - Possible PCA
        - Data standardization
7. Re-Evaluate models 
    - Compare results of raw data pipeline with preprocessed data pipeline. For better performing pipeline:
        - run simple grid-search of hyperparameters to get best performance. 
        - Note: number of leaves on decision tree cannot be too large; we want to keep the ability to visualize and understand its decision rules.
        - re-evaluate results using grid-search best hyperparameters. 
        


## Begin Initial Pipeline

### Start with Data Input

In [1]:
#start with all dependencies

import numpy as np
import pandas as pd
import sklearn
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score

In [4]:
#Read in the data as pandas dataframe

baseball = pd.read_csv('statcast_data.csv')

#have some formatting issue, fix that here
baseball = baseball.drop('Unnamed: 21', axis = 1)

baseball.head()

Unnamed: 0,release_speed,release_pos_x,release_pos_z,player_name,description,p_throws?,pfx_x,pfx_z,fielder_2?,vx0,...,vz0,ax,ay,az,sz_top,sz_bot,release_spin_rate,release_extension,release_pos_y,pitch_name
0,85.7,-2.5842,5.7006,Luis Severino,called_strike,R,1.084,-0.5006,596142,3.1515,...,-0.3659,10.5705,23.5605,-37.4886,3.1055,1.4646,2736.0,5.818,54.6805,1
1,88.4,-2.6028,5.5854,Luis Severino,blocked_ball,R,1.1356,-0.1705,596142,6.809,...,-6.1106,11.3649,23.893,-33.1358,3.1285,1.4876,2947.0,6.174,54.3243,1
2,86.4,-2.3152,5.7586,Luis Severino,blocked_ball,R,1.1174,-0.5522,596142,5.8596,...,-5.2221,10.3541,26.1915,-37.054,3.2894,1.6259,2876.0,5.824,54.6744,1
3,86.7,-2.4262,5.7775,Luis Severino,called_strike,R,0.7952,-0.1712,596142,2.214,...,-1.9537,7.9923,23.7258,-33.8381,3.1975,1.5798,2768.0,5.88,54.6185,1
4,98.2,-2.0388,5.8304,Luis Severino,ball,R,-0.1258,1.2586,596142,7.9476,...,-8.5002,-3.3795,29.3792,-13.1614,3.3739,1.568,2214.0,6.522,53.9761,2


In [3]:
baseball['player_name'].value_counts()

J.A. Happ            891
Eduardo Rodriguez    855
David Price          811
Blake Snell          810
Rick Porcello        783
Luis Severino        778
Andrew Cashner       774
Ryan Yarbrough       741
Dylan Bundy          726
Marco Estrada        663
CC Sabathia          646
Sam Gaviglio         643
Sonny Gray           642
Kevin Gausman        642
Brian Johnson        606
Chris Sale           569
Masahiro Tanaka      556
Aaron Sanchez        541
Marcus Stroman       520
Yefry Ramirez        501
Nathan Eovaldi       474
Miguel Castro        438
Jaime Garcia         418
Domingo German       411
Alex Cobb            389
Mike Wright          385
David Hess           379
Ryan Borucki         377
Joe Biagini          376
Yonny Chirinos       375
                    ... 
Reynaldo Lopez        58
Dallas Keuchel        57
Jakob Junis           56
Justin Verlander      56
Brett Anderson        55
Jacob deGrom          54
Wilmer Font           53
Brad Keller           53
Kyle Gibson           52


## Information regarding data

Since the data is very technical and readers may not be familiar with the terms and what they could mean, the following is a list of the features and a short description of each, taken from the official documentation from BaseballSavant website.

Note that the distance from home plate, where the batter stands, and the pitcher’s mound, where the pitcher throws, is 60 feet and 2 inches. 
- Release_speed: pitch velocity, reported out-of-hand.
- Release_pos_x: horizontal release position of the ball measured in feet from the catchers perspective.
- Release_pos_z: vertical release position of the ball measured in feet from the catchers perspective.
- Player_name: the name of the pitcher
- Description: description of the resulting pitch: ball, blocked_ball, called strike.
- P_throws: hand the pitcher throws with.
- Pfx_x: Horizontal movement in feet from the catcher’s perspective.
- Pfx_z: Vertical movement in feet from the catcher’s perspective.
- Fielder_2: the identification number of the catcher.
- Vx0: the velocity of the pitch, measured in feet per second, in the x-dimension, determined at y=50 feet.
- Vy0: The velocity of the pitch, in feet per second, in the y-dimension, determined at y=50 feet.
- Vz0: the velocity of the pitch, in feet per second, in the z-dimension, determined at y=50 feet.
- ax: the acceleration of the pitch, in feet per second per second, in the x-dimension, determined at y=50 feet.
- ay:   the acceleration of the pitch, in feet per second per second, in the y-dimension, determined at y=50 feet.
- az: the acceleration of the pitch, in feet per second per second, in the z-dimension, determined at y=50 feet.
- Sz_top: Top of the batter’s strike zone set by the operator when the ball is halfway to the plate
- Sz_bottom: Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate.
- Release_spin_rate: Spin rate of the pitch tracked by Statcast.
- Release_extension: Release extension of pitch in feet as tracked by Statcast.
- Release_pos_y: Release position of the pitch measured in feet from the catcher’s perspective
- Pitch_name: the type of pitch derived from Statcast. (1: Slider, 2: 4-Seam fastball, 3: Changeup, 4: Cutter, 5: Sinker, 6: 2-Seam fastball, 7: Curveball, 8: Split-Finger)

## Data formatting steps 

Looking at the data, there are some aspects we need to address:

- We have three classes in the target (description): called_strike, blocked_ball, and ball. All we need to do is consider a blocked_ball as a ball, because that is true. So we'll replace blocked_ball with ball.
- Can drop the pitcher's name from the data, as we don't need such identifying information. 
- Same can be said about P_throws; since there are only two classes (R & L), this feature is most likely not contributing much inforation. In addition, the Release_pos_x and Release_pos_z features can portray the same type of information as P_throws.

### Data formatting steps 

Some aspects we need to address:

- Have three classes in the target (description): called_strike, blocked_ball, and ball. All we need to do is consider a blocked_ball as a ball, because that is true. So we'll replace blocked_ball with ball.
- 

In [5]:
baseball.columns

Index(['release_speed', 'release_pos_x', 'release_pos_z', 'player_name',
       'description', 'p_throws?', 'pfx_x', 'pfx_z', 'fielder_2?', 'vx0',
       'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot', 'release_spin_rate',
       'release_extension', 'release_pos_y', 'pitch_name', 'Unnamed: 21'],
      dtype='object')

Attribute Lists:

Baseball Savant website and collect data on pitches thrown by starting pitchers from American League East teams, namely the New York Yankees, the Boston Redsox, the Baltimore Orioles, the Toronto Blue Jays, and the Tampa Bay Rays. They have collected 30,000 pitches from the 2018 season. Each pitch has corresponding measurements made by Statcast; the following is an explanation of the features that are measured. 

Note that the distance from home plate, where the batter stands, and the pitcher’s mound, where the pitcher throws, is 60 feet and 2 inches. 
- Player_name: the name of the pitcher
- Pitch_name: the type of pitch derived from Statcast. (slider, 4-seam fastball, etc.)
- Release_speed: pitch velocity, reported out-of-hand.
- P_throws: hand the pitcher throws with.
- Release_pos_x: horizontal release position of the ball measured in feet from the catchers perspective.
- Release_pos_z: vertical release position of the ball measured in feet from the catchers perspective.
- Release_pos_y: Release position of the pitch measured in feet from the catcher’s perspective
- Description: description of the resulting pitch: ball, blocked_ball, called strike.
- Pfx_z: Vertical movement in feet from the catcher’s perspective.
- Pfx_x: Horizontal movement in feet from the catcher’s perspective.
- Release_spin_rate: Spin rate of the pitch tracked by Statcast.
- Vx0: the velocity of the pitch, measured in feet per second, in the x-dimension, determined at y=50 feet.
- Vy0: The velocity of the pitch, in feet per second, in the y-dimension, determined at y=50 feet.
- Vz0: the velocity of the pitch, in feet per second, in the z-dimension, determined at y=50 feet.
- Spin_dir: 
- ax: the acceleration of the pitch, in feet per second per second, in the x-dimension, determined at y=50 feet.
- ay:   the acceleration of the pitch, in feet per second per second, in the y-dimension, determined at y=50 feet.
- az: the acceleration of the pitch, in feet per second per second, in the z-dimension, determined at y=50 feet.
- Sz_top: Top of the batter’s strike zone set by the operator when the ball is halfway to the plate
- Sz_bottom: Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate. 
- Fielder_2: the identification number of the catcher.
