# Baseball Data Pipelines

This file goes through an analytical pipeline of the Statcast baseball data collected from the website www.baseballsavant.com. This file will detail a baseline performance using two methods: Logistic Regression and Decision Tree. These two models are specifically chosen for tow main reasons: first is that they serve the role of "traditional models", and second is that they are, to an extent, interpretable. These two models, Logistic Regression and Decision Trees, will serve as a baseline in terms of performance and interpretability. The ultimate goal of this analytical project is utilize newer, more advanced modeling methods and extract knowledge with interpretation techniques. Thus, a benchmark needs to be established, both in terms of performance metrics (Accuracy, AUC, etc.) and interpretation power, such as coefficients and tree structure, to conclude that the proposed models and interpretation techniques do provide significant advantages over such traditional methods.

As such, the following will outline the process of this notebook. The main goals include: 
- Establish performance metrics for both models on raw data.
- Investigate interpretable aspects for both models. Explore the validity of such aspects
- Establish appropriate preprocessing steps to improve performance of models
- Re-evaluate models on processed data
- Investigate interpretable aspects on model built on processed data


Here is the outline of steps to be performed in this notebook.

1. Read in data.
    - Structure data as Pandas Dataframe
2. Ensure data quality
    - Investiate and remedy possible issues, such as missing data, poorly formatted data, data types, etc.
        - Check missing data
        - Check data types
        - Initial check on data values, see if any illogical values (for instance, 0mph for pitch speed)
3. Light Exploratory Data Analysis
    - Visualize data distributions, measures of central tendencies, correlations
        - Build Histograms
        - Build Box-plots
        - Scatter plots of interesting features noted in step 2
        - Build correlation matrix
        - Check target feature balance
        - Analysis of each
4. Establish initial baselines
    - Establish models with initial hyperparameters used in sci-kit learn library. Build training and testing split procedures, and train models on appropriate splits
        - If target value imbalance, balance out levels using simple oversampler
        - Instantiate models using scikit-learn library
        - Build cross-validation schema
        - Implement model with cross-validation schema
        - Show evaluation metrics: classification accuracy, f1 score, AUC
5. Interpret both models
    - Look into models more in depth
        - Logit Regression: show coefficients
        - Decision Tree: Visualize the tree in the notebook
        - Analyze both elements
6. Apply data preprocessing steps
    - Build data pipelines with preprocessing steps to improve results
        - Feature selection based on correlation analysis, statistical methods
        - Possible PCA
        - Data standardization
7. Re-Evaluate models 
    - Compare results of raw data pipeline with preprocessed data pipeline. For better performing pipeline:
        - run simple grid-search of hyperparameters to get best performance. 
        - Note: number of leaves on decision tree cannot be too large; we want to keep the ability to visualize and understand its decision rules.
        - re-evaluate results using grid-search best hyperparameters. 
        


## Begin Initial Pipeline

### Start with Data Input

In [1]:
#start with all dependencies

import numpy as np
import pandas as pd
import sklearn
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score

In [2]:
#Read in the data as pandas dataframe

baseball = pd.read_csv('st')
baseball.head()

FileNotFoundError: File b'statcast_data.csv' does not exist

Attribute Lists:

Baseball Savant website and collect data on pitches thrown by starting pitchers from American League East teams, namely the New York Yankees, the Boston Redsox, the Baltimore Orioles, the Toronto Blue Jays, and the Tampa Bay Rays. They have collected 30,000 pitches from the 2018 season. Each pitch has corresponding measurements made by Statcast; the following is an explanation of the features that are measured. 

Note that the distance from home plate, where the batter stands, and the pitcher’s mound, where the pitcher throws, is 60 feet and 2 inches. 
- Player_name: the name of the pitcher
- Pitch_name: the type of pitch derived from Statcast. (slider, 4-seam fastball, etc.)
- Release_speed: pitch velocity, reported out-of-hand.
- P_throws: hand the pitcher throws with.
- Release_pos_x: horizontal release position of the ball measured in feet from the catchers perspective.
- Release_pos_z: vertical release position of the ball measured in feet from the catchers perspective.
- Release_pos_y: Release position of the pitch measured in feet from the catcher’s perspective
- Description: description of the resulting pitch: ball, blocked_ball, called strike.
- Pfx_z: Vertical movement in feet from the catcher’s perspective.
- Pfx_x: Horizontal movement in feet from the catcher’s perspective.
- Release_spin_rate: Spin rate of the pitch tracked by Statcast.
- Vx0: the velocity of the pitch, measured in feet per second, in the x-dimension, determined at y=50 feet.
- Vy0: The velocity of the pitch, in feet per second, in the y-dimension, determined at y=50 feet.
- Vz0: the velocity of the pitch, in feet per second, in the z-dimension, determined at y=50 feet.
- Spin_dir: 
- ax: the acceleration of the pitch, in feet per second per second, in the x-dimension, determined at y=50 feet.
- ay:   the acceleration of the pitch, in feet per second per second, in the y-dimension, determined at y=50 feet.
- az: the acceleration of the pitch, in feet per second per second, in the z-dimension, determined at y=50 feet.
- Sz_top: Top of the batter’s strike zone set by the operator when the ball is halfway to the plate
- Sz_bottom: Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate. 
- Fielder_2: the identification number of the catcher.
