*Tim Boudreau <br>
tim.boudreau25@gmail.com* <br>
https://github.com/timboudreau25/
***

# Predicting Batted Balls - Popups

*note: my primary focus is popups, though this easily can be modified for all batted ball types*

## Popups

There are few guarantees in baseball. Three strikes and you're out. A ball hit fair 450 feet is a home run. And a popup should be caught. Popups are nearly guaranteed outs, placing value on them as a batted ball outcome. Knowing whether or not a pitcher can induce popups is valuable information, especially if it can be predicted. I chose to explore batted ball prediciton, specifically for popups, to see if I could predict popup rates for pitchers. I believe, if producing popups is a skill, a pitcher who is inducing fewer popups than predicted may be facing poor luck and is undervalued.

In [27]:
%matplotlib inline

from __future__ import division
import pandas as pd
import statsmodels.api as sm
import numpy as np
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt

## Data Source:

The data I used to form my database is from BaseballSavant (https://baseballsavant.mlb.com/about), through it's Statcast search tool. My data consists of every pitch and outcome from April through August 2017.

## Data Acquisition:

### Database Formation:

Locally, I created the database separate from running my analysis, as the formation took a few minutes on my machine to run. Using a SQL database saved me quite a bit of time when I would run my analysis.

Due to limitations on query sizes through BaseballSavant, I could only export approximately 30,000 pitches per query, or about a week's worth of data. Before I formed my local database, I merged each week's data.

In [28]:
# ## current directory of data files

# cd = "/Data/"     # change if needed


# ## create empty dataframe

# data = pd.DataFrame()


# ## loop through each sheet and append the previous one

# for count in range(1, 23):
# 	import_data = pd.read_csv(cd + "savant_" 
# 		+ str(count) + ".csv")
# 	data = data.append(import_data)

    
# ## set any cells that say 'null' to numeric 123456789, and convert numeric columns to numeric

# data = data.replace(to_replace = 'null', value = "123456789").apply(pd.to_numeric, errors = "ignore")


## create/connect to database

path = 'Data/mlb_data.db'
conn = sqlite3.connect(path)
c = conn.cursor()


# # add dataframe to database

# data.to_sql("MLB_2017", conn, if_exists="replace")

### SQL Query in Database

The variables I used to build my model were batted ball types, pitch types, release speed, release spin rate, and both horizontal and vertical movements. I also used player names, to analyze pitchers with the largest discrepencies.

In [None]:
## execute query and close connection

data = pd.read_sql("""SELECT bb_type, player_name, pitch_type, release_speed, 
	release_spin_rate, pfx_x, pfx_z, zone
	FROM MLB_2017
	WHERE bb_type != '123456789'
	;""", conn)
c.close()

## Data Preparation:

### Create Dummy Variables & Clean the Data:

I wrote code to ask for batted ball type, but for this analysis, I hard-wired the batted ball type of interest to be popups. To start, I made sure the sample of pitchers whose pitches were used was large enough - only pitchers with 100 or more balls in play were considered. I cleaned up the dataset by removing nulls. I also created dummy variables out of our boolean true-false classification of whether or not the batted ball type was a popup. Lastly, I removed the null observations, which earlier were listed as 123456789 (because the dataset had string "null" for null cells).


In [None]:
## ask for batted ball type to analyze

# bbtype = raw_input("\nPlease choose which of the following batted ball types"
# 	" to predict (ground_ball, line_drive, fly_ball, popup): ")


## here i hard-wired popups, but could take any batted ball type

bbtype = "popup"

if bbtype not in ['ground_ball', 'line_drive', 'fly_ball', 'popup']:
	print("Your input was incorrect. Please match the spelling of an"
	" option from the list. Shutting down...")
	exit()


## count the amount of batted balls each pitcher has allowed in play

data['count'] = data.groupby('player_name')['player_name'].transform('count')
  
    
## create a dummy for each zone pitches were thrown in - removed for lack of impact on accuracy

# data = pd.get_dummies(data, columns = ['zone'])

    
## for pitch analysis, use a copy of the data set

indiv_pitches = data[data['count'] > 100]


## select data where batted balls are popups, and set to 1 true and 0 false

indiv_pitches['batted_ball'] = (indiv_pitches['bb_type'] == bbtype).astype(int)
indiv_pitches['all_batted_ball'] = (indiv_pitches['bb_type'] != "123456789").astype(int)


## gather pitch types used in this dataset

pitch_types = indiv_pitches['pitch_type'].unique().astype(str)


## remove all null, listed as string 123456789, and put into the regression dataset

indiv_pitches_reg = indiv_pitches[(indiv_pitches != 123456789).all(1)]

### Train the Model:

Because a batted ball variable is boolean with respect to a batted ball type (i.e. it either is or isn't a popup), I chose to build a logistical regression model to calculate the odds a pitch would be a popup. In aggregate, I would be able to predict the popup rate of a set of pitches. 

I broke down my dataset into training and test sets, at random, with 80% train and 20% test. Given that the dataset is about 80,000 individual data points, I thought an 80-20 split would best train the model.

In [None]:
## randomly sort data into test and train sets and select data for regression

rand = np.random.randn(len(indiv_pitches_reg)) < .8

train = indiv_pitches_reg[rand]
test = indiv_pitches_reg[~rand]

reg_columns = ['release_speed', 'release_spin_rate', 'pfx_x', 'pfx_z']
#               'zone_1', 'zone_2', 'zone_3', 'zone_4', 'zone_5', 'zone_6',
#               'zone_7', 'zone_8', 'zone_9', 'zone_11', 'zone_12', 
#               'zone_13', 'zone_14']


## logit regression - odds ratio into probability of popup

logit = sm.Logit(train['batted_ball'], train[reg_columns])

On rare occasion, python would round to 16 decimal places when building the model (specifically, when calculating the Hessian matrix) as opposed to the usual 17 decimal places. When this happens, the Hessian becomes singular. I decided to add a check in to tell the user when this happens and quit the program, as opposed ot having errors pop up related to singularity.

In [None]:
## python has rounding issues - sometimes rounds to 16 digits, sometimes 17
## if 16, matrix becomes singular. if singular, this code tells user

try:											# try fitting with model
	results = logit.fit(disp=0)
except np.linalg.linalg.LinAlgError as err:		# catch any error
	if 'Singular matrix' in err.message:		# if singular error
		print "\n\nRounding error in Logit. Please try again.\n\n"
		exit()									# print message and quit
	else:										# if a different error occurs,
		print "Other error. Unknown."			# state unknown error

### Test the Model

Overall, the model performs well on the aggregate - it predicts the sample's popup rate within a half-percentage point. The value in a model like this is in predicting individual pitcher popup rates, though, not the league's.

*note: on rare occasion, the randomization leaves out certain pitch types. Rerunning the randomization code fixes this.*

In [None]:
## predict on test set

y_pred = results.predict(test[reg_columns])


## use a dataframe to compare predictons to actual

prediction = pd.DataFrame({ "actual" : test['batted_ball'], 
	"projected" : y_pred})

test_projected_rate = sum(prediction['projected'])/len(test) * 100
test_actual_rate = sum(prediction['actual'])/len(test) * 100
data_set_actual_rate = sum(indiv_pitches['batted_ball'])/sum(indiv_pitches[
	'all_batted_ball']) * 100

print("\nThe entire sample's actual popup rate: %.2f%%\n"
	"The test set's actual popup rate: %.2f%%\n"
	"The test set's projected popup rate: %.2f%%\n"
	"\nTotal pitches in data set: %i\n"
	"\nPitch types used in the model: %s\n\n"
	% (data_set_actual_rate, test_actual_rate, test_projected_rate, len(indiv_pitches), pitch_types))


Below are the regression results. Because I used a Logistic Regression, I have to convert the coefficients into the odds ratios to extract their impact on popup probabilities. As you see, horizontal movement has no statistically significant impact on popup rate, however vertical movement, velocity and spin rate do. Vertical movement increases the probability of a pitch being a popup by 124%.



In [None]:
## results!! Odds ratio because Logit model needs deconstruction to interpret

odds = pd.DataFrame({ "Variable" : results.params.index,
                     "Impact on Probability" : (np.exp(results.params.values)-1) * 100,
                    "P-Value" : results.pvalues})
odds.set_index("Variable", inplace = True)


## print regression table and odds ratios

print results.summary()
print "\n\n", odds, "\n"

## Popup Prediction - Pitchers

To test my model, I decided to try to predict individual pitcher popup rates. I noticed that the model did extremely poorly at predicting individual pitch probabilities of being a popup, but it did well in predicting the popup rate of a larger sample. Below, I predict the popup rate and calculate the actual popup rate of every pitcher with 100 or more batted balls in our sample.

In [None]:
## create list of pitchers and an empty dataframe to store their rates

pitchers = indiv_pitches['player_name'].unique()

bip_df = pd.DataFrame([])


## create a dataframe of pitcher names and amount of balls in play

batted_balls = indiv_pitches_reg[['player_name','count']].drop_duplicates('player_name', keep='last')[['player_name', 'count']]


## loop through each pitcher, clean the data and estimate bip_rate rates

for pitcher in pitchers:

    ## clean the data - find data for pitcher listed, replace the nulls as 123456789 and remove nulls
    
    df = indiv_pitches.loc[indiv_pitches['player_name'] == pitcher]
    df = df.replace(to_replace = 'null', value = "123456789").apply(pd.to_numeric, errors = "ignore")
    df = df[(df.bb_type != "123456789")]    
    
    
    ## apply model and predict rate for each pitch
    
    df_predict = results.predict(df[reg_columns])
    
    
    ## calculate projected and actual rates
    
    total = len(df)

    projected_bip_rate = sum(df_predict)/total * 100
    actual_bip_rate = sum(df['batted_ball'])/total * 100

    
    ## create dataframe of pitcher, actual and projected rates, 
    ## difference in rates and ball in play count
    
    bip_df = bip_df.append(pd.DataFrame({ 'Player' : pitcher,
                        'Actual Rate' : round(actual_bip_rate, 2),
                        'Projected Rate' : round(projected_bip_rate, 2),
                        'Difference' : round((actual_bip_rate - projected_bip_rate), 2),
                        'Balls in Play' : batted_balls['count'][batted_balls['player_name'] == pitcher]}))

    
## reorder dataframe of results and print a header of results

bip_df = bip_df[['Player', 'Actual Rate', 'Projected Rate', 'Difference', 'Balls in Play']]
print bip_df.head()
    

## plot the relationship between sample size and error in prediction

plt.scatter(bip_df["Balls in Play"], np.absolute(bip_df["Difference"]))
plt.xlabel('Balls in Play')
plt.ylabel('Absolute Value of Error')
plt.title('Sample Size versus Prediction Error on %s' % bbtype)


## if batted ball type is fly ball, make note that baseball savant uses different classifications

if bbtype == "fly_ball":
	print("\n\nNote: this model classifies fly balls separately than popups,"
	" unlike most public websites.\n")

As seen in both the heading sample and the plot of the data above, the model doesn't accurately predict pitcher popup rates, but increases in accuracy in absolute terms as the sample size increases.

## Problems & Issues:

The model wasn't nearly as accurate as I had hoped it would be. Much of this is likely due to sample size issues, as there were no pitchers who yet had a sample size as large as the recommended minimum I calculate below (approximately 579 balls in play). Even then, though, this model heavily implies that batters contribute to pitcher popup rates. To better enhance this model, I would need to include batter characteristics, like bat angle, swing path and bat speed, to better predict popup rates.

Below is a plot comparing projected and actual popup rate frequencies. Clearly, the model centralized in a similar locaiton but didn't have the same variance. After limiting the sample of pitchers who have allowed 400 or more balls in play, we can see that those who have a larger sample size have more accurate projections. A study I reference below, from MedCalc, suggests a certain calculation to find the suggested minimum sample size. That sample size for this model was 579 balls in play, which no pitcher has reached yet in 2017. I will revisit this after the 2017 season, and may consider including 2016 data as well to increase sample sizes.

In [None]:
## plot actual versus projected rate frequencies by value

sns.kdeplot(bip_df['Actual Rate'], shade = True, label = "Actual")
sns.kdeplot(bip_df['Projected Rate'], shade = True, label = "Projected")
plt.title('Projected versus Actual %s' % bbtype)

In [None]:
## Set ball in play count minimum, tell the user and then plot

bip_size = 400
bip_df_larger_sample = bip_df[bip_df['Balls in Play'] > bip_size]

print("\n\nThere are %i pitchers with %i %s balls in play in our sample.\n\n" % 
      (len(bip_df_larger_sample), bip_size, bbtype))


## plot using the sample of pitchers with a minimum number of batted ball type we want

sns.kdeplot(bip_df_larger_sample['Actual Rate'], shade = True, label = "Actual")
sns.kdeplot(bip_df_larger_sample['Projected Rate'], shade = True, label = "Projected")
plt.title('Projected versus Actual %s' % bbtype)

In [None]:
## sample size recommendation:
## MedCalc, "Logistic Regression", Accessed July 2017.
## URL: https://www.medcalc.org/manual/logistic_regression.php

## print recommended sample size

bip_proportion = sum(test['batted_ball'])/len(test)

print "\nRecommended Sample Size, as per MedCalc: ", int((10 * 4 / bip_proportion)), "\n"


## Conclusion:

I believe the goal of predicting batted ball types is feasible, given the right data. Clearly, the outcome of pitches is dependent on both pitchers and batters, and building a model based on statistics from only one of the two parties is incomplete. As baseball data continues to become available to the public, I will continue to explore it and revise the model. 

The most significant result of this project is that predicting popup rates for individual pitchers seems possible. The value in popup rate prediction is that popups are almost guaranteed outs and pitchers who should be producing more popups than they are may be undervalued in the marketplace of MLB pitchers.

### Next Steps:

- Find data on batter characteristics (i.e. bat angle, bat speed, etc.) to include in the model.
- Explore this model further for other batted ball types.