# Project: Classify Pitches as Swinging Strikes

Swinging strikes are strongly correlated with pitcher effectiveness. This is likely because strikeout outs are strongly correlated with pitching effectiveness. Previous research of mine suggested that there is more variance in pitcher's swinging strikeout is more projectable (i.e., has higher $r^2$) from season to season than called strikeout rate. Futhermore, the majority of strikeouts are swinging strikeouts.

The goal of this project is to determine whether data about pitch movement, velocity, release point and location relative to the strike zone is sufficient to classify pitches as swinging strikes. If so, this suggests that pitches induce swinging strikes. It is possible that additional factors involving the pitcher, game state, and the batter are significant influences on whether a pitch will result in a swinging strike.



In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/statcast_dumps/statcast2017.csv').drop('Unnamed: 0', axis=1)

In [3]:
deprecated = ['spin_dir','spin_rate_deprecated',
              'break_angle_deprecated','break_length_deprecated',
              'tfs_deprecated', 'tfs_zulu_deprecated','umpire']
df.drop(labels=deprecated,axis=1, inplace=True)
df.columns

Index(['index', 'pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'zone', 'des', 'game_type', 'stand', 'p_throws',
       'home_team', 'away_team', 'type', 'hit_location', 'bb_type', 'balls',
       'strikes', 'game_year', 'pfx_x', 'pfx_z', 'plate_x', 'plate_z', 'on_3b',
       'on_2b', 'on_1b', 'outs_when_up', 'inning', 'inning_topbot', 'hc_x',
       'hc_y', 'fielder_2', 'sv_id', 'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az',
       'sz_top', 'sz_bot', 'hit_distance_sc', 'launch_speed', 'launch_angle',
       'effective_speed', 'release_spin_rate', 'release_extension', 'game_pk',
       'pitcher.1', 'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5',
       'fielder_6', 'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
       'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',
       'woba_value', 'woba_denom', 'babip_value', 'iso_value',
       'launch_speed_angle',

In [4]:
print("Working...")
df1 = pd.read_csv('data/statcast_dumps/statcast2018.csv').drop('Unnamed: 0', axis=1)
df1.drop(labels=deprecated,axis=1, inplace=True)
print("Working...")
df2 = pd.read_csv('data/statcast_dumps/statcast2019.csv').drop('Unnamed: 0', axis=1)
df2.drop(labels=deprecated,axis=1, inplace=True)
print("Working...")
df3 = pd.read_csv('data/statcast_dumps/statcast2020.csv').drop('Unnamed: 0', axis=1)
df3.drop(labels=deprecated,axis=1, inplace=True)
print("Working...")
df = pd.concat([df,df1,df2,df3])
print("Finished.")

Working...
Working...
Working...
Working...
Finished.


In [5]:
del(df1)
del(df2)
del(df3)


In [6]:
try:
    df.drop('Unnamed: 0.1',axis=1,inplace=True)
except:
    pass
print(df.dropna().shape)
print(df.isna().sum())
df.shape

(7655, 83)
index                        0
pitch_type               15340
game_date                    0
release_speed            14912
release_pos_x            12166
                         ...  
post_home_score              0
post_bat_score               0
post_fld_score               0
if_fielding_alignment    11358
of_fielding_alignment    11358
Length: 83, dtype: int64


(2355353, 83)

## What's a swinging strike?

The question isn't as simple as it sounds. A foul ball with fewer than 2 strikes has the effect of a swinging strike (unless it's also a popup); it's worse than not swinging at a ball out of the zone and no better than a called strike. With 2 strikes, a foul ball is worse than swinging at a ball out of the zone but better than a called strike. 

_I'm going to stipulate that a foul ball is not a swinging strike, a foul tip is._ This is mostly for simplicity at present, but if this were a multiple classificaiton problem, I would regard foul balls as a separate class.

In [7]:
def is_swinging_strike(row):
    if row['description'] in ['swinging_strike','swinging_strike_blocked','foul_tip']:
        return True
    else:
        return False
    
def is_swinging_strike2(row):
    if row['description'] == 'swinging_strike':
        return True
    elif row['description'] == 'swinging_strike_blocked':
        return True
    elif row['description'] == 'foul_tip':
        return True
    else:
        return False
    
def is_swinging_strike3(col):
    '''
    Let's just do pd.series.apply...
    '''
    if col in ['swinging_strike','swinging_strike_blocked','foul_tip']:
        return True
    else:
        return False

## Optimizing the classification

Wow. The kernel crashed horribly when I ran ```df['swinging_strike'] = df.apply(is_swinging_strike,axis=1)```. I used the cell below to find a much faster method.

In [8]:
## This was deadly slow. Like, it crashed the kernel.
#df['swinging_strike'] = df.apply(is_swinging_strike,axis=1)
#df.columns

##Let's see if we can optimize.
import time
temp = df.iloc[0:10000].copy()##a much smaller dataframe for testing time.
start = time.time()
temp.apply(is_swinging_strike,axis=1)
print(time.time() - start)
start = time.time()
temp.apply(is_swinging_strike2,axis=1)
print(time.time() - start)
start = time.time()
temp['description'].apply(is_swinging_strike3)
print(time.time() - start)
###and we have a winner.

0.3903791904449463
0.7085263729095459
0.003505229949951172


In [9]:
df['swinging_strike'] = df['description'].apply(is_swinging_strike3)

In [10]:
## Time to select the data that we're using for prediciton.
## Statcast data documentation at https://baseballsavant.mlb.com/csv-docs

#for e in df.columns:
#    print("'"+e+"',")##copy and paste output, then comment out.
#print(df.columns)

physical_data = ['release_speed',
 'release_pos_x',
 'release_pos_z',
 'pfx_x',
 'pfx_z',
 'plate_x',
 'plate_z',
 'vx0',
 'vy0',
 'vz0',
 'ax',
 'ay',
 'az',
 'release_pos_y',
 'release_spin_rate'
                ]


In [11]:
df.dropna(subset=physical_data).shape

(2304729, 84)

In [12]:
from sklearn.model_selection import train_test_split
data = df.dropna(subset=physical_data)

## I later found that using more than 5000 samples from the data provided no performance improvements
## but came with a significant training time cost.
## Hence, training size will be just 30_000, which leaves room to find performance gains.
## We can re-test on the whole set later

X_train, X_hold, y_train, y_hold = train_test_split(data[physical_data],
                                                    data['swinging_strike'], train_size = 30_000, test_size=10_000)
X_dev, X_test, y_dev, y_test = train_test_split(X_hold, y_hold, test_size = 0.5)

In [13]:
print(X_train.shape)

(30000, 15)


In [14]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

logistic = Pipeline([('scaler', StandardScaler()), ('logistic_regression', LogisticRegression(max_iter=400))])

In [15]:
import pickle

try:
    temp = open('logistic.pickle','rb')
    logistic = pickle.load(temp)
    temp.close()
    raise Exception ##uncomment to force
except:
    logistic.fit(X_train,y_train)
    pickle.dump(logistic,open('logistic.pickle','wb'))

In [16]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='most_frequent').fit(X_train,y_train)
print(dummy.score(X_dev,y_dev))
print(logistic.score(X_dev,y_dev))

0.8866
0.8866


In [17]:
print(confusion_matrix(y_dev,logistic.predict(X_dev)))
print(confusion_matrix(y_dev,dummy.predict(X_dev)))

[[4433    0]
 [ 567    0]]
[[4433    0]
 [ 567    0]]


## Wow.

The logistic classifier is easily the worst classifier I have ever seen. It's slightly worse than the dummy classifier.

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

forest_clf = RandomForestClassifier(n_estimators=25)

In [20]:
#forest_clf.fit(X_train,y_train)

In [21]:
#forest_clf.score(X_dev,y_dev)

In [22]:
from sklearn.neural_network import MLPClassifier

In [23]:
simple_mlp_clf = MLPClassifier(activation='tanh', hidden_layer_sizes=(4),
                               alpha = 0,
                               solver='sgd',
                               max_iter=2000,
                               learning_rate_init=.03
                              )
pipe = Pipeline([('scaler', StandardScaler()),('mlp',simple_mlp_clf)])

In [24]:
try:
    pipe = pickle.load(open('pipe.pickle','rb'))
    #raise Exception ##uncomment to force fitting
except:
    print("No such pickle. Fitting pipe and pickling.")
    pipe.fit(X_train,y_train)
    pickle.dump(pipe,open('pipe.pickle','wb'))

In [25]:
def quick_analysis(model):
    print("Train score:\n{:.4f}".format(model.score(X_train,y_train)))
    print("Dev   score:\n{}".format(model.score(X_dev,y_dev)))
    print("Dev Confusion matrix:\n{}".format(confusion_matrix(y_dev,model.predict(X_dev))))
quick_analysis(pipe)

Train score:
0.8804
Dev   score:
0.8866
Dev Confusion matrix:
[[4433    0]
 [ 567    0]]


# Oooof

This surprises me. Is there really nothing that allows us to better predict the result of a pitch than assuming the most frequent class?

Let's increase our number of units in the layer.

In [26]:
simple_mlp_clf = MLPClassifier(activation='tanh', hidden_layer_sizes=(9),
                               alpha = 0,
                               solver='sgd',
                               max_iter=2000,
                               learning_rate_init=.03
                              )
pipe_large = Pipeline([('scaler', StandardScaler()),('mlp',simple_mlp_clf)])

In [27]:
try:
    pipe_large = pickle.load(open('pipe_large.pickle','rb'))
    #raise Exception ##uncomment to force                        
except:
    pipe_large.fit(X_train,y_train)
    pickle.dump(pipe_large,open('pipe_large.pickle','wb'))

In [28]:
quick_analysis(pipe_large)

Train score:
0.8804
Dev   score:
0.8866
Dev Confusion matrix:
[[4433    0]
 [ 567    0]]


In [29]:
two_layer_mlp_clf = MLPClassifier(activation='tanh', hidden_layer_sizes=(8,4),
                               alpha = 0,
                               solver='sgd',
                               max_iter=2000,
                               learning_rate_init=.03
                              )
two_layer_pipe = Pipeline([('scaler', StandardScaler()),('mlp',two_layer_mlp_clf)])

In [30]:
try:
    two_layer_pipe = pickle.load(open('two_layer_pipe.pickle','rb'))
    #two_layer_pipe.predict(X_dev.iloc[0])
    #raise Exception ##uncomment to force fitting
except:
    print("Failed to open two_layer_pipe. Fitting and pickling...")
    two_layer_pipe.fit(X_train,y_train)
    pickle.dump(two_layer_pipe,open('two_layer_pipe.pickle','wb'))


In [31]:
print(two_layer_pipe)
print(logistic)
print(pipe_large)
print(pipe)

Pipeline(steps=[('scaler', StandardScaler()),
                ('mlp',
                 MLPClassifier(activation='tanh', alpha=0,
                               hidden_layer_sizes=(8, 4),
                               learning_rate_init=0.03, max_iter=2000,
                               solver='sgd'))])
Pipeline(steps=[('scaler', StandardScaler()),
                ('logistic_regression', LogisticRegression(max_iter=400))])
Pipeline(steps=[('scaler', StandardScaler()),
                ('mlp',
                 MLPClassifier(activation='tanh', alpha=0, hidden_layer_sizes=9,
                               learning_rate_init=0.03, max_iter=2000,
                               solver='sgd'))])
Pipeline(steps=[('scaler', StandardScaler()),
                ('mlp',
                 MLPClassifier(activation='tanh', alpha=0, hidden_layer_sizes=4,
                               learning_rate_init=0.03, max_iter=2000,
                               solver='sgd'))])


## Rethinking model evaluation

I mean, I know this could be a really hard problem, but it seems a little nuts that there's literally nothing about movement (etc.) of a pitch that wouldn't make it more likely to be a swinging strike. 

### This is about probabilities and classification

Let's face it, no pitch is more than 50% likely to be a swinging strike. So, there's your issue. Sure, a 101 mph fastball at the top of the zone is more likely to be one, but even the best located 100mph fastballs aren't swinging strikes more than 50% of the time.

We need to implement better measures of model accuracy. In particular, a model is good if it assigns higher probabilities to pitches that are actually swinging strikes. Two functions below assess the average deviation of swing strike probability from the classification.

### Multiple Classification

Just looking at swinging strikes is fun, but of course, the base rate of a non-swinging strike is about 7 times higher, so of course your classifier is going to favor that base rate unless there are strong swinging strike indicators.

### Revisit Data

What if there are some elements that I should have included which might be making things noisy? That is, maybe there's something which, if you added it, would make it easier to classify?

Here's what I notice in the data that I didn't include as features:
- **p_throws** whether the pitcher is right or left handed. This seems pretty important. Technically not physical data, but it could be noise that matters.
- **stand** is the side of the plate that the batter is on. That seems pretty important. 
- **count** seems like it could matter. It's not physical data, though. Still, perhaps the noise comes from batters not swinging at "tough" pitches when they are ahead. There's an additional issue with including it, too: a pitcher who gets ahead is more likely to get swinging strikes.
- **type** this is the classifiation of the pitch as a 'slider', 'fastball', etc. This might matter, but it also seems to be non-physical data.

Let's revisit our data and see if the batter/pitcher handedness makes a difference to these classifications.


In [32]:
##curious
logistic_proba = logistic.predict_proba(X_train).T
pipe_proba = pipe.predict_proba(X_train).T

In [33]:
a = pipe_proba[1] > .12
b = pipe_proba[1] > .14
np.sum(b)

10057

In [34]:
def model_standard_error(model):
    predict = model.predict_proba(X_dev).T[1]
    error = (y_dev.astype(int) - predict)
    return np.sum(error*error)/len(error)

def model_absolute_error(model):
    predict = model.predict_proba(X_dev).T[1]
    error = np.abs(y_dev.astype(int) - predict)
    return np.sum(error)/len(error)


In [35]:
two_layer_pipe.predict(X_dev)
print("Mean Absolute Error")
for model in [dummy,logistic,pipe,pipe_large,two_layer_pipe]:
    print(model_absolute_error(model))
print("Mean Standard Error")
for model in [dummy,logistic,pipe,pipe_large,two_layer_pipe]:
    print(model_standard_error(model))


Mean Absolute Error
0.1134
0.2024668536420064
0.18913982231091123
0.1890874798571327
0.18636042405111003
Mean Standard Error
0.1134
0.09897089884541346
0.09457206923591979
0.09279901892412655
0.09276316795999799


In [36]:
for model in [dummy,logistic,pipe,pipe_large,two_layer_pipe]:
    print(model_absolute_error(model))

0.1134
0.2024668536420064
0.18913982231091123
0.1890874798571327
0.18636042405111003


In [38]:
df['description'].unique()
in_play = ['hit_into_play_score', 'hit_into_play','hit_into_play_no_out']
swinging_strike = ['swinging_strike','swinging_strike_blocked','foul_tip']
called_strike = ['called_strike']
ball = ['ball', 'blocked_ball', 'hit_by_pitch']
foul = ['foul', 'foul_pitchout']
bunt = ['missed_bunt','foul_bunt','bunt_foul_tip']

In [39]:
def basic_result(x):
    if x in in_play:
        return 1
    elif x in swinging_strike:
        return 2
    elif x in called_strike:
        return 3
    elif x in ball:
        return 4
    elif x in foul:
        return 5
    elif x in bunt:
        return 6 
    else:
        return 0
    
    
df['result'] = df['description'].apply(basic_result)

In [40]:
result_dict = {
    0 : 'uncategorized',
    1 : 'in play',
    2 : 'swinging strike',
    3 : 'called strike',
    4 : 'ball',
    5 : 'foul ball',
    6 : 'bunt'
}


df['result']


0         1
1         2
2         5
3         4
4         2
         ..
173854    3
173855    1
173856    3
173857    1
173858    4
Name: result, Length: 2355353, dtype: int64

In [41]:
##This usually fails, seemingly because df is very big.")
##data2 = df.dropna(subset=physical_data)
subset = physical_data + ['result']
data2 = df[subset].copy()
data2.dropna(inplace = True)
X_train2, X_hold2, y_train2, y_hold2 = train_test_split(data2[physical_data],
                                                    data2['result'], train_size=30_0000, test_size=10000)
X_dev2, X_test2, y_dev2, y_test2 = train_test_split(X_hold2, y_hold2, test_size = 0.5)

In [42]:

## max_iter= 400 did not converge and had bad results. Changed to 1000
logistic2 = Pipeline([('scaler', StandardScaler()), ('logistic_regression', LogisticRegression(max_iter=1000))])

try:
    logistic2 = pickle.load(open('logistic2.pickle','rb'))
except:
    ##FIX ME: this has not been run with max_iter = 1000.
    logistic2.fit(X_train2,y_train2)
    pickle.dump(logistic2,open('logistic2.pickle','wb'))

In [50]:
def quick_analysis_multi(model):
    print("Train score:\n{:.4f}".format(model.score(X_train2,y_train2)))
    print("Dev   score:\n{}".format(model.score(X_dev2,y_dev2)))
    x = confusion_matrix(y_dev2,model.predict(X_dev2))
    print("Dev Confusion matrix:\n{}".format(x))
    print('Recall:') 
    for i in range(x.shape[1]):
        true_positives = x[i,i]
        all_positives = x[i,:].sum()
        recall = true_positives/all_positives
        print("{}: {:.3f}".format(result_dict[i],recall))

    print('\nPrecision')
    for i in range(x.shape[1]):
        tp = x[i,i]
        precision = tp/x[:,i].sum()
        print("{}: {:.3f}".format(result_dict[i],precision))

print(result_dict)
quick_analysis_multi(logistic2)

{0: 'uncategorized', 1: 'in play', 2: 'swinging strike', 3: 'called strike', 4: 'ball', 5: 'foul ball', 6: 'bunt'}
Train score:
0.3314
Dev   score:
0.3318
Dev Confusion matrix:
[[   0    0    0    0    2    0    0]
 [   0   29    1    9  867   11    0]
 [   0    7    0   12  559   17    0]
 [   0   21    0   23  785    4    0]
 [   0   72    0   60 1592   82    0]
 [   0   14    0   15  788   15    0]
 [   0    2    0    0   13    0    0]]
Recall:
uncategorized: 0.000
in play: 0.032
swinging strike: 0.000
called strike: 0.028
ball: 0.882
foul ball: 0.018
bunt: 0.000

Precision
uncategorized: nan
in play: 0.200
swinging strike: 0.000
called strike: 0.193
ball: 0.346
foul ball: 0.116
bunt: nan


  precision = tp/x[:,i].sum()


In [51]:
forest_multi_clf = RandomForestClassifier(n_estimators=40, max_depth=8)##I tried this with the default 100 estimators, trained to slow.
forest_multi_clf.fit(X_train2,y_train2)
quick_analysis_multi(forest_multi_clf)

Train score:
0.5174
Dev   score:
0.5084
Dev Confusion matrix:
[[   0    0    0    0    2    0    0]
 [   0  441    0  177  143  156    0]
 [   0  146    0   58  300   91    0]
 [   0  357    0  276   72  128    0]
 [   0   71    0   60 1642   33    0]
 [   0  348    0  132  169  183    0]
 [   0    3    0    3    8    1    0]]
Recall:
uncategorized: 0.000
in play: 0.481
swinging strike: 0.000
called strike: 0.331
ball: 0.909
foul ball: 0.220
bunt: 0.000

Precision
uncategorized: nan
in play: 0.323
swinging strike: nan
called strike: 0.391
ball: 0.703
foul ball: 0.309
bunt: nan


  precision = tp/x[:,i].sum()


In [52]:
## The gradient classifier takes forever to train. Let's train it on a smaller sample.

X_grad, X_ignore, y_grad, y_ignore = train_test_split(X_train2,y_train2,train_size=2500)

gradient_multi_clf = GradientBoostingClassifier()
gradient_multi_clf.fit(X_grad,y_grad)
quick_analysis_multi(gradient_multi_clf)

Train score:
0.4969
Dev   score:
0.4896
Dev Confusion matrix:
[[   0    0    0    0    2    0    0]
 [   0  358   41  262  123  133    0]
 [   0  104   61  105  252   71    2]
 [   0  295   21  300   78  139    0]
 [   0   37   64   66 1573   61    5]
 [   0  281   46  205  140  156    4]
 [   0    3    0    3    8    1    0]]
Recall:
uncategorized: 0.000
in play: 0.390
swinging strike: 0.103
called strike: 0.360
ball: 0.871
foul ball: 0.188
bunt: 0.000

Precision
uncategorized: nan
in play: 0.332
swinging strike: 0.262
called strike: 0.319
ball: 0.723
foul ball: 0.278
bunt: 0.000


  precision = tp/x[:,i].sum()


In [80]:
mlp_4 = MLPClassifier(activation='tanh', hidden_layer_sizes=(4),
                               alpha = 0,
                               solver='sgd',
                               max_iter=5000,
                               learning_rate_init=.03
                              )

mlp_4 = Pipeline([('scaler', StandardScaler()),('mlp', mlp_4)])

In [54]:
clf_name = 'mlp_4.pickle'
try:
    #raise Exception
    mlp_4 = pickle.load(open(clf_name,'rb'))
except:
    file = open(clf_name,'wb')
    mlp_4.fit(X_train2,y_train2)
    pickle.dump(mlp_4,file)

In [55]:
quick_analysis_multi(mlp_4)

Train score:
0.4996
Dev   score:
0.4882
Dev Confusion matrix:
[[   0    0    0    0    2    0    0]
 [   0  469   44  130  108  166    0]
 [   0  170   53   33  249   90    0]
 [   0  359   19  176   92  187    0]
 [   0   30   59   51 1582   84    0]
 [   0  403   35   96  137  161    0]
 [   0    4    0    2    9    0    0]]
Recall:
uncategorized: 0.000
in play: 0.511
swinging strike: 0.089
called strike: 0.211
ball: 0.876
foul ball: 0.194
bunt: 0.000

Precision
uncategorized: nan
in play: 0.327
swinging strike: 0.252
called strike: 0.361
ball: 0.726
foul ball: 0.234
bunt: nan


  precision = tp/x[:,i].sum()


In [56]:
two_layer_mlp_multi_clf = MLPClassifier(activation='tanh', hidden_layer_sizes=(8,4),
                               alpha = 0,
                               solver='sgd',
                               max_iter=5000,
                               learning_rate_init=.03
                              )
mlp_8_4 = Pipeline([('scaler', StandardScaler()),('mlp',two_layer_mlp_multi_clf)])

In [57]:
clf_name = 'mlp_8_4.pickle'
try:
    #raise Exception
    mlp_8_4 = pickle.load(open(clf_name,'rb'))
except:
    file = open(clf_name,'wb')
    mlp_8_4.fit(X_train2,y_train2)
    pickle.dump(mlp_8_4,file)

In [59]:
print(result_dict)

quick_analysis_multi(mlp_8_4)

print(confusion_matrix(y_dev2,mlp_8_4.predict(X_dev2)))
print(confusion_matrix(y_dev2,mlp_8_4.predict(X_dev2)).sum(axis=1))
print(confusion_matrix(y_dev2,mlp_8_4.predict(X_dev2)).sum(axis=0))

{0: 'uncategorized', 1: 'in play', 2: 'swinging strike', 3: 'called strike', 4: 'ball', 5: 'foul ball', 6: 'bunt'}
Train score:
0.5285
Dev   score:
0.5284
Dev Confusion matrix:
[[   0    0    0    0    2    0    0]
 [   0  418    2  200  142  155    0]
 [   0  132    6   55  291  111    0]
 [   0  256    0  357   87  133    0]
 [   0   31    3   82 1651   39    0]
 [   0  295    3  153  171  210    0]
 [   0    4    0    3    7    1    0]]
Recall:
uncategorized: 0.000
in play: 0.456
swinging strike: 0.010
called strike: 0.429
ball: 0.914
foul ball: 0.252
bunt: 0.000

Precision
uncategorized: nan
in play: 0.368
swinging strike: 0.429
called strike: 0.420
ball: 0.702
foul ball: 0.324
bunt: nan
[[   0    0    0    0    2    0    0]
 [   0  418    2  200  142  155    0]
 [   0  132    6   55  291  111    0]
 [   0  256    0  357   87  133    0]
 [   0   31    3   82 1651   39    0]
 [   0  295    3  153  171  210    0]
 [   0    4    0    3    7    1    0]]
[   2  917  595  833 1806  832  

  precision = tp/x[:,i].sum()


In [60]:
mlp_12_7 = MLPClassifier(activation='tanh', hidden_layer_sizes=(12,7),
                               alpha = 0,
                               solver='sgd',
                               max_iter=5000,
                               learning_rate_init=.03
                              )
mlp_12_7 = Pipeline([('scaler', StandardScaler()),('mlp',mlp_12_7)])

In [63]:
clf_name = 'mlp_12_7.pickle'
try:
    #raise Exception
    mlp_12_7 = pickle.load(open(clf_name,'rb'))
except:
    file = open(clf_name,'wb')
    mlp_12_7.fit(X_train2,y_train2)
    pickle.dump(mlp_12_7,file)

In [64]:
quick_analysis_multi(mlp_12_7)

Train score:
0.5287
Dev   score:
0.5288
Dev Confusion matrix:
[[   0    0    0    0    2    0    0]
 [   0  418   39  182  132  146    0]
 [   0  102   52   51  260  130    0]
 [   0  274   14  336   94  115    0]
 [   0   22   39   64 1643   38    0]
 [   0  289   32  144  172  195    0]
 [   0    4    0    1    9    1    0]]
Recall:
uncategorized: 0.000
in play: 0.456
swinging strike: 0.087
called strike: 0.403
ball: 0.910
foul ball: 0.234
bunt: 0.000

Precision
uncategorized: nan
in play: 0.377
swinging strike: 0.295
called strike: 0.432
ball: 0.711
foul ball: 0.312
bunt: nan


  precision = tp/x[:,i].sum()


In [61]:
mlp_15_10_7 = MLPClassifier(activation='tanh', hidden_layer_sizes=(15,10,7),
                               alpha = 0,
                               solver='sgd',
                               max_iter=5000,
                               learning_rate_init=.03
                              )
mlp_15_10_7 = Pipeline([('scaler', StandardScaler()),('mlp',mlp_15_10_7)])

In [62]:
clf_name = 'mlp_15_10_7.pickle'
try:
    #raise Exception
    mlp_15_10_7 = pickle.load(open(clf_name,'rb'))
except:
    file = open(clf_name,'wb')
    mlp_15_10_7.fit(X_train2,y_train2)
    pickle.dump(mlp_15_10_7,file)

In [79]:
#quick_analysis_multi(mlp_15_10_7)
x = confusion_matrix(y_dev2,mlp_15_10_7.predict(X_dev2))
print('Recall:')
for i in range(x.shape[1]):
    recall = x[i,i]/x[i,:].sum()
    print("{}: {:.3f}".format(result_dict[i],recall))
    
print('\nPrecision')
for i in range(x.shape[1]):
    negatives = x.sum()-x[i,:].sum()
    true_negatives = (x.sum() - x[i,:].sum())-(x[:,i].sum()-x[i,i].sum())
    precision = true_negatives/negatives
    print("{}: {:.3f}".format(result_dict[i],precision))

Recall:
uncategorized: 0.000
in play: 0.475
swinging strike: 0.074
called strike: 0.465
ball: 0.910
foul ball: 0.225
bunt: 0.000

Precision
uncategorized: 1.000
in play: 0.800
swinging strike: 0.980
called strike: 0.884
ball: 0.816
foul ball: 0.915
bunt: 1.000


## Thoughts

That was a satisfying experience.  Getting an MLP to substantially outperform the Logistic Regression was cool, though complex models barely outperformed simple ones. Sure, a score of .53 isn't very high, but the problem we're looking at is also really hard. There are at least two large factors in the outcome of a pitch besides the pitch (I think). One is the count. Major League hitters don't offer at a curveball that's heading toward the bottom corner of the strike zone in a 2-0 count. The other factor is the hitter himself. Some swing more than others, and some make contact more than others. 

For fun, I want to see how much difference including the count makes.

In [93]:
df.columns
subset = physical_data + ['strikes','balls','result']
data3 = df[subset].copy()
data3.dropna(inplace = True)
X_train3, X_hold3, y_train3, y_hold3 = train_test_split(data3[subset].drop('result', axis=1),
                                                    data2['result'], test_size=10000)
X_dev3, X_test3, y_dev3, y_test3 = train_test_split(X_hold3, y_hold3, test_size = 0.5)

In [94]:
mlp_6 = MLPClassifier(activation='tanh', hidden_layer_sizes=(6),
                               alpha = 0,
                               solver='sgd',
                               max_iter=5000,
                               learning_rate_init=.03
                              )

mlp_6 = Pipeline([('scaler', StandardScaler()),('mlp', mlp_4)])

In [95]:
clf_name = 'mlp_6.pickle'
try:
    #raise Exception
    mlp_6 = pickle.load(open(clf_name,'rb'))
except:
    file = open(clf_name,'wb')
    mlp_6.fit(X_train3,y_train3)
    pickle.dump(mlp_6,file)

In [100]:
mlp_6.score(X_dev3,y_dev3)
print(confusion_matrix(y_dev3,mlp_6.predict(X_dev3)))

[[ 457   38  242  118   94    0]
 [ 120   36  113  231   52    0]
 [ 138   12  472  134   37    0]
 [  25   40   60 1653   23    0]
 [ 370   37  209  171  103    0]
 [   0    0    6    8    1    0]]


In [97]:
mlp_12_7.fit(X_train3,y_train3)

Pipeline(steps=[('scaler', StandardScaler()),
                ('mlp',
                 MLPClassifier(activation='tanh', alpha=0,
                               hidden_layer_sizes=(12, 7),
                               learning_rate_init=0.03, max_iter=5000,
                               solver='sgd'))])

In [99]:
mlp_12_7.score(X_dev3,y_dev3)
print(confusion_matrix(y_dev3,mlp_12_7.predict(X_dev3)))

[[ 426   17  228  129  149    0]
 [  96   22  103  223  108    0]
 [ 124    1  530   98   40    0]
 [  22   27   71 1657   24    0]
 [ 305   18  210  162  195    0]
 [   2    0    4    7    2    0]]


## Playtime...

Okay, I think there's not much else to be gained here unless I want to try batter id as categorical data. But while I'm sitting here with a big data set that I've played with a little, I think I want to learn a little about whether training time can be optimized better. The above was very slow. 

I ran a bunch of the code below (_sometimes in forms that I edited later._) Let's summarize what I learned:
1. The size of a hidden layer doesn't necessarily slow down learning. It seems like convergence can be reached faster if it's more complex.
1. The difference in performance of the model didn't change much at all based on whether I had 5,000 or 50,000 examples. 
> - Lesson: When playing around with this stuff and testing lots of models, it would behoove you to investigate where diminishing returns on training set size and model perfomance seems to get real. Save yourself some time getting a few parameters right, then tune with a larger data set for fine details. 

In [152]:
mlp_sandbox = MLPClassifier(activation='tanh', hidden_layer_sizes=(1),
                               alpha = 0,
                               solver='sgd',
                               max_iter=5000,
                               learning_rate_init=.03
                              )
mlp_sandbox = Pipeline([('scaler', StandardScaler()),('mlp',mlp_sandbox)])

In [135]:
X_quick, X_slow, y_quick, y_slow = train_test_split(X_train3, y_train3, train_size=5_000)

In [154]:
start = time.time()
#mlp_sandbox.activation = 'relu'
mlp_sandbox.hidden_layer_sizes=(1)
#mlp_sandbox.learning_rate_init=.01
mlp_sandbox.fit(X_quick,y_quick)
end = time.time()
print(end - start) ##20.91 seconds on the first pass
print(mlp_sandbox.score(X_quick,y_quick))
mlp_sandbox.score(X_dev3,y_dev3)

4.433730602264404
0.5386


0.5428

In [151]:
start = time.time()
#mlp_sandbox.activation = 'relu'
#mlp_sandbox.hidden_layer_sizes=(12,4)
#mlp_sandbox.learning_rate_init=.01
mlp_sandbox.fit(X_quick,y_quick)
end = time.time()
print(end - start) ##20.91 seconds on the first pass
print(mlp_sandbox.score(X_quick,y_quick))
mlp_sandbox.score(X_dev3,y_dev3)

2.9261984825134277
0.5448


0.5442

In [131]:
#mlp_sandbox.activation = 'relu'
mlp_sandbox.hidden_layer_sizes=(12,4)
#mlp_sandbox.learning_rate_init=.01

start = time.time()
mlp_sandbox.fit(X_quick,y_quick)
end = time.time()
print(end - start) ##21.57 seconds on the first pass
print(mlp_sandbox.score(X_quick,y_quick))
mlp_sandbox.score(X_dev3,y_dev3)

2.0856385231018066
0.5664


0.5446

In [120]:
mlp_sandbox.activation = 'relu'
#mlp_sandbox.hidden_layer_sizes=12
mlp_sandbox.learning_rate_init=.01

start = time.time()
mlp_sandbox.fit(X_quick,y_quick)
end = time.time()
print(end - start) ##35.91 seconds on the first pass
print(mlp_sandbox.score(X_quick,y_quick))
mlp_sandbox.score(X_dev3,y_dev3)

35.90644931793213
0.56468


0.5596

In [123]:
mlp_sandbox.activation = 'tanh'
#mlp_sandbox.hidden_layer_sizes=12
mlp_sandbox.learning_rate_init=.03

mlp_sandbox.solver='sgd'
mlp_sandbox.learning_rate = 'adaptive'
start = time.time()
mlp_sandbox.fit(X_quick,y_quick)
end = time.time()
print(end - start) ##38.59 seconds on the first pass
print(mlp_sandbox.score(X_quick,y_quick))
mlp_sandbox.score(X_dev3,y_dev3)

38.59362196922302
0.56588


0.5672

In [124]:
mlp_sandbox.activation = 'tanh'
#mlp_sandbox.hidden_layer_sizes=12
mlp_sandbox.learning_rate_init=.1

mlp_sandbox.solver='sgd'
mlp_sandbox.learning_rate = 'adaptive'
start = time.time()
mlp_sandbox.fit(X_quick,y_quick)
end = time.time()
print(end - start) ##31.5 seconds on the first pass
print(mlp_sandbox.score(X_quick,y_quick))
mlp_sandbox.score(X_dev3,y_dev3)

31.500808477401733
0.56254


0.5648