## Building Off CatBoost to Determine Hit Outcomes
#### By Joe Leonard, Bennett Ross, and Nathan Moss
We are all current fourth-year students at the University of Virginia. Joe is studying computer science while Bennett and Nathan both double-major in Statistics and Economics. Even combined, we would say our experiences thus far aren't that extensive, so our modeling knowledge is pretty limited to what has been taught to us in our courses. When we first looked at this problem our idea was to go off individually, try different modeling/optimizing techniques, and then take whatever provided the best ROC AUC score and build from there. We hoped that something other than CatBoost would work, as we didn't want to directly copy off of Nick's original submission. We tried LightGBM, XGBoosting, Neural Networks, Random Forest, and even basic Linear Regression (remember our modeling background isn't too deep). Nothing provided a better score than when we just ran the CatBoost simulation. So, we decided to learn about a technique we've never heard of before and get to fine-tuning it.

##### Imports

In [13]:
import os
import pandas as pd
import numpy as np
import catboost as cb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

train_df = pd.read_csv('train.csv')

##### Feature Engineering
We started by running Nick's sample submission which somehow landed us a better ROC AUC score right away. We felt pretty good about this starting point and decided to try various ways of fine-tuning. You can see below that we used a One Hot Encoder to break down pitch type into numerical values to give us a better glimpse into which pitches may be more prone to certain hit outcomes. We also attempted to feature engineer a few new variables that we *thought* would be important to our solution, but you'll find later on we were pretty wrong.

In [2]:
# Formatting Pitch Type Column Using a One Hot Encoder
one_hot_encoded_df = pd.get_dummies(train_df['pitch_type'], prefix='pitch_type')
train_df = pd.concat([train_df, one_hot_encoded_df], axis=1)
train_df.drop('pitch_type', axis=1, inplace=True)

# Feature Engineering
## Derived Statistics
train_df['opposite_hand'] = train_df['is_lhp'] != train_df['is_lhb']
train_df['on_base'] = train_df['on_1b'] + train_df['on_2b'] + train_df['on_3b']
train_df['late_game'] = train_df['inning'] >= 7
train_df['pressure'] = (train_df['outs_when_up'] == 2) & ((train_df['on_2b'] == 1) | (train_df['on_3b'] == 1))
train_df['clutch'] = train_df['pressure'] & (train_df['late_game'] == 1)
train_df['fatigue'] = train_df['pitch_number'] / 110
train_df['momentum_x'] = train_df['release_speed'] * train_df['release_pos_x']
train_df['momentum_z'] = train_df['release_speed'] * train_df['release_pos_z']
train_df['total_acceleration'] = np.sqrt(train_df['ax']**2 + train_df['ay']**2 + train_df['az']**2)
train_df['pitch_break_horizontal'] = train_df['vx0'] - train_df['pfx_x']
train_df['pitch_break_vertical'] = train_df['vz0'] - train_df['pfx_z']
## Rolling Statistics
train_df = train_df.sort_values(by=['uid', 'pitch_number'])
train_df['avg_speed_last_5'] = train_df.groupby('uid')['release_speed'].rolling(window=5, min_periods=1).mean().reset_index(drop=True)
train_df['avg_speed_last_10'] = train_df.groupby('uid')['release_speed'].rolling(window=10, min_periods=1).mean().reset_index(drop=True)
train_df['avg_spin_last_5'] = train_df.groupby('uid')['release_spin_rate'].rolling(window=5, min_periods=1).mean().reset_index(drop=True)
train_df['avg_spin_last_10'] = train_df.groupby('uid')['release_spin_rate'].rolling(window=10, min_periods=1).mean().reset_index(drop=True)

##### Initial Model
Once we had all these different features made, we felt it was time to give it a try. We kept the standard 3-fold cross validation and CatBoost model in there to give us a new baseline.

In [3]:
# Set Features and Target
feats = ['release_speed', 'release_pos_x', 'release_pos_z', 'is_lhp', 'is_lhb', 'balls', 'strikes', 
         'pfx_x', 'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b', 'outs_when_up', 'inning', 
         'is_top', 'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot', 'effective_speed', 
         'release_spin_rate', 'release_extension', 'release_pos_y', 'pitch_number', 'spin_axis', 'spray_angle', 
         'bat_speed', 'swing_length', 'pitch_type_CH', 'pitch_type_CU', 'pitch_type_EP', 'pitch_type_FA',
         'pitch_type_FC', 'pitch_type_FF', 'pitch_type_FO', 'pitch_type_FS', 'pitch_type_KC', 'pitch_type_KN', 
         'pitch_type_SC', 'pitch_type_SI', 'pitch_type_SL', 'pitch_type_ST', 'pitch_type_SV', 'opposite_hand', 
         'on_base', 'late_game', 'pressure', 'clutch', 'fatigue', 'momentum_x', 'momentum_z', 
         'total_acceleration', 'pitch_break_horizontal', 'pitch_break_vertical', 'avg_speed_last_5', 
         'avg_speed_last_10', 'avg_spin_last_5', 'avg_spin_last_10']
target = 'outcome_code'

# Model Training with K-Fold Cross Validation
folds = 3
kf = KFold(folds, shuffle=True)
outputs = pd.DataFrame()
for train_idx, test_idx in kf.split(train_df):
    train = train_df.iloc[train_idx]
    test = train_df.iloc[test_idx]
    model = cb.CatBoostClassifier(iterations=1000, verbose=False, loss_function='MultiClassOneVsAll', eval_metric='AUC')
    model.fit(train[feats], train[target])
    _df = pd.DataFrame(model.predict_proba(test[feats]), index=test.index)
    outputs = pd.concat([outputs, _df])

# Post-Model Processing
train_df = pd.concat([train_df, outputs], axis=1)
train_df['e'] = train_df.loc[:, range(5)].sum(axis=1).sub(1)
train_df[0] = train_df[0].sub(train_df['e'])

# Output Summary
train_df = train_df.rename(columns={0:'out',1:'single',2:'double',3:'triple',4:'home_run'})
print(train_df.loc[:, ['uid','out','single','double','triple','home_run']].sample(10))
print(roc_auc_score(train_df[target], train_df.loc[:, ['out','single','double','triple','home_run']], multi_class='ovr'))

         uid       out    single    double    triple  home_run
23781  34542  0.790504  0.183133  0.017707  0.001052  0.007604
13609  19731  0.641020  0.179676  0.084935  0.006242  0.088127
6673    9679  0.681511  0.207263  0.060891  0.009370  0.040965
46284  67213  0.752945  0.199336  0.029713  0.002163  0.015843
26849  39137  0.708471  0.272160  0.015413  0.002471  0.001484
33060  48581  0.759034  0.106066  0.106670  0.005572  0.022658
43041  62694  0.714660  0.246058  0.029074  0.002337  0.007872
5785    8206  0.789151  0.153473  0.033952  0.006271  0.017153
21003  30519  0.403141  0.222127  0.357501  0.002014  0.015218
20528  29849  0.810001  0.125265  0.030931  0.002599  0.031204
0.7218415306415815


##### Pointing Out Important Statistics
Improvement! Thankfully, we saw that this gave us a better output than the original. But, we figured we can probably do better. Let's check out how important each statistic was.

In [7]:
# Feature Important Extraction
feature_importances = model.get_feature_importance()
feature_names = model.feature_names_
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df.sort_values(by='Importance', ascending=False, inplace=True)
print(importance_df.head(15))
print(importance_df.tail(10))

               Feature  Importance
31         spray_angle   21.547941
9              plate_x    9.249407
32           bat_speed    8.895975
33        swing_length    8.399019
4               is_lhb    6.958930
10             plate_z    5.180351
23              sz_top    2.131489
28       release_pos_y    2.018287
57  total_acceleration    1.963190
21                  ay    1.722975
2        release_pos_z    1.550724
8                pfx_z    1.492220
24              sz_bot    1.390897
30           spin_axis    1.255123
19                 vz0    1.211319
          Feature  Importance
39  pitch_type_FF    0.023836
42  pitch_type_KC    0.006082
48  pitch_type_SV    0.003766
43  pitch_type_KN    0.001741
40  pitch_type_FO    0.001132
36  pitch_type_EP    0.000734
44  pitch_type_SC    0.000051
53         clutch    0.000000
37  pitch_type_FA    0.000000
52       pressure    0.000000


So sadly, we see that One Hot Encoding pitch type didn't really do much and looking back on it, this makes sense. Since pitchers have an arsenal of pitches, each pitch type is used generally the same amount of times throughout a game, besides a heavy lean for fastballs. Especially when you break it down to a per at-bat basis, you can't find that much lean to a specific pitch in a set of 3-7 pitches. We also see that our "clutch" and "pressure" features aren't important, two features we used to take into account game situation and hoped would be outliers in the competition. That being said, we do see total_acceleration up there in the top 10, so not everything we engineered was going to waste. After careful experimentation, we found that using features of 2.2 important score or more provided the best output, so let's take out those features less important than 2.2 and see what we get.

In [8]:
# Selecting Important Features
threshold = 2.2
selected_features = importance_df[importance_df['Importance'] > threshold]['Feature'].tolist()
print(f"Selected Features: {selected_features}")

Selected Features: ['spray_angle', 'plate_x', 'bat_speed', 'swing_length', 'is_lhb', 'plate_z']


We are left with spray_angle, plate_x, bat_speed, swing_length, is_lhb, and plate_z. It is key to note that you may get different features the next time you run the filtering as importance can change each model run, but you will likely always get the features we got here. Looking at what we have it is safe to say that hit outcome, at least according to our model, is very dependent on where the ball lands on the field. This is a pretty intuitive conclusion, but we see spray angle as the most important feature. Whether a batter took the ball pull side or opposite field can alter his *get-out-of-the-box speed* and possibly lead to a longer play. We also see is_lhb up there. Another feature that applies to that *GOTB Speed* as left-handed hitters have a quicker time getting to first base than right-handed hitters do, because they are closer and have to take less time to get over there. These features are interesting, so let's see how the model performs now.

In [10]:
# Re-Uploading the Data
train_df = pd.read_csv('train.csv')

# Formatting Pitch Type Column Using a One Hot Encoder
one_hot_encoded_df = pd.get_dummies(train_df['pitch_type'], prefix='pitch_type')
train_df = pd.concat([train_df, one_hot_encoded_df], axis=1)
train_df.drop('pitch_type', axis=1, inplace=True)

# Feature Engineering
## Derived Statistics
train_df['opposite_hand'] = train_df['is_lhp'] != train_df['is_lhb']
train_df['on_base'] = train_df['on_1b'] + train_df['on_2b'] + train_df['on_3b']
train_df['late_game'] = train_df['inning'] >= 7
train_df['pressure'] = (train_df['outs_when_up'] == 2) & ((train_df['on_2b'] == 1) | (train_df['on_3b'] == 1))
train_df['clutch'] = train_df['pressure'] & (train_df['late_game'] == 1)
train_df['fatigue'] = train_df['pitch_number'] / 110
train_df['momentum_x'] = train_df['release_speed'] * train_df['release_pos_x']
train_df['momentum_z'] = train_df['release_speed'] * train_df['release_pos_z']
train_df['total_acceleration'] = np.sqrt(train_df['ax']**2 + train_df['ay']**2 + train_df['az']**2)
train_df['pitch_break_horizontal'] = train_df['vx0'] - train_df['pfx_x']
train_df['pitch_break_vertical'] = train_df['vz0'] - train_df['pfx_z']
## Rolling Statistics
train_df = train_df.sort_values(by=['uid', 'pitch_number'])
train_df['avg_speed_last_5'] = train_df.groupby('uid')['release_speed'].rolling(window=5, min_periods=1).mean().reset_index(drop=True)
train_df['avg_speed_last_10'] = train_df.groupby('uid')['release_speed'].rolling(window=10, min_periods=1).mean().reset_index(drop=True)
train_df['avg_spin_last_5'] = train_df.groupby('uid')['release_spin_rate'].rolling(window=5, min_periods=1).mean().reset_index(drop=True)
train_df['avg_spin_last_10'] = train_df.groupby('uid')['release_spin_rate'].rolling(window=10, min_periods=1).mean().reset_index(drop=True)

# Re-Modeling
folds = 3
kf = KFold(folds, shuffle=True)
outputs = pd.DataFrame()
for train_idx, test_idx in kf.split(train_df):
    train = train_df.iloc[train_idx]
    test = train_df.iloc[test_idx]
    model = cb.CatBoostClassifier(iterations=1000, verbose=False, loss_function='MultiClassOneVsAll', eval_metric='AUC')
    model.fit(train.loc[:, selected_features], train[target])
    _df = pd.DataFrame(model.predict_proba(test.loc[:, selected_features]), index=test.index)
    outputs = pd.concat([outputs, _df])

# Post-Model Processing
train_df = pd.concat([train_df, outputs], axis=1)
train_df['e'] = train_df.loc[:, range(5)].sum(axis=1).sub(1)
train_df[0] = train_df[0].sub(train_df['e'])

# Output Summary
train_df = train_df.rename(columns={0:'out',1:'single',2:'double',3:'triple',4:'home_run'})
print(train_df.loc[:, ['uid','out','single','double','triple','home_run']].sample(10))
print(roc_auc_score(train_df[target], train_df.loc[:, ['out','single','double','triple','home_run']], multi_class='ovr'))

         uid       out    single    double    triple  home_run
21456  31193  0.714045  0.158761  0.028921  0.003255  0.095018
10939  15698  0.654540  0.257735  0.040278  0.000898  0.046548
36183  52933  0.690601  0.244863  0.042833  0.010189  0.011513
30497  44617  0.459363  0.266697  0.053373  0.001460  0.219107
2696    3751  0.629428  0.308664  0.024068  0.001499  0.036341
4658    6539  0.727001  0.250672  0.017207  0.002252  0.002868
18617  27202  0.874991  0.078869  0.016267  0.002703  0.027171
33619  49343  0.713495  0.278694  0.004941  0.001452  0.001418
14765  21361  0.508058  0.222679  0.042987  0.001040  0.225236
1074    1438  0.489152  0.215623  0.049399  0.001409  0.244417
0.7375037755191928


##### Changing the Cross-Validation Technique
Great! We see another improvement. Looks like keying in on important features really does help. One last thing we wanted to see was if the cross-validation technique we are using is the best. So, we took out the standard 3-fold cross validation technique for a Stratified K-Fold technique. 

In [11]:
# Re-Uploading the Data
train_df = pd.read_csv('train.csv')

# Formatting Pitch Type Column Using a One Hot Encoder
one_hot_encoded_df = pd.get_dummies(train_df['pitch_type'], prefix='pitch_type')
train_df = pd.concat([train_df, one_hot_encoded_df], axis=1)
train_df.drop('pitch_type', axis=1, inplace=True)

# Feature Engineering
## Derived Statistics
train_df['opposite_hand'] = train_df['is_lhp'] != train_df['is_lhb']
train_df['on_base'] = train_df['on_1b'] + train_df['on_2b'] + train_df['on_3b']
train_df['late_game'] = train_df['inning'] >= 7
train_df['pressure'] = (train_df['outs_when_up'] == 2) & ((train_df['on_2b'] == 1) | (train_df['on_3b'] == 1))
train_df['clutch'] = train_df['pressure'] & (train_df['late_game'] == 1)
train_df['fatigue'] = train_df['pitch_number'] / 110
train_df['momentum_x'] = train_df['release_speed'] * train_df['release_pos_x']
train_df['momentum_z'] = train_df['release_speed'] * train_df['release_pos_z']
train_df['total_acceleration'] = np.sqrt(train_df['ax']**2 + train_df['ay']**2 + train_df['az']**2)
train_df['pitch_break_horizontal'] = train_df['vx0'] - train_df['pfx_x']
train_df['pitch_break_vertical'] = train_df['vz0'] - train_df['pfx_z']
## Rolling Statistics
train_df = train_df.sort_values(by=['uid', 'pitch_number'])
train_df['avg_speed_last_5'] = train_df.groupby('uid')['release_speed'].rolling(window=5, min_periods=1).mean().reset_index(drop=True)
train_df['avg_speed_last_10'] = train_df.groupby('uid')['release_speed'].rolling(window=10, min_periods=1).mean().reset_index(drop=True)
train_df['avg_spin_last_5'] = train_df.groupby('uid')['release_spin_rate'].rolling(window=5, min_periods=1).mean().reset_index(drop=True)
train_df['avg_spin_last_10'] = train_df.groupby('uid')['release_spin_rate'].rolling(window=10, min_periods=1).mean().reset_index(drop=True)

# Re-Modeling with New CV Technique
folds = 3
kf = StratifiedKFold(n_splits=folds, shuffle=True)
outputs = pd.DataFrame()
for train_idx, test_idx in kf.split(train_df, train_df[target]):
    train = train_df.iloc[train_idx]
    test = train_df.iloc[test_idx]
    model = cb.CatBoostClassifier(iterations=1000, verbose=False, loss_function='MultiClassOneVsAll', eval_metric='AUC')
    model.fit(train[selected_features], train[target])
    _df = pd.DataFrame(model.predict_proba(test[selected_features]), index=test.index)
    outputs = pd.concat([outputs, _df])

# Post-Model Processing
train_df = pd.concat([train_df, outputs], axis=1)
train_df['e'] = train_df.loc[:, range(5)].sum(axis=1).sub(1)
train_df[0] = train_df[0].sub(train_df['e'])

# Output Summary
train_df = train_df.rename(columns={0:'out',1:'single',2:'double',3:'triple',4:'home_run'})
print(train_df.loc[:, ['uid','out','single','double','triple','home_run']].sample(10))
print(roc_auc_score(train_df[target], train_df.loc[:, ['out','single','double','triple','home_run']], multi_class='ovr'))

         uid       out    single    double    triple  home_run
1897    2635  0.520172  0.172141  0.076279  0.014714  0.216694
37380  54595  0.805990  0.042502  0.044690  0.003603  0.103215
49492  71909  0.708607  0.175465  0.064639  0.002911  0.048377
10885  15618  0.552414  0.147686  0.286174  0.011597  0.002129
31000  45287  0.728233  0.242890  0.013445  0.002339  0.013092
44578  64909  0.618781  0.123345  0.104079  0.004959  0.148836
32852  47986  0.864972  0.056578  0.067320  0.002225  0.008904
39578  57580  0.552020  0.081792  0.099866  0.001812  0.264510
37500  54763  0.804616  0.154291  0.015793  0.003607  0.021693
48633  70771  0.783882  0.205723  0.007438  0.001701  0.001255
0.738911613708389


##### Conclusion
During our experimentation phase, we looked into adding in more folds or introducing Randomized Search. Ultimately, we didn't find a big enough improvement to add all of this in and make our notebook even more busy. Although we are sure our solution will not produce the greatest ROC AUC Score, we are excited to have learned more about CatBoost Modeling and actually produce an output that does perform better than our baseline. We hope to eventually optimize this model even further and gain more insights into how different features play a part in hit outcome. 