# Pitcher/Batter Embeddings

**Author:** Joel Klein, Jacob Sauberman, Ben Perkins

**Date:** November 21, 2021

**Description:** This script loads in MLB Statcast data from every MLB registered pitch since 2017 and generates batter and pitcher embedding vectors. The batter and pitcher are important factors which affects which pitch type is thrown within an at-bat. Batter and pitcher embedding vectors are needed to represent the batter and pitcher as multi-dimensional vectors rather than sparse, high-dimensional one hot encoded vectors. These batter and pitcher embedding vectors will eventually be used as input into an attention model to predict next pitch type.

**Data:** The data is scraped from *baseballsavant.com* for each year and combined together in one large data file. This file leverages the training and validation data which is prepared in the statcast_data_preparation.ipynb file in the code folder.

**Scope:** The data from the 2017 and 2018 seasons is used for training data while the 2019 season is used for validation. 2020 and 2021 seasons will be used for the testing data in the next pitch prediction model. 

There are many instances in major league baseball where rookie batters and pitchers receive at-bats and there is little or no prior pitch sequence data. In order to make a prediction on next pitch types in the 2019, 2020, and 2021 seasons, the batter and the pitcher need to have a significant amount of recorded at-bats in the 2017 and 2018 seasons. Only batters and pitchers accounting for 90% of at-bats in the 2017 and 2018 seasons were included in scope (441 batters and 512 pitchers). There will be a 9-dimensional vector embedding generated for the batters and a 9-dimensional vector embedding generated for the pitchers. The final outputs of this file are two pandas data frames: the mapping of the batter ids to the batter embeddings, the mapping of the pitcher ids to the pitcher embeddings.

**Notes:** 

**Warnings:** 

**Outline:** 
  - Install Libraries
  - Global Options
  - Set Directories
  - Define Functions
  - Load Data
  - Data Preparation
  - Data Splitting
  - Data Sampling
  - One Hot Encoding
  - Data Loaders
  - Train Model
  - Final Output
  - Batter & Pitcher Embedding Analysis

## Import Libraries

In [None]:
# data manipulation
import numpy as np
import pandas as pd 
import os
import zipfile

# plotting
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import plotly.express as px

# pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
import torchvision
import torch.utils.data
import torchvision.transforms as transforms
from torchvision.transforms import ToTensor
from torch.utils.data import random_split
from torch import cuda
from torch.autograd import Variable
from torch.utils.data import TensorDataset

# splitting data
from sklearn.model_selection import train_test_split

# preprocessing
from sklearn.preprocessing import OneHotEncoder

# pipelines
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline

# random number generator
import random

# other
import IPython
import pickle
from os import listdir
from os.path import isfile, join

# pybaseball
!pip install pybaseball
from pybaseball import playerid_reverse_lookup



## Global Options

In [None]:
# do not show warnings
import warnings
warnings.filterwarnings('ignore')

# set pandas display options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 2000)
pd.set_option('display.max_colwidth', 2000)
pd.options.display.float_format = '{:.5f}'.format

## Set Directories

In [None]:
# mount data drive on colab
from google.colab import drive
drive.mount('/content/drive')

# set folder directories to load and save data
DATA_DIR = "/content/drive/MyDrive/final-project-dl/data"
STATCAST_DATA_DIR = "/content/drive/MyDrive/final-project-dl/data/statcast"
EMBEDDINGS_DATA_DIR = "/content/drive/MyDrive/final-project-dl/data/embeddings"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Define Functions

In [None]:
##### Define data loading function ----

# define a function for loading in dataset
def load_data(in_path, name):
    df = pd.read_csv(in_path)
    print(f"{name}: shape is {df.shape}")
    
    return df

In [None]:
#### Define training functions ----

# define function to train a model at each epoch and return train loss and accuracy
def train_epoch(epoch, clf, criterion, opt, data_loader, printlog=False, printnum=1):

  """Train pytorch model and collect training results
  
  Keyword arguments:
  
  epoch -- number of epochs
  clf -- torch nn module
  criterion -- loss function
  opt -- torch optimizer function
  data_loader -- torch train data loader
  printlog -- indicator to print the training loss and accuracy (default=False)
  printnum -- print the output after this number of epochs
  """

  # training mode
  clf.train()

  # reset train_loss, correct predictions, and predictions
  correct = 0.0
  n = 0.0
  train_loss = 0.0

  # training via batching
  for batch_id, data in enumerate(data_loader):
      
      # load in input and target data in batch
      batter_input, pitcher_input, target = data[0].to(device), data[1].to(device), data[2].to(device)
      
      # predict the train data
      y_pred = clf(batter_input, pitcher_input)
      
      # calculate loss
      loss = criterion(y_pred, target)
      
      # zero the gradient
      opt.zero_grad()

      # backpropogation
      loss.backward()
      
      # update model weights
      opt.step()

      # append loss stat
      train_loss += loss.item()

      # returns class prediction
      pred_class = torch.max(y_pred.data, 1)[1]

      # number of correct predictions
      correct += (pred_class == target).sum().item()
        
      # append number of predictions made
      n += target.size(0)

  # print run stats at end of each epoch for average loss, accuracy
  if (printlog == True) & (epoch % printnum == 0):
    print(f'Epoch {epoch}: | Train Loss: {train_loss/len(data_loader):.5f} | Train Acc: {correct/n:.5f}')

  train_loss = train_loss/len(data_loader) # running loss
  train_accuracy = correct/n # train accuracy for epoch

  return clf, train_loss, train_accuracy

# define function to evaluate a model at each epoch and return test loss and accuracy
def evaluate_model(epoch, clf, criterion, opt, data_loader, printlog=False, printnum=1):

  """Evaluate pytorch model and collect test results
  
  Keyword arguments:
  
  epoch -- number of epochs
  clf -- torch nn module
  criterion -- loss function
  opt -- torch optimizer function
  data_loader -- torch test data loader
  printlog -- indicator to print the test loss and accuracy (default=False)
  printnum -- print the output after this number of epochs
  """

  clf.eval() # set model in inference mode
  
  # set test_loss, correct predictions, and predictions
  test_loss = 0.0
  correct = 0.0
  n = 0.0

  with torch.no_grad():
    for i,data in enumerate(data_loader):
        
        # load in input and target data in batch
        batter_input, pitcher_input, target = data[0].to(device), data[1].to(device), data[2].to(device)
      
        # predict the test data
        y_pred = clf(batter_input, pitcher_input)
        
        # calculate loss
        loss = criterion(y_pred, target)
        
        # returns class prediction
        pred_class = torch.max(y_pred.data, 1)[1]
        
        # append loss stat
        test_loss += loss.item()
        
        # number of correct predictions
        correct += (pred_class == target).sum().item()
        
        # append test accuracy
        n += target.size(0)

  # print run stats at end of each epoch for average loss, accuracy
  if (printlog == True) & (epoch % printnum) == 0:
    print(f'Epoch {epoch}: | Test Loss: {test_loss/len(data_loader):.5f} | Test Acc: {correct/n:.5f}')

  test_loss = test_loss/len(data_loader) # running loss
  test_accuracy = correct/n

  return test_loss, test_accuracy


##### Define training and evaluation function wrapper ----

# define a function to perform training given classifier, hidden layers, activation function, 
# initialization technique, and optimizer arguments
def train_and_evaluate(clf, opt, epochs, learning_rate, criterion, printlog, train_data_loader, test_data_loader, printnum):

  """Train and Evaluate pytorch deep neural network model given classifier, 
  number of hidden layers, activation function, weight initialization strategy, 
  and training parameters.
  
  Keyword arguments:
  
  clf -- torch nn module
  opt -- torch optimizer function
  epochs -- number of epochs
  learning_rate -- learning rate for weight optimization
  criterion -- loss function
  printlog -- indicator to print the training and test loss and accuracy (default=False)
  train_data_loader -- training data loader 
  test_data_loader -- out of sample data loader for performance estimation
  printnum -- print the output after this number of epochs
  """

  # place classifier on the device
  clf.to(device)

  # show classifier architecture
  print(clf)

  # set optimizer
  opt = opt(clf.parameters(), lr=learning_rate)
  
  # set loss function
  criterion = criterion()

  # initialize vectors to store performance
  train_loss = []
  train_accuracy = []
  test_loss = []
  test_accuracy = []

  # train the model using parameters
  for epoch in range(epochs):
      clf, train_loss_epoch, train_accuracy_epoch = train_epoch(epoch, clf, criterion, opt, train_data_loader, printlog, printnum)
      test_loss_epoch, test_accuracy_epoch = evaluate_model(epoch, clf, criterion, opt, test_data_loader, printlog, printnum)
      train_loss.append(train_loss_epoch)
      train_accuracy.append(train_accuracy_epoch)
      test_loss.append(test_loss_epoch)
      test_accuracy.append(test_accuracy_epoch)

  return clf, train_loss, train_accuracy, test_loss, test_accuracy

## Load Data

Let's load in the training, validation, and test data files produced from the data preparation code. 

In [None]:
# load in each dataset
train_data = load_data(os.path.join(STATCAST_DATA_DIR, 'train_data.csv'), 'train_data')
validation_data = load_data(os.path.join(STATCAST_DATA_DIR, 'validation_data.csv'), 'validation_data')
test_data = load_data(os.path.join(STATCAST_DATA_DIR, 'test_data.csv'), 'test_data')

train_data: shape is (785777, 63)
validation_data: shape is (344941, 63)
test_data: shape is (267635, 63)


## Data Preparation

To proceed with model training we will need to categorize the pitch classes by a specific code. This operation needs to be done to both the training and the validation data sets.

In [None]:
# change the pitch type to be a numeric representation for model to handle
train_data['pitch_class'] = pd.Categorical(train_data['pitch_class'])
train_data['pitch_class'] = train_data['pitch_class'].cat.codes

# filter the dataset to only get batter, pitcher, and pitch type fields
train_data = train_data[['batter', 'pitcher','pitch_class']]

# split the input and labels
x_train = train_data[['batter', 'pitcher']]
y_train = train_data['pitch_class']

# change the pitch type to be a numeric representation for model to handle
validation_data['pitch_class'] = pd.Categorical(validation_data['pitch_class'])
validation_data['pitch_class'] = validation_data['pitch_class'].cat.codes

# filter the dataset to only get batter, pitcher, and pitch type fields
validation_data = validation_data[['batter', 'pitcher', 'pitch_class']]

# split the input and labels
x_valid = validation_data[['batter', 'pitcher']]
y_valid = validation_data['pitch_class']

## Data Sampling

There are about 700k pitches thrown in the 2017 and 2018 seasons in scope. Let's sample this data and select only 50% of this data for training. Let's also sample 50% of the 2019 pitch data for estimating the model accuracy in the validation set.

In [None]:
##### Data Sampling -----

# randomly sample 25% of data for training
x_train_samp, x_1, y_train_samp, y_1 = train_test_split(x_train, y_train, stratify = y_train, train_size=.5, random_state=42)

# randomly sample 25% of data for training
x_valid_samp, x_1, y_valid_samp, y_1 = train_test_split(x_valid, y_valid, stratify = y_valid, train_size=.5, random_state=42)

# delete objects no longer needed in memory
import gc
gc.enable()
del x_1, y_1
gc.collect()

print('Training data size:',len(x_train_samp))
print('Validation data size:',len(x_valid_samp))

Training data size: 392888
Validation data size: 172470


In [None]:
# get the number of batters and pitchers in the data set
n_batters = len(np.unique(x_train_samp['batter']))
n_pitchers = len(np.unique(x_train_samp['pitcher']))

In [None]:
# make sure all batters in the validation set are in the training set
print(np.unique(x_valid_samp['batter'].isin(x_train_samp['batter']))) # should be True

# make sure all pitchers in the validation set are in the training set
print(np.unique(x_valid_samp['pitcher'].isin(x_train_samp['pitcher']))) # should be True

[ True]
[ True]


## One Hot Encoding

The next step is to one hot encode the input training and validation data based on the batter and pitcher ids. Each batter and pitcher will be represented as a OHE vector.

In [None]:
##### Data preparation - OHE -----

categorical_features = ['batter','pitcher']

# create pipeline for categorical features
cat_pipeline = Pipeline([
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

# specify the column transformer for numeric and categorical features
data_pipeline = ColumnTransformer(
    [("cat_pipeline", cat_pipeline, categorical_features)],
    remainder='passthrough')

# generate the pipeline w/ feature selection
full_data_pipeline = Pipeline([
    ("preparation", data_pipeline)
])

# fit the model pipeline on the training data and transform the training data
x_train_ohe = full_data_pipeline.fit_transform(x_train_samp, y_train_samp)

# transform the validation data
x_valid_ohe = full_data_pipeline.transform(x_valid_samp)

## Data Loaders

The batter and the pitcher OHE vectors will be split into two sets of vectors. One for the batters and one for the pitchers. Each set of vectors will be passed into a separate embedding layer within the network. The output of the embedding layers will be concatenated and then passed into a fully connected layer to produce the pitch prediction output. The vectors need further processing as the embedding layers require vectors of indices. Let's use argmax to get the input vector of indices of the batter and pitcher vectors. 

In [None]:
#### Data Loaders ----

# set batch size for training data loader
BATCH_SIZE = 1000

# set seed for reproducibility
torch.manual_seed(42)

# get the training vectors
x_train_batters = np.float64(x_train_ohe[:, :n_batters]) # there are 441 batters
x_train_pitchers = np.float64(x_train_ohe[:, n_batters:]) # the remaining vectors are pitchers

# get the validation vectors
x_valid_batters = np.float64(x_valid_ohe[:, :n_batters]) # there are 441 batters
x_valid_pitchers = np.float64(x_valid_ohe[:, n_batters:]) # the remaining vectors are pitchers

# load input indices and pitch labels into the training data tensor data set
train_data = TensorDataset(torch.argmax(torch.FloatTensor(x_train_batters), 1).reshape(-1, 1), torch.argmax(torch.FloatTensor(x_train_pitchers), 1).reshape(-1, 1), torch.LongTensor(np.array(y_train_samp)))

# load input indices and pitch labels into the validation data tensor data set
validation_data = TensorDataset(torch.argmax(torch.FloatTensor(x_valid_batters), 1).reshape(-1, 1), torch.argmax(torch.FloatTensor(x_valid_pitchers), 1).reshape(-1, 1), torch.LongTensor(np.array(y_valid_samp)))

# set training data loader
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)

# set validation data loader
validation_loader = DataLoader(validation_data, batch_size=x_valid_pitchers.shape[0], shuffle=False, num_workers=0)

## Train Model

Before training we need to set up the GPU instance.


In [None]:
#### Set up the GPU instance ----

# set up GPU instance
device = 'cuda' if cuda.is_available() else 'cpu'

# we should be printing out CUDA
print(f"Training will occur using a {device} device")

Training will occur using a cuda device


The embedding architecture takes as input a vector of batter indices and pitcher indices. Each are passed into an embedding layer. The output of the embedding layers are then reshaped and concatentated. The concatenated input then passes to the final fully connected layer to generate the pitch prediction. We are interested in training the embedding layer weights.

In [None]:
#### Define the model architecture ----

class batterPitcherEmbed(nn.Module):
    
    def __init__(self, batter_size, pitcher_size, embedding_size, activation, classes):
        super(batterPitcherEmbed, self).__init__()
        
        # set the number of output classes
        self.classes = classes

        # embedding layers
        self.batter_embeddings = nn.Embedding(batter_size, embedding_size)
        self.pitcher_embeddings = nn.Embedding(pitcher_size, embedding_size)

        # linear layer
        self.linear_layer_1 = nn.Linear(2*embedding_size, self.classes)

        # activation function passed in by user
        self.activation = activation

    def forward(self, batter_vec, pitcher_vec):
        
        # batter embeddings
        batter_emb = self.batter_embeddings(batter_vec).view((batter_vec.size(0), -1))
        batter_emb = self.activation(batter_emb)

        # pitcher embeddings
        pitcher_emb = self.pitcher_embeddings(pitcher_vec).view((pitcher_vec.size(0), -1))
        pitcher_emb = self.activation(pitcher_emb)

        # concatenate the embeddings
        x = torch.cat((batter_emb, pitcher_emb), dim=1)

        # fully connected layer
        x = self.linear_layer_1(x)

        return x

Let's train the model with the following parameters:

- Default initialization
- Relu activation after embedding layers
- Adam optimization with 0.001 learning rate
- Cross Entropy Loss (for multiclass classification task)
- 1000 batch size
- 100 epochs

The embeddings will be 9 dimensional.

In [None]:
#### Train Model ----

# set classifier
embedding_nn = batterPitcherEmbed(n_batters, n_pitchers, 9, nn.ReLU(), 6) # 441 batters, 512 pitchers, 9 dimensions embeddings, relu activation, 6 pitch type classes

# train and evaluate model with parameters
embedding_nn, train_loss, train_accuracy, validation_loss, validation_accuracy = train_and_evaluate(embedding_nn, optim.Adam, 50, 0.0005, nn.CrossEntropyLoss, True, train_loader, validation_loader, 1)

batterPitcherEmbed(
  (batter_embeddings): Embedding(441, 9)
  (pitcher_embeddings): Embedding(512, 9)
  (linear_layer_1): Linear(in_features=18, out_features=6, bias=True)
  (activation): ReLU()
)
Epoch 0: | Train Loss: 1.69726 | Train Acc: 0.30415
Epoch 0: | Test Loss: 1.67903 | Test Acc: 0.33959
Epoch 1: | Train Loss: 1.63698 | Train Acc: 0.34307
Epoch 1: | Test Loss: 1.65186 | Test Acc: 0.34840
Epoch 2: | Train Loss: 1.60354 | Train Acc: 0.35713
Epoch 2: | Test Loss: 1.62233 | Test Acc: 0.35288
Epoch 3: | Train Loss: 1.56696 | Train Acc: 0.37224
Epoch 3: | Test Loss: 1.59305 | Test Acc: 0.36364
Epoch 4: | Train Loss: 1.53188 | Train Acc: 0.38700
Epoch 4: | Test Loss: 1.56729 | Test Acc: 0.37266
Epoch 5: | Train Loss: 1.50210 | Train Acc: 0.39955
Epoch 5: | Test Loss: 1.54658 | Test Acc: 0.38189
Epoch 6: | Train Loss: 1.47804 | Train Acc: 0.40934
Epoch 6: | Test Loss: 1.53092 | Test Acc: 0.38966
Epoch 7: | Train Loss: 1.45901 | Train Acc: 0.41609
Epoch 7: | Test Loss: 1.51934 | Test

The validation accuracy rounds out to 0.428 after 100 epochs. The majority class in the data occurs at 34%. We are seeing a 8 percentage point lift just from passing in the batter and pitcher indices.

## Final Output

Let's save the model weights for later.

In [None]:
##### Save the model -----

# save model weights
torch.save(embedding_nn.state_dict(), os.path.join(EMBEDDINGS_DATA_DIR, 'batter_pitcher_weights.pth'))

The final output is a mapping of the batter ids to the 30 dimensional batter embeddings and a mapping of the pitcher ids to the 30 dimensional pitcher embeddings.

In [None]:
##### Create data frame of batter and pitcher vector embeddings -----

# batter embeddings
batter_ids = pd.DataFrame({'batter':x_train_samp['batter'], 'batter_idx':torch.argmax(torch.FloatTensor(x_train_batters), 1).cpu().detach().numpy()}).drop_duplicates().sort_values('batter_idx')
batter_embeddings = pd.DataFrame(embedding_nn.state_dict()['batter_embeddings.weight'].cpu().detach().numpy(), columns=['batter_' + str(i) for i in range(9)])
batter_embedding_map_df = pd.concat([batter_ids.reset_index(drop=True), batter_embeddings], axis=1)
batter_embedding_map_df['batter_name'] = playerid_reverse_lookup(batter_embedding_map_df['batter'])['name_first'] + ' ' +  playerid_reverse_lookup(batter_embedding_map_df['batter'])['name_last']

# pitcher embeddings
pitcher_ids = pd.DataFrame({'pitcher':x_train_samp['pitcher'], 'pitcher_idx':torch.argmax(torch.FloatTensor(x_train_pitchers), 1).cpu().detach().numpy()}).drop_duplicates().sort_values('pitcher_idx')
pitcher_embeddings = pd.DataFrame(embedding_nn.state_dict()['pitcher_embeddings.weight'].cpu().detach().numpy(), columns=['pitcher_' + str(i) for i in range(9)])
pitcher_embedding_map_df = pd.concat([pitcher_ids.reset_index(drop=True), pitcher_embeddings], axis=1)
pitcher_embedding_map_df['pitcher_name'] = playerid_reverse_lookup(pitcher_embedding_map_df['pitcher'])['name_first'] + ' ' +  playerid_reverse_lookup(pitcher_embedding_map_df['pitcher'])['name_last']

Gathering player lookup table. This may take a moment.


In [None]:
##### Save batter and pitcher vector embeddings -----

# save batter embedding table
batter_embedding_map_df.to_pickle(os.path.join(EMBEDDINGS_DATA_DIR, 'batter_embedding_df.pkl'))

# save pitcher embedding table
pitcher_embedding_map_df.to_pickle(os.path.join(EMBEDDINGS_DATA_DIR, 'pitcher_embedding_df.pkl'))

## Batter & Pitcher Embedding Analysis

Let's do some analysis to view the quality of the embeddings to confirm and see which pitchers and hitters are similar to each other based on the pitches they throw and receive respectively. We will use a tSNE analysis to reduce the dimensionality of the data to 2 dimensions. We will then visualize the data in a 2D plot to see clusters of pitchers.

### tSNE analysis

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components = 2)

# batter tsne
tsne_batter = tsne.fit_transform(batter_embedding_map_df[['batter_' + str(i) for i in range(9)]])
batter_embedding_map_tsne_df = pd.concat([batter_embedding_map_df, pd.DataFrame(tsne_batter, columns = ['tsne_1','tsne_2'])], axis=1, join="inner")

# pitcher tsne
tsne_pitcher = tsne.fit_transform(pitcher_embedding_map_df[['pitcher_' + str(i) for i in range(9)]])
pitcher_embedding_map_tsne_df = pd.concat([pitcher_embedding_map_df, pd.DataFrame(tsne_pitcher, columns = ['tsne_1','tsne_2'])], axis=1, join="inner")

In [None]:
# tsne batter scatterplot
fig = px.scatter(batter_embedding_map_tsne_df, x="tsne_1", y="tsne_2", text="batter_name")

fig.update_traces(textposition='top center')

fig.update_layout(
    height=1600,
    title_text='TSNE of MLB Batters'
)

fig.show()

In [None]:
# tsne pitcher scatterplot
fig = px.scatter(pitcher_embedding_map_tsne_df, x="tsne_1", y="tsne_2", text="pitcher_name")

fig.update_traces(textposition='top center')

fig.update_layout(
    height=1600,
    title_text='TSNE of MLB Pitchers'
)

fig.show()