# LSTM Baseball Pitch Sequence Predictor

Name: John Hodge

Date: 02/05/26

This notebook uses an LSTM (Long Short-Term Memory) neural network to predict the next pitch type
based on a sliding window of recent pitches and game context. LSTMs are well-suited for this task
because they can learn sequential patterns — such as pitch setup strategies — that simpler models miss.

## Install Dependencies

In [None]:
!pip install torch pandas numpy scikit-learn matplotlib seaborn

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## Load and Prepare Data

We load the enriched pitch data generated by the simulator and create sliding-window sequences.
Each sample consists of the last `WINDOW_SIZE` pitches (with context features) and the target
is the next pitch type.

In [None]:
# Load data
df = pd.read_csv('data/baseball_pitch_data.csv')
print(f'Dataset shape: {df.shape}')
print(f'Columns: {list(df.columns)}')
print(f'\nPitch type distribution:')
print(df['PitchType'].value_counts(normalize=True).round(3))
df.head(10)

In [None]:
# Configuration
WINDOW_SIZE = 8  # Number of previous pitches to use as input
BATCH_SIZE = 256
EPOCHS = 20
LEARNING_RATE = 0.001
HIDDEN_SIZE = 64
NUM_LAYERS = 2
DROPOUT = 0.3

In [None]:
# Encode categorical features
pitch_encoder = LabelEncoder()
pitcher_encoder = LabelEncoder()
prev_pitch_encoder = LabelEncoder()
outcome_encoder = LabelEncoder()

df['PitchType_enc'] = pitch_encoder.fit_transform(df['PitchType'])
df['PitcherType_enc'] = pitcher_encoder.fit_transform(df['PitcherType'])
df['PreviousPitchType_enc'] = prev_pitch_encoder.fit_transform(df['PreviousPitchType'])
df['Outcome_enc'] = outcome_encoder.fit_transform(df['Outcome'])

NUM_PITCH_TYPES = len(pitch_encoder.classes_)
NUM_PITCHER_TYPES = len(pitcher_encoder.classes_)
print(f'Pitch types ({NUM_PITCH_TYPES}): {list(pitch_encoder.classes_)}')
print(f'Pitcher types ({NUM_PITCHER_TYPES}): {list(pitcher_encoder.classes_)}')

In [None]:
# Features to use in the sequence
# Each timestep has: PitchType_enc, Balls, Strikes, PitcherType_enc, PitchNumber, RunnersOn, ScoreDiff
FEATURE_COLS = ['PitchType_enc', 'Balls', 'Strikes', 'PitcherType_enc',
                'PitchNumber', 'RunnersOn', 'ScoreDiff']
NUM_FEATURES = len(FEATURE_COLS)

# Normalize numerical features
for col in ['PitchNumber', 'Balls', 'Strikes', 'ScoreDiff']:
    df[col] = (df[col] - df[col].mean()) / (df[col].std() + 1e-8)

# Create sliding window sequences
# We need to respect game boundaries — don't create windows that span across games
# We can approximate game boundaries by detecting resets in PitchNumber
# (PitchNumber resets when a new game starts)

features = df[FEATURE_COLS].values
targets = df['PitchType_enc'].values

# Detect game boundaries (where raw PitchNumber would drop)
raw_pitch_num = pd.read_csv('data/baseball_pitch_data.csv')['PitchNumber'].values
game_starts = np.where(np.diff(raw_pitch_num, prepend=raw_pitch_num[0]+1) <= 0)[0]
game_starts = set(game_starts)

X_sequences = []
y_targets = []

for i in range(WINDOW_SIZE, len(features)):
    # Check that none of the window indices cross a game boundary
    window_range = range(i - WINDOW_SIZE + 1, i + 1)
    if any(idx in game_starts for idx in window_range):
        continue
    X_sequences.append(features[i - WINDOW_SIZE:i])
    y_targets.append(targets[i])

X = np.array(X_sequences, dtype=np.float32)
y = np.array(y_targets, dtype=np.int64)

print(f'Sequence dataset: X={X.shape}, y={y.shape}')
print(f'Each sample: {WINDOW_SIZE} timesteps x {NUM_FEATURES} features')

In [None]:
# Train/test split (80/20, preserving temporal order)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print(f'Train: {X_train.shape[0]} samples')
print(f'Test:  {X_test.shape[0]} samples')

# Create PyTorch datasets
class PitchSequenceDataset(Dataset):
    def __init__(self, sequences, targets):
        self.sequences = torch.FloatTensor(sequences)
        self.targets = torch.LongTensor(targets)

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, idx):
        return self.sequences[idx], self.targets[idx]

train_dataset = PitchSequenceDataset(X_train, y_train)
test_dataset = PitchSequenceDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

## Define the LSTM Model

A 2-layer LSTM with dropout. The final hidden state is passed through a fully connected layer
to predict the next pitch type (4-class classification).

In [None]:
class PitchPredictor(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes, dropout=0.3):
        super(PitchPredictor, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # x shape: (batch_size, seq_len, input_size)
        lstm_out, (h_n, c_n) = self.lstm(x)
        # Use the last hidden state
        out = self.dropout(h_n[-1])
        out = self.fc(out)
        return out

model = PitchPredictor(
    input_size=NUM_FEATURES,
    hidden_size=HIDDEN_SIZE,
    num_layers=NUM_LAYERS,
    num_classes=NUM_PITCH_TYPES,
    dropout=DROPOUT
).to(device)

print(model)
print(f'\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}')

## Train the Model

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, factor=0.5)

train_losses = []
test_losses = []
test_accuracies = []

for epoch in range(EPOCHS):
    # Training
    model.train()
    total_train_loss = 0
    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        total_train_loss += loss.item()

    avg_train_loss = total_train_loss / len(train_loader)
    train_losses.append(avg_train_loss)

    # Evaluation
    model.eval()
    total_test_loss = 0
    all_preds = []
    all_targets = []
    with torch.no_grad():
        for batch_X, batch_y in test_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            total_test_loss += loss.item()
            preds = torch.argmax(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_targets.extend(batch_y.cpu().numpy())

    avg_test_loss = total_test_loss / len(test_loader)
    test_losses.append(avg_test_loss)
    accuracy = accuracy_score(all_targets, all_preds)
    test_accuracies.append(accuracy)

    scheduler.step(avg_test_loss)

    if (epoch + 1) % 2 == 0 or epoch == 0:
        print(f'Epoch [{epoch+1}/{EPOCHS}] '
              f'Train Loss: {avg_train_loss:.4f} | '
              f'Test Loss: {avg_test_loss:.4f} | '
              f'Test Acc: {accuracy:.4f}')

## Evaluate Results

In [None]:
# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(train_losses, label='Train Loss')
ax1.plot(test_losses, label='Test Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Test Loss')
ax1.legend()
ax1.grid(True)

ax2.plot(test_accuracies, label='Test Accuracy', color='green')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Test Accuracy Over Epochs')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()

print(f'\nFinal test accuracy: {test_accuracies[-1]:.4f}')
print(f'Best test accuracy:  {max(test_accuracies):.4f} (epoch {np.argmax(test_accuracies)+1})')

In [None]:
# Confusion matrix
model.eval()
all_preds = []
all_targets = []
with torch.no_grad():
    for batch_X, batch_y in test_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        outputs = model(batch_X)
        preds = torch.argmax(outputs, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_targets.extend(batch_y.cpu().numpy())

# Classification report
target_names = list(pitch_encoder.classes_)
print('Classification Report:')
print(classification_report(all_targets, all_preds, target_names=target_names))

# Plot confusion matrix
cm = confusion_matrix(all_targets, all_preds)
plt.figure(figsize=(8, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names)
disp.plot(cmap='Blues')
plt.title('LSTM Pitch Prediction - Confusion Matrix')
plt.show()

## Baseline Comparison

Compare the LSTM against simple baselines:
1. **Most frequent**: Always predict the most common pitch type
2. **Previous pitch**: Predict whatever pitch was thrown last
3. **Random weighted**: Sample from the training distribution

In [None]:
from collections import Counter

# Most frequent baseline
most_common = Counter(y_train).most_common(1)[0][0]
most_freq_acc = np.mean(y_test == most_common)

# Previous pitch baseline (use the last pitch in each window as prediction)
prev_pitch_preds = X_test[:, -1, 0].astype(int)  # PitchType_enc is first feature (but normalized)
# We need the raw encoded pitch from the window - use the un-normalized target of the previous step
prev_pitch_preds_raw = y[split_idx - 1:len(y) - 1]  # offset by 1 for "previous" pitch
if len(prev_pitch_preds_raw) > len(y_test):
    prev_pitch_preds_raw = prev_pitch_preds_raw[:len(y_test)]
prev_pitch_acc = np.mean(y_test[:len(prev_pitch_preds_raw)] == prev_pitch_preds_raw)

# Random weighted baseline
train_dist = Counter(y_train)
total_train = sum(train_dist.values())
probs = [train_dist[i] / total_train for i in range(NUM_PITCH_TYPES)]
random_preds = np.random.choice(NUM_PITCH_TYPES, size=len(y_test), p=probs)
random_acc = np.mean(y_test == random_preds)

lstm_acc = test_accuracies[-1]

print(f'Baseline Comparison:')
print(f'  Most Frequent:  {most_freq_acc:.4f}')
print(f'  Previous Pitch: {prev_pitch_acc:.4f}')
print(f'  Random Weighted: {random_acc:.4f}')
print(f'  LSTM:           {lstm_acc:.4f}')

# Bar chart
methods = ['Most Frequent', 'Previous Pitch', 'Random Weighted', 'LSTM']
accs = [most_freq_acc, prev_pitch_acc, random_acc, lstm_acc]

plt.figure(figsize=(8, 5))
bars = plt.bar(methods, accs, color=['gray', 'gray', 'gray', 'steelblue'])
plt.ylabel('Accuracy')
plt.title('Pitch Prediction: LSTM vs Baselines')
for bar, acc in zip(bars, accs):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
             f'{acc:.3f}', ha='center', va='bottom', fontweight='bold')
plt.ylim(0, max(accs) * 1.15)
plt.grid(axis='y', alpha=0.3)
plt.show()

## Conclusion

The LSTM model leverages sequential patterns in the pitch data — particularly the setup-pitch
strategies and pitcher archetype tendencies — to predict the next pitch type. By using a sliding
window of recent pitches with contextual features (count, pitcher type, fatigue, game situation),
the LSTM can capture higher-order dependencies that simpler models like the HMM or count-only
tabular models cannot.

### Key findings:
- The LSTM should outperform the "most frequent" baseline, demonstrating that the sequential patterns are learnable
- Pitcher archetype is a strong signal — different pitcher types have distinct pitch distributions
- Sequence strategies (e.g., Fastball-Fastball-Changeup) create patterns the LSTM can exploit

### Potential improvements:
- Attention mechanisms to weight which past pitches matter most
- Transformer architecture for longer-range dependencies
- Embedding layers for categorical features instead of raw encoding