# Tutorial: Generating Synthetic Baseball Pitch Sequences Using Markov Models

Name: John Hodge

Date: 04/18/24 (Updated: 02/05/26)

**Updates:** Added pitcher archetypes, higher-order pitch sequence dependencies, count-dependent outcomes, fatigue modeling, and game situation context to produce richer training data for ML models.

In this tutorial, we'll explore how to simulate realistic baseball pitch sequences using Markov models enhanced with pitcher archetypes, sequential pitch patterns, and game context. By the end, we will have a synthetic dataset of baseball pitch sequences, including ball and strike counts, pitch outcomes, pitcher types, fatigue, and game situation features. This dataset can be used to train machine learning models to predict the type of pitch based on count, pitch history, and context.

## Conceptual Overview
A Markov model is a stochastic model that describes a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. For our baseball simulation, we extend this by:
1. **Pitcher archetypes** — different pitcher profiles with distinct pitch distributions
2. **Higher-order dependencies** — previous pitches influence the next pitch selection via "setup pitch" strategies
3. **Count-dependent outcomes** — hit/ball/strike probabilities vary realistically with the count
4. **Fatigue modeling** — pitch selection and control degrade over the course of a game
5. **Game situation** — at-bat number, runners on base, and score differential affect strategy

## Initialize environment

In [24]:
!pip install pandas matplotlib seaborn



In [None]:
import pandas as pd
import numpy as np
from typing import Dict, Tuple, List, Optional
import random
from matplotlib import pyplot as plt
import seaborn as sns

## Step 1: Define Pitcher Archetypes and Base Transition Probabilities

We define four pitcher archetypes with distinct pitch preferences. The base transition probabilities are rebalanced to reduce Fastball dominance (30-50% instead of 50-80%). Each archetype blends with the base probabilities to create unique pitch distributions.

We also define pitch sequence strategies — common 2-3 pitch patterns that pitchers use as "setup" sequences. When the recent pitch history matches a strategy prefix, the next pitch gets a probability boost, creating higher-order dependencies that sequence-aware models (LSTM, Transformer) can learn.

In [None]:
# --- Pitcher Archetypes ---
PITCHER_ARCHETYPES = {
    'power': {
        'pitch_bias': {'Fastball': 0.55, 'Slider': 0.25, 'Curveball': 0.10, 'Changeup': 0.10},
        'fatigue_resistance': 85,   # starts fatiguing after this many pitches
        'strike_tendency': 0.05,    # bonus strike probability
    },
    'finesse': {
        'pitch_bias': {'Fastball': 0.25, 'Slider': 0.20, 'Curveball': 0.30, 'Changeup': 0.25},
        'fatigue_resistance': 95,
        'strike_tendency': -0.03,
    },
    'slider_specialist': {
        'pitch_bias': {'Fastball': 0.35, 'Slider': 0.40, 'Curveball': 0.10, 'Changeup': 0.15},
        'fatigue_resistance': 80,
        'strike_tendency': 0.02,
    },
    'balanced': {
        'pitch_bias': {'Fastball': 0.30, 'Slider': 0.25, 'Curveball': 0.25, 'Changeup': 0.20},
        'fatigue_resistance': 90,
        'strike_tendency': 0.0,
    },
}

# --- Pitch Sequence Strategies ---
# Each strategy is a tuple of (prefix_pitches, next_pitch, boost_amount)
# If the last N pitches match the prefix, boost the next_pitch probability
PITCH_SEQUENCE_STRATEGIES = [
    # Speed change setups
    (['Fastball', 'Fastball'], 'Changeup', 0.20),
    (['Fastball', 'Fastball'], 'Slider', 0.15),
    # Progressive breaking
    (['Fastball', 'Slider'], 'Curveball', 0.20),
    # Eye-level change
    (['Curveball'], 'Fastball', 0.15),
    # Off-speed to power
    (['Changeup'], 'Fastball', 0.18),
    # Double breaking ball avoidance (pitchers rarely throw 3 breaking balls in a row)
    (['Slider', 'Slider'], 'Fastball', 0.25),
    (['Curveball', 'Curveball'], 'Fastball', 0.20),
    # Slider setup for curveball
    (['Fastball', 'Curveball'], 'Slider', 0.15),
]

# --- Base Transition Probabilities (rebalanced, Fastball reduced to 30-50%) ---
transition_probs = {
    (0, 0): {
        'pitch_probs': {'Fastball': 0.40, 'Slider': 0.25, 'Curveball': 0.20, 'Changeup': 0.15},
        'outcomes': {
            'Fastball': {'ball': 0.30, 'strike': 0.58, 'hit': 0.12},
            'Slider': {'ball': 0.40, 'strike': 0.50, 'hit': 0.10},
            'Curveball': {'ball': 0.48, 'strike': 0.44, 'hit': 0.08},
            'Changeup': {'ball': 0.38, 'strike': 0.52, 'hit': 0.10}
        }
    },
    (0, 1): {
        'pitch_probs': {'Fastball': 0.35, 'Slider': 0.28, 'Curveball': 0.20, 'Changeup': 0.17},
        'outcomes': {
            'Fastball': {'ball': 0.28, 'strike': 0.62, 'hit': 0.10},
            'Slider': {'ball': 0.35, 'strike': 0.55, 'hit': 0.10},
            'Curveball': {'ball': 0.42, 'strike': 0.50, 'hit': 0.08},
            'Changeup': {'ball': 0.33, 'strike': 0.58, 'hit': 0.09}
        }
    },
    (0, 2): {
        'pitch_probs': {'Fastball': 0.30, 'Slider': 0.32, 'Curveball': 0.20, 'Changeup': 0.18},
        'outcomes': {
            'Fastball': {'ball': 0.32, 'strike': 0.62, 'hit': 0.06},
            'Slider': {'ball': 0.38, 'strike': 0.56, 'hit': 0.06},
            'Curveball': {'ball': 0.45, 'strike': 0.50, 'hit': 0.05},
            'Changeup': {'ball': 0.40, 'strike': 0.54, 'hit': 0.06}
        }
    },
    (1, 0): {
        'pitch_probs': {'Fastball': 0.42, 'Slider': 0.23, 'Curveball': 0.20, 'Changeup': 0.15},
        'outcomes': {
            'Fastball': {'ball': 0.28, 'strike': 0.58, 'hit': 0.14},
            'Slider': {'ball': 0.38, 'strike': 0.50, 'hit': 0.12},
            'Curveball': {'ball': 0.45, 'strike': 0.45, 'hit': 0.10},
            'Changeup': {'ball': 0.35, 'strike': 0.53, 'hit': 0.12}
        }
    },
    (1, 1): {
        'pitch_probs': {'Fastball': 0.38, 'Slider': 0.25, 'Curveball': 0.20, 'Changeup': 0.17},
        'outcomes': {
            'Fastball': {'ball': 0.30, 'strike': 0.58, 'hit': 0.12},
            'Slider': {'ball': 0.38, 'strike': 0.50, 'hit': 0.12},
            'Curveball': {'ball': 0.44, 'strike': 0.46, 'hit': 0.10},
            'Changeup': {'ball': 0.36, 'strike': 0.52, 'hit': 0.12}
        }
    },
    (1, 2): {
        'pitch_probs': {'Fastball': 0.32, 'Slider': 0.30, 'Curveball': 0.20, 'Changeup': 0.18},
        'outcomes': {
            'Fastball': {'ball': 0.30, 'strike': 0.63, 'hit': 0.07},
            'Slider': {'ball': 0.36, 'strike': 0.57, 'hit': 0.07},
            'Curveball': {'ball': 0.43, 'strike': 0.51, 'hit': 0.06},
            'Changeup': {'ball': 0.38, 'strike': 0.55, 'hit': 0.07}
        }
    },
    (2, 0): {
        'pitch_probs': {'Fastball': 0.45, 'Slider': 0.22, 'Curveball': 0.18, 'Changeup': 0.15},
        'outcomes': {
            'Fastball': {'ball': 0.25, 'strike': 0.57, 'hit': 0.18},
            'Slider': {'ball': 0.35, 'strike': 0.50, 'hit': 0.15},
            'Curveball': {'ball': 0.42, 'strike': 0.44, 'hit': 0.14},
            'Changeup': {'ball': 0.33, 'strike': 0.52, 'hit': 0.15}
        }
    },
    (2, 1): {
        'pitch_probs': {'Fastball': 0.40, 'Slider': 0.25, 'Curveball': 0.18, 'Changeup': 0.17},
        'outcomes': {
            'Fastball': {'ball': 0.28, 'strike': 0.57, 'hit': 0.15},
            'Slider': {'ball': 0.36, 'strike': 0.50, 'hit': 0.14},
            'Curveball': {'ball': 0.43, 'strike': 0.45, 'hit': 0.12},
            'Changeup': {'ball': 0.35, 'strike': 0.51, 'hit': 0.14}
        }
    },
    (2, 2): {
        'pitch_probs': {'Fastball': 0.35, 'Slider': 0.28, 'Curveball': 0.20, 'Changeup': 0.17},
        'outcomes': {
            'Fastball': {'ball': 0.30, 'strike': 0.62, 'hit': 0.08},
            'Slider': {'ball': 0.37, 'strike': 0.55, 'hit': 0.08},
            'Curveball': {'ball': 0.44, 'strike': 0.49, 'hit': 0.07},
            'Changeup': {'ball': 0.38, 'strike': 0.54, 'hit': 0.08}
        }
    },
    (3, 0): {
        'pitch_probs': {'Fastball': 0.50, 'Slider': 0.20, 'Curveball': 0.15, 'Changeup': 0.15},
        'outcomes': {
            'Fastball': {'ball': 0.22, 'strike': 0.55, 'hit': 0.23},
            'Slider': {'ball': 0.30, 'strike': 0.50, 'hit': 0.20},
            'Curveball': {'ball': 0.38, 'strike': 0.44, 'hit': 0.18},
            'Changeup': {'ball': 0.28, 'strike': 0.52, 'hit': 0.20}
        }
    },
    (3, 1): {
        'pitch_probs': {'Fastball': 0.48, 'Slider': 0.22, 'Curveball': 0.15, 'Changeup': 0.15},
        'outcomes': {
            'Fastball': {'ball': 0.25, 'strike': 0.57, 'hit': 0.18},
            'Slider': {'ball': 0.33, 'strike': 0.50, 'hit': 0.17},
            'Curveball': {'ball': 0.40, 'strike': 0.45, 'hit': 0.15},
            'Changeup': {'ball': 0.30, 'strike': 0.53, 'hit': 0.17}
        }
    },
    (3, 2): {
        'pitch_probs': {'Fastball': 0.42, 'Slider': 0.25, 'Curveball': 0.18, 'Changeup': 0.15},
        'outcomes': {
            'Fastball': {'ball': 0.28, 'strike': 0.60, 'hit': 0.12},
            'Slider': {'ball': 0.35, 'strike': 0.53, 'hit': 0.12},
            'Curveball': {'ball': 0.42, 'strike': 0.47, 'hit': 0.11},
            'Changeup': {'ball': 0.33, 'strike': 0.55, 'hit': 0.12}
        }
    },
}

## Step 2: Simulator Architecture

The `BaseballPitchSimulator` class now incorporates:

1. **Pitcher archetype blending** — The archetype's pitch bias is blended (60% archetype / 40% base) with the count-based probabilities
2. **Sequence strategy boosts** — Recent pitch history triggers probability boosts for strategic follow-up pitches
3. **Fatigue modeling** — After the pitcher's fatigue threshold, pitch selection shifts toward fastballs and ball probability increases
4. **Game situation modifiers** — Runners on base, score differential, and at-bat number influence pitch selection and aggression
5. **Count-dependent outcomes** — Hit probability ranges from 5-6% in pitcher's counts (0-2) to 18-23% in hitter's counts (3-0)

## Step 3: The Enhanced Baseball Pitch Simulator Class

In [None]:
class BaseballPitchSimulator:
    def __init__(self, transition_probs: Dict, pitcher_type: str = 'balanced'):
        self.states = [(balls, strikes) for balls in range(4) for strikes in range(3)]
        self.pitch_types = ['Fastball', 'Slider', 'Curveball', 'Changeup']
        self.transition_probs = transition_probs
        self.pitcher_type = pitcher_type
        self.archetype = PITCHER_ARCHETYPES[pitcher_type]
        self.pitch_count = 0
        self.recent_pitches: List[str] = []  # tracks pitch history within a game

    def _blend_pitch_probs(self, base_probs: Dict[str, float]) -> Dict[str, float]:
        """Blend base count probabilities with pitcher archetype bias (60/40 split)."""
        bias = self.archetype['pitch_bias']
        blended = {}
        for pitch in self.pitch_types:
            blended[pitch] = 0.4 * base_probs[pitch] + 0.6 * bias[pitch]
        # Normalize
        total = sum(blended.values())
        return {p: v / total for p, v in blended.items()}

    def _apply_sequence_strategies(self, probs: Dict[str, float]) -> Dict[str, float]:
        """Boost probabilities based on recent pitch history matching known strategies."""
        if not self.recent_pitches:
            return probs

        probs = dict(probs)  # copy
        for prefix, next_pitch, boost in PITCH_SEQUENCE_STRATEGIES:
            n = len(prefix)
            if len(self.recent_pitches) >= n:
                if self.recent_pitches[-n:] == prefix:
                    probs[next_pitch] += boost

        # Normalize
        total = sum(probs.values())
        return {p: v / total for p, v in probs.items()}

    def _apply_fatigue(self, probs: Dict[str, float],
                       outcome_probs: Dict[str, float]) -> Tuple[Dict[str, float], Dict[str, float]]:
        """Shift toward fastballs and increase ball rate as pitcher fatigues."""
        threshold = self.archetype['fatigue_resistance']
        if self.pitch_count <= threshold:
            return probs, outcome_probs

        # Fatigue factor: 0 at threshold, increases linearly, caps at 0.4
        fatigue = min((self.pitch_count - threshold) / 40.0, 0.4)

        # Shift pitch selection toward fastballs
        probs = dict(probs)
        probs['Fastball'] += fatigue * 0.5
        for p in ['Slider', 'Curveball', 'Changeup']:
            probs[p] = max(probs[p] - fatigue * 0.5 / 3, 0.02)
        total = sum(probs.values())
        probs = {p: v / total for p, v in probs.items()}

        # Increase ball probability
        outcome_probs = dict(outcome_probs)
        ball_boost = fatigue * 0.15
        outcome_probs['ball'] = min(outcome_probs['ball'] + ball_boost, 0.60)
        outcome_probs['strike'] = max(outcome_probs['strike'] - ball_boost * 0.7, 0.20)
        outcome_probs['hit'] = max(outcome_probs['hit'] - ball_boost * 0.3, 0.03)
        # Normalize outcomes
        total = sum(outcome_probs.values())
        outcome_probs = {k: v / total for k, v in outcome_probs.items()}

        return probs, outcome_probs

    def _apply_situation(self, probs: Dict[str, float], outcome_probs: Dict[str, float],
                         runners_on: bool, score_diff: int) -> Tuple[Dict[str, float], Dict[str, float]]:
        """Modify probabilities based on game situation."""
        probs = dict(probs)
        outcome_probs = dict(outcome_probs)

        if runners_on:
            # Pitchers from the stretch: more fastballs, fewer slow pitches
            probs['Fastball'] += 0.08
            probs['Curveball'] = max(probs['Curveball'] - 0.04, 0.02)
            probs['Changeup'] = max(probs['Changeup'] - 0.04, 0.02)

        if score_diff >= 3:
            # Ahead by 3+: more aggressive, throw strikes
            outcome_probs['strike'] += 0.05
            outcome_probs['ball'] = max(outcome_probs['ball'] - 0.05, 0.10)
        elif score_diff <= -3:
            # Behind by 3+: more careful, nibble corners
            outcome_probs['ball'] += 0.05
            outcome_probs['strike'] = max(outcome_probs['strike'] - 0.05, 0.20)

        # Normalize both
        total_p = sum(probs.values())
        probs = {p: v / total_p for p, v in probs.items()}
        total_o = sum(outcome_probs.values())
        outcome_probs = {k: v / total_o for k, v in outcome_probs.items()}

        return probs, outcome_probs

    def simulate_pitch(self, current_state: Tuple[int, int],
                       runners_on: bool = False,
                       score_diff: int = 0) -> Tuple[str, str]:
        """Simulate a pitch with archetype, sequence, fatigue, and situation modifiers."""
        # Start with base probabilities for this count
        base_pitch_probs = self.transition_probs[current_state]['pitch_probs']

        # 1. Blend with pitcher archetype
        pitch_probs = self._blend_pitch_probs(base_pitch_probs)

        # 2. Apply sequence strategies
        pitch_probs = self._apply_sequence_strategies(pitch_probs)

        # Select pitch type
        pitch_types = list(pitch_probs.keys())
        pitch_probabilities = list(pitch_probs.values())
        pitch_type = random.choices(pitch_types, weights=pitch_probabilities, k=1)[0]

        # Get base outcome probabilities
        outcome_probs = dict(self.transition_probs[current_state]['outcomes'][pitch_type])

        # Apply archetype strike tendency
        outcome_probs['strike'] += self.archetype['strike_tendency']
        outcome_probs['ball'] -= self.archetype['strike_tendency']

        # 3. Apply fatigue
        pitch_probs, outcome_probs = self._apply_fatigue(pitch_probs, outcome_probs)

        # 4. Apply game situation
        pitch_probs, outcome_probs = self._apply_situation(
            pitch_probs, outcome_probs, runners_on, score_diff)

        # Ensure outcome probs are valid
        total = sum(outcome_probs.values())
        outcome_probs = {k: max(v / total, 0.01) for k, v in outcome_probs.items()}
        total = sum(outcome_probs.values())
        outcome_probs = {k: v / total for k, v in outcome_probs.items()}

        outcome = random.choices(
            list(outcome_probs.keys()),
            weights=list(outcome_probs.values()),
            k=1
        )[0]

        self.pitch_count += 1
        self.recent_pitches.append(pitch_type)

        return pitch_type, outcome

    def update_count(self, current_state, outcome):
        balls, strikes = current_state
        if outcome == 'ball':
            balls += 1
            if balls == 4:
                return 'walk'
        elif outcome == 'strike':
            strikes += 1
            if strikes == 3:
                return 'strikeout'
        elif outcome == 'hit':
            return 'hit'
        return (balls, strikes) if (balls < 4 and strikes < 3) else None

    def simulate_at_bat(self, runners_on: bool = False, score_diff: int = 0):
        state = (0, 0)
        sequence = []

        while True:
            pitch_type, outcome = self.simulate_pitch(state, runners_on, score_diff)
            sequence.append((state, pitch_type, outcome))
            new_state = self.update_count(state, outcome)
            if isinstance(new_state, tuple):
                state = new_state
            else:
                sequence.append(new_state)
                break

        return sequence

In [None]:
# Example usage with different pitcher archetypes:
for ptype in PITCHER_ARCHETYPES:
    simulator = BaseballPitchSimulator(transition_probs, pitcher_type=ptype)
    at_bat = simulator.simulate_at_bat(runners_on=False, score_diff=0)
    print(f"{ptype:>20}: {at_bat}")

## Step 4: Generate Dataset

Simulate full games with multiple at-bats per game. Each game has a randomly assigned pitcher archetype, and game situation features (runners on base, score differential, at-bat number, pitch number) are tracked throughout.

In [None]:
def generate_dataset(num_games: int = 3000, at_bats_per_game: int = 35):
    """Generate a dataset by simulating full games with context."""
    pitcher_types = list(PITCHER_ARCHETYPES.keys())
    data = []

    for game_id in range(num_games):
        # Assign a random pitcher archetype for this game
        pitcher_type = random.choice(pitcher_types)
        simulator = BaseballPitchSimulator(transition_probs, pitcher_type=pitcher_type)

        # Simulate game situation
        score_diff = 0  # pitcher's team score minus opponent score

        for at_bat_num in range(1, at_bats_per_game + 1):
            # Randomize game situation for each at-bat
            runners_on = random.random() < 0.35  # ~35% chance runners are on
            # Score drifts randomly over the game
            if random.random() < 0.15:
                score_diff += random.choice([-1, 1, 1, 2])  # slight bias toward runs scored
            score_diff = max(min(score_diff, 8), -8)  # clamp

            at_bat = simulator.simulate_at_bat(runners_on=runners_on, score_diff=score_diff)

            for item in at_bat[:-1]:  # Exclude the final result ('walk'/'strikeout'/'hit')
                state, pitch_type, outcome = item
                balls, strikes = state
                data.append([
                    balls, strikes, pitch_type, outcome,
                    pitcher_type, simulator.pitch_count,
                    at_bat_num, int(runners_on), score_diff
                ])

    df = pd.DataFrame(data, columns=[
        'Balls', 'Strikes', 'PitchType', 'Outcome',
        'PitcherType', 'PitchNumber', 'AtBatNumber',
        'RunnersOn', 'ScoreDiff'
    ])

    # Add PreviousPitchType (within the same game, shifts across at-bat boundaries)
    df['PreviousPitchType'] = df['PitchType'].shift(1)
    # Fill the first row's NaN with 'None' (start of game)
    df['PreviousPitchType'] = df['PreviousPitchType'].fillna('None')

    return df

In [None]:
import os
os.makedirs('data', exist_ok=True)

# Generate the dataset
random.seed(42)
dataset = generate_dataset(num_games=3000, at_bats_per_game=35)

# Save to CSV (for ML model training)
dataset.to_csv('data/baseball_pitch_data.csv', index=False)

print(f"Dataset shape: {dataset.shape}")
print(f"\nPitch type distribution:")
print(dataset['PitchType'].value_counts(normalize=True).round(3))
print(f"\nPitcher type distribution:")
print(dataset['PitcherType'].value_counts(normalize=True).round(3))

## View Data

In [None]:
dataset.head(20)

In [None]:
# Pitch distribution by pitcher type
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Overall PitchType vs Outcome heatmap
ax1 = axes[0]
df_2dhist = pd.DataFrame({
    x_label: grp['Outcome'].value_counts()
    for x_label, grp in dataset.groupby('PitchType')
})
sns.heatmap(df_2dhist, cmap='viridis', ax=ax1, annot=True, fmt='d')
ax1.set_xlabel('PitchType')
ax1.set_ylabel('Outcome')
ax1.set_title('PitchType vs Outcome (Overall)')

# PitchType distribution by PitcherType
ax2 = axes[1]
ct = pd.crosstab(dataset['PitcherType'], dataset['PitchType'], normalize='index')
ct.plot(kind='bar', ax=ax2, rot=0)
ax2.set_ylabel('Proportion')
ax2.set_title('Pitch Distribution by Pitcher Type')
ax2.legend(title='PitchType', bbox_to_anchor=(1.05, 1))

plt.tight_layout()
plt.show()

### Additional Visualizations

In [None]:
# Hit rate by count
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Hit rate by count
hit_by_count = dataset.groupby(['Balls', 'Strikes'])['Outcome'].apply(
    lambda x: (x == 'hit').mean()
).unstack(fill_value=0)
sns.heatmap(hit_by_count, cmap='YlOrRd', annot=True, fmt='.3f', ax=axes[0])
axes[0].set_title('Hit Rate by Count')
axes[0].set_xlabel('Strikes')
axes[0].set_ylabel('Balls')

# Fastball rate by pitch number (fatigue effect)
pitch_bins = pd.cut(dataset['PitchNumber'], bins=range(0, 131, 10))
fb_by_pitch_num = dataset.groupby(pitch_bins, observed=True)['PitchType'].apply(
    lambda x: (x == 'Fastball').mean()
)
fb_by_pitch_num.plot(kind='bar', ax=axes[1], rot=45)
axes[1].set_title('Fastball Rate by Pitch Count (Fatigue Effect)')
axes[1].set_ylabel('Fastball Proportion')
axes[1].set_xlabel('Pitch Number Range')

plt.tight_layout()
plt.show()

# Conclusion

This enhanced simulator generates realistic baseball pitch sequences with learnable patterns:

- **Pitcher archetypes** create distinct pitch distributions that models can learn to identify
- **Sequence strategies** create higher-order dependencies that reward models which look at pitch history
- **Count-dependent outcomes** provide realistic hit rates (5-6% in pitcher's counts up to 18-23% in hitter's counts)
- **Fatigue modeling** creates drift in pitch selection over the course of a game
- **Game situation** adds contextual signal (runners on, score differential)

The resulting dataset has columns: `Balls, Strikes, PitchType, Outcome, PitcherType, PitchNumber, AtBatNumber, RunnersOn, ScoreDiff, PreviousPitchType` — providing rich features for both tabular models (AutoGluon) and sequence models (HMM, LSTM).