This notebook walks through the process of attempting to build as advanced an MLB playoff predictive model as possible using only team statistics. Though this approach is likely inferior to a more sophisticated model built on the microfoundations of individual player data, it will hopefully both serve as a strong baseline model to compare future models to and be a nice way to practice iterating on a very simple baseline playoff model to add complexity and hopefully improve model performance. 

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
from pybaseball import standings
from pybaseball import team_batting
from pybaseball import team_pitching
from pybaseball import retrosheet

team_pitching_df = team_pitching(1970,2021)
team_batting_df = team_batting(1970,2021)

We first implement an extremely simple baseline playoff model that simply takes a two playoff team's winning percentages as an input and uses the simple formula


$\frac{\text{Team 1 Win \% + (1 - Team 2 Win \%)}}{2}$ = Team 1 Projected Win % 

to compute probabilities on a team winning a given playoff matchup. We then use a Monte Carlo simulation to simulate every playoff bracket from 1968 to 2021 and compute the probability each team has of attaining various outcomes (wins World series, loses in first round, etc.). 

We then compare the results to the actual outcomes of these playoff brackets. We use the following evaluation model:

Error = $\sum_{teams}\sum_{outcomes}\text{(Distance from Outcome)}*|\text{Actual Outcome - Projected Outcome}|$

Here, the actual and projected outcomes are represented by arrays of length equal to the number of possible outcomes a team can have in the playoffs (World Series win, World Series Loss, ALCS loss, etc.). The actual outcomes array is an array of zeroes and a single value of one, which corresponds to the team's ultimate bracket placement. For example, in the 2021 playoff format, since the Braves avoided the Wild Card round, their array would be represented as [Lost in NLDS, Lost in NLCS, Lost in WS, Won WS] = [0,0,0,1]. The projected outcomes array is in the same format as the actual outcomes array, but gives the percentage of time a team ended up with the placement corresponding to the array index as an input for each index. For example, the Braves' 2021 projected outcomes array might have looked something like [45, 22, 9, 4]. 

The distance from an outcome is defined as the number of 'steps' away from the actual outcome the projected outcome was. For example, if the Tigers were projected to win the World Series but actually lost in the ALCS, the distance between these outcomes would be 2, but if they lost in the World Series, the distance would be 1. 

In [2]:
team_pitching_df = team_pitching_df.set_index(["Season", "Team"])

In [3]:
team_pitching_df["Win Percentage"] = team_pitching_df["W"] / 162
team_df = team_pitching_df["Win Percentage"]
print(team_df)

Season  Team
1972    BAL     0.493827
        OAK     0.574074
1981    HOU     0.376543
1972    LAD     0.524691
        PIT     0.592593
                  ...   
2001    TEX     0.450617
1995    MIN     0.345679
2021    BAL     0.320988
1999    COL     0.444444
1996    DET     0.327160
Name: Win Percentage, Length: 1444, dtype: float64


In [31]:
world_series = retrosheet.world_series_logs()

In [45]:
world_series["date"] = [int(''.join(list(str(x))[:4])) for x in world_series["date"]]
world_series["index"] = world_series.index
ws_modern = world_series[world_series["date"] >= 1970]
ws_outcomes = ws_modern["date"].count()

print(ws_outcomes)

295
