**Team Visualization**

Goal: For all 30 teams, compare the *expected* number of runs scored per game calculated using the Markov Chain to the *actual* number of runs scored per game in 2016.

In [1]:
from markov_functions import (find_team_atbats, import_raw_batting_data, run_matrix,
                            find_states, make_empty_transition_matrix, 
                            make_transition_matrix, team_markov,
                            team_markov_from_raw)

batting=import_raw_batting_data(verbose = False)

In [2]:
# Show the expected runs for 9 innings of the Minnesota Twins
print(team_markov_from_raw(batting, "MIN"))

5.81307069903


In [3]:
# Now find the expected runs for 9 innings of all 30 teams
# (This takes some time to run...)
team_codes = batting.visteam.unique()
team_runs_per_9 = {team:team_markov_from_raw(batting, team) for team in team_codes}
print(team_runs_per_9)

{'LAN': 5.4445949177522834, 'MIL': 5.9167171091630042, 'TEX': 6.0551656758336572, 'PIT': 5.5507411362189272, 'KCA': 5.0977499601756042, 'NYA': 5.1306437446854058, 'TOR': 5.5759519780219282, 'ARI': 5.9292154301695366, 'TBA': 5.5336832918836532, 'SLN': 5.8278414600815198, 'WAS': 5.9850904772612061, 'CLE': 6.2236148023533939, 'ATL': 5.0923221120316331, 'MIN': 5.8130706990279499, 'COL': 6.4761877810536195, 'OAK': 4.7050994570590925, 'MIA': 5.1848046917608146, 'PHI': 5.1206321094137026, 'DET': 6.136764025401769, 'CHN': 6.4739706857570125, 'HOU': 5.9071522836238488, 'NYN': 5.3796663787204437, 'SFN': 5.5146144643686412, 'CIN': 5.7835963264503105, 'ANA': 5.2064281316535288, 'SDN': 5.1140017257752479, 'BAL': 5.8845932407912587, 'CHA': 5.2560025844894458, 'SEA': 5.7671807993032171, 'BOS': 6.6382131279814907}


How do these predicted runs-per-9-innings compare to the actual 2016 runs per game? We'll compare the two using data from http://www.baseball-reference.com/leagues/MLB/2016.shtml. Unfortunately, our data set uses a set of team codes that is slightly different from this data, so we'll re-code the index of our imported data frame accordingly.

In [5]:
import pandas as pd
team_2016_data = pd.read_csv('team_2016_stats.csv', index_col = 'team_name')
# Data from http://www.baseball-reference.com/leagues/MLB/2016.shtml

code_exchange = {'ARI':'ARI', 'ATL':'ATL', 'BAL':'BAL', 'BOS':'BOS', 'CHC':'CHN', 'CHW':'CHA', 
                 'CIN':'CIN', 'CLE':'CLE', 'COL':'COL', 'DET':'DET', 'HOU':'HOU', 'KCR':'KCA', 
                 'LAA':'ANA', 'LAD':'LAN', 'MIA':'MIA', 'MIL':'MIL', 'MIN':'MIN', 'NYM':'NYN', 
                 'NYY':'NYA', 'OAK':'OAK', 'PHI':'PHI', 'PIT':'PIT', 'SDP':'SDN', 'SEA':'SEA', 
                 'SFG':'SFN', 'STL':'SLN', 'TBR':'TBA', 'TEX':'TEX', 'TOR':'TOR', 'WSN':'WAS'}

team_2016_data.index = [code_exchange[i] for i in team_2016_data.index]

print(team_2016_data['runs_per_game'])

ARI    4.64
ATL    4.03
BAL    4.59
BOS    5.42
CHN    4.99
CHA    4.23
CIN    4.42
CLE    4.83
COL    5.22
DET    4.66
HOU    4.47
KCA    4.17
ANA    4.43
LAN    4.48
MIA    4.07
MIL    4.14
MIN    4.46
NYN    4.14
NYA    4.20
OAK    4.03
PHI    3.77
PIT    4.50
SDN    4.23
SEA    4.74
SFN    4.41
SLN    4.81
TBA    4.15
TEX    4.72
TOR    4.69
WAS    4.71
Name: runs_per_game, dtype: float64


In [6]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
output_notebook()

In [7]:
# create a new plot with default tools, using figure
p = figure(plot_width=600, plot_height=600, title="This Markov Chain Prediction is Awful!")
p.xaxis.axis_label = "Expected Runs Per Game"
p.yaxis.axis_label = "Actual Runs Per Game"

expected_runs = [team_runs_per_9[i] for i in team_codes]
#print(exp_runs)

actual_runs = [team_2016_data.at[i,'runs_per_game'] for i in team_codes]
#print(actual_runs)

p.circle(x=expected_runs, y=actual_runs, size=15, line_color="black", fill_color="black", fill_alpha=0.5)
p.line(x=[4.5,7],y=[4.5,7], line_width=4, line_color="navy", alpha=0.6)

show(p) # show the results

It's clear from the graph that our model consistently **overestimates** the number of runs per game. Likely this is due to an error in the transition matrices that ignores non-transitions like pitching changes, etc. On the other hand, the shape looks pretty linear, which is good.