**Single vs Home Run**

Which is better: a team that hits a home run every 4 at bats (else strikes out) OR a team that hits a single every 2 at bats?

In [7]:
import sys
import pandas as pd
sys.path.insert(0, '../')
from markov_functions import (find_team_atbats, import_raw_batting_data, run_matrix,
                            find_states, make_empty_transition_matrix, make_team_transition_matrix,
                            make_transition_matrix, team_markov, team_markov_from_raw)

batting = import_raw_batting_data(verbose = False)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

To compare the two teams, we need to make separate transition matrices. One will account for a team that only hits singles (runners only advance one base currently) with probability `p_single`. The other transition matrix will make a team that only hits home runs with probability `p_hr`.

In [2]:
def make_single_transition_matrix(p_single):
    single_t_matrix = make_empty_transition_matrix()
    all_base_options = ["", "1", "2", "3", "12", "13", "23", "123"]

    # Result of out:
    for outs in range(3):
        for runners in all_base_options:
            if outs != 2:
                single_t_matrix.loc[(outs, runners),(outs+1, runners)] = 1-p_single
            else: #if outs == 2:
                single_t_matrix.loc[(outs, runners), "0"] = 1-p_single

    single_base_options = ["", "1", "12", "123"]
    # Result of single:
    for outs in range(3):
        for runners in single_base_options:
            if runners == "":
                single_t_matrix.loc[(outs, runners),(outs, "1")  ] = p_single
            elif runners == "1":
                single_t_matrix.loc[(outs, runners),(outs, "12") ] = p_single
            elif runners == "12":
                single_t_matrix.loc[(outs, runners),(outs, "123")] = p_single
            else: #Bases already loaded.
                single_t_matrix.loc[(outs, runners),(outs, "123")] = p_single 
    return(single_t_matrix)

#print(make_single_transition_matrix(p_single=0.5))

In [3]:
def make_hr_transition_matrix(p_hr):
    hr_t_matrix = make_empty_transition_matrix()
    hr_t_matrix.loc[(0,""),(0,"")] = p_hr
    hr_t_matrix.loc[(0,""),(1,"")] = 1 - p_hr
    hr_t_matrix.loc[(1,""),(1,"")] = p_hr
    hr_t_matrix.loc[(1,""),(2,"")] = 1 - p_hr
    hr_t_matrix.loc[(2,""),(2,"")] = p_hr
    hr_t_matrix.loc[(2,""),("0") ] = 1 - p_hr
    return(hr_t_matrix)
#print(make_hr_transition_matrix(p_hr=0.5))

Now, let's use the `team_markov` function to see which batting matrix is expected to score more runs per 9 innings:

In [6]:
#print(team_markov(make_hr_transition_matrix(p_hr=0.25),        iterations = 25))
#print(team_markov(make_single_transition_matrix(p_single=0.5), iterations = 25))

11.69664
4.4001736089


I thought the home runs would do better, but that's a much larger gap than I anticipated!

What if we change the probability of singles or home runs? Let's make a visualization.

First, we'll need to find the expected number of runs when `p_hr` and `p_single` vary between 0 and 1. We'll count by hundredths and use 50 steps in our markov chains for a little more accuracy. This will take some time to run...

In [4]:
stepsize = 1  #should go evenly into 100
prob_values = [i/100 for i in range(stepsize, 100, stepsize)]
markov_iterations = 50

# Calculate expected runs from markov chain
#single_expected_runs = [team_markov(make_single_transition_matrix(p_single=i), iterations=markov_iterations) for i in prob_values]
#print(single_expected_runs)
#hr_expected_runs = [team_markov(make_hr_transition_matrix(p_hr=i), iterations=markov_iterations) for i in prob_values]
#print(hr_expected_runs)

Now let's arrange the data in a dictionary with 3 keys: `hr_prob`, `single_prob`, and `run_diff`, i.e. the difference in expected runs scored given `hr_prob` and `single_prob`.

In [7]:
# Calculate the difference between home run expected value and single expected value
diff = [hr_expected_runs[i]-single_expected_runs[j]
        for i in range(len(prob_values)) 
        for j in range(len(prob_values))]

# Enter data as dictionary
hr_prob_list = []
for i in prob_values:
    for j in range(len(prob_values)):
        hr_prob_list.append(i)
#print(hr_prob)

data = {'hr_prob': hr_prob_list,
         'single_prob': prob_values*len(prob_values),
         'run_diff': diff}
#print(data)

NameError: name 'hr_expected_runs' is not defined

Let's save the data dictionary for later using `np.save`

In [8]:
import numpy as np

# Save
#np.save('single_vs_hr_data.npy', data) 

# Load
new_data = np.load('single_vs_hr_data.npy').item()
#print(new_data)

Finally, let's look at a heatmap of the data using Bokeh.

In [9]:
from bokeh.io import output_notebook, show
from bokeh.charts import HeatMap, bins, show
from bokeh.layouts import column, gridplot
from bokeh.palettes import RdBu9 as palette
from bokeh.charts import HeatMap, bins, show
output_notebook()

In [10]:
hm = HeatMap(new_data, x='single_prob', y='hr_prob', values='run_diff', 
             title='Home Runs vs Singles', palette=palette, stat=None,
            hover_text = "@run_diff")
show(hm)

This plot is easy to code, but not really what I wanted. There are only a few bins and it's hard to see where home runs and singles are equal in strength. It doesn't look like there are many examples of Bokeh `HeatMap` plots online, and while struggling through the documentation I found some nice Plotly plots that looked workable. So I signed up for a free account and tried again...

In [11]:
import plotly
plotly.tools.set_credentials_file(username='dpebert7', api_key='d4eJMMZN1TFUv9YNAKqN')

In [15]:
import plotly.plotly as py
import plotly.graph_objs as go

trace = go.Heatmap(x=new_data['hr_prob'],
                   y=new_data['single_prob'],
                   z=new_data['run_diff'])
data=[trace]
#print(data)
py.iplot(data, filename='basic-heatmap')

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~dpebert7/0 or inside your plot.ly account where it is named 'basic-heatmap'


This is much better, but it's got too much gray area due to the color bar being to wide. Rather than figuring out how to pretty it up in Python, we can add axis labels, etc. on the plotly website by clicking the link above.

In Plotly, go to Style --> Traces --> Min Value/Max Value to make sure white corresponds to 0. Setting the range from -20 to 20 (or smaller) yields a nice, colorful graph. Along with many, many features, Plotly also has a nice smoothing function that looks good with this data.