Recently, I needed to make a heatmap with a dendrogram for work.  The only libraries that I could find with that particular template were [seaborn](https://stanford.edu/~mwaskom/software/seaborn/) or [plotly](https://plot.ly/) (the latter being a pretty awesome tool I just recently found out about).  However, my familiarity with [bokeh](http://bokeh.pydata.org/en/latest/) was gravitating me towards it.  After stumbling upon [this](http://stackoverflow.com/questions/23578753/how-to-make-a-cluster-style-dendrogram-in-bokeh) StackOverflow question, it seemed like no code was available.  The more I program the more I find myself prefering to make code graphs myself, rather than using a template.  So, I figured I take a stab at it.  Here is a heatmap clustering the MLB teams by batting statistics.  

First import the standards...

In [2]:
import pandas as pd
import numpy as np

Load team colors and batting data downloaded from [baseball reference](http://www.baseball-reference.com/).  

In [3]:
colors_df = pd.read_csv('data/colors.csv', index_col=0)

df = pd.read_csv('data/2016_teams_standard_batting_8_31_16.csv', index_col=0)

colors_df = colors_df.loc[df.index]

Scale each batting stat to [0,1] so I can use color intensity to compare teams.  Do a little housekeeping on the colors data as well.  

In [5]:
df_std = (df - df.min(axis=0)) / (df.max(axis=0) - df.min(axis=0))
df_scaled = df_std * (1.0 - 0.0) + 0.0

print(df)


names = colors_df['Teams']
colors_df.drop(['Teams', 'Abbrv.1'], axis=1, inplace=True)

      R/G    G    PA    AB    R     H   2B  3B   HR  RBI  ...    SLG    OPS  \
Tm                                                        ...                 
ARI  4.58  132  5114  4631  604  1217  233  51  147  569  ...  0.430  0.752   
ATL  3.73  132  5055  4510  492  1111  237  23   92  466  ...  0.370  0.683   
BAL  4.70  132  4962  4516  621  1176  225   5  208  593  ...  0.451  0.770   
BOS  5.42  132  5153  4636  716  1319  292  23  167  685  ...  0.465  0.815   
CHC  5.08  131  5149  4470  666  1153  240  22  166  633  ...  0.433  0.778   
CHW  4.07  131  4940  4461  533  1129  219  26  132  509  ...  0.403  0.717   
CIN  4.48  131  4920  4433  587  1108  222  24  143  551  ...  0.408  0.719   
CLE  4.87  131  5007  4503  638  1190  241  23  163  604  ...  0.437  0.763   
COL  5.30  131  5062  4535  694  1240  260  36  167  661  ...  0.457  0.795   
DET  4.65  132  5018  4540  614  1207  202  25  174  592  ...  0.436  0.764   
HOU  4.56  132  5088  4528  602  1117  243  22  165 

KeyError: 'Teams'

Write a function that will generate a random color from a team's color palette.  

In [4]:
def color_gen(tm):
    """ function to generate a random color from a given team thats not white"""
    from random import randint
    color = '#FFFFFF'
    while color == '#FFFFFF':
        color = colors_df.loc[tm][randint(0, len(colors_df.loc[tm].dropna())-1)]
    return color

[Partial](https://docs.python.org/2/library/functools.html) is a pretty useful function for "loading" a bunch of functions each with different parameters.  Here I use partial to create a dictionary where the keys are team names and the value is a the `color_gem` function, with that particular team already passed as an argument.  

In [5]:
from functools import partial
color_map = dict([(tm, partial(color_gen, tm)) for tm in colors_df.index.tolist()])

I create the dendrogram using the [SciPy's dendrogram](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html) function.  The function takes a linkage matrix `Z` and returns a dictionary of objects, two of which are the horiztonal and vertical line coordinates of the dendorogram.  A third object, `ivl`, a list of a labels mapping the original order of the teams, to their order in the dendorgram (from top to bottom).

In [6]:
from sklearn.metrics.pairwise import pairwise_distances
from scipy.cluster.hierarchy import linkage, dendrogram

X = pairwise_distances(df.values, metric='euclidean')
Z = linkage(X, 'ward')
results = dendrogram(Z, no_plot=True)
icoord, dcoord = results['icoord'], results['dcoord']
labels = list(map(int, results['ivl']))
df = df.iloc[labels]
df_scaled = df_scaled.iloc[labels]

  """


I reshape the data a bit to make it work with bokeh's tooltips.  A hover tool that allows you to see the data associated with a certain object on the plot.  

In [7]:
tms = []
batting = []
xs = []
ys = []
colors = []
alpha = []
value = []

for i, tm in enumerate(df.index):
    tms = tms + [tm]*len(df.columns)
    batting = batting + df.columns.tolist()
    xs = xs + list(np.arange(0.5, len(df.columns)+0.5))
    ys = ys + [i+0.5]*len(df.columns)
    colors = colors + [color_map[tm]() for i in range(len(df.columns))]
    value = value + df.loc[tm].tolist()
    alpha = alpha + df_scaled.loc[tm].tolist()


The rest is just plotting.

In [10]:
from bokeh.plotting import figure, output_file
from bokeh.models.sources import ColumnDataSource
from bokeh.models import HoverTool, SaveTool

data = pd.DataFrame(dict(
        tms=tms,
        batting=batting,
        xs=xs,
        ys=ys,
        colors=colors,
        value=value,
        alpha=alpha
              ))

source = ColumnDataSource(data)

hover = HoverTool()

hover.tooltips = [
    ("Team", "@tms"),
    ("Batting Stat", ("@batting: @value")),
    ("Value", "@value")
]

height, width = df.shape

icoord = pd.DataFrame(icoord)
icoord = icoord * (data['ys'].max() / icoord.max().max())
icoord = icoord.values

dcoord = pd.DataFrame(dcoord)
dcoord = dcoord * (data['xs'].max() / dcoord.max().max())
dcoord = dcoord.values


hm = figure(x_range=[-40, 40],
            #y_range=[0, icoord.max()],
            height=800,
            width=800,
            tools=[ 
#                     ResizeTool(), 
                    SaveTool()],
            webgl=True
)

for i, d in zip(icoord, dcoord):
    d = list(map(lambda x: -x, d))
    hm.line(x=d, y=i, line_color='black')

hm.add_tools(hover)


hm.rect(x='xs', y='ys',
        height=1,
        width=1,
        fill_color='colors',
        line_color='black',
        source=source,
        line_alpha=0.2,
        fill_alpha='alpha'
        
        )
hm.text([data['xs'].max()+0.51] * len(data['tms'].unique()), 
        data['ys'].unique().tolist(), 
        text=[nm for nm in names.iloc[labels]],
        text_baseline='middle',
        text_font_size='6pt'
       
       )

hm.axis.major_tick_line_color = None
hm.axis.minor_tick_line_color = None
hm.axis.major_label_text_color = None
hm.axis.major_label_text_font_size = '0pt'
hm.axis.axis_line_color = None
hm.grid.grid_line_color = None
hm.outline_line_color = None

AttributeError: unexpected attribute 'webgl' to Figure, possible attributes are above, aspect_scale, background_fill_alpha, background_fill_color, below, border_fill_alpha, border_fill_color, css_classes, disabled, extra_x_ranges, extra_y_ranges, h_symmetry, height, hidpi, inner_height, inner_width, js_event_callbacks, js_property_callbacks, layout_height, layout_width, left, lod_factor, lod_interval, lod_threshold, lod_timeout, match_aspect, min_border, min_border_bottom, min_border_left, min_border_right, min_border_top, name, outline_line_alpha, outline_line_cap, outline_line_color, outline_line_dash, outline_line_dash_offset, outline_line_join, outline_line_width, output_backend, plot_height, plot_width, renderers, right, sizing_mode, subscribed_events, tags, title, title_location, toolbar, toolbar_location, toolbar_sticky, v_symmetry, width, x_range, x_scale, y_range or y_scale

In [None]:
from bokeh.plotting import output_notebook, show

In [None]:
output_notebook()


In [None]:
show(hm)

In the heatmap, MLB leaders have higher color intensities.  It looks like their are two distinct clusters, with the bottom half of the heatmap having much more faded/white color spots.  Not surprisingly, a majority of the division leaders (sans the Dodgers) are in the top half of the map.

All the code and data are available on my github page: https://github.com/russodanielp/bokeh_plots.