# The Height of Fitness
## Rank Correlation Notebook

Mark Baum (markmbaum@protonmail.com)

In [12]:
from os.path import join
import numpy as np
import pandas as pd

from bokeh.plotting import figure, show, output_file, save
from bokeh.layouts import column
from bokeh.embed import components

from bokeh.io import output_notebook, curdoc
from bokeh.io.export import *
from bokeh.models import ColumnDataSource, Span, FactorRange
from bokeh.transform import jitter
from bokeh.palettes import *

In [27]:
# config cell
saving = False
component_graphics = dict()
if not saving:
    output_notebook()
curdoc().theme = 'light_minimal'    
#choose height/age globally
var = 'height'

In [28]:
#colors for plots categorizing only by these Comp-Gender combos
competitionDivision2color = {
    'Games: Men': Paired8[1],
    'Games: Women': Paired8[5],
    'Open: Men': Paired8[3],
    'Open: Women': Paired8[7]
}
#functions for computing alpha and size values for markers, from p values, to visually signal significance values
psize = lambda p: 10 - 6*np.power(p, 1/4)
palpha = lambda p: 1 - (3/4)*np.power(p, 1/3)

In [29]:
def save_or_show(p, filename: str='') -> None:
    if saving:
        component_graphics[filename] = p
        path = join('..', 'plots', filename)
        output_file(filename=path)
        save(p)
        export_png(p ,filename=path.replace('html','png'))
    else:
        show(p)
    return None

This notebook contains some interactive plots for exploring/comparing rank correlations between athlete height and placement in the CrossFit Games and the elite stratum of the Open. [Rank correlation](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) measures how similarly *ordered* two vectors are. For example, if the shortest person wins a workout, the second shortest person takes second, all the way down to the tallest person to who finishes last, that workout would have a perfect positive rank correlation of one (1). In this case, the smallest height values correspond perfectly with the smallest placement values, so the correlation is *positive*. Conversely, if a workout's placements are strictly *descending* in height (tallest person winning, second tallest taking second, ..., shortest finishing last) then the rank correlation is exactly negative one (-1). Everything in between these extremes will yield rank correlation coefficients between -1 and 1, and 0 indicates no correlation at all. 

Each rank correlation coefficient comes with a [p-value](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient#Determining_significance) assessing its statistical significance. This value roughly indicates the probability of finding a correlation at least as extreme as the one that was found by accident when there is no actual relationship. If you repeatedly, randomly shuffled the results and recomputed the correlation, you would see a lot of values around 0 but occasionally some larger values too, just by chance. Each p-value gives a sense of how likely its correlation is to be the product of chance, just a fluke, so lower values are better. Take the p-values with a grain of salt, though, especially for small samples like workouts with only 10 people. The stakes here are low and this isn't a controlled experiment, so I'm not fretting over the statistics. Just think of the p-values as rough proxies for confidence in the correlation coefficients.

The data include results for all Games and Open workouts with **more than 5** recorded height values in the individual Men's and Women's divisions. I exclude the second stage of the 2020 Games (5 people) and any other occasions where almost all the height values are unavailable. Correlations in these cases are just not very meaningful. I took the top 2500 finishers in each Open workout, so they have much larger samples, but **all** of the correlation values are approximate. Just in that case, taking the top 2500 in each Open workout was arbitrary and taking the top 200 or some other number would change the correlations a bit. More generally, lots of things affect placements/performance and we shouldn't make any overdramatic conclusions. As I'll point out below, there are some repeated workouts in both the Games and Open, which help calibrate our confidence a little bit.

It's definitely best to look at this on a wider screen and not a touch screen. All of the graphics allow zooming, panning, hovering and other interaction. Use the little toolbars on the right to change the action/tool. The plots were all made using [Bokeh](https://docs.bokeh.org/en/latest/index.html) in Python.

### Rank Correlations for Individual Workouts

In [30]:
dfw = pd.read_csv(join('..', 'data', 'pro', 'workout_statistics.csv'))
#take only Men's and Women's individual divisions and workouts with >5 recorded heights
dfw = dfw[(dfw.divisionNumber < 3) & (dfw['N_'+var] > 5)]
#ignore the stage 1 waypoint
dfw = dfw[dfw.workoutName != 'Stage 1 Points']
# arrange columns for marker size/transparency
dfw['competitionDivisionColor'] = dfw.competitionDivision.map(competitionDivision2color)
dfw['p_size'] = psize(dfw['p_'+var])
dfw['p_alpha'] = palpha(dfw['p_'+var])

The following plot shows rank correlations for workouts, split up by the competition (games/open) and division. The vertical location of each dot represents its correlation and the horizontal scatter is totally random, just to separate them out. The size and transparency are related to the p-value for each correlation. The more visible markers have better significance estimates (smaller p-values). I've also *reversed* the vertical axis so that markers higher up on the vertical axis represent workouts tilted toward tall people. I think it's more intuitive this way.

A few things are clear at a glance
* The most basic point: a lot of workouts appear to have significant correlations.
* Games workouts tend to have stronger correlations. This makes sense, as the Games probably represent a more diverse set of tests, but remember that they also have much smaller numbers of people in each sample than the Open.
* In all groups, workouts are sometimes better for short athletes, sometimes better for tall people, and sometimes neutral. It looks like there might be overall preferences in some groups, but it's hard to tell from this plot and none of them are *really obviously* stacked for/against short/tall athletes.

Hover over the dots to see which workouts they represent.

In [31]:
p = figure(
    title=f'Rank Correlation—Athlete {var.title()} & Workout Placement',
    y_axis_label='Spearman Rank Correlation',
    height=400,
    width=1600,
    x_range=sorted(dfw.competitionDivision.unique()),
    #active_scroll='wheel_zoom',
    active_drag='pan',
    tooltips=[
        ('year', '@year'),
        ('workout', '@workoutName'),
        ('correlation', '@c_'+var),
        ('p-value', '@p_'+var),
        ('N', '@N_'+var)
    ]
)
p.scatter(
    x=jitter('competitionDivision', width=0.6, range=p.x_range),
    y='c_'+var,
    fill_color='competitionDivisionColor',
    fill_alpha='p_alpha',
    line_color='gray',
    line_alpha='p_alpha',
    size='p_size',
    source=ColumnDataSource(dfw)
)
span = Span(
    dimension='width',
    location=0,
    line_color='gray',
    line_width=2,
    line_alpha=0.75
)
p.add_layout(span)
p.y_range.flipped = True
span.level = 'underlay'
save_or_show(p, 'scatter_workouts.html')

Below I plot the same information but arrange it differently. I sort the correlations in each group and plot them in sequence as vertical bars. This makes it easier to compare workouts in each group. Here again, the width and transparency of the bars is related to their p-values. Also remember that *negative* correlation means tall athletes did better, and these bars are pointed upward.

A couple of high-level points:
* It's important to remember the approximate nature of each correlation and not obsess over small differences. For example, the Fibonacci workout was repeated in adjacent years at the Games. In 2017 its correlation in the Men's division was `0.44`, but in 2018 it was `0.32`. So it looks like that workout is better for shorter athletes, but it's not clear that it is *exactly the 9th best workout for short athletes* or whatever other very specific conclusions you could make. Open workouts have better numbers, so their correlations ought to be more stable, and I'll point out an example in the next plot.
* I put vertical lines in each panel where the correlations flip from positive to negative. This is interesting because it gives a sense for what *overall fraction* of the workouts tilt toward tall/short athletes in each case. The Men's Games look pretty balanced, but the Women's Games workouts seem a little tilted toward *tall* athletes in aggregate. The Open workouts are a bit tilted toward short athletes in both divisions. Another way to see this would just be to look at the median workout, which is at the very center of each panel. For example, the median correlation at the Women's Games is a workout that slightly favors tall people.

Hover over the bars to see which workouts they represent and do your own exploring.

In [7]:
figs = []
for K, df in dfw.groupby('competitionDivision'):
    
    df = df.sort_values(['c_'+var])
    #df = df[df['p_'+var] <= 0.1]
    source = dict(x=np.arange(len(df)))
    for col in df.columns:
        source[col] = df[col].values
    
    p = figure(
        title=f'Rank Correlation—Athlete {var.title()} & Workout Placement ({K})',
        y_axis_label='Spearman Rank Correlation',
        #active_scroll='wheel_zoom',
        height=250,
        width=1600,
        tooltips=[
            ('year', '@year'),
            ('workout', '@workoutName'),
            ('correlation', '@c_'+var),
            ('p-value', '@p_'+var),
            ('N', '@N_'+var)
        ]
    )
    p.vbar(
        x='x',
        top='c_'+var,
        fill_color='competitionDivisionColor',
        fill_alpha='p_alpha',
        line_color='gray',
        line_alpha='p_alpha',
        width='p_alpha',
        source=ColumnDataSource(source)
    )
    span = Span(
        dimension='width',
        location=0,
        line_color='gray',
        line_width=2,
        line_alpha=0.75
    )
    p.add_layout(span)
    span = Span(
        dimension='height',
        location=(df['c_'+var] < 0).sum() - 0.5,
        line_color='gray',
        line_width=1.5
    )
    p.add_layout(span)
    p.y_range.flipped = True
    span.level = 'underlay'
    p.xgrid.grid_line_alpha = 0.0
    figs.append(p)
save_or_show(column(figs), 'bar_workouts_sorted.html')

Below, I have all of the workouts arranged chronologically for each group, with some vertical lines separating each year's group of workouts. The width and transparency of the bars have the same meaning as in the plot above. Explore and compare as you like, but I'll point out another repeat workout.

Open workout 15.2 is identical to 14.2, so the correlations should be very similar. 
* We have `0.220` and `0.216` for the Men
* We have `0.186` and `0.205` for the Women

So they're very close in both cases, which is what we expect given the larger number of athletes in the Open 👍.

In [8]:
figs = []
for K, df in dfw.groupby('competitionDivision'):

    df.sort_values(['year', 'workoutNumber'], inplace=True)
    source = dict(
        x=list(
            zip(
                df.year.astype(str),
                df.workoutNumber.astype(str)
            )
        )
    )
    for col in df.columns:
        source[col] = df[col].values
    palette = Set2_7
    source['color'] = list(map(lambda i: palette[i % len(palette)], df.year))

    p = figure(
        title=f'Rank Correlation—Athlete {var.title()} & Workout Placement ({K})',
        y_axis_label='Spearman Rank Correlation',
        height=350,
        width=1600,
        x_range=FactorRange(
            *source['x'],
            group_padding=2
        ),
        tooltips=[
            ('year', '@year'),
            ('workout', '@workoutName'),
            ('correlation', '@c_'+var),
            ('p-value', '@p_'+var),
            ('N', '@N_'+var)
        ]
    )
    p.vbar(
        x='x',
        top='c_'+var,
        width='p_alpha',
        fill_color='competitionDivisionColor',
        fill_alpha='p_alpha',
        line_color='gray',
        line_alpha='p_alpha',
        source=ColumnDataSource(source)
    )
    span = Span(
        dimension='width',
        location=0,
        line_color='gray',
        line_width=2,
        line_alpha=0.75
    )
    p.add_layout(span)
    span.level = 'underlay'
    n = 0
    for k,sl in df.groupby('year'):
        n += len(sl) + 2
        span = Span(
            dimension='height',
            location=n-1,
            line_color='gray',
            line_width=1,
            line_alpha=0.5
        )
        p.add_layout(span)
        span.level = 'underlay'

    p.xaxis.major_label_text_alpha = 0
    p.y_range.flipped = True
    p.xgrid.grid_line_alpha = 0.0
    figs.append(p)
figs = column(figs)
save_or_show(figs, 'bar_workouts_chronological.html')

### Rank Correlations for Entire Competitions

Here's the money plot. Rank correlations can be computed for the overall results of entire competitions, instead of individual workouts. So I did that and plotted them below in the same bar-chart format. The same words of caution apply, especially for the Games, which have small sample sizes. Nevertheless... the results are pretty interesting. Some of these observations are suggested by the previous plots, but here they are:
* The *overall* bias in the Open is pretty small but is toward shorter athletes. The most slanted year was 2012, primarily because 12.1 was so bad for tall people, but some years of the Open are almost perfectly neutral with respect to height. The most balanced years were 2014, 2018, and 2020. The only clear case where the open favored taller athletes was the Women's division in 2019, and this is because 19.1 was *really* good for tall people.
* The *overall* bias at the Games goes back and forth for the Men's individuals. Some years seem to favor tall guys, others favor short ones. Notably, the 2019 games advantaged tall men but the first stage of the 2020 went the other direction and favored shorter athletes pretty strongly.
* The Women’s division at the Games usually tilts toward taller athletes. The only years with a noticeable preference for short women were 2010 and 2014, but these results are only mildly significant. Most years are neutral or favor taller athletes.

In [9]:
dfc = pd.read_csv(join('..', 'data', 'pro', 'competition_statistics.csv'))
dfc = dfc[(dfc.divisionNumber < 3) & (dfc['N_'+var] > 5)]
dfc['p_alpha'] = 1 - dfc['p_'+var]/2
dfc['competitionDivisionColor'] = dfc.competitionDivision.map(competitionDivision2color)

In [10]:
df = dfc.sort_values(['year', 'competitionDivision'])
source = dict(
    x=list(
        zip(
            df.year.astype(str),
            df.competitionDivision
        )
    )
)
for col in df.columns:
    source[col] = df[col].values

p = figure(
    title=f'Rank Correlation—Althete {var.title()} & Event Placement',
    y_axis_label='Spearman Rank Correlation',
    height=400,
    width=1600,
    x_range=FactorRange(
        *source['x'],
        group_padding=3
    ),
    tooltips=[
        ('year', '@year'),
        ('division', '@divisionName'),
        ('competition', '@competitionType'),
        ('correlation', '@c_'+var),
        ('p-value', '@p_'+var),
        ('N', '@N_'+var)
    ]
)
p.vbar(
    x='x',
    top='c_'+var,
    width='p_alpha',
    fill_color='competitionDivisionColor',
    fill_alpha='p_alpha',
    line_color='gray',
    line_alpha='p_alpha',
    legend_group='competitionDivision',
    source=ColumnDataSource(source)
)
span = Span(
    dimension='width',
    location=0,
    line_color='gray',
    line_width=2,
    line_alpha=0.75
)
p.add_layout(span)
span.level = 'underlay'
p.xaxis.major_label_text_alpha = 0
p.y_range.flipped = True
save_or_show(p, 'bar_overall.html')

In [11]:
if saving:
    script, divs = components(component_graphics)
    with open(join('..', 'plots', 'script.js'), 'w') as ofile:
        ofile.write(script)
    print(type(divs))
    with open(join('..', 'plots', 'divs.txt'), 'w') as ofile:
        for k in divs:
            ofile.write(f'{k}\n\t{divs[k]}\n\n')

<class 'dict'>


Questions/comments about any of this? Feel free to let me know (email address is at the top).