In [None]:
import pandas as pd, altair as alt, numpy as np
%matplotlib inline
cc_df = pd.read_csv("combined_results.csv", index_col=0)

In [None]:
cc_df.head()

# Exploratory Analysis

In [None]:
alt.Chart(cc_df).mark_point().encode(
    x='Place_Section:Q',
    y='Place_State:Q',
    column='Year:N',
    color='Year:N'
).properties(
    width=180,
    height=180
)

In [None]:
discrep = pd.DataFrame({"Place_Diff": cc_df.Place_State - cc_df.Place_Section})
discrep["Place_Section"] = cc_df["Place_Section"]
discrep["Year"] = cc_df["Year"]
discrep["Place_State"] = cc_df["Place_State"]

In [None]:
alt.Chart(discrep).mark_point().encode(
    x='Place_Section:Q',
    y='Place_Diff:Q',
    column='Year:N',
    color='Year:N'
).properties(
    width=180,
    height=180
)

Viewing these distributions as a single distribution may be more useful.

In [None]:
interval = alt.selection_interval()

base = alt.Chart(discrep).mark_point().encode(
    x='Place_Section:Q',
    color=alt.condition(interval, 'Year:N', alt.value('lightgray')),
    #tooltip='Name'
).properties(
    selection=interval
)

hist = alt.Chart(discrep).mark_bar().encode(
    x='count()',
    y='Year:N',
    color='Year:N'
).properties(
    width=800,
    height=80
).transform_filter(
    interval
)

scatter = base.encode(y='Place_State:Q') | base.encode(y='Place_Diff:Q')

scatter & hist

With this plot, we can see the distribution of placements in its entirety. Selecting an area of either side of the graph allows you to observe the translation of the same area to the other side. With any given area, you are also able to see the breakdown by year of the counts of the points. While this is helpful to see a general trend in the data, we may still want to see the yearly breakdown of the data as it relates to the entire distribution. We can accomplish this through use of an interactive legend, by which you may select any given year (or multiple).

In [None]:
selector = alt.selection_multi(fields=["Year"])
color = alt.condition(selector,
                      alt.Color('Year:N', legend=None),
                      alt.value('lightgray'))

base = alt.Chart(discrep).mark_point().encode(
    x='Place_Section:Q',
    color = color
).properties(
    selection=selector
)

legend = alt.Chart(discrep).mark_point(
    filled=True, size=200).encode(
    y=alt.Y('Year:N', axis=alt.Axis(orient='right')),
    color=color).properties(selection=selector)

scatter = base.encode(y='Place_State:Q') | base.encode(y='Place_Diff:Q')

scatter | legend

As evident by the graphs, we see that the distribution of the difference between placement in the section meet and placement in the statement is cone-like. More simply, the further the observation is from the origin, the more variable the difference seems to be. Another interesting note is that the yearly breakdowns are very similar in distribution. 2018 and 2017 do appear to be a little less variable than earlier years, but we are still able to say that year does not play a significant role in the data.

In [None]:
alt.Chart(cc_df).mark_point().encode(
    x='float_Time_Section:Q',
    y='float_Time_State:Q',
    column='Year:N',
    color='Year:N'
).properties(
    width=180,
    height=180
)

As expected, the relationship between times at the section meet and times at the state meet are highly correlated, if not nearly 1-to-1. This is perfectly reasonable to state, as a runner will likely have around the same time, regardless of the competition level (section vs state).