# Olympics

In this example, we will be exploring the weights and heights of olympic athletes by using hvplot and pandas.

In [6]:
import hvplot.pandas  # noqa\n"
import pandas as pd
    

## Data loading

First, we will load in the data.

In [None]:
df = pd.read_csv('data/athlete_events.csv')
df.head()

## Analysis

Since we are focusing on the weight and height columns of our dataset, let's remove any NaN values that are in those columns.

In [8]:
df = df.dropna(subset=["Weight", "Height"])

One of the first things to look at is the the average height of each sport. 

In [9]:
avg_height_by_sport = df.groupby('Sport')['Height'].mean()
avg_height_by_sport.sort_values(inplace=True)
avg_height_by_sport.head()

Sport
Gymnastics             162.875641
Trampolining           166.563758
Diving                 166.662196
Rhythmic Gymnastics    167.795122
Weightlifting          167.822726
Name: Height, dtype: float64

Using hvplot, we can create a horizontal bar chart by using `.hvplot.barh`. Then we add labels to our chart and select the basketball and gymnastics as they have the highest and lowest heights. We finally combine everything by overlaying the chart, labels, and select together to create our chart through the use of `*` operator.

In [10]:
bar = avg_height_by_sport.hvplot.barh(by='Sport', stacked=True, height=600)
txt = avg_height_by_sport.hvplot.labels("Sport", "Height", "{Height:.0f} cm", invert=True, text_align="left", text_baseline="middle", text_color="darkred")
bar_sel = bar.select(Sport=["Basketball", "Gymnastics"]).opts(fill_color="darkred")
(bar * bar_sel * txt.select(Sport=["Basketball", "Gymnastics"])).opts(title="Average Height of each Olympic Sport, a 28 cm difference")

We can also create a line plot between differnece the average heights of basketball and gymanstic olympians for each year. It's interesting to see that the difference has been growing wider with it peaking at 1992. 

In [11]:
avg_basketball_height = df[df["Sport"] == "Basketball"].groupby("Year").mean()
avg_gymnastics_height = df[df["Sport"] == "Gymnastics"].groupby("Year").mean()
avg_height_diff = avg_basketball_height["Height"] - avg_gymnastics_height["Height"]
avg_height_diff.hvplot.line()


We can also take a look at the height and weights at the same time for each medal and sport combination. We were able to overlay the plot earlier using `*` but we can also arrange the two plots together using `+`. To make them vertical we use `.cols(1)`.

Note that the medal defaults to "nan" which won't display any data on the graphs, so it needs to be changed to something else. 

In [14]:
df = df.sort_values("Year")
height_plot = df.hvplot("Year", "Height", groupby=["Medal", "Sport"])
weight_plot = df.hvplot("Year", "Weight", groupby=["Medal", "Sport"])
combined_plots = height_plot + weight_plot
combined_plots.cols(1)

We can further narrow down the data by looking at specific events instead of sports. 

In [15]:
df.hvplot("Year", "Weight", groupby=["Medal", "Event"])


We can even take a step further by looking at the difference in weight and height between medalists and non-medalists. First, we will separate the data by having the averages of medalists and non-medalists. We use the `*` operator to overlay the medalists and non-medalists data for both the height and weight plots. Then we use the `+` operator to arrange the two plots. 

This gives us very interesting graphs as we can see there is a distinct difference of height and weight between medalists and non-medalists of the athletics men's 100 metres event.

In [16]:
medalist_avg = df[df["Medal"].notna()].groupby(["Year", "Event"]).mean()
not_medalist_avg = df[df["Medal"].isna()].groupby(["Year", "Event"]).mean()

height_medalist_plot = medalist_avg.hvplot.line(x="Year", y="Height", groupby=["Event"], label='Medalist')
height_not_medalist_plot = not_medalist_avg.hvplot.line(x="Year", y="Height", groupby=["Event"], color="red", label='Non-Medalist')
height_plot = (height_medalist_plot * height_not_medalist_plot).opts(title="Height", legend_position="bottom_right")

weight_medalist_plot = medalist_avg.hvplot.line(x="Year", y="Weight", groupby=["Event"], label='Medalist')
weight_not_medalist_plot = not_medalist_avg.hvplot.line(x="Year", y="Weight", groupby=["Event"], color="red", label='Non-Medalist')
weight_plot = (weight_medalist_plot * weight_not_medalist_plot).opts(title="Weight", legend_position="bottom_right")

combined_plots = height_plot + weight_plot

combined_plots.cols(1)