# 01 — Data Exploration

This notebook covers all Day 1 exploration work:
1. Load and profile the Vickers & Vertosick (2016) dataset
2. EDA — distributions, relationships, and visualizations
3. Profile the Afonseca et al. training dataset

**Primary dataset:** Vickers & Vertosick — 2,303 recreational runners with self-reported training and race times.  
**Secondary dataset:** Afonseca et al. — 10.7M training records from 36,412 World Marathon Majors runners.

## Setup

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

from src.data_loading import load_vickers

sns.set_theme(style='whitegrid')
%matplotlib inline

---
## Part 1: Vickers Dataset — Load and Profile

Load the XLSX, standardize column names to snake_case, and check structure.

In [9]:
df = load_vickers()
print(f'Shape: {df.shape}')
df.dtypes.to_frame('dtype')

Shape: (2303, 50)


Unnamed: 0,dtype
id,int64
adjusted,int64
age,float64
bmi,float64
cohort1,float64
cohort2,float64
cohort3,float64
cohort4,float64
endurancecat,int64
endurancespeed,int64


Ok, all int64 and float64. I know that some of these columns are still categorical based on the docs though. 

In [10]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,2303.0,1154.437256,667.163997,1.0,577.5,1153.0,1732.5,2311.0
adjusted,2303.0,0.939644,0.238197,0.0,1.0,1.0,1.0,1.0
age,2303.0,36.421407,9.913292,16.0,29.0,35.0,42.0,74.0
bmi,2303.0,23.745838,3.052466,15.428386,21.704651,23.374725,25.231911,47.184647
cohort1,929.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
cohort2,633.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
cohort3,493.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
cohort4,387.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
endurancecat,2303.0,1.833695,0.656222,1.0,1.0,2.0,2.0,4.0
endurancespeed,2303.0,3.440729,1.768493,1.0,2.0,3.0,4.0,10.0


Mean mileage per week was 32 miles, but standard deviation was 17 miles. This means we had significant variaition in mileage ran per week. I should look over that more closely when I do thorough EDA.

In [15]:
df.isnull().sum().to_frame("null count")

Unnamed: 0,null count
id,0
adjusted,0
age,0
bmi,0
cohort1,1374
cohort2,1670
cohort3,1810
cohort4,1916
endurancecat,0
endurancespeed,0


Interesting. The most submitted race was the half-marathon followed by the 5k, with the least submitted races being the 5 mile and the 10 mile. Approximately 40% of runners submitted a marathon time, which is promising. Lets see how many of these marathon runners have other race times logged. 

In [20]:
has_marathon = df[df["mf_ti_adj"].notna()]

race_cols = ["k5_ti_adj", "k10_ti_adj", "m5_ti_adj", "m10_ti_adj", "mh_ti_adj"]
has_marathon["other_races"] = has_marathon[race_cols].notna().sum(axis=1)

has_marathon["other_races"].value_counts().sort_index().to_frame("# of other races")

Unnamed: 0_level_0,# of other races
other_races,Unnamed: 1_level_1
1,436
2,493


Alright, so we have 493 runners who have an adjusted marathon time, who also ran two other races with adjusted times. That shaves n down significantly.

### Data quality checks

I'm looking for dupes and negatives, along with impossible values. All of this was self reported to a magazine, so I want to be diligent.

In [21]:
# Duplicates
print(f'Duplicate rows: {df.duplicated().sum()}')

Duplicate rows: 0


In [25]:

impossible = df[(df["mf_ti_adj"] < 120*60) | (df["mf_ti_adj"] > 480*60)]
print(f"Total impossible vals: {len(impossible)}")


Total impossible vals: 0


Awesome. No impossible values. Vickers et al. Almost certainly cleaned this data before they used it, but its always good to check.

### Vickers profiling summary

We have 493 datapoints to train off of, which should be a decent amount for our purposes. Our data seems pretty clean already. We have no dupe rows and no impossible marathon values in our data. Based on this, its worth exploring the data further. I want to inspect the distribution of training volume, since thats a major feature. Its also worth inspecting what values are null in the data we will be working with to make sure theres no systemic issues. EDA is called for next.

---
## Part 2: EDA — Distributions and Relationships

### Marathon finish time distribution

In [None]:
# Histogram of marathon finish times
fig, ax = plt.subplots(figsize=(10, 5))
# sns.histplot(df['marathon_time'], kde=True, ax=ax)
# ax.set_xlabel('Marathon Finish Time (min)')
# ax.set_title('Distribution of Marathon Finish Times')
# plt.tight_layout()
# plt.savefig('../results/figures/marathon_time_dist.png', dpi=150)

*What do you see? Is it normally distributed? Skewed?*

### Weekly mileage distribution

In [None]:
# Histogram of weekly mileage
fig, ax = plt.subplots(figsize=(10, 5))
# sns.histplot(df['weekly_mileage'], kde=True, ax=ax)
# ax.set_xlabel('Weekly Mileage')
# ax.set_title('Distribution of Weekly Training Mileage')
# plt.tight_layout()
# plt.savefig('../results/figures/weekly_mileage_dist.png', dpi=150)

*What do you see?*

### Core relationship: weekly mileage vs marathon time

In [None]:
# Scatter plot — this should show a clear negative relationship
fig, ax = plt.subplots(figsize=(10, 6))
# sns.scatterplot(data=df, x='weekly_mileage', y='marathon_time', alpha=0.5, ax=ax)
# ax.set_xlabel('Weekly Mileage')
# ax.set_ylabel('Marathon Time (min)')
# ax.set_title('Weekly Mileage vs Marathon Finish Time')
# plt.tight_layout()
# plt.savefig('../results/figures/mileage_vs_marathon.png', dpi=150)

*Describe the relationship. Be specific — e.g., "runners averaging 40+ miles/week finish ~30 min faster than those under 20 mpw."*

### Half marathon time vs marathon time

In [None]:
# Scatter plot — should show strong positive correlation
fig, ax = plt.subplots(figsize=(10, 6))
# sns.scatterplot(data=df, x='half_marathon_time', y='marathon_time', alpha=0.5, ax=ax)
# ax.set_xlabel('Half Marathon Time (min)')
# ax.set_ylabel('Marathon Time (min)')
# ax.set_title('Half Marathon Time vs Marathon Finish Time')
# plt.tight_layout()
# plt.savefig('../results/figures/half_vs_marathon.png', dpi=150)

*What do you see?*

### Marathon time by sex and age group

In [None]:
# Box plot — marathon time by sex
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# sns.boxplot(data=df, x='sex', y='marathon_time', ax=axes[0])
# axes[0].set_title('Marathon Time by Sex')

# Box plot — marathon time by age group (you may need to bin ages first)
# sns.boxplot(data=df, x='age_group', y='marathon_time', ax=axes[1])
# axes[1].set_title('Marathon Time by Age Group')
# plt.tight_layout()
# plt.savefig('../results/figures/marathon_by_demographics.png', dpi=150)

*What do you see?*

### Correlation heatmap

In [None]:
# Correlation heatmap of all numeric features
fig, ax = plt.subplots(figsize=(12, 8))
# corr = df.select_dtypes(include=np.number).corr()
# sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0, ax=ax)
# ax.set_title('Feature Correlation Heatmap')
# plt.tight_layout()
# plt.savefig('../results/figures/correlation_heatmap.png', dpi=150)

*Which features are most correlated with marathon time? Which features are highly correlated with each other (redundant)?*

### Scatter matrix of top features

In [None]:
# Pairplot of the 4-5 most promising features
# top_features = ['marathon_time', 'half_marathon_time', 'weekly_mileage', '10k_time', 'bmi']
# sns.pairplot(df[top_features].dropna(), corner=True)
# plt.savefig('../results/figures/pairplot_top_features.png', dpi=150)

*What do you see?*

### EDA summary

*Write a paragraph summarizing the key relationships and takeaways from the visualizations above.*

---
## Part 3: Afonseca Training Dataset — Profile

Quick look at the large-scale training data to decide whether it's usable within the 5-day timeline.

In [None]:
# Load the weekly-level Parquet file
# af = pd.read_parquet('../data/raw/afonseca/weekly.parquet')
# print(f'Shape: {af.shape}')
# af.dtypes

In [None]:
# af.describe()

In [None]:
# How many unique athletes? How many have marathon major participation?
# print(f'Unique athletes: {af["athlete"].nunique()}')

In [None]:
# Distribution of weekly training distance and duration

### Key limitation

The Afonseca dataset records which marathon(s) each athlete participated in but **does not include actual finish times**. To use this data for prediction, you would need to cross-reference with scraped race results — non-trivial given athletes are anonymized.

### Decision: use this dataset or set it aside?

*Write your decision and reasoning here. Update `data/README.md` accordingly.*