In [69]:
from dotenv import load_dotenv
from urllib.parse import urlencode
import requests
import os
from datetime import datetime, timedelta

import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
import pandas as pd
import requests
import urllib3
import seaborn as sns
import numpy as np
from pandas import json_normalize

## Setup for Accessing Strava Data

When making a [Strava App](https://www.strava.com/settings/api), the access token scope is not sufficient for actually getting activity data. This is because the "/athlete/activities" API endpoint requires either a token with scope "activity:read" or "activity:read_all". Since a Strava API application only gets you a read token (which funnily enough is not sufficient to pull activity data on its own), the easiest way to overcome this is the following.

1) Make a request using your clientID through a browser to the http://www.strava.com/oauth/authorize. Make sure to set scope=activity:read or your desired scope in the request. redirect_url=http://localhost is sufficient in this request since it is whitelisted by default and does not need to actually successfully redirect
2) After making the request, you will be redirected to an OAuth page for your created Strava API application. Grant it permission to login with the requested scope.
3) view the URL returned (the actual browser probably just shows nothing since it tried to redirect to localhost) and save the code that was returned into the response.
4) Using this code, make a new request to https://www.strava.com/oauth/token, making sure to use your normal clientID, clientSecret, and code=your_code in the request and grant_type=authorization_code. This can be done in a Python script now.
5) If successful, the request should return a new access_token and a refresh_token, which can be saved into your preferred way for storing secrets (this project uses a .env file that is .gitignored).

The provided access_refresh tokens supposedly do not expire, so this process only needs to be done once. While roundabout, this appears to be the easiest way for handling this sort of issues when working with Strava's API in an isolated environment, rather than an active application with a server and browser to handle OAuth request.

In [12]:
# Set Strava URL for accessing API
activities_url = 'https://www.strava.com/api/v3/athlete/activities'

In [9]:
# Load .env file contents
load_dotenv()

True

## Request all my Strava Activity Data

Using the access token with the appropriate scope for reading my activities from Strava, I create request headers to use

In [None]:
header = {'Authorization': 'Bearer ' + os.getenv("STRAVA_ACCESS_TOKEN")}

This function loops through the Strava response data and adds it to an output array, if any were returned. This is because when hitting the 'athlete/activity' GET API, the items are returned in a list without an associated key (i.e. 'data') so the JSON can be directly iterated over or appended to a list. When all activites are parsed, the next page request will return an empty JSON item back, so iteration will stop there.

The maximum results per page that can be configured is 200, which is used to reduce the chance of hitting Strava's API limits.

Small functionality is added to check for an error of hitting the Rate Limit, since the boolean condition does not account for getting a response that is not actually containing data (since there isn't a key to check for when data is returned).

Strava has a 100 read requests / 15 minute limit on top of a 1000 read limit per day, which isn't an issue when dealing with just my activites but could be difficult if parsing for multiple athletes.

In [6]:
# Function to request activities data
def loop_through_pages(page):
    # start at page ...
    page = page
    # set new_results to True initially
    new_results = True
    # create an empty array to store our combined pages of data in
    data = []
    while new_results:
        # Give some feedback
        print(f'You are requesting page {page} of your activities data ...')
        # request a page + 200 results
        get_strava = requests.get(activities_url, headers=header, params={'per_page': 200, 'page': f'{page}'}).json()

        if 'message' in get_strava:
            if get_strava['message'] == "Rate Limit Exceeded":
                print("Rate Limited Exceeded, please wait before retrying")
                break

        # save the response to new_results to check if its empty or not and close the loop
        new_results = get_strava
        # add our responses to the data array
        data.extend(get_strava)
        # increment the page
        page += 1
    # return the combine results of our get requests
    return data

# call the function to loop through our strava pages and set the starting page at 1
my_dataset = loop_through_pages(1)

You are requesting page 1 of your activities data ...
You are requesting page 2 of your activities data ...


In [13]:
print(f'Found {len(my_dataset)} activites!')

Found 106 activites!


## Converting Activities into a proper dataset

The JSON list returned from Strava can conveniently be converted into a Pandas dataframe using ```json_normalize()```

In [None]:
# JSON list is flattened into a data frame
df = json_normalize(my_dataset).reset_index()

In [19]:
df.head()

Unnamed: 0,resource_state,name,distance,moving_time,elapsed_time,total_elevation_gain,type,sport_type,workout_type,id,...,external_id,from_accepted_tag,pr_count,total_photo_count,has_kudoed,athlete.id,athlete.resource_state,map.id,map.summary_polyline,map.resource_state
0,2,Saturday long run,16150.5,5725,6036,33.7,Run,Run,2.0,16034518965,...,4B217AF1-1BDD-4B7F-AAAE-A1CE9EE70476-activity.fit,False,7,0,False,152953032,1,a16034518965,o~maGxo~pLKi@Yk@e@?ILGOMKFaDVaCZeA_@yAc@eDMYGs...,2
1,2,Happy Friday,9116.8,3490,3672,31.9,Run,Run,0.0,16024185362,...,A12C2EF1-BE4A-4DA1-AA35-E028DD6E0985-activity.fit,False,0,0,False,152953032,1,a16024185362,g~maGho~pL@KCJI@BEGGAQQY@CG?F?CBOMo@UKM@_AEGCO...,2
2,2,Warm up/cool down to and from gym,3467.2,1342,2864,16.1,Run,Run,0.0,16014778959,...,2D0DEE1F-3E6D-4716-B013-C2BFE6EDF58D-activity.fit,False,0,0,False,152953032,1,a16014778959,w}maG~p~pLA?CPORDRANBLP@PZ?JEZMBe@^GBCAQHIJAND...,2
3,2,Gym speed,4345.2,1380,1380,0.0,Run,Run,3.0,16014794615,...,,False,0,0,False,152953032,1,a16014794615,,2
4,2,No run club :(,8103.0,2940,3132,35.5,Run,Run,3.0,16003878242,...,315B6DEB-C192-4459-85DA-61908A525528-activity.fit,False,0,0,False,152953032,1,a16003878242,u{maGpn~pLYHMEGKCUDWCWQkASgBI[OoA?CZUJMK]Ea@M]...,2


In [28]:
print(f'Shape: {df.shape}, Column Names: {df.columns}')

Shape: (106, 51), Column Names: Index(['index', 'resource_state', 'name', 'distance', 'moving_time',
       'elapsed_time', 'total_elevation_gain', 'type', 'sport_type',
       'workout_type', 'id', 'start_date', 'start_date_local', 'timezone',
       'utc_offset', 'location_city', 'location_state', 'location_country',
       'achievement_count', 'kudos_count', 'comment_count', 'athlete_count',
       'photo_count', 'trainer', 'commute', 'manual', 'private', 'visibility',
       'flagged', 'gear_id', 'start_latlng', 'end_latlng', 'average_speed',
       'max_speed', 'has_heartrate', 'heartrate_opt_out',
       'display_hide_heartrate_option', 'elev_high', 'elev_low', 'upload_id',
       'upload_id_str', 'external_id', 'from_accepted_tag', 'pr_count',
       'total_photo_count', 'has_kudoed', 'athlete.id',
       'athlete.resource_state', 'map.id', 'map.summary_polyline',
       'map.resource_state'],
      dtype='object')


Before cleaning the dataset more, we want to reduce down to some of the more relevant columns.

I am choosing to keep:

* name
* distance
* moving_time
* elapsed_time (When a run has a big difference between elapsed_time and moving_time, does it affect outputs? Running in the city often results in lots of waiting at red lights)
* total_elevation_gain
* sport_type (for filtering, even though I know all of these are or should be runs with a few activites I miscategorized as hikes)
* workout_type (distinguishing runs by recovery, long, etc.)
* start_date_local
* average_speed
* max_speed
* elev_high
* elev_low
* pr_count

Unfortunately, I have not saved much in terms of geographical data to Strava, which would have been nice to compare my performance pre/post moving (due to a large elevation change and environment difference) but that will have to wait for another time.

In [282]:
# Define columns to keep
keep_columns = ["name","distance","moving_time","elapsed_time","total_elevation_gain","type","sport_type", "workout_type", "start_date_local", "average_speed","max_speed","elev_high","elev_low", "pr_count"]

# Drop columns and save into a reduced df
df_reduce_cols = df[keep_columns]

I know I have some biking and strength workouts saved in here so lets filter down to just my runs. Additionally, I am chosing to drop walks as well since they may skew analysis of runs. Virtual Runs is the category I used to classify treadmill running so those should stay.

In [283]:
df_reduce_cols.type.unique()

array(['Run', 'Walk', 'VirtualRun', 'Workout', 'Ride'], dtype=object)

In [284]:
# Save length of df to compare
len_before = len(df_reduce_cols)

# Filter to only running activities
df_reduce_cols = df_reduce_cols[df_reduce_cols['type'].isin(["Run", "VirtualRun"])]

df_reduce_cols.head()

Unnamed: 0,name,distance,moving_time,elapsed_time,total_elevation_gain,type,sport_type,workout_type,start_date_local,average_speed,max_speed,elev_high,elev_low,pr_count
0,Saturday long run,16150.5,5725,6036,33.7,Run,Run,2.0,2025-10-04T15:22:14Z,2.821,4.82,11.0,1.4,7
1,Happy Friday,9116.8,3490,3672,31.9,Run,Run,0.0,2025-10-03T17:13:26Z,2.612,5.42,12.7,1.5,0
2,Warm up/cool down to and from gym,3467.2,1342,2864,16.1,Run,Run,0.0,2025-10-02T18:08:42Z,2.584,6.72,14.8,4.0,0
3,Gym speed,4345.2,1380,1380,0.0,Run,Run,3.0,2025-10-02T16:35:37Z,3.149,0.0,,,0
4,No run club :(,8103.0,2940,3132,35.5,Run,Run,3.0,2025-10-01T18:06:35Z,2.756,7.6,23.9,3.0,0


In [285]:
print(f'Length before filter: {len_before} vs. length after filter: {len(df_reduce_cols)}')

Length before filter: 106 vs. length after filter: 99


Now that the data is reduced down to the columns I'm interested in, I will start with some clean up tasks.

1) Distance is given in meters. This is okay to keep for now but I want to add a column with the distance in miles.
2) moving_time is in seconds. This is also fine, but I'd like to get an hh:mm:ss timestamp that would be useful for a visualization
3) Same as #2 but for the moving time as well.
4) Adding a new column that indicates whether a run took place in the gym. This is done by checking for NaN values in the elevation fields, since those activites are manually entered by me and I know are consistent since this is my data. If possible I would further distinguish between gym vs. treadmill vs. traditional run but there isn't a very good way to do it without parsing my activity descriptions.
5) Adding 0 instead of NaN for elevation columns.
6) Convert all elevation columns (low, high, total) to feet from meters
7) Convert average_speed and max_speed to mph (from m/s)
8) Convert workout_type to string where 0: None (default), 1: Race, 2: Long Run,3: Workout
9) Clean up time stamp by splitting date/time components

In [29]:
# Define some simple helper functions to apply to columns
def time_in_hhmmss(time):
    return str(datetime.timedelta(seconds=time))

def convert_speed(speed):
    return speed * 2.23

def meters_to_miles(distance):
    return distance * 0.000621371192

def meters_to_feet(distance):
    return distance * 3.28084

def clean_date_time(timestamp):
    return timestamp.replace('T', ' ')[:-1]

The data cleaning pipeline can be constructed below in the following steps

In [287]:
# 1) Add column for distance in miles
print("Converting meters to miles")
df_reduce_cols.loc[:, ['distance_miles']] = df_reduce_cols.distance.apply(meters_to_miles)

# 2) Convert moving_time to hh:mm:ss
df_reduce_cols.loc[:, ["moving_time_str"]] = df_reduce_cols.moving_time.apply(time_in_hhmmss)

# 3) Convert ellapsed_time to hh:mm:ss
df_reduce_cols.loc[:, ["elapsed_time_str"]] = df_reduce_cols.elapsed_time.apply(time_in_hhmmss)

# 4) Add new column is_gym_run to check if run took place on treadmill or at gym
df_reduce_cols.loc[:, ["is_gym_run"]] = df_reduce_cols.elev_high.where(df_reduce_cols.elev_high.isna(), False).fillna(True)

# 5) Replace NaNs in elev_low and elev_high with 0
df_reduce_cols.loc[:, ["elev_low", "elev_high"]] = df_reduce_cols[["elev_low", "elev_high"]].fillna(0)

# 6) Convert elevations from meters to feet
df_reduce_cols.loc[:, ["elev_low"]] = df_reduce_cols.elev_low.apply(meters_to_feet)
df_reduce_cols.loc[:, ["elev_high"]] = df_reduce_cols.elev_high.apply(meters_to_feet)
df_reduce_cols.loc[:, ["total_elevation_gain"]] = df_reduce_cols.total_elevation_gain.apply(meters_to_feet)


# 7) Convert average_speed and max_speed from m/s to mph
df_reduce_cols.loc[:, ["average_speed"]] = df_reduce_cols.average_speed.apply(convert_speed)
df_reduce_cols.loc[:, ["max_speed"]] = df_reduce_cols.max_speed.apply(convert_speed)

# 8) Convert workout_type to categorical string
df_reduce_cols.loc[:, ["workout_type"]] = df_reduce_cols.workout_type.fillna(0) # Set any NaN workout types to a none workout
df_reduce_cols.loc[:, ["workout_type_str"]] = df_reduce_cols.workout_type.map({0:"None", 1:"Race", 2:"Long Run", 3:"Workout"})

# 9) Break start_date_local into two separate columns
df_reduce_cols = df_reduce_cols.rename(columns={"start_date_local":"start_date_full"})
df_reduce_cols.loc[:, ["start_date_full"]] = df_reduce_cols.start_date_full.apply(clean_date_time)
df_reduce_cols.loc[:, ["start_date_local"]] = pd.to_datetime(df_reduce_cols.start_date_full).dt.date
df_reduce_cols.loc[:, ["start_time_local"]] = pd.to_datetime(df_reduce_cols.start_date_full).dt.time

# 10) Drop columns not needed anymore after cleaning
cols_to_drop = ["moving_time", "elapsed_time", "start_date_full", "workout_type", "type", "sport_type"]
df_reduce_cols.drop(columns=cols_to_drop, inplace=True)

# 11) Rename columns now that old ones are dropped and save to run_df_cleaned
run_df_cleaned = df_reduce_cols.rename(columns={"moving_time_str":"moving_time", "elapsed_time_str":"elapsed_time", "workout_type_str":"workout_type"})

Converting meters to miles


  df_reduce_cols.loc[:, ["is_gym_run"]] = df_reduce_cols.elev_high.where(df_reduce_cols.elev_high.isna(), False).fillna(True)


The cleaned dataset appears below with all the listed changes made.

In [288]:
run_df_cleaned.head()

Unnamed: 0,name,distance,total_elevation_gain,average_speed,max_speed,elev_high,elev_low,pr_count,distance_miles,moving_time,elapsed_time,is_gym_run,workout_type,start_date_local,start_time_local
0,Saturday long run,16150.5,110.564308,6.29083,10.7486,36.08924,4.593176,7,10.035455,1:35:25,1:40:36,False,Long Run,2025-10-04,15:22:14
1,Happy Friday,9116.8,104.658796,5.82476,12.0866,41.666668,4.92126,0,5.664917,0:58:10,1:01:12,False,,2025-10-03,17:13:26
2,Warm up/cool down to and from gym,3467.2,52.821524,5.76232,14.9856,48.556432,13.12336,0,2.154418,0:22:22,0:47:44,False,,2025-10-02,18:08:42
3,Gym speed,4345.2,0.0,7.02227,0.0,0.0,0.0,0,2.699982,0:23:00,0:23:00,True,Workout,2025-10-02,16:35:37
4,No run club :(,8103.0,116.46982,6.14588,16.948,78.412076,9.84252,0,5.034971,0:49:00,0:52:12,False,Workout,2025-10-01,18:06:35


In [289]:
# Confirm that we don't have any NA values hiding
run_df_cleaned.isna().sum()

name                    0
distance                0
total_elevation_gain    0
average_speed           0
max_speed               0
elev_high               0
elev_low                0
pr_count                0
distance_miles          0
moving_time             0
elapsed_time            0
is_gym_run              0
workout_type            0
start_date_local        0
start_time_local        0
dtype: int64

This dataset is saved to a csv for reading to avoid having to rescrape from Strava repeatedly.

In a future update of this project, I would like to create a Python script that will do the same scraping, starting from my last activity and get all my new data, that can be executed on a week interval (i.e. every Sunday as a cronjob).

In [290]:
# Save the output as a csv for re-reading in the future
run_df_cleaned.to_csv('data/greg_runs.csv', index=False)

## Getting my current PRs

While the newly created dataset above will be used for visualizing my run data, I also want to do some predictive modeling since I am preparing for my first marathon in Feb. 2026. To do this, first, I will add my personal PRs. manually extracted from Strava.

For whatever reason, the only way to get PR data is to pull each activity by ID, parse through best effort splits, identify the ones for the fixed distances and save it that way, which is not feasable with hundreds of activites and a 100 requests per 15 minutes limit.

In [14]:
my_prs = pd.DataFrame(columns=['1mile', '5K','10K','HalfMarathon'])

my_prs.loc[0] = [466, 1496, 3320, 7179]

my_prs

Unnamed: 0,1mile,5K,10K,HalfMarathon
0,466,1496,3320,7179


These PRs are saved for logging purposes. When my race is closer, I will update this manually and save them for further comparisons later down the line.

In [48]:
my_prs.to_csv('data/greg_prs_Oct25.csv', index=False)

## Visualizing the data with Tableau

Rather than plot in mathplotlib, I wanted to incorporate the data into a Tableau Dashboard that can be embedded into my personal website. This project is under development and the current version can be found at [here]().

I intend to keep the Tableau Dashboard updated by turning this notebook into a python script that can be run weekly to pull new data that is fed into the dashboard, keeping the visualizations up to date with my current progress. This part of the proejct is also under development currently.

The in-progress Tableau file is included in this repository: ```figs/RunningDataVisualizations.twb```

## Predicting my first marathon with sklearn LinearRegression()

In order to fit a model to estimate my marathon time from my current training PRs, I need data. I used Kaggle to source marathon datasets which contained splits for various points in the marathon. The idea is to take the splits, and use them as inputs to fit a regression model to predict marathon times.

The following datasets are found in this projct (along with links to their original sources):

* data/hk_elite.csv - Elite runner times from the 2016 Hong Kong Marathon (pulled from MelvinCheung's [Kaggle Post](https://www.kaggle.com/datasets/melvincheung/hong-kong-marathon-2016))
* data/hk_normal.csv - Result times for non-elite field from 2016 Hong Kong Marathon (pulled from MelvinCheung's [Kaggle Post](https://www.kaggle.com/datasets/melvincheung/hong-kong-marathon-2016))

### Data Preparation

In [146]:
hk_elite = pd.read_csv('data/hk_elite.csv')
hk_normal = pd.read_csv('data/hk_normal.csv')

In [147]:
hk_full = pd.concat([hk_elite, hk_normal])
hk_full.head()

Unnamed: 0,Overall Position,Gender Position,Category Position,Category,Race No,Country,Official Time,Net Time,10km Time,Half Way Time,30km Time
0,1,1,1,MMS,21080,Kenya,2:12:12,2:12:11,0:30:35,1:04:48,1:33:36
1,2,2,1,MMI,14,Kenya,2:12:14,2:12:13,0:30:34,1:04:48,1:33:36
2,3,3,2,MMI,2,Ethiopia,2:12:20,2:12:18,0:30:35,1:04:49,1:33:36
3,4,4,2,MMS,21077,Kenya,2:12:29,2:12:27,0:30:35,1:04:48,1:33:36
4,5,5,3,MMI,18,Ethiopia,2:12:47,2:12:46,0:30:34,1:04:48,1:33:36


In [148]:
print(f'Full Hong Kong 2016 data has columns {hk_full.columns} and contains {hk_full.shape[0]} records')

Full Hong Kong 2016 data has columns Index(['Overall Position', 'Gender Position', 'Category Position', 'Category',
       'Race No', 'Country ', 'Official Time', 'Net Time', '10km Time',
       'Half Way Time', '30km Time'],
      dtype='object') and contains 12849 records


When reducing the full dataset down to the times we are interested in, na times are dropped as they represent a DNF which we don't want to predict for in this case.

In [149]:
hk_train = hk_full[['10km Time', 'Half Way Time', 'Official Time']].dropna()

hk_train.head()

Unnamed: 0,10km Time,Half Way Time,Official Time
0,0:30:35,1:04:48,2:12:12
1,0:30:34,1:04:48,2:12:14
2,0:30:35,1:04:49,2:12:20
3,0:30:35,1:04:48,2:12:29
4,0:30:34,1:04:48,2:12:47


In [150]:
def timestamp_to_seconds(time):
    dt = datetime.strptime(time, "%H:%M:%S")
    delta = dt - datetime(1900, 1, 1)
    return delta.total_seconds()

In [151]:
def preprocess_hongkong_data(df):
    df.loc[:, '10K'] = df['10km Time'].apply(timestamp_to_seconds)
    df.loc[:, 'HalfMarathon'] = df['Half Way Time'].apply(timestamp_to_seconds)
    df.loc[:, 'FullMarathon'] = df['Official Time'].apply(timestamp_to_seconds)

    # Drop string times since we do not need them anymore
    df.drop(columns=['10km Time', 'Half Way Time', 'Official Time'], inplace=True)

    return df

hk_train = preprocess_hongkong_data(hk_train)
hk_train.head()

Unnamed: 0,10K,HalfMarathon,FullMarathon
0,1835.0,3888.0,7932.0
1,1834.0,3888.0,7934.0
2,1835.0,3889.0,7940.0
3,1835.0,3888.0,7949.0
4,1834.0,3888.0,7967.0


### Fit a simple regression model on the Hong Kong Dataset then predict my time

This process is just an example of what a process could look like for this type of prediction. Realistically speaking, the training data and my numbers are not the same: race splits from the actual marathon are different from my PRs that come from a variety of different work outs/races/etc.

Additionally, this is predicting off my current splits (Oct. 2025) which **hopefully** will improve by Feb. 2026. It will be interesting to update

A better process would be to work with a dataset of athletes work outs and use that to build a dataset that could be used to model with. Strava's data is not public and requires direct authorization from an athlete to retrieve their activities, so in my position it is difficult to have an easy to access dataset.

There do exist some athlete workout datasets on Kaggle that I want to explore more into and refine this prediction and see how it compares once I complete my race in February.

In [52]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn import preprocessing

After loading in sklearn imports for the LinearRegression model, the Hong Kong dataset is prepared to be used as training inputs

In [91]:
# Full Marathon is dropped from X since it is the predicted variable
X = hk_train.drop(columns=['FullMarathon'])
y = hk_train.FullMarathon

# Use train_test_split to divide into training and test data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3, random_state=33)

# Drop columns that the Hong Kong dataset does not include in my pr splits for predicting.
X_Greg = my_prs.drop(columns=['1mile', '5K'])

# Check to ensure that train and test data have same length so there are no errors with the model
assert(len(X_train) == len(y_train))
# Make sure columns are named the same so they will predict properly
assert(X_Greg.columns.all() == X.columns.all())

The model is then created and fitted on the training data

In [113]:
model_full = LinearRegression()
model_full.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


After fitting the model, we can predict for the test set and use means_squared_error to see the quality of the fit

In [114]:
predictions = model_full.predict(X_test)

print('mean_squared_error : ', mean_squared_error(y_test, predictions))
print('mean_absolute_error : ', mean_absolute_error(y_test, predictions))

mean_squared_error :  727576.9909034839
mean_absolute_error :  649.470364476865


### Some quick analysis based on MSE output

The model is missing by around 850 seconds on making predictions, which is approximately 14:30 off for each actual prediction. I want to try and see why this may be, since in race splits should be a decent predictor of final time, especially when the halfway point is included.

My current working theory is the weight of elite runners vs. amateur runners and the difference in their output results in data that is not consistent. Elite runners train to pace themselves and it is easy to see consistency in their splits. For example, from the hk_elite.csv dataset, if we look at the splits from the first row (the person who won the event), we see the following:

10K: 0:30:35 -> 1835s / 10km = 183.5 s/km
HalfMarathon (21KM): 1:04:48 -> 3888s / 21km = 185.1 s/km
30K (approximately 3/4): 1:33:36 -> 5616s / 30km = 187.2 s/km
FullMarathon (42KM): 2:12:12 -> 7932s / 42km = 188.8 s/km

The top runner in the field was able to essentially maintain their pace for the entire marathon with only about ~6 seconds of variance in his pace, which is incredible consistency.

In comparison, looking at the splits from the very middle runner from the hk_normal dataset, runner 3616 we see:

10K: 1:10:35 -> 4235 / 10km = 423.5 s/km
HalfMarathon (21KM): 2:26:14 -> 8774s / 21km = 417.8 s/km
30K (approximately 3/4): 3:29:50 -> 12,590 / 30km = 419.6 s/km
FullMarathon (42KM): 5:02:48 -> 18,168s / 42km = 432.57 s/km

For the first 30K, this runner did a great job of staying pretty consistent and even kicked up the pace through the half and 3/4 marks. However, the last quarter took its toll and the pace increased by about 10-15 km/s, which doesn't seem like a lot but given the distance of the race, it adds up.

My theory that I will test later is that by keeping the elite and amateur data separate when fitting models, the prediction ability on the elite data will be greater than that of the data that includes amateur runners since there is much more variability on the outcomes of the race, including injuries, fatigue, lack of training, or people who simply run marathons because they are fun and run a bit and walk most of it.

### Predicting my Full Marathon time using Linear Regression model trained on Full Hong Kong 2016 data

In [117]:
greg_full = model_full.predict(X_Greg)

# Convert the predicted time in seconds into a timestamp
greg_predicted_time = datetime.strptime(str(timedelta(seconds=greg_full[0])), '%H:%M:%S.%f')

After putting my splits through the model, we see the following result

In [116]:
print(f"The Linear Regression model with full Hong Kong marathon data predicted a time of: {greg_predicted_time.strftime('%H:%M:%S')} for my first marathon based on my current splits")

The Linear Regression model with full Hong Kong marathon data predicted a time of: 04:21:42 for my first marathon based on my current splits


### Normal vs. Elite Regression Models

Based on the previous analysis, I want to essentially repeat the process but keep the Elite and Normal runner data separate

In [None]:
# Clean up existing data frame with full elite runner data from Hong Kong 2016
hk_elite_clean = preprocess_hongkong_data(hk_elite[['10km Time', 'Half Way Time', 'Official Time']].dropna())
hk_elite_clean.head()

Unnamed: 0,10K,HalfMarathon,FullMarathon
0,1835.0,3888.0,7932.0
1,1834.0,3888.0,7934.0
2,1835.0,3889.0,7940.0
3,1835.0,3888.0,7949.0
4,1834.0,3888.0,7967.0


In [158]:
# Create train test splits for elite runner data
X_elite = hk_elite_clean.drop(columns=['FullMarathon'])
y_elite = hk_elite_clean.FullMarathon

X_elite_train, X_elite_test, y_elite_train, y_elite_test = train_test_split(X_elite, y_elite)

In [None]:
# Clean up existing data frame with full normal runner data from Hong Kong 2016
hk_normal_clean = preprocess_hongkong_data(hk_normal[['10km Time', 'Half Way Time', 'Official Time']].dropna())
hk_normal_clean.head()

Unnamed: 0,10K,HalfMarathon,FullMarathon
0,2330.0,4995.0,10265.0
1,2375.0,5049.0,10444.0
2,2453.0,5223.0,10714.0
3,2572.0,5434.0,11014.0
4,2442.0,5251.0,11090.0


In [154]:
# Create train test splits for normal runner data
X_normal = hk_normal_clean.drop(columns=['FullMarathon'])
y_normal = hk_normal_clean.FullMarathon

X_normal_train, X_normal_test, y_normal_train, y_normal_test = train_test_split(X_normal, y_normal)

First we will repeat the model fitting predicting loop with the normal runner data

In [None]:
# Fitting and predicting using only the normal data
model_normal = LinearRegression()
model_normal.fit(X_normal_train, y_normal_train)
predictions_normal = model_normal.predict(X_normal_test)

print('mean_squared_error : ', mean_squared_error(y_normal_test, predictions_normal))
print('mean_absolute_error : ', mean_absolute_error(y_normal_test, predictions_normal))

mean_squared_error :  679181.8199889385
mean_absolute_error :  653.1483757322314


We do see a slight performance improvement by separating the elite runners from the normal runners. The model may fit better since the elite runners would act as more of an outlier in comparison to an average amateur marathon participant.

Next, we'll do the same for the elite runner data

In [159]:
# Fitting and predicting using only the elite runner data
model_elite = LinearRegression()
model_elite.fit(X_elite_train, y_elite_train)
predictions_elite = model_normal.predict(X_elite_test)

print('mean_squared_error : ', mean_squared_error(y_elite_test, predictions_elite))
print('mean_absolute_error : ', mean_absolute_error(y_elite_test, predictions_elite))

mean_squared_error :  799953.9948755463
mean_absolute_error :  718.9900577104567


Surprisingly, we see worse performance (based on these metrics) from just having the elite data included. It is hard to hypothesize why the elite data would be worse, especially since my initial intuition was that the model fitted to only elite runner data would do better due to a higher expectation of consistency in the pacing.

It would be worthwhile to look deeper into the datasets and potentially recreate the same pace variation calculation for all rows to see if there really is a significant difference. It is possible as well that the classification for an "elite" runner in this case was quite broad (there are 5k rows in the elite dataset) and some athletes that are more borderline could influence the data. This could be tested by filtering using the provided overall_position column to only fit a model on the top 100 or 200 data points to see if there is a performance gap even within the elite category.

### Predicting my Full Marathon time with a Linear Regression model trained on only normal runner data from Hong Kong 2016

In [156]:
greg_normal = model_normal.predict(X_Greg)
greg_predicted_time_normal = datetime.strptime(str(timedelta(seconds=greg_normal[0])), '%H:%M:%S.%f')
print(f"The Linear Regression model with full Hong Kong marathon data predicted a time of: {greg_predicted_time_normal.strftime('%H:%M:%S')} for my first marathon based on my current splits")

The Linear Regression model with full Hong Kong marathon data predicted a time of: 04:24:40 for my first marathon based on my current splits


This model actually predicted a slightly slower time for me, which to be fair, may be closer to the truth. It is my first race so there is a chance I overpace myself and collapse in the later part of the race. Even though the MSE is still high, there may be truth in this model since predicting the outputs of a casual runner is so chaotic and difficult without more insight into their preparation, goals, and training regimen.

### Predicting my Full Marathon time with a Linear Regression model trained on only normal runner data from Hong Kong 2016

In [160]:
greg_elite = model_elite.predict(X_Greg)
greg_predicted_time_elite = datetime.strptime(str(timedelta(seconds=greg_elite[0])), '%H:%M:%S.%f')
print(f"The Linear Regression model with full Hong Kong marathon data predicted a time of: {greg_predicted_time_elite.strftime('%H:%M:%S')} for my first marathon based on my current splits")

The Linear Regression model with full Hong Kong marathon data predicted a time of: 04:21:34 for my first marathon based on my current splits


The output of the elite model predicted almost the same time as the first full model, just slightly faster (approx. 20s difference).

## Simple Conclusions

As stated before, this methodology is not expected to be especially reliable for a few reasons:

* Comparing training splits to race splits is not necessarily an even comparison (my 10K pr split is likely going to be faster than my 10K split in a marathon)
* Race splits, especially for amateur runners like myself, are likely to be very varied since there are lots of factors that may influence them. It is a hard thing to predict for and be confident in the prediction.

Future improvements to this process could include the following:

* Instead of using race splits to predict, use a training dataset of workout data and aggregate my Strava workout. It is likely that preparation information (how many times a week you ran, distance per week, elevation ran per week, etc.) would be a better (and more interesting!!) predictor of a runner's potential.
* Construct a better marathon dataset. The Hong Kong dataset is nice and convenient since it was available on Kaggle and included splits that matched up with data I had, but further optimizations could be done to either improve this dataset or build a better overall one.
* Automate the scraping into a weekly job that can be run to keep my activites log up to date. This is not used in the prediction part of the project but would be nice to keep the Tableau dashboard up to date.

Potential problems that may come up when continuing work:

* Lack of quality running datasets on Kaggle - There is a great [Ultramarathon dataset](https://www.kaggle.com/datasets/aiaiaidavid/the-big-dataset-of-ultra-marathon-running) with over 7 million records. However, for marathon data, there exists multiple smaller ones (especially for Boston Marathon data) but nothing of the same scale. A comparable dataset would be a massive project but could be helpful to work with.
* Even bigger lack of runner training data - Since Strava only allows you to access the API (at least the interesting endpoints of it) for users who have authenticated with OAuth, I am essentially limited to my data, data from friends I know, or people who have made similar projects and uploaded their activities data to Kaggle, GitHub etc. There are some public [athlete datasets](https://www.kaggle.com/datasets/girardi69/marathon-time-predictions) made from a few Strava users, but nothing at a particularly large scale.
  