# "Run Girl Run": Visualising My Strava Activities with APIs

## Introduction

The purpose of this project is to explore my sports activity patterns througohut the last year. I like to go for long runs, practice sports such as tennis and padel, and sometimes go to the gym when it's rainy outside. The platform that I use to keep track of my physical activities is Strava, an American internet service for tracking physical exercise which also has the features of a social media. It's a nice platform where I can look back at my average pace of my fast runs, look at the maps of my long runs, and get motivated by receiving kudos from my friends and vice versa!

In order to analyse my activity on my JupyterLab, I need to scrape the data from my Strava account (through the APIs they provide), and download it locally. I read [this](https://towardsdatascience.com/using-the-strava-api-and-pandas-to-explore-your-activity-data-d94901d9bfde) online article which helped me in the process of downloading the necessary tools to test Strava APIs, getting access and refresh tokens, and finally getting the data with Python. Would highly recommend to anybody who's new to APIs and would like to give it a go. 

Once I've done that, I then started exploring my data through publicly available libraries such as Pandas and NumPy. After having cleansed my data, I aimed at answering the following questions:

- What's the most frequent activity that I've performed in the past year?
- What's the most common run distance I've performed in the past few years?
- Has my average pace decreased during my fast, 5km runs throughout the past year?
- In which country/ county did I exercise the most in the past years?

## Getting our Data

In [None]:
import calendar
from datetime import datetime
from IPython.display import Image, display
import matplotlib.pyplot as plt
import numpy as np
import os 
import pandas as pd
import seaborn as sns

In [None]:
from utils import (
    getting_access_and_refresh_tokens, 
    get_dataset,
    convert_to_hhmmss,
    start_date_to_y_m_d_t,
    converting_num_months_to_strings,
    sorting_values_by_year,
    converting_metres_into_kms,
    creating_a_total_column,
    plotting_table_into_grouped_bar_chart,
    remove_rows_by_sport_type,
    creating_bins,
    calculating_basic_statistics,
    plotting_a_histogram
)

In [None]:
access_token = getting_access_and_refresh_tokens(
    client_id="111330",
    client_secret='db75227de1fa38f856b4ce323affe796687943ef',
    refresh_token='186b85fbbff3960fcfb5ec338fac12cb356611ce',
)

In [None]:
# Getting the dataset 
df_original = get_dataset(access_token) # applying the get_dataset function 

In [None]:
# Creating a copy of the original dataset 
df = df_original.copy()

## Cleaning the main dataframe

In [None]:
# Converting moving_time into hours, minutes, and seconds
df["moving_time"] = df["moving_time"].apply(convert_to_hhmmss)

In [None]:
# Splitting the 'start_date' into years, months, days, and times 
df[["year", "month", "day", "time"]] = df["start_date"].apply(start_date_to_y_m_d_t)

In [None]:
# Converting number of months into names of the month 
df = converting_num_months_to_strings(df)

In [None]:
# Sorting all values by year
df = df.sort_values("year")

In [None]:
# Converting the distance column from metres into kms
df = converting_metres_into_kms(df)

## Visualising our Data

#### Sport by Month Frequency Table (2023)

Let's now answer the first of our research questions. What's the most frequent activity that I've performed in the past year? We will create a filtered dataframe.

In [None]:
# Filtering data based on the 'year' column
year_to_filter = 2023
filtered_df = df[df['year'] == year_to_filter] # creating a filtered dataframe to select only activities performed in 2023 

Before creating the frequency table, let's cleanse the data.

In [None]:
# Replacing values in 'sport_type' with 'Padel'
filtered_df.loc[filtered_df['name'].str.contains('Padel'), 'sport_type'] = 'Padel' # Strava does not have padel as a sport yet 

# Replacing infrequent activities with 'Workout' to reduce the amount of 'sport_type(s)'
filtered_df['sport_type'] = filtered_df['sport_type'].replace('WeightTraining', 'Workout')
filtered_df['sport_type'] = filtered_df['sport_type'].replace('HighIntensityIntervalTraining', 'Workout')

In [None]:
# Removing activities practiced only once or twice
sport_types_to_remove = ['Yoga', 'Hike', 'Walk', 'Badminton']
filtered_df = remove_rows_by_sport_type(filtered_df, sport_types_to_remove)

In [None]:
# Calculating the frequency of sports practiced by month 
frequency_table = filtered_df.groupby(['sport_type', 'month']).size().unstack(fill_value=0) # adding size to count the occurrences. filling missing values with 'unstack' 

In [None]:
# Creating a 'total' column to sort all values from biggest to smallest  
frequency_table = creating_a_total_column(frequency_table)

The most frequent activity in the past year is running, followed by padel. Let's now use this frequency table to create a grouped bar chart.

In [None]:
# Plotting a frequency table into a grouped bar chart
frequency_table = plotting_table_into_grouped_bar_chart(frequency_table)
frequency_table

It looks like runs are the most frequent activities, followed by padel. The month with the highest number of runs is May. This might be because I was training for a Trail Half Marathon which took place at the beginning of June. This could also explain why the month with the lowest amount of run activities was June, since I was going through post-race recovery. We can also observe the diversity in sports which took place in August and September. This might be because of the nice weather thus the multiple activities which can be practiced outdoor. I then discovered padel only midway through the summer, which explains why padel is a most common activity from July onwards.

## Exploring Runs

Let's now explore the runs since it is the most common activity. It would be good to explore how my distance in runs has changed throughout the past few years. I'm going to use the original dataframe to also include all the runs I did before 2023. This will help us know "What's the most common run distance I've performed in the past few years?".

In [None]:
# Filtering the dataframe to only select runs activities 
run_df = df[df['sport_type'] == 'Run']

In [None]:
# Creating bins to categorise my runs' distances 
bin_intervals = [0, 5, 10, 15, 20, 25, 30]
run_df = creating_bins(run_df, 'distance_km', bin_intervals, labels = ["0-5", "5-10", "10-15", "15-20", "20-25", "25-30"])

In [None]:
# Calculating basic statistics of my runs' distances
column_name = 'distance_km'
calculating_basic_statistics(run_df, column_name)

In [None]:
# Creating a kernel density plot
plt.figure(figsize=(10, 6))
sns.kdeplot(run_df['distance_km'], shade=True)

# Set labels and title
plt.xlabel('Distance (km)')
plt.ylabel('Density')
plt.title('Kernel Density Plot of My Run\'s Distances')

# Display the plot
plt.show()

As shown by the plots above, the distribution of my run's distances is positively skewed. The mean lies between 5 and 10, whereas the median is on 5km. The mode is also on 5km, therefore, the most common type of distance ran in the past few years is the 5km one. This might be because it's been more convenient for me to go for a quick 5km run during my lunch break rather than going for a long, 30km run - which happened on very rare weekends. 

It might be worth exploring my 5km runs further to check if my average time has decreased at all, since the 5km runs are the ones which I've been practicing the most. This will help us answer our third research question: "Has my average pace decreased during my fast, 5km runs throughout the past year?"

#### Exploring 5km Runs

In [None]:
# Filtering runs into 5km runs only 
short_run_df = run_df[run_df['distance_km'] == 5]

In [None]:
# Sorting rows chronologically 
short_run_df = short_run_df.sort_values("start_date")

In [None]:
# Computing my average pace 
short_run_df['average_pace'] = 1 / (short_run_df['average_speed'] * 60 / 1000) # converting the average speed from m/sec to km/min to compute pace
short_run_df = short_run_df[(short_run_df['average_pace'] < 10) & (short_run_df['average_pace'] > 3)] # filtering out outliers 

In [None]:
# Creating a line chart
plt.figure(figsize=(10, 6))
plt.plot(short_run_df['start_date'], short_run_df['average_pace'], marker='o', linestyle='-', color='b')
ticks = plt.xticks(rotation=45)

As we assumed, it seems like my average pace has been decreasing throughout the past year. From November 2022 to mid-February 2023, there has been a sharp decline in average pace. This might be because I started running with friends who gave some useful feedback on how to run better. But, I also started running slower around July 2023. This is because I did not sign up to any new marathon after the one I had in June.

## Exploring Routes

Let's look at the routes of my runs and cycling activities on a map. GPX files store Strava routes, a standard file format used to store and exchange GPS data and tracks. To bulk export my files, I used this article. GPX Files are usually packed in a zip file, so I coded a way to unzip them. I also kept the non-zipped files as they are, to then store them in a separate GPX Files directory before using them. This will help us answer our last question: "In which country/ county did I exercise the most in the past years?"

#### Moving GPX Files into a GPX Files Directory

In [1]:
import gzip
import shutil
import os
from pathlib import Path

In [2]:
# Creating a directory of the activities 
activities_directory = Path('/Users/giorgiadimiccoli/Desktop/repos/my_strava_runs/my_strava_archive/activities/')

# Creating a new directory to store all the gpx files 
gpx_files = Path('/Users/giorgiadimiccoli/Desktop/repos/my_strava_runs/my_strava_archive/gpx_files/')
gpx_files.mkdir(parents=True)

In [3]:
# Extracting GPX files and unzipping zipped GPX files 
def extracting_gpx_files(activities_directory, gpx_files):
    
    # Iterating through each file in the activities directory 
    for file in activities_directory.iterdir():
        
        # Checking if the file has both a .gpx and .gz extension    
        if '.gpx' in str(file) and str(file).endswith('.gz'): # Not using 'endswith' for gpx since some files contain .gpx but may end with .gz
    
            # Reading the gzip-compressed file in binary-read mode 
            with gzip.open(str(file), 'rb') as f_in:
                
                # Writing the decompressed content to a new file 
                unzipped_file = gpx_files / file.parts[-1].replace(".gz", "")
            
                with open(str(unzipped_file), 'wb') as f_out: 
                
                    # Copying the content of the file with shutil
                    shutil.copyfileobj(f_in, f_out)
                
        if '.gpx' in str(file) and not str(file).endswith('.gz'):
        
            gpx_file_path = gpx_files / file.parts[-1]
        
            # Moving the file from the 'activities' directory to the 'gpx_files' directory 
            shutil.move(str(file), str(gpx_file_path))

In [4]:
extracting_gpx_files(activities_directory, gpx_files)

Now that we've cleansed and stored all the GPX files in a new directory, we can iterate through each file path and turn them into a dataframe to better manipulate and analyse each activity's track.

In [None]:
# Creating a function that takes in a file path and returns a dataframe
def gpx_files_to_df(file_path):
    
    # Parsing the gpx file 
    gpx = gpxpy.parse(open(file_path))   
    
    # Creating an empty list for each info we want to store in the df 
    lats = []
    longs = []
    elevs = []
    times = []
    
    # Iterating through each point in the segment of the track of the gpx file
    for point in gpx.tracks[0].segments[0].points:
    
    # Appending each info from the point to the correspondent empty list
        lats.append(point.latitude)
        longs.append(point.longitude)
        elevs.append(point.elevation)
        times.append(point.time)
        
    # Creating a dataframe with the information as names of the columns 
    my_df = pd.DataFrame.from_dict(
    {
        "longitude": longs,
        "latitude": lats,
        "elevation": elevs,
        "time": times,
    }
    )
    return my_df

In [None]:
import gpxpy

# Creating an empty list of dataframes
list_of_dfs = []

# Looping over each file
for file in os.listdir(gpx_files): # Listing all the files in the 'gpx_files' directory
    # Creating the file path by combining its name to the 'gpx_files'
    file = gpx_files + "/" + file
    # Applying the 'gpx_files_to_df' function to the file and assigning the result to 'dfed_file'
    dfed_file = gpx_files_to_df(file)
    # Appending all the 'dfed_file'(s) to the list of empty dataframes 
    list_of_dfs.append(dfed_file)

Now it's time to create our maps! Let's import folium, a package that allows us to create interactive geographic visualisations. 
Since having the maps in the Jupyter Notebook would take too much memory, I have created the maps first, took their screenshots, and then removed them from the notebook and loaded up the screenshots. 

#### Creating Maps and Strava Routes

In [None]:
# Using geospatial analysis to visualize my Strava routes with the folium package
import folium

route_map = folium.Map( # creating a map through folium.Map 
    location=[51.752457, -1.230000], # giving coordinates 
    zoom_start=13, # setting where the zoom starts 
    tiles='OpenStreetMap', # setting the map tile 
    width='100%', # choosing the width of the map 
    height='100%' # choosing the height of the map 
)

In [None]:
# Creating a map of the routes 
for df in list_of_dfs:
    for _, row in df.iterrows(): # Iterating through each row in the first dataframe from the list of dataframes 
        folium.CircleMarker( # dding a circle marker at each longitude and latitude combination
            location=[row['latitude'], row['longitude']],
            radius=3, # setting the circle radius to 3 
        ).add_to(route_map) # adding the markers to the map 

In [None]:
# Connecting circle markers in a polygon line to properly represent the route 
route_map = folium.Map( # creating a map through folium.Map 
    location=[51.752457, -1.230000], # giving coordinates of Oxford to start with 
    zoom_start=13, # setting where the zoom starts 
    tiles='OpenStreetMap', # adjusting the map tile to make the route stand out more 
    width='100%', # choosing the width of the map 
    height='100%' # choosing the height of the map 
)

# Creating a nested for loop to connect the circle markers
for df in list_of_dfs: # iterating through the list of dataframes
    for row in df: # looping through each row in the dataframe we are iterating through 
        coordinates = [tuple(x) for x in df[['latitude', 'longitude']].to_numpy()] # Extracting geolocation info as a list of tuples 
        folium.PolyLine( # Creating the polylines with the list of tuples
            coordinates,
            weight=3 # choosing the line thickness in pixels
        ).add_to(route_map) # adding them back to the route_map 

#### Showing my Strava Routes

In [None]:
# Getting the current working directory (root directory)
root_directory = os.getcwd()

# Iterating over all files in the root directory
for filename in os.listdir(root_directory):
    # Checking if the file has an Open Street Map tile layer
    if "_openstreetmap_" in filename:
        # Displaying the Open Street Map images using IPython.display.Image
        img_path = os.path.join(root_directory, filename)
        display(Image(filename=img_path))

Here above, we can see the routes of my Strava activities on the maps of Exmoor and Snowdonia. 

To reduce as much noise as possible, we could use a different tile layer named `'CartoDBPositron'`. Let's apply it to the Oxfordshire map, since it's most probably the one with the highest activity density because I am currently living in Oxford.

In [None]:
# Displaying a map with the tile layer 'CartoDBPositron'
display(Image(filename='oxf_map_card_screenshot.png'))

Let's also create a dark map with the tile layer `'CartoDBDark_Matter'`. Let's do that for the Cambridge data, since it has an interesting shape.

In [None]:
# Displaying a map with the tile layer 'CartoDBDark_Matter'
display(Image(filename='cam_dark_map_screenshot.png'))

It looks like the majority of my activities in the past year has been performed the Oxfordshire county, which is the area where I'm currently living in. Some other activities have been performed in Exmoor and Wales due to trail half marathons, and in Cambridge where I used to live.

## Conclusion

The purpose of this project was to explore my physical exercise activities performed throughout the past few years. We scraped the data from my Strava account using the APIs they provide and downloaded the data locally to explore it, cleanse it, and analyse it. 

Here's what we've discovered so far:
    
- The most frequent activity that I've performed in the past six months was running. This might be because of my trainings for multiple half marathons. 
- The most common run distance I've performed in the past few years is the 5km distance. This is because I found it more convenient to run shorter distance due to a busy lifestyle. 
- My average pace decreased during my fast, 5km runs throughout the past year, with a constant up-and-down rhythm between 4:30 /km and 5:00 /km. A reason could be that I set myself the goal of improving my running performance for the Trail Half Marathon in June 2023, thus aimed at running my 5km runs better and faster.
- Oxfordshire, followed by Cambridgeshire and some areas of Exmoor and Wales, is the county where I exercised the most in the past years.

Here are the recommendations which could be used for further data exploration: 
- Design some time series visualisations to track my fitness progress over time, including metrics such as heart rate and calories burned. 
- Create a radar chart or parallel coordinate plots to compare my performance with the one of my partner or my friends across different statistical categories.
- Plot a chart to compare the rankings of different teams during padel tournaments based on performance metrics such as defense, offense, and/ or overall rating. 