# Running

I love running. I also collect a lot of data about my runs through my Garmin watch. It's about time I did something interesting with it.

## Where to get the data from?

Garmin Connect and Strava are the two obvious places to retrieve my running history. Both give the user the option to bulk export their data history.

- Garmin - [https://www.garmin.com/en-US/account/datamanagement/exportdata/](https://www.garmin.com/en-US/account/datamanagement/exportdata/)
- Strava - [https://www.strava.com/athlete/delete_your_account](https://www.strava.com/athlete/delete_your_account) (don't be scared by the page name)

These need to be compiled so these requests can take a while. They both send a link to your email address for you to retrieve the data. For me, Strava took about 5 mins, Garmin took about 20 mins. After having a quick look through both folders, Strava seems a lot more intuitive to understand, so for now I'll focus on files from there.

## File types

There are several file types across the folders=

- `csv` - the activites summary is in this format
- `gpx` - routes and some activities are stored as this
- `fit.gz` - most of the activities are in this format. These are zipped .fit files.

## Running overview
The activities summary file provides top level statistics about my activities to date. I've tackled this aspect of the data in an [R script](https://github.com/patricktudor/running/blob/main/Activity%20summary%20visualisations.R) because tidyverse is epic.

## Individual activity data
In this notebook I'm going to focus on tackling the activity `gpx` and `gz` files.

In [1]:
# import packages
import pandas as pd
import os
import fnmatch
import glob
import gzip

from tqdm import tqdm
tqdm.pandas()

from pathlib import Path

First get a list of activities that are runs.

In [70]:
# open file
activities = pd.read_csv("running-data-exports/Strava/activities.csv")

# get Activity IDs for run activities
metrics_run = activities.loc[(activities['Activity Type'] == 'Run') & (activities['Filename'].str.endswith('fit.gz')), ['Activity ID', 'Filename']]
run_ids = metrics_run['Filename'].to_list()

# add file type to IDs
# run_ids = [str(run) + '.fit.gz' for run in run_ids]

data_folder = Path('running-data-exports/Strava/')

run_files = []

for run in run_ids:
    file_to_open = Path(data_folder / run)
    run_files.append(file_to_open)
    

# alternate method

# get run activities using glob
# note that recursive = True is required if '**' is specified for the directory
activity_files = glob.glob('**/*.fit.gz', recursive = True)


In [71]:
my_directory = os.getcwd()
print(my_directory)

c:\Users\ptudor\Documents\GitHub\running


In [72]:
len(run_files)

813

In [5]:
run_files[0:5]

[WindowsPath('running-data-exports/Strava/activities/143805738.tcx.gz'),
 WindowsPath('running-data-exports/Strava/activities/143805734.tcx.gz'),
 WindowsPath('running-data-exports/Strava/activities/143859088.tcx.gz'),
 WindowsPath('running-data-exports/Strava/activities/143859096.tcx.gz'),
 WindowsPath('running-data-exports/Strava/activities/143859092.tcx.gz')]

In [67]:
len(activity_files)

1267

## .FIT files
A .fit.gz file is a zipped .fit file. FIT stands for Flexible and Interoperable Data Transfer. They are for storing data originating from health devices from Garmin / Ant.

This is how to open a .fit.gz file -

In [7]:
# select one file
my_run = activity_files[8]

with gzip.open(my_run, 'r') as run:
    for line in run:
        print(line) 

b'\x0e\x10\xf4\x03\xb2\x05\x00\x00.FIT\xec\xce@\x00\x00\x00\x00\x06\x03\x04\x8c\x04\x04\x86\x01\x02\x84\x02\x02\x84\x05\x02\x84\x00\x01\x00\x00>M \xe7\x8f\xa4H3\x01\x00W\x06\xff\xff\x04A\x00\x001\x00\x02\x00\x02\x84\x01\x01\x02\x01J\x01\xffB\x00\x00\x15\x00\x05\xfd\x04\x86\x03\x04\x86\x00\x01\x00\x01\x01\x00\x04\x01\x02\x02\x90\xa4H3\x00\x00\x00\x00\x00\x00\x00C\x00\x00\x17\x00\x14\xfd\x04\x86\x03\x04\x8c\x07\x04\x86\x08\x04\x86\x0f\x04\x86\x10\x04\x86\x02\x02\x84\x04\x02\x84\x05\x02\x84\n'
b'\x02\x84\x15\x02\x8b\x00\x01\x02\x01\x01\x02\x06\x01\x02\t\x01\x02\x0b\x01\x02\x14\x01\n'
b'\x16\x01\x00\x17\x01\x02\x19\x01\x00\x03\x90\xa4H3>M \xe7\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x01\x00W\x06J\x01\xff\xff\x00\x00\x00\xff\xff\xff\xff\x00\xff\xff\x05\x03\x90\xa4H3\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x01\x00\x99\x06^\x01\xff\xff\x00\x00\x01\x00\xff\xff\xff\x00\xff\xff\x05\x03\x90\xa4H3\x00\x00\x00\x00\xff\xff\xff\xff\xff\x

A chap called David Cooper has created a python library called [fitparse](https://github.com/dtcooper/python-fitparse) to parse .FIT files. Lets install it and see how it can help.

In [8]:
# install the package
# pip install fitparse

import fitparse

The following script is taken from his Github page to show to to view data in a .Fit file.

In [12]:
# original version

my_run = activity_files[0]

with gzip.open(my_run, 'r') as run:
    fitfile = fitparse.FitFile(run)

    # this next bit is taken from dtcooper github page

    # Iterate over all messages of type "record"
    # (other types include "device_info", "file_creator", "event", etc)
    for record in fitfile.get_messages("record"):
        # Records can contain multiple pieces of data (ex: timestamp, latitude, longitude, etc)
        for data in record:

            # Print the name and value of the data (and the units if it has any)
            if data.units:
                print(" * {}: {} ({})".format(data.name, data.value, data.units))
            else:
                print(" * {}: {}".format(data.name, data.value))

        print("---")

08:44:10
---
 * altitude: 27.399999999999977 (m)
 * cadence: 89 (rpm)
 * distance: 495.97 (m)
 * enhanced_altitude: 27.399999999999977 (m)
 * enhanced_speed: 3.266 (m/s)
 * fractional_cadence: 0.0 (rpm)
 * position_lat: 615709511 (semicircles)
 * position_long: -47441449 (semicircles)
 * speed: 3.266 (m/s)
 * timestamp: 2017-03-18 08:44:15
---
 * altitude: 28.200000000000045 (m)
 * cadence: 89 (rpm)
 * distance: 519.33 (m)
 * enhanced_altitude: 28.200000000000045 (m)
 * enhanced_speed: 3.284 (m/s)
 * fractional_cadence: 0.0 (rpm)
 * position_lat: 615708592 (semicircles)
 * position_long: -47445217 (semicircles)
 * speed: 3.284 (m/s)
 * timestamp: 2017-03-18 08:44:22
---
 * altitude: 31.600000000000023 (m)
 * cadence: 88 (rpm)
 * distance: 542.12 (m)
 * enhanced_altitude: 31.600000000000023 (m)
 * enhanced_speed: 3.303 (m/s)
 * fractional_cadence: 0.0 (rpm)
 * position_lat: 615707143 (semicircles)
 * position_long: -47448328 (semicircles)
 * speed: 3.303 (m/s)
 * timestamp: 2017-03-18 0

I'll use this as a guide to capture the data for my runs and save it in a dataframe.

In [73]:
# empty list to store run dfs
run_dfs = []

for run_index, myrun in tqdm(enumerate(run_files)):

    # unzip file
    with gzip.open(myrun, 'r') as runfile:
        fitfile = fitparse.FitFile(runfile)

        # record counter
        rec_count = 0

        # I couldn't find another way to count the number
        # of records in a run
        for record in fitfile.get_messages("record"):
             rec_count +=1

        # set up default lists
        runs = ['NA'] * rec_count
        records = ['NA'] * rec_count
        alts = ['NA'] * rec_count
        cads = ['NA'] * rec_count
        dists = ['NA'] * rec_count
        en_alts = ['NA'] * rec_count
        en_speeds = ['NA'] * rec_count
        frac_cads = ['NA'] * rec_count
        pos_lats = ['NA'] * rec_count
        pos_longs = ['NA'] * rec_count
        speeds = ['NA'] * rec_count
        times = ['NA'] * rec_count

        # add record data to lists
        for record_index, record in enumerate(fitfile.get_messages("record")):
            runs[record_index] = run_index + 1
            records[record_index] = record_index + 1
            for data in record:
                if data.name == 'altitude':
                    alts[record_index] = data.value
                elif data.name == 'cadence':
                    cads[record_index] = data.value
                elif data.name == 'distance':
                    dists[record_index] = data.value
                elif data.name == 'enhanced_altitude':
                    en_alts[record_index] = data.value
                elif data.name == 'enhanced_speed':
                    en_speeds[record_index] = data.value
                elif data.name == 'fractional_cadence':
                    frac_cads[record_index] = data.value
                elif data.name == 'position_lat':
                    pos_lats[record_index] = data.value
                elif data.name == 'position_long':
                    pos_longs[record_index] = data.value
                elif data.name == 'speed':
                    speeds[record_index] = data.value
                elif data.name == 'timestamp':
                    times[record_index] = data.value

        # create dictionary
        d = {'Run':runs, 'Record':records, 'Timestamp':times, 'Latitude':pos_lats, 'Longitude':pos_longs, 'Speed':speeds, 'Enhanced_speed':en_speeds, 'Distance':dists, 'Altitude':alts,                    'Enhanced_altitude':en_alts, 'Cadence':cads, 'Fractional_candence':frac_cads}
        run_dfs.append(pd.DataFrame(d))

# create main dataframe
my_run_data = pd.concat(run_dfs, ignore_index = True)

print('Finished!')


813it [03:42,  3.65it/s]
Finished!


In [77]:
# save to csv
my_run_data.to_csv('running-data-exports/my_run_data.csv', index = False)

In [76]:
my_run_data.tail(20)

Unnamed: 0,Run,Record,Timestamp,Latitude,Longitude,Speed,Enhanced_speed,Distance,Altitude,Enhanced_altitude,Cadence,Fractional_candence
666944,813,1131,2021-03-30 07:33:06,615868292,-46918829,3.695,3.695,10161.03,4.0,4.0,92,0.0
666945,813,1132,2021-03-30 07:33:07,615868336,-46919440,3.704,3.704,10164.6,3.8,3.8,92,0.0
666946,813,1133,2021-03-30 07:33:10,615868373,-46921129,3.704,3.704,10174.39,3.0,3.0,92,0.0
666947,813,1134,2021-03-30 07:33:12,615868570,-46922182,3.658,3.658,10180.77,2.4,2.4,93,0.0
666948,813,1135,2021-03-30 07:33:18,615869046,-46925810,3.63,3.63,10202.3,2.2,2.2,92,0.0
666949,813,1136,2021-03-30 07:33:19,615869235,-46926456,3.639,3.639,10206.43,2.8,2.8,92,0.0
666950,813,1137,2021-03-30 07:33:22,615869244,-46928051,3.63,3.63,10216.33,4.2,4.2,92,0.0
666951,813,1138,2021-03-30 07:33:23,615868913,-46928610,3.592,3.592,10220.8,4.6,4.6,92,0.0
666952,813,1139,2021-03-30 07:33:24,615868639,-46928910,3.574,3.574,10223.87,4.6,4.6,91,0.5
666953,813,1140,2021-03-30 07:33:27,615867472,-46928956,3.564,3.564,10235.56,3.8,3.8,91,0.5


In [58]:
print('runs has', len(runs), 'values')
print('records has', len(records), 'values')
print('times has', len(times), 'values')
print('pos_lats has', len(pos_lats), 'values')
print('pos_longs', len(pos_longs), 'values')
print('speeds', len(speeds), 'values')
print('en_speeds', len(en_speeds), 'values')
print('dists', len(dists), 'values')
print('alts', len(alts), 'values')
print('en_alts', len(en_alts), 'values')
print('cads', len(cads), 'values')
print('frac_cads', len(frac_cads), 'values')

runs has 856 values
records has 856 values
times has 856 values
pos_lats has 856 values
pos_longs 856 values
speeds 856 values
en_speeds 856 values
dists 856 values
alts 856 values
en_alts 856 values
cads 856 values
frac_cads 856 values
