# Analyzing jogging data recorded in TCX files

I recently got back into the habit of regular jogging, and I have been keeping track of my progress using a free account at mapmyrun.com and their iPhone app. The site, at least in the free tier, lets you explore your data in rather limited ways, but there exists an export option in the TCX ([Training Center XML](https://en.wikipedia.org/wiki/Training_Center_XML)) format. The purpose of this notebook is to play around with that data.

## Initializing the notebook

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import os
from lxml import objectify
from IPython.display import set_matplotlib_formats

set_matplotlib_formats('retina')

## Extracting the data from the TCX files to Pandas DataFrames

As the name of the format suggests, the TCX files are XML data. Peeking into the file with a text editor quickly reveals the tag hierarchy and makes it easy to write functions to import data summarizing the run:

In [2]:
def read_run_summary(filename):
    """Reads the summary data of a run out of a .tcx file and returns them as
    a Pandas Series.
    """
    # Read the XML file
    tree = objectify.parse(filename)
    root = tree.getroot()
    
    # Based on my running habits, I can safely assume here that there is only
    # the single lap
    lap = root.Activities.Activity.Lap
    start_time = pd.to_datetime(lap.attrib['StartTime'])
    total_time_seconds = float(lap.TotalTimeSeconds)
    distance_meters = float(lap.DistanceMeters)
    maximum_speed = float(lap.MaximumSpeed) # in what units?
    
    return pd.Series({
        'start_time': start_time,
        'total_time_seconds': total_time_seconds,
        'distance_meters': distance_meters,
        'maximum_speed': maximum_speed
    })

In [3]:
read_run_summary("data/Ran 2.51 mi on 07_23_18.tcx.txt")

start_time            2018-07-23 22:48:58
total_time_seconds                   1423
distance_meters                   4046.63
maximum_speed                     6.52455
dtype: object

The app seems to record your position and altitude at one-second intervals, storing them as "trackpoints".

In [6]:
def read_run_trackpoints(filename):
    """Reads the trackpoint data from a .tcx file and returns it as a Pandas
    DataFrame."""
    tree = objectify.parse(filename)
    root = tree.getroot()
    
    timestamp, altitude, latitude, longitude, distance = [], [], [], [], []
    for trackpoint in root.Activities.Activity.Lap.Track.iterchildren():
        timestamp.append(str(trackpoint.Time))
        altitude.append(float(trackpoint.AltitudeMeters))
        latitude.append(float(trackpoint.Position.LatitudeDegrees))
        longitude.append(float(trackpoint.Position.LongitudeDegrees))
        distance.append(float(trackpoint.DistanceMeters))
    
    time = pd.to_datetime(timestamp)
    
    return pd.DataFrame({
        'timestamp': timestamp,
        'altitude_meters': altitude,
        'latitude': latitude,
        'longitude': longitude,
        'distance_meters': distance
    })

In [7]:
read_run_trackpoints("data/Ran 2.51 mi on 07_23_18.tcx.txt").head()

Unnamed: 0,timestamp,altitude_meters,latitude,longitude,distance_meters
0,2018-07-23T22:48:59+00:00,23.78,41.215496,-73.103262,0.0
1,2018-07-23T22:49:00+00:00,23.86,41.21549,-73.103247,1.443456
2,2018-07-23T22:49:01+00:00,23.92,41.215487,-73.103236,2.383505
3,2018-07-23T22:49:02+00:00,23.97,41.215488,-73.103227,3.176702
4,2018-07-23T22:49:03+00:00,24.02,41.215491,-73.103216,4.138012


Let's also write a function for collecting the summary of each workout in a folder:

In [8]:
def read_all_runs(directory="data/"):
    """Returns a Pandas DataFrame containing a summary of all the runs in the data
    directory.
    """
    
    runs, files = [], []
    for path, _, filenames in os.walk("data"):
        for filename in filenames:
            filepath = os.path.join(path, filename)
            try:
                run = read_run_summary(filepath)
                runs.append(run)
                files.append(filepath)
            except:
                # if there are any files we can't read, just ignore them
                pass
    
    # Also include a column for the filename, so that we can easily find it
    # if we want to read the details
    summary = pd.DataFrame(runs)
    summary['tcx_file'] = files
    
    return summary

In [9]:
runs = read_all_runs()
runs

Unnamed: 0,start_time,total_time_seconds,distance_meters,maximum_speed,tcx_file
0,2018-05-23 22:52:26,1605.0,3992.541062,4.335162,data/Ran 2.48 mi on 05_23_18.tcx.txt
1,2018-06-02 19:07:53,1211.0,3155.537341,3.56024,data/Ran 1.96 mi on 06_02_18.tcx.txt
2,2018-07-10 22:57:22,1616.0,4007.507962,5.197376,data/Ran 2.49 mi on 07_10_18.tcx.txt
3,2018-07-18 23:04:41,1414.0,3967.338735,4.607775,data/Ran 2.47 mi on 07_18_18.tcx.txt
4,2018-05-20 21:01:44,1610.0,4087.17049,3.989439,data/Ran 2.54 mi on 05_20_18.tcx.txt
5,2018-06-08 22:24:42,1477.0,3987.133667,4.384211,data/Ran 2.48 mi on 06_08_18.tcx.txt
6,2018-06-11 22:51:03,1455.0,3961.545097,4.529141,data/Ran 2.46 mi on 06_11_18.tcx.txt
7,2018-07-16 22:52:48,1502.0,4048.417486,6.421372,data/Ran 2.52 mi on 07_16_18.tcx.txt
8,2018-04-29 20:16:46,2110.0,4965.212483,3.503269,data/Ran 3.09 mi on 04_29_18.tcx.txt
9,2018-07-23 22:48:58,1423.0,4046.631114,6.524549,data/Ran 2.51 mi on 07_23_18.tcx.txt


The runs seem to be in no particular order.