# Collecting and preparing data before analysis
How to start? I have several .gpx files, one for each run, over several years and I want to generate some visualizations based on those data. To do so I'm using the pandas library.

Before we start, the only few things I know about those gps paths are the followings:
+ one path per week *when I have the data*
+ it starts from different locations, but each path is a loop *ie the path comes back to its origin*
+ it almost always reaches the same middle point *ie the same gps location*

## Collect data
It is the part of the project that requires a huge amount of time.

I ended up gathering .gpx files, one for each run. Here I just renamed each file according to the date the run was done.

### Check data files to be processed

In [1]:
!ls ../data/gpx/*gpx | wc -l

     108


### Import a few modules

In [2]:
import sys
import numpy as np
import pandas as pd
import gpxpy
import gpxpy.gpx
import matplotlib.pyplot as plt
import geopy.distance
import glob
import os
import importlib
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from datetime import datetime

below I'm loading my module.

In [3]:
sys.path.append("../my_modules")
import toolToReadGPX as ttrgpx

path_data     = "../data/"
path_data_csv = "../data/csv/"

# Rename files

I want all files to be as follows **RunRite_year_month_day.gpx**.

I will:
+ list the files
+ lower case all the files
+ re-write **RunRite**

In [4]:
# list and rename the files
ll = glob.glob(path_data+"gpx/*.gpx")
ll.sort()
for c, l in enumerate(ll):
    head_tail = os.path.split(ll[c])
    dst_tail = head_tail[1].lower()
    dst_head_tail = head_tail[0]+"/"+"RunRite"+dst_tail[7:]
    os.rename(ll[c], dst_head_tail)

print(len(ll))

108


## Format data in pandas DataFrame

Once I found the right Python libraries, I have created some tools (ie my own modules) to load those .gpx files, to get some information about them and to save them as pandas DataFrame. The idea is to no have to re-load all my source data files each time I want to do something with them.

I'm using:
+ https://pypi.org/project/gpxpy/
+ https://pandas.pydata.org/

In [5]:
importlib.reload(ttrgpx)

# Select gpx file
list_all_files = glob.glob(path_data+"/gpx/RunRite*.gpx")
list_all_files.sort()
print("There is {0:1.0f} files to process.".format(len(list_all_files)))

# convert list of files to a list of dataFrame
list_all_files_df = ttrgpx.fun_listPath_gpx2pd(list_all_files)

# get number of run
nb_run = len(list_all_files_df)

There is 108 files to process.


In [6]:
# get average distance for this year
vec_run_distance = np.zeros(nb_run)
for i in np.arange(nb_run):
    vec_run_distance[i] = list_all_files_df[i].iloc[-1,4]

average_run_distance = np.mean(vec_run_distance / 1000)

print("Average distance per run for all the gpx paths listed: {0:1.2f}km.".format(average_run_distance))

Average distance per run for all the gpx paths listed: 10.98km.


Check how one gpx path looks like:

# Reduce data size and data as csv

Here I will reduce the data size as I don't need so many points (eg here more than 1000) and will reduce the length to **x** points per gpx points.

And I will save the downsample path as csv files.

In [27]:
importlib.reload(ttrgpx)

# reduce the size
list_all_files_ReSample_df = []

for c,d in enumerate(list_all_files_df):
    df = list_all_files_df[c]
    ReSample_df = ttrgpx.fun_DownSample_gpx(df, number_of_sample = 100) # <--- here I choose to how many points I reduce the size of a trace
    list_all_files_ReSample_df.append(ReSample_df)

print(len(list_all_files_df))
      
# check one gpx path
list_all_files_ReSample_df[0].describe()

108


Unnamed: 0,latitude,longitude,elevation,distance,cumulative_distance
count,100.0,100.0,100.0,100.0,100.0
mean,45.468929,-73.624705,73.31,2.423878,4926.576085
std,0.005701,0.008375,8.502757,1.046578,2765.450691
min,45.458722,-73.640089,60.0,0.0,0.0
25%,45.464626,-73.632082,66.0,2.256777,2636.791994
50%,45.469419,-73.622529,71.0,2.621083,5002.988528
75%,45.47401,-73.618103,78.0,2.987005,7267.802797
max,45.47808,-73.610549,93.0,4.714748,9549.687209


In [8]:
head_tail = os.path.split(ll[0])
print(head_tail[1][0:-4])

RunRite_2018_01_18


# Save as csv

In [30]:
for c,d in enumerate(list_all_files_ReSample_df):
    head_tail = os.path.split(ll[c])
    
    df = list_all_files_ReSample_df[c]
    path_to_downSample_data = path_data_csv+head_tail[1][0:-4]+"_downSample.csv"
    df.to_csv(path_to_downSample_data, index=False)

In [19]:
!ls ../data/csv/*.csv | wc -l

      70


Check that the csv files has been created.

In [29]:
# load one file
new_df = pd.read_csv(path_data_csv+"RunRite_2024_08_01_downSample.csv")
new_df.describe()

Unnamed: 0,latitude,longitude,elevation,distance,cumulative_distance
count,100.0,100.0,100.0,100.0,100.0
mean,45.485571,-73.590099,73.934,2.220142,5646.144833
std,0.007232,0.01437,57.76649,1.51415,3018.497491
min,45.472547,-73.610438,18.0,0.0,0.0
25%,45.480106,-73.603666,22.6,0.0,3187.357596
50%,45.486018,-73.592992,50.9,2.789315,5899.710849
75%,45.491422,-73.577565,117.7,3.4621,7855.971007
max,45.496528,-73.564141,188.0,4.735848,11043.15038


In [13]:
list_all_files_ReSample_df[-1].describe()

Unnamed: 0,latitude,longitude,elevation,distance,cumulative_distance
count,100.0,100.0,100.0,100.0,100.0
mean,45.485571,-73.590099,73.934,2.220142,5646.144833
std,0.007232,0.01437,57.76649,1.51415,3018.497491
min,45.472547,-73.610438,18.0,0.0,0.0
25%,45.480106,-73.603666,22.6,0.0,3187.357596
50%,45.486018,-73.592992,50.9,2.789315,5899.710849
75%,45.491422,-73.577565,117.7,3.4621,7855.971007
max,45.496528,-73.564141,188.0,4.735848,11043.15038


In [14]:
print(list_all_files[0],list_all_files[-1])

../data//gpx/RunRite_2024_08_01.gpx ../data//gpx/RunRite_2024_08_01.gpx


In [21]:
print(list_all_files_ReSample_df[0]["cumulative_distance"].iloc[-1])
print(list_all_files_ReSample_df[-1]["cumulative_distance"].iloc[-1])

9549.68720853391
11043.150380348541


# Now the datavisulaization can start

Before let's have a look at the data I have, how they are organized.

In folders I have:
+ *../data/gpx/* the .gpx files for each run, each of them name **RunRite_year_month_day.gpx**
+ *../data/csv/* the .csv files for each run in reduced size comparing to the .gpx files , each of them name **RunRite_year_month_day_downSample.csv**

# Re-load the data

I want to have:
+ a list of all file names
+ a list of panda DataFrame where each element is for a single run

In [22]:
# list of file names
path_csv_files = "../data/csv/"
list_csv_files  = glob.glob(path_csv_files+"*.csv")
list_csv_files.sort()

# list of panda DataFrame
list_run_df = []
for c, f in enumerate(list_csv_files):
    list_run_df.append(pd.read_csv(f))
    
list_run_df[0].describe()

Unnamed: 0,latitude,longitude,elevation,distance,cumulative_distance
count,100.0,100.0,100.0,100.0,100.0
mean,45.468929,-73.624705,73.31,2.423878,4926.576085
std,0.005701,0.008375,8.502757,1.046578,2765.450691
min,45.458722,-73.640089,60.0,0.0,0.0
25%,45.464626,-73.632082,66.0,2.256777,2636.791994
50%,45.469419,-73.622529,71.0,2.621083,5002.988528
75%,45.47401,-73.618103,78.0,2.987005,7267.802797
max,45.47808,-73.610549,93.0,4.714748,9549.687209


# Add extra data cleaning

I need to:
+ clean data to keep only the gps trace of the run
+ to remove all point before and after arrival

# Adding missing data

Sometimes I don't have the data, so I get them from other runners, as we don't use all the same device to record our run we don't have always the same extra information. Therefore I skipped adding the _time_ added to each (longitude, latitude) point.