# Collecting and preparing data before analysis
How to start? I have several .gpx files, one for each run, over several years and I want to generate some visualizations based on those data. To do so I'm using the pandas library.

Before we start, the only few things I know about those gps paths are the followings:
+ one path per week *when I have the data*
+ it starts from different locations, but each path is a loop *ie the path comes back to its origin*
+ it almost always reaches the same middle point *ie the same gps location*

## Collect data
It is the part of the project that requires a huge amount of time.

I ended up gathering .gpx files, one for each run. Here I just renamed each file according to the date the run was done.

### Check data files to be processed

In [1]:
!ls /Users/jeremiegerhardt/Documents/dev/datavizRun/data/gpx/*gpx | wc -l

      99


### Import a few modules

In [1]:
import sys
import numpy as np
import pandas as pd
import gpxpy
import gpxpy.gpx
import matplotlib.pyplot as plt
import geopy.distance
import glob
import os
import importlib
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from datetime import datetime

below I'm loading my module.

In [2]:
sys.path.append("../my_modules")
import toolToReadGPX as ttrgpx

path_data     = "../data/"
path_data_csv = "../data/csv/"

# Rename files

I want all files to be as follows **RunRite_year_month_day.gpx**.

I will:
+ list the files
+ lower case all the files
+ re-write **RunRite**

In [10]:
# list and rename the files
ll = glob.glob(path_data+"gpx/*.gpx")
for c, l in enumerate(ll):
    head_tail = os.path.split(ll[c])
    dst_tail = head_tail[1].lower()
    dst_head_tail = head_tail[0]+"/"+"RunRite"+dst_tail[7:]
    os.rename(ll[c], dst_head_tail)

# just to verify I renamed the files as I wanted
ll2 = glob.glob(path_data+"gpx/*.gpx")
ll2.sort()
for c, l in enumerate(ll):
    print(c,ll2[c])

0 ../data/gpx/RunRite_2021_07_08.gpx
1 ../data/gpx/RunRite_2021_08_12.gpx
2 ../data/gpx/RunRite_2021_10_21.gpx
3 ../data/gpx/RunRite_2021_11_11.gpx
4 ../data/gpx/RunRite_2021_11_18.gpx
5 ../data/gpx/RunRite_2021_11_25.gpx
6 ../data/gpx/RunRite_2021_12_02.gpx
7 ../data/gpx/RunRite_2021_12_09.gpx
8 ../data/gpx/RunRite_2022_02_17.gpx
9 ../data/gpx/RunRite_2022_02_24.gpx
10 ../data/gpx/RunRite_2022_03_03.gpx
11 ../data/gpx/RunRite_2022_03_10.gpx
12 ../data/gpx/RunRite_2022_03_17.gpx
13 ../data/gpx/RunRite_2022_03_31.gpx
14 ../data/gpx/RunRite_2022_04_07.gpx
15 ../data/gpx/RunRite_2022_04_14.gpx
16 ../data/gpx/RunRite_2022_04_28.gpx
17 ../data/gpx/RunRite_2022_05_05.gpx
18 ../data/gpx/RunRite_2022_05_19.gpx
19 ../data/gpx/RunRite_2022_06_09.gpx
20 ../data/gpx/RunRite_2022_06_30.gpx
21 ../data/gpx/RunRite_2022_07_14.gpx
22 ../data/gpx/RunRite_2022_08_11.gpx
23 ../data/gpx/RunRite_2022_08_18.gpx
24 ../data/gpx/RunRite_2022_08_25.gpx
25 ../data/gpx/RunRite_2022_09_01.gpx
26 ../data/gpx/RunRite

## Format data in pandas DataFrame

Once I found the right Python libraries, I have created some tools (ie my own modules) to load those .gpx files, to get some information about them and to save them as pandas DataFrame. The idea is to no have to re-load all my source data files each time I want to do something with them.

I'm using:
+ https://pypi.org/project/gpxpy/
+ https://pandas.pydata.org/

In [3]:
importlib.reload(ttrgpx)

# Select gpx file
list_all_files = glob.glob(path_data+"/gpx/RunRite*.gpx")
list_all_files.sort()
print("There is {0:1.0f} files to process.".format(len(list_all_files)))

# convert list of files to a list of dataFrame
list_all_files_df = ttrgpx.fun_listPath_gpx2pd(list_all_files)

# get number of run
nb_run = len(list_all_files_df)

# get average distance for this year
vec_run_distance = np.zeros(nb_run)
for i in np.arange(nb_run):
    vec_run_distance[i] = list_all_files_df[i].iloc[-1,5]

average_run_distance = np.mean(vec_run_distance / 1000)

print("Average distance per run for all the gpx paths listed: {0:1.2f}km.".format(average_run_distance))

There is 99 files to process.
Average distance per run for all the gpx paths listed: 11.00km.


Check how one gpx path looks like:

In [15]:
print(ll2[0])
list_all_files_df[0].head()

../data/gpx/RunRite_2021_07_08.gpx


Unnamed: 0,time,latitude,longitude,elevation,distance,cumulative_distance,duration,cumulative_duration
0,2021-07-08 23:00:29+00:00,45.471885,-73.613648,65.5,0.0,0.0,0.0,0.0
1,2021-07-08 23:00:39+00:00,45.471885,-73.613648,65.5,0.0,0.0,10.0,10.0
2,2021-07-08 23:00:41+00:00,45.471885,-73.613648,65.5,0.0,0.0,2.0,12.0
3,2021-07-08 23:00:44+00:00,45.471885,-73.613648,65.5,0.0,0.0,3.0,15.0
4,2021-07-08 23:00:47+00:00,45.471885,-73.613648,65.5,0.0,0.0,3.0,18.0


# Reduce data size and data as csv

Here I will reduce the data size as I don't need so many points (eg here more than 1000) and will reduce the length to **x** points per gpx points.

And I will save the downsample path as csv files.

In [16]:
importlib.reload(ttrgpx)

# reduce the size
list_all_files_ReSample_df = []

for c,d in enumerate(list_all_files_df):
    df = list_all_files_df[c]
    ReSample_df = ttrgpx.fun_DownSample_gpx(df, number_of_sample = 100) # <--- here I choose to how many points I reduce the size of a trace
    list_all_files_ReSample_df.append(ReSample_df)

print(len(list_all_files_df))
      
# check one gpx path
list_all_files_ReSample_df[0].describe()

99


Unnamed: 0,latitude,longitude,elevation,distance,cumulative_distance,duration,cumulative_duration
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,45.486056,-73.608386,111.534,7.560835,4968.481528,2.65,1936.9
std,0.006519,0.008316,43.288075,2.648653,2886.22647,0.903137,1125.162538
min,45.471885,-73.621947,55.3,0.0,0.0,0.0,0.0
25%,45.481418,-73.614622,76.95,6.164782,2617.697671,2.0,980.0
50%,45.487698,-73.610363,98.4,7.555856,4840.667083,3.0,1894.0
75%,45.491495,-73.600949,138.025,9.576444,7395.307861,3.0,2908.0
max,45.493624,-73.59247,197.4,14.036485,10112.057737,8.0,3901.0


In [17]:
head_tail = os.path.split(ll2[0])
print(head_tail[1][0:-4])

RunRite_2021_07_08


In [18]:
# save as csv
for c,d in enumerate(list_all_files_ReSample_df):
    head_tail = os.path.split(ll2[c])
    
    df = list_all_files_ReSample_df[c]
    path_to_downSample_data = path_data_csv+head_tail[1][0:-4]+"_downSample.csv"
    df.to_csv(path_to_downSample_data, index=False)

In [19]:
!ls /Users/jeremiegerhardt/Documents/ProjectDataVizRun/data/csv/*csv | wc -l

      99


Check that the csv files has been created.

In [20]:
!head /Users/jeremiegerhardt/Documents/ProjectDataVizRun/data/csv/RunRite_2021_07_08_downSample.csv

time,latitude,longitude,elevation,distance,cumulative_distance,duration,cumulative_duration
2021-07-08 23:00:29+00:00,45.471885,-73.613648,65.5,0.0,0.0,0.0,0.0
2021-07-08 23:01:22+00:00,45.472342,-73.613935,68.1,6.056304992277719,66.56976698168943,2.0,53.0
2021-07-08 23:01:56+00:00,45.473051,-73.614586,71.9,10.180549429785133,167.4458083534726,3.0,87.0
2021-07-08 23:02:32+00:00,45.473857,-73.615365,72.3,8.235967277669209,284.54733842398986,3.0,123.0
2021-07-08 23:03:56+00:00,45.474595,-73.614795,75.2,5.950272512784583,378.94393548738793,2.0,207.0
2021-07-08 23:04:47+00:00,45.475372,-73.614055,77.4,6.119016014265821,483.69095523612566,2.0,258.0
2021-07-08 23:05:21+00:00,45.476161,-73.613453,80.3,10.330146936556138,584.0896037604091,3.0,292.0
2021-07-08 23:06:17+00:00,45.476886,-73.612815,82.1,6.342151685346756,680.5195539891736,2.0,348.0
2021-07-08 23:06:53+00:00,45.477781,-73.611963,83.1,9.565859415246766,801.2419182187023,3.0,384.0


In [21]:
# load one file
new_df = pd.read_csv(path_data_csv+"RunRite_2021_07_08_downSample.csv")
new_df.describe()

Unnamed: 0,latitude,longitude,elevation,distance,cumulative_distance,duration,cumulative_duration
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,45.486056,-73.608386,111.534,7.560835,4968.481528,2.65,1936.9
std,0.006519,0.008316,43.288075,2.648653,2886.22647,0.903137,1125.162538
min,45.471885,-73.621947,55.3,0.0,0.0,0.0,0.0
25%,45.481418,-73.614622,76.95,6.164782,2617.697671,2.0,980.0
50%,45.487698,-73.610363,98.4,7.555856,4840.667083,3.0,1894.0
75%,45.491495,-73.600949,138.025,9.576444,7395.307861,3.0,2908.0
max,45.493624,-73.59247,197.4,14.036485,10112.057737,8.0,3901.0


In [22]:
list_all_files_ReSample_df[0].describe()

Unnamed: 0,latitude,longitude,elevation,distance,cumulative_distance,duration,cumulative_duration
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,45.486056,-73.608386,111.534,7.560835,4968.481528,2.65,1936.9
std,0.006519,0.008316,43.288075,2.648653,2886.22647,0.903137,1125.162538
min,45.471885,-73.621947,55.3,0.0,0.0,0.0,0.0
25%,45.481418,-73.614622,76.95,6.164782,2617.697671,2.0,980.0
50%,45.487698,-73.610363,98.4,7.555856,4840.667083,3.0,1894.0
75%,45.491495,-73.600949,138.025,9.576444,7395.307861,3.0,2908.0
max,45.493624,-73.59247,197.4,14.036485,10112.057737,8.0,3901.0


# Now the datavisulaization can start

Before let's have a look at the data I have, how they are organized.

In folders I have:
+ *../data/gpx/* the .gpx files for each run, each of them name **RunRite_year_month_day.gpx**
+ *../data/csv/* the .csv files for each run in reduced size comparing to the .gpx files , each of them name **RunRite_year_month_day_downSample.csv**

# Re-load the data

I want to have:
+ a list of all file names
+ a list of panda DataFrame where each element is for a single run

In [24]:
# list of file names
path_csv_files = "../data/csv/"
list_csv_files  = glob.glob(path_csv_files+"*.csv")
list_csv_files.sort()

# list of panda DataFrame
list_run_df = []
for c, f in enumerate(list_csv_files):
    list_run_df.append(pd.read_csv(f))
    
list_run_df[0].head()

Unnamed: 0,time,latitude,longitude,elevation,distance,cumulative_distance,duration,cumulative_duration
0,2021-07-08 23:00:29+00:00,45.471885,-73.613648,65.5,0.0,0.0,0.0,0.0
1,2021-07-08 23:01:22+00:00,45.472342,-73.613935,68.1,6.056305,66.569767,2.0,53.0
2,2021-07-08 23:01:56+00:00,45.473051,-73.614586,71.9,10.180549,167.445808,3.0,87.0
3,2021-07-08 23:02:32+00:00,45.473857,-73.615365,72.3,8.235967,284.547338,3.0,123.0
4,2021-07-08 23:03:56+00:00,45.474595,-73.614795,75.2,5.950273,378.943935,2.0,207.0
