# Data Science: Analysing GPX Data
## Comparing Data Scraped from Web API w/ GPX Data Provided

Importing GPX library:

In [3]:
pip install gpxpy

Note: you may need to restart the kernel to use updated packages.


Importing necessary packages:

In [4]:
from bs4 import BeautifulSoup
import numpy as np
from math import sin, cos, sqrt, atan2
import pandas as pd
import json, requests, urllib
import gpxpy
import gpxpy.gpx
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
r = 6373 # Radius of Earth in km

Defining a function to retrieve the following information of a given track:
- Each track segment's name
- The number of points in each segment of the track
- The total distance of each segment in the track
- The total uphill elevation and downhill demotion in the segments of each track

In [5]:
def gpx_2_df(file):
    with open(file) as f:
        gpx = gpxpy.parse(f)
    # Establishing the list within which all information is stored
    routes = []
    # ---------------------------------------------------------------
    # Establishing empty lists & parameters for distance calculations
    points = []
    distances = []
    # ---------------------------------------------------------------
    # Establishing empty lists & parameters for elevation calculations
    elev1 = 0
    elevations = []
    # ---------------------------------------------------------------
    for track in gpx.tracks:
        # Eliminating number tags and "Developed with signs" tags from names
        names = track.name.split(" ")
        if names[-1] == "(Developed)":
            new_names = " ".join(names[1:names.index("(Developed)")])
        else:
            new_names = " ".join(names[1:names.index("(Developed")])
        new_names = new_names.replace("â€“", "-")
        pt_count = 0
        for segment in track.segments:
            # Setting the segment distance to start at zero for each new segment
            seg_dist = 0
            lat1 = 0
            long1 = 0
            # Setting the uphill elevation and downhill demotion values to zero for each segment
            positive = 0
            negative = 0
            for point in segment.points:
                # Counting amount of points in a given segment
                pt_count += 1
                # ----------------------------------------------------------------
                # Distance calcs - Establishing the new 2nd coordinate pair of a new point
                lat2 = point.latitude
                long2 = point.longitude
                lat = lat2 - lat1
                long = long2 - long1
                a = (sin(lat/2))**2 + cos(lat1)*cos(lat2)*(sin(long/2))**2
                c = 2*atan2(sqrt(a), sqrt(1-a))
                pt_dist = (r*c) / 100
                # Setting 1st point distance to zero as it is the start point & has no distance
                if lat1 == 0:
                    pt_dist = 0
                # Setting 2nd coordinate of previous calculation to be 1st coordinate of the next
                lat1 = lat2
                long1 = long2
                # Adding the distance between two points to the total segment distance
                seg_dist += pt_dist
                # ----------------------------------------------------------------
                # Elevation calcs - Rounding the 2nd point's elevation or demotion to 2 decimal places
                elev2 = round(float(point.elevation), 2)
                # Finding the total upward or downward slope between 2 points
                elev = elev2 - elev1
                # Setting the 2nd point to the 1st point for the next calculation
                elev1 = elev2
                # Counting total uphill elevation
                if elev > 0:
                    positive += elev
                    # Counting total downhill demotion
                else:
                    negative += elev
        # Creating a dictionary with all information retrieved
        routes.append({"Segment Name":new_names,
                       "No. of Points": pt_count,
                       "Distance (km)":seg_dist,
                       "Total Uphill Elevation (m)":positive,
                       "Total Downhill Demotion (m)":(negative)*(-1)})
    df = pd.DataFrame(routes)
    return(df)

Testing the defined function by transforming some of the GPX files to data frames:

In [6]:
trk1 = gpx_2_df("ev1.gpx")
trk2 = gpx_2_df("ev2.gpx")
trk6 = gpx_2_df("ev6.gpx")

Defining a function to find the segment of a given route with the largest distance:

In [7]:
def longest_seg(df):
    # Cutting the row with the maximum distance from the DataFrame
    seg = df.loc[df["Distance (km)"] == df.max()[2]]
    # Converting the row to a list to display the desired info
    seg_list = seg.values.tolist()[0]
    # Printing a message with the longest segment's name and distance
    print("The longest segment in this track is %s.\nDistance = %.2f km\n"
         % (seg_list[0], seg_list[2]))

In [8]:
longest_seg(trk1)
longest_seg(trk2)
longest_seg(trk6)

The longest segment in this track is Kilboghavn - Nesna.
Distance = 95.30 km

The longest segment in this track is HÃ¶xter - Wittenberg.
Distance = 120.50 km

The longest segment in this track is Tuttlingen - Ulm.
Distance = 76.89 km



The <i>hilliest</i> segment in a given track is of interest, as it provides information on how potentially difficult it may be to cycle through.

It is difficult to deduce what exactly constitutes a <i>hilly</i> track. It is assumed that if a track has little elevation and demotion, it is relatively flat. Thus, for the purposes of evaluating the segments in each track, a <b>hill factor</b> will be defined.

The hill factor will add the total elevation and demotion in each segment, and the "hilliest" segment will be defined as the segment with the largest hill factor, meaning that this segment has a large elevation <b>and</b> demotion.

Defining a function to calculate the hill factor:

In [9]:
def hill_factor(up_list, down_list):
    hill_factors = []
    for i in range(0, len(up_list)):
        # Adding both elevation and demotion of a segment to define a "hill factor"
        factor = up_list[i] + down_list[i]
        # Adding each hill constant to a list to align the hill factor with its segment
        hill_factors.append(factor)
    return hill_factors

Defining a function to find the "hilliest" segment of a given track:

In [10]:
def hilliest_seg(df):
    # Making lists of each uphill and downhill value within the DataFrame
    up_list = df["Total Uphill Elevation (m)"].values.tolist()
    down_list = df["Total Downhill Demotion (m)"].values.tolist()
    # Establishing an empty list to store the sum of the elevations and demotions
    hill_factors = hill_factor(up_list, down_list)
    # Finding the segment within the list that the max hill factor corresponds to
    list_place = hill_factors.index(max(hill_factors))
    # Finding the name of the segment to which the max hill factor belongs
    hilliest = df.loc[df["Total Uphill Elevation (m)"] == up_list[list_place]]
    seg_list = hilliest.values.tolist()[0]
    print("The hilliest segment in this track is %s.\nElevation = %.2f m\nDemotion = %.2f m\n"
         % (seg_list[0], seg_list[3], seg_list[4]))

In [11]:
hilliest_seg(trk1)
hilliest_seg(trk2)
hilliest_seg(trk6)

The hilliest segment in this track is Kilboghavn - Nesna.
Elevation = 1692.40 m
Demotion = 1702.10 m

The hilliest segment in this track is HÃ¶xter - Wittenberg.
Elevation = 2188.10 m
Demotion = 2207.70 m

The hilliest segment in this track is Tuttlingen - Ulm.
Elevation = 627.10 m
Demotion = 803.50 m



<b>Analysis Q</b>: What is the longest stage in EuroVelo 6?

In [12]:
longest_seg(trk6)

The longest segment in this track is Tuttlingen - Ulm.
Distance = 76.89 km



<b>Analysis Q</b>: What is the stage in EuroVelo 1 with the most uphill?

In [13]:
hilliest_seg(trk1)
# Check to see if the "hilliest" segment is also the segment with the most uphill elevation
max_uphill = trk1.max()[3]
print("The maximum uphill elevation was independently checked and is shown to be %.2f m" % max_uphill)

The hilliest segment in this track is Kilboghavn - Nesna.
Elevation = 1692.40 m
Demotion = 1702.10 m

The maximum uphill elevation was independently checked and is shown to be 1692.40 m


# Holiday Finder

It would be interesting and useful to find a sequence of flat stages in which one could stop and rest while not having to worry about tough cycle routes ahead of them. Following this, a sequence of the 3 flattest stages in the track should be sought.

In [14]:
def find_rest_stop(df):
    # Making lists of each uphill and downhill value within the DataFrame
    up_list = df["Total Uphill Elevation (m)"].values.tolist()
    down_list = df["Total Downhill Demotion (m)"].values.tolist()
    # Finding the sum of elevation and demotion for each segment
    hill_factors = hill_factor(up_list, down_list)
    # Establishing a empty list where the values for 3 grouped segments will be placed
    segments_grouped = []
    # Finding the sum of each group of 3 consecutive route segments
    for j in range(0, (len(hill_factors)-2)):
        segments_grouped.append(hill_factors[j]+hill_factors[j+1]+hill_factors[j+2])
    # Finding the minimum sum of 3 consecutive sement elevations and demotions
    small_seg = min(segments_grouped)
    # Finding the location within the list that the minimum value occurs. This is where the rest stop section begins
    list_place = segments_grouped.index(small_seg)
    # Establishing the start and end points of this rest stop segment
    rest_start = df.loc[df["Total Uphill Elevation (m)"] == up_list[list_place]]
    rest_mid = df.loc[df["Total Uphill Elevation (m)"] == up_list[list_place+1]]
    rest_end = df.loc[df["Total Uphill Elevation (m)"] == up_list[list_place+2]]
    # Finding the segment start and end point names
    seg_list_start = rest_start.values.tolist()[0]
    seg_list_mid = rest_mid.values.tolist()[0]
    seg_list_end = rest_end.values.tolist()[0]
    # Printing the sought information in an easily readable manner
    print("The ideal rest stop would be before the segment of %s begins." % seg_list_start[0])
    print("This will allow for a relatively flat route through %s up to %s." % (seg_list_mid[0], seg_list_end[0]))
    print("The average sum of elevation and demotion for each segments is no higher than %.2f m." % (small_seg/3))

<b>Analysis Q</b>: What are the 3 flattest continuous stages in EuroVelo 2?

In [15]:
find_rest_stop(trk2)

The ideal rest stop would be before the segment of Athlone - Kinnegad begins.
This will allow for a relatively flat route through Kinnegad - Maynooth up to Maynooth - Dublin.
The average sum of elevation and demotion for each segments is no higher than 122.67 m.


<b>Analysis Q</b>: Find the 5 most uphill continuous stages in EuroVelo 1.

In [16]:
# Making a list of the elevation data for ease of use
up_list = trk1["Total Uphill Elevation (m)"].values.tolist()
# Establishing an empty list in which segments will be grouped in 5
uphill_grouped = []
# Grouping the elevation of each segment with the elevation of 4 segments to follow it
for k in range(0, (len(up_list)-5)):
    consecutives = up_list[k:k+5]
    uphill_grouped.append(sum(consecutives))
# Establishing the maximum uphill sum of every group of 5 segments
most_uphills = max(uphill_grouped)
# Finding the place within the list that this max value lies in. This tells us the first of the 5 segments
list_place = uphill_grouped.index(most_uphills)
# Establishing each consecutive segment as the 5 with the largest uphill
uphill1 = trk1.loc[trk1["Total Uphill Elevation (m)"] == up_list[list_place]]
uphill2 = trk1.loc[trk1["Total Uphill Elevation (m)"] == up_list[list_place+1]]
uphill3 = trk1.loc[trk1["Total Uphill Elevation (m)"] == up_list[list_place+2]]
uphill4 = trk1.loc[trk1["Total Uphill Elevation (m)"] == up_list[list_place+3]]
uphill5 = trk1.loc[trk1["Total Uphill Elevation (m)"] == up_list[list_place+4]]
# Extracting the names of each of the 5 consecutive segments
seg_list1 = uphill1.values.tolist()[0]
seg_list2 = uphill2.values.tolist()[0]
seg_list3 = uphill3.values.tolist()[0]
seg_list4 = uphill4.values.tolist()[0]
seg_list5 = uphill5.values.tolist()[0]
# Printing the results
print("The 5 most uphill continuous stages in EuroVelo 1 are:\n%s -> %s -> %s -> %s -> %s"
      % (seg_list1[0], seg_list2[0], seg_list3[0], seg_list4[0], seg_list5[0]))
print("The average elevation for each of the 5 segments = %.2f m" % (most_uphills/5))

The 5 most uphill continuous stages in EuroVelo 1 are:
Floro - Forde -> Forde - Askvoll -> Hellevik - Rysjedalsvika -> Rutledal - Slovag -> Leirvag - Lunde
The average elevation for each of the 5 segments = 675.18 m


# Task 3: Testing the Accuracy of Distance Estimates

After investigating the mapping APIs suggested in the assignment brief, the TomTom API was chosen to perform performance assessments of the GPX data.

<b>TomTom API:</b> https://developer.tomtom.com

<b>NOTE:</b> As was warned about in the assignment brief, this API is rate-limited, so API calls will need to be placed on hold between working sessions.

## TomTom

The skeleton of a directions request is as follows:

> https://{baseURL}/routing/{versionNumber}/calculateRoute/{routePlanningLocations}/{contentType}?{sectionType}&key={api_key}

In [17]:
prefix = "https://api.tomtom.com/routing/1/calculateRoute/"
contentType = "json?"
sectionType = "sectionType=travelMode&travelMode=bicycle&"
api_key = "key=d6W04aTCnailhAZCJqNfeblTIIEG4OZX"

First the API will be used to calculate route segment distances by feeding the API with only the start and end coordinates (not the points between each segment) to investigate the fluctuations in distance when the API is given the freedom to select the shortest route between the start and end coordinates.

Defining a function to extract the starting and ending coordinates of each segment in a track described by a GPX file:

In [18]:
def start_end_coords(file):
    with open(file) as f:
        gpx = gpxpy.parse(f)
    # Establishing the list within which the start and end coordinates of the segments is stored
    TT_waypoints = []
    # Establishing empty lists & parameters for distance calculations
    points = []
    # ---------------------------------------------------------------
    for track in gpx.tracks:
        for segment in track.segments:
            points = []
            for point in segment.points:
                lat = round(point.latitude,7)
                long = round(point.longitude,7)
                coord = str(lat)+","+str(long)
                points.append(coord)
                start_end = str(points[0])+":"+str(points[-1])
        # Creating a dictionary with all information retrieved
        TT_waypoints.append(start_end)
    return TT_waypoints

Defining a function to use the TomTom mapping API to measure the distance from one set of coordinates to another and store all of these distance measurements in a DataFrame:

<b>NOTE:</b> The <i>NaN</i> values will not be dropped from the dataframe as they provide a visual of data that was either not found by the API or unnecessary because the distance was over or under-estimated. It also helps to include all of the rows for calculating the mean of estimate variation.

In [19]:
def API_distances(file):
    TT_distances = []
    df_coords = start_end_coords(file)
    for coord in df_coords:
        routePlanningLocations = coord+"/"
        url = prefix
        url += routePlanningLocations
        url += contentType
        url += sectionType
        url += api_key
        try:
            response = requests.get(url)
            jdata = response.text
            data = json.loads(jdata)
            route_length = (data["routes"][0]["summary"]["lengthInMeters"])/1000
        except:
            route_length = "Not Found"
        TT_distances.append({"TomTom Distance (km)":route_length})
    df = pd.DataFrame(TT_distances)
    return df

Defining a function to compare distances summarised in Task 1 with distances calculated in Task 3 using the TomTom mapping API:

In [20]:
def compare_w_api(file):
    # Establishing an empty list to store comparison information
    compare = []
    # Establishing two DataFrames with route distances using functions defined in in Tasks 1 and 3
    gpx_df = gpx_2_df(file)
    TT_df = API_distances(file)
    # Turning the two distance columns in each DataFrame into lists
    TT_list = TT_df["TomTom Distance (km)"].values.tolist()
    gpx_list = gpx_df["Distance (km)"].values.tolist()
    for a in range(0, len(gpx_list)):
        if TT_list[a] == "Not Found":
            compare.append({"% Underestimated":np.nan,
                           "% Overestimated":np.nan})
        else:
            if gpx_list[a] < TT_list[a]:
                compare.append({"% Underestimated":100-(gpx_list[a]/TT_list[a])*100,
                                "% Overestimated":np.nan})
            else:
                compare.append({"% Underestimated":np.nan,
                                "% Overestimated":100-(TT_list[a]/gpx_list[a])*100})
    # Joining the DataFrames to display calculated distances along with API-retrieved distances
    gpx_df = gpx_df.join(TT_df)
    compare_df = pd.DataFrame(compare)
    compare_df = gpx_df.join(compare_df)
    # Removing elevation and demotion columns from new comparison DataFrame
    compare_df.drop("Total Uphill Elevation (m)", inplace=True, axis=1)
    compare_df.drop("Total Downhill Demotion (m)", inplace=True, axis=1)
    # Counting all segment distances that were not found by the API
    print("Distances not found = %i" % (compare_df["TomTom Distance (km)"]=="Not Found").sum())
    print("Average %% Underestimated = %.2f%%" % (compare_df["% Underestimated"].mean()))
    print("Average %% Overestimated = %.2f%%" % (compare_df["% Overestimated"].mean()))
    return compare_df

In [21]:
compare_w_api("ev14.gpx")

Distances not found = 0
Average % Underestimated = 15.60%
Average % Overestimated = 1.79%


Unnamed: 0,Segment Name,No. of Points,Distance (km),TomTom Distance (km),% Underestimated,% Overestimated
0,Zell Am See - St Johann im Pongau,54,30.72411,39.353,21.926893,
1,St Johann im Pongau - Liezen,145,82.290975,93.533,12.019314,
2,Liezen - World Heritage Graz,203,116.944107,144.003,18.790507,
3,World Heritage Graz - Szentgotthard,158,74.269141,74.771,0.671195,
4,Szentgotthard - Vasvar,67,47.914727,47.059,,1.785938
5,Vasvar - Keszthely,57,45.528641,57.057,20.204987,
6,Keszthely - Balatonfuzfo,112,65.362961,80.155,18.454293,
7,Balatonfuzfo - Velence,66,51.411212,62.015,17.098747,


In [22]:
compare_w_api("ev19.gpx")

Distances not found = 0
Average % Underestimated = 35.44%
Average % Overestimated = 9.25%


Unnamed: 0,Segment Name,No. of Points,Distance (km),TomTom Distance (km),% Underestimated,% Overestimated
0,Langres - Pouilly-en-bassigny,45,19.466521,30.428,36.024315,
1,Pouilly-en-bassigny - Montigny-le-Roi,15,5.650435,10.636,46.874436,
2,Montigny-le-Roi - Bourmont,37,14.786136,27.137,45.513006,
3,Bourmont - NeufchÃ¢teau,43,14.643477,20.381,28.151334,
4,NeufchÃ¢teau - Vaucouleurs,51,18.031749,31.445,42.656228,
5,Vaucouleurs - Commercy,29,11.018337,19.474,43.420268,
6,Commercy - St-Mihiel,28,9.172713,18.686,50.91131,
7,St-Mihiel - Verdun memorial,37,18.851096,35.786,47.322706,
8,Verdun memorial - Dun-Sur-Meuse,45,21.750029,34.575,37.093191,
9,Dun-Sur-Meuse - Stenay,19,8.538632,14.366,40.563611,


# Discussion

There is a large fluctuation in the percentage underestimated by the technique used to calculate route distance when compared to the TomTom mapping API. This is potentially due to a number of factors.

## API Method
The method by which the calculated distances were compared to the API distance measurements contribute to this fluctuation in results. The API calculated a route distance by being fed the beginning and ending coordinates of a route segment. Ideally, it would be more accurate to calculate the distances between each point and adding all of these distances to achieve a total route segment distance. However, when this method was attempted, the API produced too many unfound point distances to yield credible results.

Additionally, the TomTom API is rate-limited, so the method by which distances were compared was constructed in such a way that requests were kept as minimal as possible.

This means that the API has a part to play in this fluctuation of results, and not all of the blame can be placed on manual coordinate distance measurements.

## Manual Distance Measurements
The distances calculated in this Jupyter Notebook employed a method by which the curvature of the Earth was taken into account. Despite this, the distances of each point relative to its preceding point were straight line measurements using latitude and longitude points. They did not account for bends in the road and other potentially misleading real-world phenomena.

The TomTom API distance calculations accounted for these parameters as the route type was specified for a cycle route. This could be another source of variation in distance results.

In conclusion, the measured distances varied to a degree from the API-measured distances. However, some distances were not found by the API and the percentage 