# Importing and Parsing Strava Datasets

This notebook will parse all Strava GPX files and construct the necessary database.

## Step 0:  Global Parameters


In [1]:
database_path = 'bike_data.db'
epsg_code = 32613

## Step 1: Import required libraries

In [2]:
import gpxpy
import datetime
from math import sqrt, floor
import numpy as np
import pandas as pd
import os, re
import sqlite3 as sql
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
from pyproj import CRS, Proj, Transformer
import geopy.distance

## Step 2: Remove the database and re-initialize

In [3]:
if os.path.exists( database_path ):
    print('Removing {}'.format( database_path ))
    os.remove( database_path )
    
conn = sql.connect( database_path )

Removing bike_data.db


Setup the projection.

In [4]:
#  Setup the Projection Transformer
crs = CRS.from_epsg( epsg_code )
proj = Transformer.from_crs(crs.geodetic_crs, crs)
utm_zone = int(re.findall("\d+", crs.utm_zone)[0])
print('UTM Zone: {}'.format(utm_zone))

UTM Zone: 13


### Database Tables

#### sector_list

This table contains the list of sectors, the number of points in each, and it's name.

| index          | sector_name  | sector_id    |  number_points |
| :------------- | :----------- | :----------- | :------------- |
|  Integer       | String       | String       | Integer        |
|  0             | Sector 1     | sector_1     | 14             |

* Sector Name: 
  * Descriptive name of the sector.
* Sector-ID: 
  * Table that contains the point data.
* Number-Points:
  * Number of points in the table for the sector polygon.

#### sector_X

This table breaks down an individual sector.

| index   | latitude    | longitude   | elevation  |
| :------ | :---------- | :---------- | :--------- |
| Integer | Float       | Float       | Float      |
| 0       | 38.12345    | -104.1243   | 1713       |

#### point_list

This table contains the full list of points collected from GPX files.

| index   | longitude | latitude  | elevation | timestamp                 | step_dist    | time_diff_sec | sector_id | dataset                      | dataset_id |
| :------ | :-------- | :-------- | :-------- | :------------------------ | :----------- | :------------ | :-------- | :--------------------------- | :--------- |
| Integer | Float     | Float     | Float     | String                    | Float        | Float         | String    | String                       | Integer    |
|   3     | 39.12345  | -104.1234 | 1713.123  | 2020-11-20 00:03:18+00:00 | 3.7842052780 | 1             | sector_1  | ./datasets/ride.20201120.gpx | 1          |

## Step 3:  Load the Sector Polygons

In order to classify each track point, we need to assign it to a sector.  The sector KML file has each sector in the form of a KML polygon.  This block will load the sector KML file and load the sectors into the database.

In [5]:
# Load the sector map
import KML_Parser
kml_file = KML_Parser.Bike_Sector_KML_File()
bike_sector_polygons = kml_file.Parse_KML( 'bike_sectors.kml' )

#  Create a table list
table_list = ['sector_' + str(x) for x in range(0, len(bike_sector_polygons))]
dataset = { 'sector_name'  : [ x['name'] for x in bike_sector_polygons ],
            'sector_id'    : table_list,
            'number_points': [ len(x['polygon']) for x in bike_sector_polygons ] }
pd.DataFrame( data = dataset ).to_sql( 'sector_list', conn )

#  Create a table for each sector
counter = 0
for sector in bike_sector_polygons:
    table_name = table_list[counter]
    dataset = { 'latitude':  [ x[1] for x in sector['polygon'] ],
                'longitude': [ x[0] for x in sector['polygon'] ],
                'gridZone':  [ utm_zone for x in sector['polygon'] ],
                'isNorth':   [ True for x in sector['polygon'] ],
                'easting':   [ 0    for x in sector['polygon'] ],
                'northing':  [ 0    for x in sector['polygon'] ],
                'elevation': [ x[2] for x in sector['polygon'] ] }
    
    #  Compute UTM Coordinates
    for x in range( 0, len(sector['polygon'])):
        (easting, northing) = proj.transform( sector['polygon'][x][1],
                                              sector['polygon'][x][0] )
        dataset['easting'][x]  = easting
        dataset['northing'][x] = northing
    
    pd.DataFrame( data = dataset ).to_sql( table_name, conn )
    
    #  Create Shapely Polygon to Aid Point-in-Polygon Searches
    bike_sector_polygons[counter]['shape']     = Polygon( [ (x[0], x[1]) for x in sector['polygon'] ] )
    bike_sector_polygons[counter]['sector_id'] = table_name
    
    counter += 1


In order to map points to sectors, we need a lookup method.

In [6]:
def Find_Sector( sector_polygon_list, point ):
    
    #  Iterate over each polygon
    for x in range( 0, len( sector_polygon_list ) ):
        if sector_polygon_list[x]['shape'].contains( point ):
            return sector_polygon_list[x]['sector_id']
              

## Step 4: Load the GPX Data and write to the database

This is a multi-step process.
* Load each GPX Dataset
* For each dataset
  * Assign a sector
* Write all points to the database

In [7]:
#  Look for GPX Files
dataset_id = 0
df = pd.DataFrame( columns=['longitude',
                            'latitude',
                            'gridZone',
                            'easting',
                            'northing',
                            'elevation',
                            'timestamp', 
                            'stepDist',
                            'elapsedDist',
                            'timeDiffSec',
                            'sectorId',
                            'dataset',
                            'datasetId'])
for root, dirs, files in os.walk( "./datasets", topdown=False ):
    for name in files:
        fname = os.path.join( root, name )
        if os.path.splitext( fname )[-1] == '.gpx':
            print('Loading: {}, Dataset-ID: {}'.format( fname, dataset_id ))
            gpx_file = open( fname )
            gpx_data = gpxpy.parse( gpx_file )
            
            point_data = gpx_data.tracks[0].segments[0].points
            
            first_point = True
            prev_point = None
            distance_elapsed = 0
            for point in point_data:
                
                #  Assign a sector-id to the point
                sector_id = Find_Sector( bike_sector_polygons, Point([point.longitude, point.latitude]) )
                time_diff = 0
                distance_prev = 0
                if first_point:
                    first_point = False
                else:
                    distance_prev = geopy.distance.geodesic( (prev_point.latitude, prev_point.longitude),
                                                             (     point.latitude,      point.longitude) ).m
                    distance_elapsed += distance_prev
                    time_diff = ( point.time - prev_point.time).total_seconds()
                    
                (easting,northing) = proj.transform( point.latitude, point.longitude )
                #print( 'Point: {}, {}, Easting: {}, Northing: {}'.format( point.latitude, 
                #                                                          point.longitude,
                #                                                          easting,
                #                                                          northing))
                    
                df = df.append({'longitude'   : point.longitude, 
                                'latitude'    : point.latitude,
                                'gridZone'    : utm_zone,
                                'easting'     : easting,
                                'northing'    : northing,
                                'elevation'   : point.elevation,
                                'timestamp'   : str(point.time),
                                'stepDist'    : distance_prev,
                                'elapsedDist' : distance_elapsed,
                                'timeDiffSec' : time_diff,
                                'sectorId'    : sector_id,
                                'dataset'     : fname,
                                'datasetId'   : int(dataset_id) }, ignore_index=True )
                prev_point = point
                
            dataset_id += 1

df
df.to_sql( 'point_list', conn )
print('Database Written to Disk')

Loading: ./datasets/ride.20201123.gpx, Dataset-ID: 0
Loading: ./datasets/ride.20201120.gpx, Dataset-ID: 1
Loading: ./datasets/ride.20201130.gpx, Dataset-ID: 2
Loading: ./datasets/ride.20201209.gpx, Dataset-ID: 3
Database Written to Disk
