# 2. Analyzing Human Trajectories with Spark


## Objective

In the previous tutorial you learnt how to use Spark for loading, cleansing and visualizing urban data. For this purpose you worked with the [Geolife trajectory dataset][2]. In this 2nd tutorial, you will explore new capabilities of Spark for detecting human trajectories and computing some basic properties (distance, duration, etc.). 

[2]: https://www.microsoft.com/en-us/research/publication/geolife-gps-trajectory-dataset-user-guide/

### Preparing the Environment

In [3]:
# For printing several outpus in 1 cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Connect to Spark cluster
import pyspark
sc


## User Transportation Modes

Besides GPS coordinates, the Geolife trajectory dataset contains information about **user transportation modes** (e.g., bicycle, bus, taxi, etc.). This information was manually provided by project participants and is stored in files called `labels.txt`. **There is one labels.txt file per user folder**. 

A labels.txt file is composed of lines having the structure:

    (date1, time1, date2, time2, transport)

The pair `date1+time1`/`date2+time2` represents the **start time/end time** of a trajectory, respectively. Lets verify this by loading and exploring the first lines of a labels.txt file.

In [6]:
rdd1 = sc.textFile( 'data/010/labels.txt' )
rdd1.take(5)

['Start Time\tEnd Time\tTransportation Mode',
 '2007/06/26 11:32:29\t2007/06/26 11:40:29\tbus',
 '2008/03/28 14:52:54\t2008/03/28 15:59:59\ttrain',
 '2008/03/28 16:00:00\t2008/03/28 22:02:00\ttrain',
 '2008/03/29 01:27:50\t2008/03/29 15:59:59\ttrain']

As you can see the file contains metadata. Lets remove it by filtering the lines not representing a 

In [None]:
rdd1 = sc.textFile( 'data/010/Trajectory/20080618003409.plt' )
rdd2 = sc.textFile( 'data/010/Trajectory/20081219114010.plt' )
rdd3 = sc.textFile( 'data/010/Trajectory/20080928160000.plt' )

rdd4 = rdd1.union( rdd2 ).union( rdd3 )

crds = rdd4.filter( notTraceMetadata ).map( parseLogLine ).map( 
    lambda cor: ( cor['lat'], cor['lon'] )
).collect()

fig = gmaps.figure()
fig.add_layer( gmaps.heatmap_layer(crds) )
fig

In [None]:


from math import radians, sin, cos, sqrt, atan2



In [None]:
import folium
from folium.plugins import HeatMap


rdd1 = sc.textFile( 'data/010/Trajectory/20080618003409.plt' )
rdd2 = sc.textFile( 'data/010/Trajectory/20081219114010.plt' )

_data = rdd1.union( rdd2 ).filter( notTraceMetadata ).map( parseLogLine ).map( 
    lambda cor: [ cor['lat'], cor['lon'] ]
).collect()



m = folium.Map(location=_data[0], tiles='Mapbox', API_key='pk.eyJ1IjoiamF2aWVyYWVzcGlub3NhIiwiYSI6ImNpdmgwaDhmdDAwejQyeW8wMWswbzg3YTcifQ.g7D5-oF4zE4r-oZidRTWBA')
HeatMap( _data ).add_to(m)

m

### Grouping Traces per User

So far, you are capable of answering these questions:

* How many locations has the trace ```20080618003409.plt```? 
* How many locations producer user  ```010```? 
* How many locations produced all users?


In [None]:
rdd1 = sc.textFile( 'data/010/Trajectory/20080618003409.plt' )
rdd2 = sc.textFile( 'data/010/Trajectory/*.plt' )
rdd3 = sc.textFile( 'data/*/Trajectory/*.plt' )

parseTraceRDD( rdd1 ).count()
parseTraceRDD( rdd2 ).count()
parseTraceRDD( rdd3 ).count()

What about the **number of locations per user**? This information is not contain in the files but in their PATHs. 

One way of dealing with this situation is by using ```sc.wholeTextFiles(path)```. This operation returns a ```(key, value)``` pair RDD, where ```key``` is the PATH to a file and ```value``` its content.

In [None]:
rdd4 = sc.wholeTextFiles( 'data/*/Trajectory/*.plt' )

rdd4.map(
    lambda kv: ( kv[0], type(kv[1] ) 
)).take(3)


You can know extract the user and trace numbers from the path file.

In [None]:
import re

def parsePath(path):
    m = re.match('.*/(\d.*)/Trajectory/(\d*).plt', path)
    return ( m.group(1), m.group(2) )

def parseWholeLog(content):
    lines = content.strip().split('\n')
    return [ parseLogLine(line) for line in lines[6:] ]


rdd4 = sc.wholeTextFiles( 'data/*/Trajectory/*.plt' )

def f1(kv): return (parsePath(kv[0]), parseWholeLog(kv[1])) 
def f2(kv): return kv

locationsRDD = rdd4.map( f1 ).flatMapValues( f2 )
locationsRDD.count()

### Visualizing GPS Traces

#### Ploting traces

In [None]:
rdd4 = sc.wholeTextFiles( 'data/*/Trajectory/*.plt' )

def f1(kv): return (parsePath(kv[0]), parseWholeLog(kv[1])) 
def f2(kv): return kv

locationsRDD = rdd4.map( f1 ).flatMapValues( f2 )


In [None]:
def toGeoJSON(geometry, locations):
    coords = [ [loc['lon'], loc['lat']] for loc in locations ]
    return {
        "type": "FeatureCollection",
        "features": [{
            "type": "Feature",
            "geometry": {
                "type": geometry, 
                "coordinates": coords if geometry == "LineString" else coords.pop()
            }
        }]
    }

locs  = rdd4.map( f1 ).take(10)
jsons = [ toGeoJSON( 'LineString', loc[1] ) for loc in locs ]


fig = gmaps.figure()
for json in jsons:
    fig.add_layer( gmaps.geojson_layer( json) )
fig


#### Initial and Final Locations

Recall that traces represent sequences ```<lat, lon, ts>```. The **initial/final location** of a trace correspond thus to the location having the **min/max timestamp**.

* Projecting and ordering timestamps

In [None]:
timestampsRDD = locationsRDD.map(
    lambda loc: loc['ts']
).sortBy( 
    lambda ts: ts 
)

timestampsRDD.take(6)

* Identyfing min/max timestamps

In [None]:
min_ts = timestampsRDD.min()
max_ts = timestampsRDD.max()

min_ts, max_ts

* Looking the location with the min/max timestamp 