## Introduction
This notebook is used to parse raw gps data. In some Apps trip information starts and ends being collected when the user inform so and trip mode is also informed by the user. This implies several issues like users travelling on modes different from what they informed or trips lasting several hours because users forgot to turn off the app.

In [1]:
import geopandas as gpd
import pandas as pd
from datetime import datetime
import numpy as np
import math
import matplotlib.pyplot as plt

### Import and Format Data

In [2]:
# Upload the data all at once
import glob
import errno
path  = '.data/' # replace with data path
files = glob.glob(path + '*.geojson')
list_file=[]
for i in files[0:20]:
    a=gpd.read_file(i)
    list_file.append(a)
data_app=pd.concat(list_file)

In [3]:
# Check that there are no duplicates after concatenating all the files
data=data_app.drop_duplicates(subset=['activity_id','user_id','date'], keep='first')
data.shape,data_biko.shape

((1767000, 5), (1922959, 5))

In [5]:
# Reset the index to have it ordered 
data.reset_index(inplace=True)
data.drop('index',inplace=True,axis=1)

In [6]:
# Unpack geometry into lat an long
data.loc[:,'lat']=data['geometry'].apply(lambda x: x.y)
data.loc[:,'lon']=data['geometry'].apply(lambda x: x.x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [7]:
# Put the date into timestamp format
data.loc[:,'date_1']=data['date'].apply(lambda x: datetime.strptime(x,'%Y-%m-%dT%H:%M:%S'))

In [2]:
# Sort the data to be in chronologycal order from the oldest to the newest date
data.sort_values(['activity_id','date_1'],inplace=True)

In [13]:
# function to calculate the delta seconds from the start of an activity id to each point recorded
def total_second(arr):
    return (arr-arr.min())

In [14]:
# Use the total_seconds function to create a list of cumulative times
cum= data[['date_1']].groupby(data['activity_id']).transform(total_second)

In [23]:
cum.rename(columns={'date_1':'delta_seconds'},inplace=True)

In [24]:
# Add a column where we see the seconds from the start of the trip for each recorded point
merge=pd.merge(data,cum,left_index=True,right_index=True,how='outer')

In [3]:
data=merge.copy()

In [33]:
# Use the function total_seconds (notice that this is not our function delta_second) to go from timestamp to float number
data['delta_seconds']=data['delta_seconds'].apply(lambda x: x.total_seconds())

### Parse Trajectories

***
This program extracts "stay points" from a sequence of traces of an anonymous individual user with spatiotemporal inputs include:
* user1--time stamp (in second), lon, lat (in degree)
* par(1)-- roaming distance threshold (in km)
* par(2)-- stay time threshold (in second).

The algorithm was first proposed by R. Hariharan and K. Toyama (2004) in “Project Lachesis: parsing and modeling location histories.” The program here is a slightly revised version from R. Hariharan and K. Toyama(2004) By Shan Jiang (shanjang@mit.edu), May 2013. For more advanced algorithms treating phone data for a similar purpose, please refer to Jiang, S., G. Fiore, Y. Yang, J. Ferreira, E. Frazzoli, and M. C. González.(2013). "A Review of Urban Computing for Mobile Phone Traces: Current Methods, Challenges and Opportunities." Proceedings of the ACM SIGKDD International Workshop on Urban Computing. Chicago, IL, USA.
***

We took the above described code and modified for our purpose in which we were more interested in identifying moving points rather than stay point

### Setting parameter of spatial and temporal thresholds;

* 1st is spatial parameter in km;
* 2nd is temporal parameter in seconds

Please try with different parameters and compare differences.

### Detect Stay points

In [77]:
def stay(user1):
    user1=user1.reset_index()
    par=[0.1,300]
    maxt=len(user1['delta_seconds']) #Number of records in the activity
    diam=par[0]
    dur=par[1]
    stay_ind=[0]*maxt

#     Calculate the delta distance and time between two consequtive points
    user1.loc[1:,'dest_lat']=user1['lat'].tolist()[:-1]
    user1.loc[1:,'dest_lon']=user1['lon'].tolist()[:-1]
    user1.loc[1:,'time_dest']=user1['delta_seconds'].tolist()[:-1]
    user1.loc[1:,'d_t']=user1['delta_seconds']-user1['time_dest']
    dT=user1['d_t'].to_list()
    user1.loc[1:,'dist']=user1.apply(lambda row:pos2dist(row['lat'],row['lon'],row['dest_lat'],row['dest_lon']),axis=1)
    dist=user1['dist'].tolist()

# Detect the stays
    i=0;
    k=0;
    move=1;
    start_move=0;
    stop=0;
    while i<=maxt-2:
        if dist[i]>diam:
            i=i+1; 
        else:            
            start_stay=i;
            stop_move=start_stay-1;
            end_stay=i+1;
#             roaming distance
            for j in range (i+2,maxt):
                lat1 = user1['lat'][i];
                lon1 = user1['lon'][i];
                lat2 = user1['lat'][j];
                lon2 = user1['lon'][j];
                dist_ij = pos2dist(lat1,lon1,lat2,lon2);
                if dist_ij>diam:
                    end_stay=j-1;
                    break
                if j==maxt-1: # Cos in python we have -1 positions 
                    end_stay=j
#             time duration
            d_t=user1['delta_seconds'][end_stay]-user1['delta_seconds'][start_stay]
            if d_t>=dur:
                if start_stay==0:
                    start_move=end_stay-1
                else:                
                    end_move=start_stay-1
                    stay_ind[start_move:start_stay-1]=[move]*(start_stay-1-start_move)
                    move=move+1;
                    stop=stop+1;
                    start_move=end_stay+1;
                i=end_stay+1;
                stop=stop+1;
            else:
                i=i+1;
    if stop==0:
        stay_ind=[1]*maxt;
    return stay_ind 

### Calculate distances

In [78]:
def pos2dist (lat1,lon1,lat2,lon2):
    if (math.isnan(lat1)|math.isnan(lat2)|math.isnan(lon1)|math.isnan(lon2)):
#         print ('nan')
        dist=-1.0
    elif ((lat1==lat2)&(lon1==lon2)):
        dist=0.0
    else:
        R_aver = 6374;
        lat1 = math.radians(lat1)
        lon1 = math.radians(lon1)
        lat2 = math.radians(lat2)
        lon2 = math.radians(lon2)
        aux=(math.cos(lat1)*math.cos(lat2)*math.cos(lon1-lon2) + math.sin(lat1)*math.sin(lat2))
        if aux>=1:
            dist=0.0
        else:
            dist = R_aver * math.acos(aux);
    return dist

### Calculate Speeds

In [79]:
def speed(leg):
    DT=[]
    DR=[]
    speed=[]
    Time=leg['delta_seconds'].apply(lambda x: x/3600) #time in hour for input in secs
    for i in range(len(leg['delta_seconds'])-1):
        DT.append(Time[i+1]-Time[i])
        lat1 = leg['lat'][i]
        lat2 = leg['lat'][i+1]
        lon1 = leg['lon'][i]
        lon2 = leg['lon'][i+1]
        DR.append(pos2dist(lat1,lon1,lat2,lon2)) #output in km
        sp=DR[i]/DT[i]
        speed.append(sp)
        speed_filter=pd.DataFrame(speed)
        speed_filter.columns=['speed']        
        mean_speed=sum(speed)/len(speed)
        mean_speed_filt=(speed_filter[speed_filter['speed']<=30]['speed']).mean()
    return mean_speed, mean_speed_filt,speed

In [80]:
stays=data.groupby(['activity_id']).apply(stay)

In [82]:
# backup=stays

In [74]:
## Transform the result into the data frame
# a_2=a
# a_3=a_2.reset_index()
# a_3.rename(columns={0:'stays'},inplace=True)
# a_3.set_index('activity_id',inplace=True)
# s = a_3.apply(lambda x: pd.Series(x['stays']),axis=1).stack().reset_index(level=1, drop=True)
# s.reset_index().to_csv('stays.csv')
# data.loc[:,'stays']=s.tolist()

In [4]:
# See the result as a dataframe
df=stays.reset_index()
df.rename(columns={0:'stays'},inplace=True)
df.set_index('activity_id',inplace=True)

In [97]:
# Break the list so that each element is a new row
stays_series=df.apply(lambda x: pd.Series(x['stays']),axis=1).stack().reset_index(level=1, drop=True)

In [5]:
stays_df=stays_series.reset_index()
stays_df.rename(columns={0:'stays'},inplace=True)

In [110]:
# Finaly, we add the result to our dataframe
final_data=pd.merge(data,stays_df,left_on='activity_id',right_on='activity_id',how='outer')
# Or
# data.loc[:,'stays']=stays_df['stays'].tolist()

In [116]:
# Export results to be used in step 2
data.to_csv('data_parsed.csv')