## Instance: GPSAnalytics()
In this notebook you can find all the necessary steps to compute descriptive statistics on the staypoint dataframe. It also requires the leg for computing distances for instance.

The objective for the library user is to get two dafaframes :
- the df at the end of PART 1 through the following pipeline
    - `check_inputs(leg, staypoint)` To be done: A small function to check if the input data have the right columns else ask user to adapt input data
    - `split_overnight()`
    - `spatial_clustering()`
    - `get_metrics()`
    
- the df at the end of PART 2
    - `get_daily_metrics()`

In [None]:
import math
import random
import pandas as pd
import geopandas as gpd
from geopandas import sjoin
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.metrics import DistanceMetric
from sklearn.cluster import DBSCAN
import datetime
import time
import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool
from functions_preprocessing import *
#from functions_prep import *

pd.set_option('display.max_columns', 999)

#nrows = 5000

Load staypoints

In [None]:
%%time
# READ FILES
act = pd.read_pickle('sample_data/staypoint_sample_panel.pkl').reset_index()
act.rename(columns={'IDNO':'user_id', 'id':'activity_id'}, inplace=True)
del act['type']

In [None]:
act.head(2)

In [None]:
%%time
# Extract longitude and latitude into separate columns
act['lon'] = act['geometry'].apply(lambda point: point.x)
act['lat'] = act['geometry'].apply(lambda point: point.y)
#Parse the activity df to datetime and geopandas
act = parse_time_geo_data(act, geo_columns=['lon','lat'], datetime_format='%Y-%m-%d %H:%M:%S', CRS2='EPSG:2056')
del act['geometry']

In [None]:
act.head(3)

Load legs

In [None]:
%%time
leg = pd.read_pickle('sample_data/leg_sample_panel.pkl').reset_index()
leg.rename(columns={'id':'leg_id', 'IDNO':'user_id'}, inplace=True)
leg['started_at'] = pd.to_datetime(leg['started_at'])
leg['finished_at'] = pd.to_datetime(leg['finished_at'])

# Add the leg destination activity_id
leg = find_next_activity_id(leg, act)

# Add a 'length' column in meters
leg = gpd.GeoDataFrame(leg, geometry='geometry', crs='EPSG:4327')
leg['length'] = leg.to_crs(crs='EPSG:2056').length

# Calculate the duration in seconds and add a 'duration' column in minutes
leg['duration'] = (leg['finished_at'] - leg['started_at']).dt.total_seconds() / 60

leg.head(2)


## Part 1

**Data format**

In order to perform Part 1, you must have a staypoint df and a leg df with at least the following columns : 
```python
staypoint.columns = ['activity_id', 'started_at', 'finished_at',
       'purpose', 'user_id', 'lon', 'lat']
```
```python
leg.columns = ['leg_id', 'started_at', 'finished_at',
       'detected_mode', 'mode', 'user_id', 'geometry', 'next_activity_id',
       'length', 'duration']
```
Pay attention to the format of (in particular) the columns with datetimes or geometries.
Also, having a `purpose == 'home'`will help complete the calculations.

**XYT instance implementation**

Output of part 1 is an extended staypoint df with extra columns
```python
extended_staypoint = GPSAnalytics().metrics()
extended_staypoint.columns = ['leg_id', 'started_at', 'finished_at',
       'detected_mode', 'mode', 'user_id', 'geometry', 'next_activity_id',
       'length', 'duration''cluster', 'cluster_size', 'cluster_info', 'location_id',
       'peak', 'first_dep', 'last_arr', 'home_loop', 'daily_trip_dist',
       'num_trip', 'max_dist', 'min_dist', 'max_dist_from_home',
       'dist_from_home', 'home_location_id', 'weekday']

```

- `GPSAnalytics().metrics.split_overnight()`

In [None]:
%%time
#split the overnight activity into last and first activities
act = split_overnight(act)

- `GPSAnalytics().metrics.spatial_clustering()`

In [None]:
%%time
#Aggregate the locations of most visited places per user_id and per imputed_purpose
act = spatial_clustering(act, purpose_col='purpose')

#Label the Most Visited Places
act = cluster_info(act,purpose_col='purpose')

#Aggregate the 100m-neighbooring lon,lat pairs
db = DBSCAN(eps=100/3671000, min_samples=2, metric='haversine', algorithm='ball_tree')
cl = db.fit_predict(np.deg2rad(act[['lon','lat']]))
for cluster in np.unique(cl):
    if cluster != -1:
        act.loc[cl == cluster, 'lon'] = act.loc[cl == cluster, 'lon'].mean()
        act.loc[cl == cluster, 'lat'] = act.loc[cl == cluster, 'lat'].mean()
#Add location id for unique lon,lat pairs
act['location_id'] = np.nan
for counter, lon, lat in act.groupby(['lon','lat'], as_index=False).size()[['lon','lat']].itertuples(index=True):
    act.loc[(act.lon == lon) & (act.lat == lat), 'location_id'] = counter
act.head(10)

- `GPSAnalytics().metrics.get_metrics()`

In [None]:
#Derive additional variables
df = act.copy()
#DAILY USER_ID: Add user_ids per day
df.insert(1, 'user_id_day', df['user_id'] + '_' + df.started_at.dt.year.astype(str) + df.started_at.dt.month.astype(str).str.zfill(2) + df.started_at.dt.day.astype(str).str.zfill(2))
#PEAK HOURS: Add boolean if trip starts (i.e. activity ends) in peak hour
morning = [datetime.datetime(2021,1,1,6,30).time(),datetime.datetime(2021,1,1,9,0).time()]
noon = [datetime.datetime(2021,1,1,12,0).time(),datetime.datetime(2021,1,1,14,0).time()]
evening = [datetime.datetime(2021,1,1,16,30).time(),datetime.datetime(2021,1,1,19,0).time()]
df['peak'] = 0
df.loc[(df.finished_at.dt.time > morning[0]) & (df.finished_at.dt.time < morning[1]), 'peak'] = 'morning_peak'
df.loc[(df.finished_at.dt.time > noon[0]) & (df.finished_at.dt.time < noon[1]), 'peak'] = 'noon_peak'
df.loc[(df.finished_at.dt.time > evening[0]) & (df.finished_at.dt.time < evening[1]), 'peak'] = 'evening_peak'
#Get the time of first departure / last arrival
df['first_dep'] = np.nan
df['last_arr'] = np.nan
df['home_loop'] = 0
df['daily_trip_dist'] = np.nan
df['num_trip'] = np.nan
df['max_dist'] = np.nan
df['min_dist'] = np.nan
df['max_dist_from_home'] = np.nan
df['dist_from_home'] = np.nan
df['home_location_id'] = np.nan

location = act[['lon','lat','location_id']].copy().sort_values('location_id').reset_index(drop=True)
location.drop_duplicates(ignore_index=True,inplace=True)
od_matrix_kms = pd.DataFrame(DistanceMetric.get_metric('haversine').pairwise(location[['lat','lon']].to_numpy())*6373, columns=location.location_id.unique(), index=location.location_id.unique())


#Create functions to be parralelized
def func1(df):
    import datetime
    import pandas as pd
    import numpy as np
    
    ## Clean up the first / last activity of the day
    #CASE 1. Manage cases where the first/last act is at home but started sometime in the morning/afternoon --> set started_at == 00:00:01 / finished_at == 23:59:59
    #Set a df with only the first/last activities of user-days (NB DO NOT RESET OR IGNORE INDEXES): 
    first_act = df.loc[(df.drop_duplicates(subset=['user_id_day'], keep='first').index)].copy()
    last_act = df.loc[(df.drop_duplicates(subset=['user_id_day'], keep='last').index)].copy()
    
    case1f = first_act[(first_act.started_at.dt.time > datetime.time(0, 0, 1)) & (first_act.started_at.dt.time < datetime.time(12, 0, 0)) & (first_act.imputed_purpose == 'Home')]
    case1l = last_act[(last_act.finished_at.dt.time > datetime.time(12, 0, 0)) & (last_act.finished_at.dt.time < datetime.time(23, 59, 59)) & (last_act.imputed_purpose == 'Home')]
    df.loc[case1f.index, 'started_at'] = pd.to_datetime(df.loc[case1f.index, 'started_at'].dt.date.astype(str)+"T00:00:01Z", format="%Y-%m-%dT%H:%M:%SZ")
    df.loc[case1l.index, 'finished_at'] = pd.to_datetime(df.loc[case1l.index, 'finished_at'].dt.date.astype(str)+"T23:59:59Z", format="%Y-%m-%dT%H:%M:%SZ")
    #Recalculate the durations
    df.loc[case1l.index.append(case1f.index), 'duration'] = (df.loc[case1l.index.append(case1f.index), 'finished_at'] - df.loc[case1l.index.append(case1f.index), 'started_at']) / np.timedelta64(1, 's')
    
    #CASE 2. Drop all the user_id_day for wich the first / last activity does not starts / ends at 00:00:01 / 23:59:59
    len_before = len(df)
    #Set a df with only the first/last activities of user-days (NB DO NOT RESET OR IGNORE INDEXES): 
    first_act = df.loc[(df.drop_duplicates(subset=['user_id_day'], keep='first').index)].copy()
    last_act = df.loc[(df.drop_duplicates(subset=['user_id_day'], keep='last').index)].copy()
            
    #Clean the user_id_day with home only
    index_to_drop = []
    for user_id in df.user_id_day.unique():
        try:
            if (len(df.loc[df.user_id_day == user_id, 'location_id'].unique()) == 1) & ('Home' in df.loc[df.user_id_day == user_id, 'imputed_purpose'].unique()):
                index_to_drop.extend(df.loc[(df.user_id_day == user_id)].index[1:].tolist())
                df.loc[(df.user_id_day == user_id), 'duration'] == df.loc[(df.user_id_day == user_id), 'duration'].sum()
                df.loc[(df.user_id_day == user_id), 'started_at'] == df.loc[(df.user_id_day == user_id), 'started_at'].min()
                df.loc[(df.user_id_day == user_id), 'finished_at'] == df.loc[(df.user_id_day == user_id), 'finished_at'].max()
        except (ValueError):
            continue
    df = df.loc[~df.index.isin(index_to_drop)].copy()
    df.reset_index(inplace=True, drop=True)
    
    df = df.loc[(~df.user_id_day.isin(first_act.loc[first_act.started_at.dt.time != datetime.time(0, 0, 1), 'user_id_day'].array))]
    df = df.loc[(~df.user_id_day.isin(last_act.loc[last_act.finished_at.dt.time != datetime.time(23, 59, 59), 'user_id_day'].array))]
    #len_after = len(df)
    #print('Warning: clean up operation reduced the df lenght by -' + str("{:.1f}".format((len_before-len_after)*100/len_before)) + ' %')
    ##Most user-days start at home:
    #print('But note that there are still some weird cases like a day starting with Shopping. Those cases are however very few :' + "\n" + df.drop_duplicates(subset=['user_id_day'], keep='first').groupby(by='imputed_purpose').count()['user_id_day'].to_string())
    
    return df


##!!!ATTENTION: WE NEED THE LEG DATA HERE, NOT A GOOD PRACTICE TO ADD IT IN A FUNCTION LIKE THAT - BUT DUE TO MULTIPROCESSING
##!!! ALSO THERE ARE LOTS OF COLUMN NAME DEPENDENCIES HERE... 
def func2(df, leg):
    import datetime
    import geopandas
    #Compute miscellaneous additional variables
    for user_id in df.user_id_day.unique():
        try:
            if len(df.loc[(df.user_id_day == user_id)]) > 1:
                df.loc[(df.user_id_day == user_id), 'first_dep'] = df.loc[(df.user_id_day == user_id), 'finished_at'].dt.time.min()
                df.loc[(df.user_id_day == user_id), 'last_arr'] = df.loc[(df.user_id_day == user_id), 'started_at'].dt.time.max()
            if sum(df.loc[(df.user_id_day == user_id), 'imputed_purpose'].isin(['Home', 'home'])) > 1:
                df.loc[(df.user_id_day == user_id), 'home_loop'] = sum(df.loc[(df.user_id_day == user_id), 'imputed_purpose'].isin(['Home', 'home'])) - 1
            #find from legs the actual trip distance
            date = datetime.datetime.strptime(user_id[-8:], "%Y%m%d").date() #retrieve the date of the concerned activities from the user_id_day string
            condition1 = leg.next_activity_id.isin(df.loc[(df.user_id_day == user_id), 'activity_id'].tolist()) #spot all the legs tracked to reach the activities
            condition2 = leg.started_at.dt.date == date #match the dates
            df.loc[(df.user_id_day == user_id), 'daily_trip_dist'] = leg.loc[(condition1) & (condition2), 'length'].sum()    
            #return the number of trips between activities
            df.loc[(df.user_id_day == user_id), 'num_trip'] = len(df[df.user_id_day == user_id]) - 1
        except (ValueError):
            pass
            
    return df

def func3(df, od_matrix_kms):
    from sklearn.metrics import DistanceMetric
    import numpy as np
    import math
    
    #Compute the distances between locations
    for user_id in df.user_id_day.unique():
        try:
            #Compute max/min distance between all activity locations
            od_pairs = df.loc[df.user_id_day == user_id, ['lon', 'lat']].drop_duplicates(ignore_index=True)
            od_dist = DistanceMetric.get_metric('haversine').pairwise(od_pairs[['lat','lon']].to_numpy())*6373
            if len(od_dist) > 1:
                df.loc[df.user_id_day == user_id, 'max_dist'] = od_dist.max().astype(int) * 1000
                df.loc[df.user_id_day == user_id, 'min_dist'] = od_dist[np.nonzero(od_dist)].min().astype(int) * 1000 #get the min among non-null values
            else:
                df.loc[df.user_id_day == user_id, 'max_dist'] = 0
                df.loc[df.user_id_day == user_id, 'min_dist'] = 0
            #Compute max distance from home
            home_locations = df.loc[(df.user_id_day == user_id) & (df.imputed_purpose.str.lower() == "home"), ['location_id', 'cluster_size']]
            if len(home_locations.location_id.unique()) > 1:
                home_id = home_locations.loc[home_locations.cluster_size.idxmax(), 'location_id']
            else:
                home_id = home_locations.location_id.mean()
            if math.isnan(home_id) == False:
                all_id = df.loc[(df.user_id_day == user_id) & (df.location_id != home_id), 'location_id']
                df.loc[df.user_id_day == user_id, 'max_dist_from_home'] = od_matrix_kms.loc[od_matrix_kms.index.isin(all_id), home_id].max() * 1000
                df.loc[df.user_id_day == user_id, 'home_location_id'] = home_id
        except (ValueError):
            continue
            
    return df

In [None]:
%%time
import datetime
from functools import partial
#RUN PARRALEL FUNCTIONS FUNC0. FUNC1 & FUNC2
#BE CAREFUL THIS PART CAN BE LONG
#MAKE A PROGRESS BAR ?

df.rename(columns={'purpose':'imputed_purpose'}, inplace=True)

for func in [func1, func2, func3]:
    cores = mp.cpu_count()
    #split the df in as many array as the machine has cores
    user_ids = np.array_split(df.user_id_day.unique(), cores, axis=0)
    df_split = []
    for u in user_ids:
        df_split.append(df.loc[df.user_id_day.isin(u.tolist())])
    # create the multiprocessing pool
    pool = Pool(cores)
    # process the DataFrame by mapping function to each df across the pool
    if func == func2:
        func2_partial = partial(func2, leg=leg)
        df_out = np.vstack(pool.map(func2_partial, df_split))
    elif func == func3:
        func3_partial = partial(func3, od_matrix_kms=od_matrix_kms)
        df_out = np.vstack(pool.map(func3_partial, df_split))
    else:
        df_out = np.vstack(pool.map(func, df_split))
    
    # return the df
    df = pd.DataFrame(df_out, columns=df.columns)
    
    # close down the pool and join
    pool.close()
    pool.join()
    pool.clear()
    
    if func == func2:
        #drop the days with only one obesrvation and small connection duration
        df.drop(df[(df.first_dep.isna()) & (df.duration < 43200)].index, inplace=True) #43200sec is 12 hours
        #Add weekdays
        df['weekday'] = df.started_at.dt.weekday
        #SORT VALUES
        df.sort_values(by=['user_id_day','started_at'], inplace=True, ignore_index=True)
        df['cluster_size'] = df['cluster_size'].astype(int)
    
    if func == func3:
        df.reset_index(inplace=True, drop=True)
        for index, row in df.iterrows():
            df.loc[index, 'dist_from_home'] = get_distance(row['location_id'], row['home_location_id'], od_matrix_kms)

In [None]:
#df.to_pickle('sample_data/extended_staypoint_sample_panel.pkl')

In [None]:
df.head()
print(df)

## Part 2

- `GPSAnalytics().metrics.get_daily_metrics()`

Aggregate per day

In [None]:
import pandas as pd
import numpy as np

def daily_metrics(df):
    """
    Construct a matrix of daily descriptive statistics.

    Args:
    - df: DataFrame containing relevant columns from the .

    Returns:
    - DataFrame with computed rhythmic profiles.
    """
    # Select relevant columns
    daily_act = df[['user_id_day', 'first_dep', 'last_arr', 'home_loop', 'daily_trip_dist', 'peak', 'num_trip', 'max_dist_from_home', 'weekday']].copy()

    # Convert 'first_dep' and 'last_arr' to minutes since midnight
    daily_act.loc[daily_act.first_dep.notnull(), 'first_dep'] = (pd.to_datetime(daily_act.loc[daily_act.first_dep.notnull(), 'first_dep'], format="%H:%M:%S") - np.datetime64('1900-01-01')).dt.total_seconds().div(60).astype(int)
    daily_act.loc[daily_act.last_arr.notnull(), 'last_arr'] = (pd.to_datetime(daily_act.loc[daily_act.last_arr.notnull(), 'last_arr'], format="%H:%M:%S") - np.datetime64('1900-01-01')).dt.total_seconds().div(60).astype(int)

    # Remove duplicate rows based on 'user_id_day'
    daily_act.drop_duplicates(subset=['user_id_day'], keep='first', ignore_index=True, inplace=True)

    # Create new columns 'am_peak', 'pm_peak', and 'noon_peak'
    daily_act['am_peak'] = daily_act['pm_peak'] = daily_act['noon_peak'] = 0
    daily_act.loc[daily_act.peak == 'morning_peak', 'am_peak'] = 1
    daily_act.loc[daily_act.peak == 'evening_peak', 'pm_peak'] = 1
    daily_act.loc[daily_act.peak == 'noon_peak', 'noon_peak'] = 1

    # Drop the 'peak' column
    daily_act.drop('peak', inplace=True, axis=1)

    # Set 'user_id_day' as the index
    daily_act.set_index('user_id_day', inplace=True)

    return daily_act


In [None]:
daily_metrics(df)