# Readme Panel Lémanique
Ce notebook reprend les données brutes GPS 2023 du Panel Lémanique et en fourni une première phase de nettoyage et de filtrage des données. A savoir deux niveau de "perte de signal" identifiés par les colonnes 'low_quality_legs_1' (env. 5% de traces éliminées) et 'low_quality_legs_2' (env. 7% de traces éliminées).
## GPS tracking
Les données GPS sont issues de la phase de collecte sur 21 jours au printemps 2023.
## Fichiers de base utilisés
- Brut: gps_panel_lemanique_by_motion_tag.csv
- Lines: lines.geojson, fichié fourni par Elisa Tirindelli le 17/08/23 --> renommé legs.geojson
- Points: points.geojson, fichié fourni par Elisa Tirindelli le 17/08/23 --> renommé staypoints.geojson
- Fichiers de raccordement: Localisation_domicile.csv par Florian Masse (trouvé sur le serveur LASUR)
- Géoinformation: Verkehrszonen_Schweiz_NPVM_2017.shp, Zone de Trafic du modèle voyageur suisse
- Questionnaire: EPFL_vague1_v4.csv, fourni par Alexis Gumy le 21/09/23
## Nettoyage par Elisa
- élimination des déplacements qui se répètent pour chaque personne plus qu’une fois (pour l'elimiation de bias dans des données)
- élimination des déplacements “triangles” (déplacement qui revient au même endroit plusieurs fois), les déplacements qui partent (ou arrivent) deux fois du (au) même endroits
- élimination des trajets (dans un même déplacement) qui partent plus tard qu’une demi heure après le trajet avant
- corriger les déplacements qui contienne plus qu’un trajet sur la même ligne (regrouper tous les trajets consecutives sur une ligne dans un seul trajet)
- corriger le temps de trajet (aussi la distance) du déplacement (les recalculer pour tenir en compte le trajets qui ont été éliminé)
- réfléchir sur les déplacements qui comprennent plus que 3/4 trajets TP (assez improbable)
## Nettoyage complémentaires par Marc-Edouard (voir dans le notebook ci-dessous)
- Enlever les étapes avec perte de signal (i.e. discontinuous legs) -> perte de 530 traces / 669808
- Segmentation des données par Canton pour faciliter la manipulation des données
- Gérer les legs non géolocalisés (beeline between OD)
- Calcul du nombre d'observation moyen par répondant.e

## Spotted issues
- 'CH14886', 'CH15539' are duplicates in the vague1_v4 file
- 'FR13508', 'CH8035', 'CH14765' are not in the vague1_v4 file but appear in the gps file

In [1]:
import geopandas as gpd
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np

from shapely import geometry, ops
from shapely.geometry import MultiLineString, LineString, Point
import os

import time

## Data loading and preparation

In [2]:
%%time

# Define the CRS you want to use (e.g., EPSG:4326 for WGS84)
target_crs = 'EPSG:4326'

# load geo data

raw = pd.read_pickle('gps_panel_lemanique_by_motion_tag.pkl').filter(items=['id','type','started_at','finished_at'])

legs = pd.read_pickle('legs.pkl').reset_index()
legs = gpd.GeoDataFrame(legs, geometry="geometry")

points = pd.read_pickle('staypoints.pkl')
points = gpd.GeoDataFrame(points, geometry="geometry")

## Load geodata as csv if no pickle were created
#raw = pd.read_csv('gps_panel_lemanique_by_motion_tag.csv')
#legs = gpd.read_file('legs.geojson', crs=target_crs)
#points = gpd.read_file('staypoints.geojson', crs=target_crs)

## Retrieve the started_at and finished_at from original files as Elisa's were missing
##For legs
#legs = pd.read_pickle('legs.pkl')
#legs = pd.merge(raw[['id','started_at','finished_at']], legs, on='id', how='right')
#legs = legs.filter(items=['id', 'started_at', 'finished_at', 'type', 'strtd__',
#       'dtctd_m', 'mode', 'IDNO', 'geometry'])
#legs.rename(columns={'strtd__':'started_at_timezone', 'dtctd_m':'detected_mode'}, inplace=True)
#legs.set_index('id').to_pickle('legs.pkl')
## For staypoints
#points = pd.read_pickle('staypoints.pkl')
#points = pd.merge(raw[['id','started_at','finished_at']], points, on='id', how='right')
#points = points.filter(items=['id', 'started_at', 'finished_at', 'type', 'strtd__',
#                              'purpose', 'IDNO', 'geometry'])
#points.rename(columns={'strtd__':'started_at_timezone'}, inplace=True)
#points.set_index('id').to_pickle('staypoints.pkl')


#load survey data
full_survey = pd.read_csv('../Vague1/EPFL_vague1_v4.csv', low_memory=False)
dom =  pd.read_csv('../Vague1/Localisation_domicile.csv', low_memory=False)
dom = gpd.GeoDataFrame(dom, geometry=gpd.points_from_xy(dom.dom_long, dom.dom_lat), crs=target_crs)
del dom['dom_long']
del dom['dom_lat']

#load official data
TAZ = gpd.read_file('../Vague1/Verkehrszonen_Schweiz_NPVM_2017_shp/Verkehrszonen_Schweiz_NPVM_2017.shp')
TAZ = TAZ[['ID_Agglo', 'N_Agglo', 'N_KT', 'ID_Gem', 'geometry']]
TAZ = TAZ.to_crs(crs=target_crs)

CPU times: user 19.6 s, sys: 11.7 s, total: 31.3 s
Wall time: 34.6 s


In [3]:
raw = pd.read_pickle('gps_panel_lemanique_by_motion_tag.pkl')
raw.head()

Unnamed: 0,id,type,started_at,started_at_timezone,finished_at,finished_at_timezone,length,detected_mode,mode,purpose,geometry,confirmed_at,started_on,misdetected_completely,merged,created_at,updated_at,started_at_in_timezone,finished_at_in_timezone,confirmed_at_in_timezone,created_at_in_timezone,updated_at_in_timezone,IDNO
0,75074f7e-43cf-45ba-85a1-870d3ef09a4e,Stay,2023-04-30 12:12:28,Europe/Zurich,2023-05-01 05:10:22,Europe/Zurich,,,,home,0020000001000010e6401a25305bc9b07a40475dbf1402...,2023-05-21 11:26:59,2023-04-30,False,False,2023-05-01 15:29:57,2023-05-21 11:26:59,2023-04-30 12:12:28,2023-05-01 05:10:22,2023-05-21 11:26:59,2023-05-01 15:29:57,2023-05-21 11:26:59,CH3181
1,051d1613-b29a-4598-90d4-2365bf58132b,Track,2023-05-01 05:10:22,Europe/Zurich,2023-05-01 05:16:18,Europe/Zurich,4999.0,Mode::Car,Mode::Car,,0020000002000010e600000043401a25305bc9b07a4047...,2023-05-01 19:41:23,2023-05-01,False,False,2023-05-01 15:29:57,2023-05-02 18:32:09,2023-05-01 05:10:22,2023-05-01 05:16:18,2023-05-01 19:41:23,2023-05-01 15:29:57,2023-05-02 18:32:09,CH3181
2,4311fb87-42d1-4950-b330-f06a18459bc1,Stay,2023-05-01 05:16:18,Europe/Zurich,2023-05-01 15:11:05,Europe/Zurich,,,,work,0020000001000010e6401a4772606fac60404759fb7c42...,2023-05-01 19:41:19,2023-05-01,False,False,2023-05-01 15:29:58,2023-05-01 19:41:19,2023-05-01 05:16:18,2023-05-01 15:11:05,2023-05-01 19:41:19,2023-05-01 15:29:58,2023-05-01 19:41:19,CH3181
3,92957703-71b4-4ae0-a1d2-3a2862d74a43,Track,2023-05-01 15:11:05,Europe/Zurich,2023-05-01 15:20:08,Europe/Zurich,5668.0,Mode::Car,Mode::Car,,0020000002000010e600000069401a4772606fac604047...,2023-05-01 19:41:16,2023-05-01,False,False,2023-05-01 15:29:58,2023-05-01 19:41:16,2023-05-01 15:11:05,2023-05-01 15:20:08,2023-05-01 19:41:16,2023-05-01 15:29:58,2023-05-01 19:41:16,CH3181
4,a274be2e-e5ea-46b2-a32a-32a711dca4e0,Stay,2023-05-01 15:20:08,Europe/Zurich,2023-05-01 17:41:37,Europe/Zurich,,,,home,0020000001000010e6401a2520382ec59040475dc168eb...,2023-05-01 19:41:09,2023-05-01,False,False,2023-05-01 19:29:18,2023-05-01 19:41:09,2023-05-01 15:20:08,2023-05-01 17:41:37,2023-05-01 19:41:09,2023-05-01 19:29:18,2023-05-01 19:41:09,CH3181


In [4]:
pd.DataFrame(raw.IDNO.unique()).to_csv('list_IDNO_tracking_gps.csv', index=False)

In [5]:
raw.sample(n=100).to_csv('sample_panel_lemanic_kanaha.csv', index=False)

In [6]:
(raw[raw['type']=='Stay'])

Unnamed: 0,id,type,started_at,started_at_timezone,finished_at,finished_at_timezone,length,detected_mode,mode,purpose,geometry,confirmed_at,started_on,misdetected_completely,merged,created_at,updated_at,started_at_in_timezone,finished_at_in_timezone,confirmed_at_in_timezone,created_at_in_timezone,updated_at_in_timezone,IDNO
0,75074f7e-43cf-45ba-85a1-870d3ef09a4e,Stay,2023-04-30 12:12:28,Europe/Zurich,2023-05-01 05:10:22,Europe/Zurich,,,,home,0020000001000010e6401a25305bc9b07a40475dbf1402...,2023-05-21 11:26:59,2023-04-30,False,False,2023-05-01 15:29:57,2023-05-21 11:26:59,2023-04-30 12:12:28,2023-05-01 05:10:22,2023-05-21 11:26:59,2023-05-01 15:29:57,2023-05-21 11:26:59,CH3181
2,4311fb87-42d1-4950-b330-f06a18459bc1,Stay,2023-05-01 05:16:18,Europe/Zurich,2023-05-01 15:11:05,Europe/Zurich,,,,work,0020000001000010e6401a4772606fac60404759fb7c42...,2023-05-01 19:41:19,2023-05-01,False,False,2023-05-01 15:29:58,2023-05-01 19:41:19,2023-05-01 05:16:18,2023-05-01 15:11:05,2023-05-01 19:41:19,2023-05-01 15:29:58,2023-05-01 19:41:19,CH3181
4,a274be2e-e5ea-46b2-a32a-32a711dca4e0,Stay,2023-05-01 15:20:08,Europe/Zurich,2023-05-01 17:41:37,Europe/Zurich,,,,home,0020000001000010e6401a2520382ec59040475dc168eb...,2023-05-01 19:41:09,2023-05-01,False,False,2023-05-01 19:29:18,2023-05-01 19:41:09,2023-05-01 15:20:08,2023-05-01 17:41:37,2023-05-01 19:41:09,2023-05-01 19:29:18,2023-05-01 19:41:09,CH3181
6,11c2c50a-98b8-4608-924e-31644efaed2d,Stay,2023-05-01 17:53:05,Europe/Zurich,2023-05-01 19:11:15,Europe/Zurich,,,,family_friends,0020000001000010e6401a71caf485787a4047654899cc...,2023-05-02 04:17:36,2023-05-01,False,False,2023-05-01 22:49:53,2023-05-02 04:17:39,2023-05-01 17:53:05,2023-05-01 19:11:15,2023-05-02 04:17:36,2023-05-01 22:49:53,2023-05-02 04:17:39,CH3181
8,3111307b-2939-4ae3-b6dc-e54ba07ebfa6,Stay,2023-05-01 19:24:23,Europe/Zurich,2023-05-02 05:13:13,Europe/Zurich,,,,home,0020000001000010e6401a2536a0a5fd5440475dbe54ab...,2023-05-02 18:28:43,2023-05-01,False,False,2023-05-02 10:23:29,2023-05-02 18:28:43,2023-05-01 19:24:23,2023-05-02 05:13:13,2023-05-02 18:28:43,2023-05-02 10:23:29,2023-05-02 18:28:43,CH3181
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1231360,7b5c65d4-7e86-43d5-a789-2b7236eb0ed3,Stay,2023-04-26 10:45:53,Europe/Zurich,2023-04-26 11:07:18,Europe/Zurich,,,,leisure,0020000001000010e640201721fa25cb6c4047b2779cef...,,2023-04-26,False,False,2023-04-26 13:59:07,2023-04-26 13:59:07,2023-04-26 10:45:53,2023-04-26 11:07:18,,2023-04-26 13:59:07,2023-04-26 13:59:07,CH23198
1231362,4e8f34fb-02d5-43be-b506-9855a3511b47,Stay,2023-04-26 11:18:01,Europe/Zurich,2023-04-26 11:22:16,Europe/Zurich,,,,wait,0020000001000010e640201a1aa74b33044047b220c3c4...,,2023-04-26,False,False,2023-04-26 13:59:07,2023-04-26 13:59:07,2023-04-26 11:18:01,2023-04-26 11:22:16,,2023-04-26 13:59:07,2023-04-26 13:59:07,CH23198
1231364,88de2000-4484-4303-a3b0-b9cf06af24f1,Stay,2023-04-26 11:37:41,Europe/Zurich,2023-04-26 13:59:01,Europe/Zurich,,,,unknown,0020000001000010e640200ab637ca71a94047aba7496a...,,2023-04-26,False,False,2023-04-26 16:00:14,2023-04-26 16:00:14,2023-04-26 11:37:41,2023-04-26 13:59:01,,2023-04-26 16:00:14,2023-04-26 16:00:14,CH23198
1231366,0d365bfb-457a-4498-bf2c-d83b3588285a,Stay,2023-04-26 15:40:22,Europe/Zurich,2023-04-26 15:47:27,Europe/Zurich,,,,errand,0020000001000010e6401b6632fd3bb8cc40474098695f...,,2023-04-26,False,False,2023-04-27 07:54:17,2023-04-27 07:54:17,2023-04-26 15:40:22,2023-04-26 15:47:27,,2023-04-27 07:54:17,2023-04-27 07:54:17,CH23198


In [7]:
#Q4_1_1_R : Combien avez-vous de voiture conventionnelle en état de fonctionnement dans votre ménage ?
#Q4_1_2_R : Combien avez-vous de voiture électrique/hybride en état de fonctionnement dans votre ménage ?
#Q5 : Quel est le type de motorisation de la voiture que vous utilisez le plus souvent ?
#Q6 : Quel est le type de motorisation de la deuxième voiture que vous utilisez le plus souvent ?
#Q7 : Pouvez-vous disposer d’une voiture du ménage quand vous le souhaitez ?
#Q8 : Vous arrive-t-il de vous faire prêter une voiture par des proches (ami·e·s, famille, etc.) pour vos propres déplacements ou ceux de votre ménage ?
#Q9 : Vous arrive-t-il d'utiliser un service d'autopartage (car-sharing, tel que Mobility ou Citiz) ?

survey = full_survey[['IDNO','canton_dep','AGGLO_CH_dom','Pays','Groupe', 'Weight', 'ID_COM', 'permis_auto', 'revenu', 'revenuFR','revenuCH','age','formation','Genre_actuel','KLASSE_ARE_dom', 'KLASSE_ARE_trav', 'pays_trav', 'Q4_1_1_R','Q4_1_2_R', 'Q5_R', 'Q6_R', 'Q7', 'Q8', 'Q9']]

##  Merge geodata with survey data

TRANSFORM MULTILINESTRINGS INTO LINESTRINGS

In [8]:
legs.geometry.geom_type.value_counts()

LineString         667106
MultiLineString      2702
Name: count, dtype: int64

In [9]:
# Rewrite continuous MultiLineString into LineString geometries
legs['geometry'] = legs['geometry'].apply(lambda geom: ops.linemerge(geom) if isinstance(geom, MultiLineString) else geom)
# Remove the discontinuous Multilinestrings (only the discontinuous lines remain after the previous operation)
# note: an alternative would be to explode the discontinuous multiline, but then we don't have the departure / arrival time: legs.explode(index_parts=True)
legs = legs.loc[legs.geometry.geom_type != 'MultiLineString',:]
# Point counts for each LineString
legs['point_per_linestring'] = legs['geometry'].apply(lambda geom: len(geom.coords))
#legs['dep_coordinates'] = legs['geometry'].apply(lambda geom: geom.coords[0])
legs.geometry.geom_type.value_counts()

LineString    669278
Name: count, dtype: int64

In [10]:
#legs = legs.merge(survey, on='IDNO', how='left')

## Split the population per Canton
After the code below we eventually obtain the following segmentation:
- VD        58%
- GE        19.6%
- FRA       16.4%
- GG_FRA    5.5%

In [11]:
# Spatial join between the domicile geolocation and the TAZ information
canton_dom = gpd.sjoin(dom, TAZ, how='left')
del canton_dom['ID_Gem']
del canton_dom['index_right']
# Since the TAZ info is only for CH, we parse manually French resident
survey = survey.merge(canton_dom, on='IDNO', how='left')
survey.loc[survey.Pays == '2', 'N_KT'] = 'FRANCE'
# Rename columns to be more explicit
survey.rename(columns={'ID_Agglo':'dom_ID_Agglo','N_Agglo':'dom_N_Agglo','N_KT':'dom_N_KT'}, inplace=True)

In [12]:
## WRONG ONE WAY TO INFER WOULD BE TO TAKE THE FIRST DEPARTURE OF DAY ONLY, BUT RESULTS ARE NOT CONVINCING...

## We still have missing Kanton of residence, mostly because the respondant did not answer the address question (Q14)
## Below we infer the address from the gps data i.e. the most recurent canton in the trip departures
#
## We use the column dep_coordinates and do a spatial join on TAZ info
#departure = gpd.GeoDataFrame(legs[['IDNO','dep_coordinates']], geometry=gpd.points_from_xy(legs.dep_coordinates.str[0], legs.dep_coordinates.str[1]), crs=target_crs)
#del departure['dep_coordinates']
#departure = gpd.sjoin(departure, TAZ[['N_KT','geometry']], how='left')
#del departure['index_right']
## Then we find the most recurent Kanton in all departures for each user
#kt_dep = departure[['IDNO', 'N_KT']].groupby(by=['IDNO']).agg(pd.Series.mode).reset_index()
#kt_dep.rename(columns={'N_KT':'dom_KT'},inplace=True)
## Join it to the survey dataframe
#survey = pd.merge(survey, kt_dep, on='IDNO')
## We can now complement the missing survey info with the one above
#survey.loc[survey.dom_N_KT.isna(), 'dom_N_KT'] = survey.loc[survey.dom_N_KT.isna(), 'KT_dep']
#survey.head()

In [13]:
# Merge to staypoint and leg dataframe
# We eventually use the column "canton_dep" provided by Alexis Gumy as he created the weights based on this one.
legs = pd.merge(legs, survey[['IDNO','canton_dep']], on='IDNO', how='left')
# Replace categories with name of canton
legs.replace({'canton_dep':{'1':'GE', '2':'VD', '5':'GG_FRA','6':'FRA'}}, inplace=True)
# Drop IDNO CH14886 and CH15539 that are duplicates in the servey with conflicting info
legs = legs.loc[(~legs.IDNO.isin(['CH15539', 'CH14886']))]
# Drop IDNO 'FR13508', 'CH8035', 'CH14765' that are not in the survey but in gps
legs = legs.loc[~legs.IDNO.isin(['FR13508', 'CH8035', 'CH14765'])]
legs.reset_index(drop=True, inplace=True)
## Delete the coordinate columns that create geometry conflicts later on
#del legs['dep_coordinates']
legs.head(2)

Unnamed: 0,id,started_at,finished_at,type,started_at_timezone,detected_mode,mode,IDNO,geometry,point_per_linestring,canton_dep
0,051d1613-b29a-4598-90d4-2365bf58132b,2023-05-01 05:10:22,2023-05-01 05:16:18,Track,Europe/Zurich,Mode::Car,Mode::Car,CH3181,"LINESTRING (6.53632 46.73239, 6.53632 46.73239...",67,VD
1,92957703-71b4-4ae0-a1d2-3a2862d74a43,2023-05-01 15:11:05,2023-05-01 15:20:08,Track,Europe/Zurich,Mode::Car,Mode::Car,CH3181,"LINESTRING (6.56977 46.70299, 6.56977 46.70299...",105,VD


In [14]:
legs.canton_dep.value_counts(normalize=True)

canton_dep
VD        0.584818
GE        0.196025
FRA       0.164199
GG_FRA    0.054958
Name: proportion, dtype: float64

In [15]:
# Specify the column based on which you want to split the DataFrame
split_column = 'canton_dep'

output_directory = 'gps_canton'

# Get unique values from the split column
unique_values = legs[split_column].unique()

# Check if the output directory already exists
if os.path.exists(output_directory):
    print(f"Output directory '{output_directory}' already exists. Aborting.")
else:
    # Create the output directory
    os.makedirs(output_directory)

    # Split the DataFrame into smaller DataFrames based on unique values
    split_dataframes = {value: legs[legs[split_column] == value] for value in unique_values}

    # Save each smaller DataFrame as a pickle file
    for value, df in split_dataframes.items():
        pickle_filename = os.path.join(output_directory, f'legs_{value}.pkl')
        df.to_pickle(pickle_filename)
        print(f"Saved '{pickle_filename}'")

    # Save each smaller DataFrame as a pickle file and GeoJSON file, and reset the index
    for value, df in split_dataframes.items():
        # Reset the index
        #df = df.reset_index(drop=True)
        
        # Save as pickle file
        pickle_filename = os.path.join(output_directory, f'legs_{value}.pkl')
        df.to_pickle(pickle_filename)
        print(f"Saved '{pickle_filename}'")

        # Save as GeoJSON file
        #geojson_filename = os.path.join(output_directory, f'legs_{value}.geojson')
        #df.to_file(geojson_filename, driver='GeoJSON')
        #print(f"Saved '{geojson_filename}'")

    #Split legs_VD in three even df
    legs_vd = pd.read_pickle('gps_canton/legs_VD.pkl')
    parts_legs_vd = np.array_split(legs_vd, 3)
    output_directory = 'gps_canton'
    for i, part in enumerate(parts_legs_vd):
        part.to_pickle(os.path.join(output_directory, f'legs_VD_part_{i+1}.pkl'))

Output directory 'gps_canton' already exists. Aborting.


## Nettoyage des pertes de signal

In [16]:
## DEFINE THE MODE SPLIT TO APPLY DIFFERENT THRESHOLDS TO EACH MODE

# We propose two level of threshold to be more or less selective in the thresholds (e.g., road_psr_1 and road_psr_2, etc)

mode_road = ['Mode::Car', 'Mode::Motorbike', 'Mode::Bus', 'Mode::Tram', 'Mode::Subway', 'Mode::KickScooter', 'Mode::LightRail', 'Mode::Other','Mode::TaxiUber', 'Mode::Carsharing', 'Mode::Ecar']
road_psa_1 = 8000
road_psr_1 = 0.8
road_psa_2 = 5000
road_psr_2 = 0.6

mode_rail = ['Mode::Train','Mode::RegionalTrain']
rail_psa_1 = 100000
rail_psr_1 = 0.65
rail_psa_2 = 85000
rail_psr_2 = 0.50

mode_active = ['Mode::Bicycle', 'Mode::Ebicycle', 'Mode::Walk']
active_psa_1 = 800
active_psr_1 = 0.8
active_psa_2 = 750
active_psr_2 = 0.7

mode_plane_boat = ['Mode::Boat', 'Mode::Airplane']
plane_boat_psa_1 = 0
plane_boat_psr_1 = 0
plane_boat_psa_2 = 0
plane_boat_psr_2 = 0

# Create dictionaries for each category
road_category_1 = {'modes': mode_road, 'psa': road_psa_1, 'psr': road_psr_1}
rail_category_1 = {'modes': mode_rail, 'psa': rail_psa_1, 'psr': rail_psr_1}
active_category_1 = {'modes': mode_active, 'psa': active_psa_1, 'psr': active_psr_1}
plane_boat_category_1 = {'modes': mode_plane_boat, 'psa': plane_boat_psa_1, 'psr': plane_boat_psr_1}

road_category_2 = {'modes': mode_road, 'psa': road_psa_2, 'psr': road_psr_2}
rail_category_2 = {'modes': mode_rail, 'psa': rail_psa_2, 'psr': rail_psr_2}
active_category_2 = {'modes': mode_active, 'psa': active_psa_2, 'psr': active_psr_2}
plane_boat_category_2 = {'modes': mode_plane_boat, 'psa': plane_boat_psa_2, 'psr': plane_boat_psr_2}

# Create a dictionary to store the categories
signal_categories = [{
    'road': road_category_1,
    'rail': rail_category_1,
    'mode_active': active_category_1,
    'mode_plane_boat': plane_boat_category_1},{
    'road': road_category_2,
    'rail': rail_category_2,
    'mode_active': active_category_2,
    'mode_plane_boat': plane_boat_category_2}]

# Example: Accessing values for the 'road' category
print("Category: road")
print("Modes:", signal_categories[1]['road']['modes'])
print("PSA:", signal_categories[1]['road']['psa'])
print("PSR:", signal_categories[1]['road']['psr'])

Category: road
Modes: ['Mode::Car', 'Mode::Motorbike', 'Mode::Bus', 'Mode::Tram', 'Mode::Subway', 'Mode::KickScooter', 'Mode::LightRail', 'Mode::Other', 'Mode::TaxiUber', 'Mode::Carsharing', 'Mode::Ecar']
PSA: 5000
PSR: 0.6


In [17]:
# Function to calculate the maximum distance in meters between two points in a LineString
def calculate_max_distance(line):
    # Extract the coordinates of the LineString into a list of points
    points = list(line.coords)
    
    # Initialize a list to store the distances between consecutive points
    distances = []

    # Iterate through the points to calculate and store the distances
    for i in range(len(points) - 1):
        point1 = Point(points[i])
        point2 = Point(points[i + 1])
        distance = point1.distance(point2)
        distances.append(distance)

    # Find the maximum distance from the list of distances
    max_distance = max(distances)
    
    return max_distance

In [20]:
%%time

# Load pickle files
KT = ['GG_FRA', 'VD_part_1', 'VD_part_2', 'VD_part_3', 'GE', 'FRA'] #,'GG_FRA', 'VD_part_1', 'VD_part_2', 'VD_part_3', 'GE', 'FRA'
intput_directory = 'gps_canton'

for KT_ in KT:
    legs_ = pd.read_pickle(os.path.join(intput_directory, f'legs_{KT_}.pkl'))
    legs_.to_crs(crs="EPSG:2056", inplace=True)
    # Apply the calculate_max_distance function to the GeoSeries and store the result in a new column
    legs_['max_signlalloss_meters'] = legs_.apply(lambda row: calculate_max_distance(row['geometry']), axis=1)
    
    # Compute the lenght of each leg
    legs_['length_leg'] = legs_['geometry'].apply(lambda geom: geom.length)
    
    # Compute the relative signal loss
    legs_['rel_max_signalloss'] = legs_['max_signlalloss_meters'].div(legs_['length_leg'])
    
    
    # Add a column to flag the legs that we want to filter out
    legs_['low_quality_legs_1'] = 0
    legs_['low_quality_legs_2'] = 0

    # Flag the low quality legs
    print('ECHANTILLON : ', KT_)
    print('-------------------------')

    for k, signal_categories_ in enumerate(signal_categories):
        if k == 0:
            threshold_col = 'low_quality_legs_1'
        elif k == 1:
            threshold_col = 'low_quality_legs_2'
        print('NIVEAU DU SEUIL : ', k+1)
        print('-------------------------')
        for cat in signal_categories_:
            #if cat == 'road':
            #    continue
            #else:
            legs_.loc[(legs_['mode'].isin(signal_categories_[cat]['modes'])) & 
                     ((legs_.max_signlalloss_meters > signal_categories_[cat]['psa']) | 
                     (legs_.rel_max_signalloss > signal_categories_[cat]['psr'])), threshold_col] = 1
    
            lost_traces = len(legs_.loc[(legs_['mode'].isin(signal_categories_[cat]['modes'])) 
                              & (legs_[threshold_col] == 1)]) / len(legs_.loc[legs_['mode'].isin(signal_categories_[cat]['modes'])]) * 100
            
            print('Category : ', cat, 
                  ' | PSA : ', signal_categories_[cat]['psa'], 
                  ' | PSR : ', signal_categories_[cat]['psr'], 
                  ' \n Part des traces perdues : ', round(lost_traces, 1), '%')
        print('-------------------------')
    
    # Identify the users who are always bad in terms on signal acquisition
    legs_avg_signal_loss = legs_[['IDNO', 'max_signlalloss_meters', 'rel_max_signalloss']].groupby('IDNO').mean()#.unique()
    legs_avg_signal_loss.reset_index(inplace=True)
    list_of_bad_users_1 = legs_avg_signal_loss.loc[(legs_avg_signal_loss.rel_max_signalloss >
                                                  legs_avg_signal_loss.rel_max_signalloss.quantile(0.99)) |
                                                 (legs_avg_signal_loss.max_signlalloss_meters > 
                                                  legs_avg_signal_loss.max_signlalloss_meters.quantile(0.99)), 'IDNO'].unique()
    list_of_bad_users_2 = legs_avg_signal_loss.loc[(legs_avg_signal_loss.rel_max_signalloss >
                                                  legs_avg_signal_loss.rel_max_signalloss.quantile(0.99)) |
                                                 (legs_avg_signal_loss.max_signlalloss_meters > 
                                                  legs_avg_signal_loss.max_signlalloss_meters.quantile(0.99)), 'IDNO'].unique()
    legs_.loc[legs_.IDNO.isin(list_of_bad_users_1), 'low_quality_legs_1'] = 1
    legs_.loc[legs_.IDNO.isin(list_of_bad_users_2), 'low_quality_legs_2'] = 1
    print('List of users with constant low signal quality : ', list_of_bad_users_2, 
          '\n eq. to ', round(len(list_of_bad_users_2) / len(legs_) * 100, 2), '%')
    print('-------------------------')

    #SAVE TO PICKLES
    cols_to_save = ['id','started_at', 'finished_at', 'type', 'started_at_timezone','detected_mode', 'mode', 
                    'IDNO', 'geometry','canton_dep','low_quality_legs_1', 'low_quality_legs_2']
    legs_[cols_to_save].to_crs(crs='EPSG:4326').to_pickle(os.path.join(intput_directory, f'legs_{KT_}_filtered.pkl'))

ECHANTILLON :  GG_FRA
-------------------------
NIVEAU DU SEUIL :  1
-------------------------
Category :  road  | PSA :  8000  | PSR :  0.8  
 Part des traces perdues :  4.2 %
Category :  rail  | PSA :  100000  | PSR :  0.65  
 Part des traces perdues :  5.4 %
Category :  mode_active  | PSA :  800  | PSR :  0.8  
 Part des traces perdues :  4.7 %
Category :  mode_plane_boat  | PSA :  0  | PSR :  0  
 Part des traces perdues :  100.0 %
-------------------------
NIVEAU DU SEUIL :  2
-------------------------
Category :  road  | PSA :  5000  | PSR :  0.6  
 Part des traces perdues :  7.4 %
Category :  rail  | PSA :  85000  | PSR :  0.5  
 Part des traces perdues :  7.6 %
Category :  mode_active  | PSA :  750  | PSR :  0.7  
 Part des traces perdues :  6.8 %
Category :  mode_plane_boat  | PSA :  0  | PSR :  0  
 Part des traces perdues :  100.0 %
-------------------------
List of users with constant low signal quality :  ['FR2129' 'FR503' 'FR6906' 'FR8311'] 
 eq. to  0.01 %
--------------

### Control the loss

In [21]:
len(legs_[legs_.low_quality_legs_1 ==1]) / len(legs_)

0.05813624971519708

In [22]:
len(legs_[legs_.low_quality_legs_2 == 1]) / len(legs_)

0.08455684666210982

### Test the different thresholds (for sensitivity analyse)

In [23]:
for cat in signal_categories[0]:
    print('\n---')
    print(cat)
    df_sorted = legs_.loc[legs_['mode'].isin(signal_categories_[cat]['modes'])].sort_values(by=["rel_max_signalloss", "max_signlalloss_meters"])
    # Calculate the number of rows to keep 98% of the data
    num_rows_to_keep = int(0.95 * len(df_sorted))
    
    # Select the rows that represent the top 98% of the data
    df_filtered = df_sorted.iloc[:num_rows_to_keep]
    
    # Get the threshold values for "rel_max_signalloss" and "max_signlalloss_meters"
    print('Threshold for max_signlalloss_meters', df_filtered["max_signlalloss_meters"].max())
    print('Threshold for rel_max_signalloss', df_filtered["rel_max_signalloss"].max())


---
road
Threshold for max_signlalloss_meters 308916.2503783547
Threshold for rel_max_signalloss 0.5062897101520215

---
rail
Threshold for max_signlalloss_meters 255756.0155114675
Threshold for rel_max_signalloss 0.7364980586836456

---
mode_active
Threshold for max_signlalloss_meters 63194.8939271471
Threshold for rel_max_signalloss 0.7096603823953925

---
mode_plane_boat
Threshold for max_signlalloss_meters 10660778.423309473
Threshold for rel_max_signalloss 0.9999959660046349


In [24]:
len(legs_.loc[(legs_['mode'].isin(mode_active)) & 
            ((legs_.max_signlalloss_meters > 750) | 
            (legs_.rel_max_signalloss > 0.7))]) / len(legs_.loc[legs_['mode'].isin(mode_road)])

0.080697628845777

### Re-import pickles to save it to shapefiles and recombined the split pickles (in canton)

In [27]:
%%time
# List of pickle file names
pickle_files = [
    'legs_FRA_filtered',
    'legs_GG_FRA_filtered',
    'legs_VD_part_1_filtered',
    'legs_VD_part_2_filtered',
    'legs_VD_part_3_filtered',
    'legs_GE_filtered'
]

intput_directory = 'gps_canton'

output_directory = 'gps_canton/shp'
os.makedirs(output_directory)

# Load and concatenate the pickle files
dfs = []
for file in pickle_files:
    df = pd.read_pickle(os.path.join(intput_directory, f'{file}.pkl'))
    df.to_crs(crs=target_crs).to_file(os.path.join(output_directory, f'{file}.shp'))
    dfs.append(df)

# Concatenate the DataFrames into a single DataFrame
combined_df = pd.concat(dfs, ignore_index=True)

# Dump the combined DataFrame to a pickle file
combined_df.to_pickle('gps_canton/legs_filtered.pkl')
combined_df.to_crs(crs=target_crs).to_file('gps_canton/shp/legs_filtered.shp')



CPU times: user 4min 42s, sys: 30.3 s, total: 5min 12s
Wall time: 5min 37s


### For more testing

In [None]:
## USE THIS LINE TO SAVE EXAMPLES OF HIGH SIGNAL LOSS
#legs_.loc[legs_.max_signlalloss_meters > legs_.max_distance_meters.quantile(0.75)].to_crs(crs=target_crs).to_file('gg_fra_75_quant.geojson', driver='GeoJSON')

#legs_.loc[(legs_['mode'].isin(mode_road)) & (legs_.max_signlalloss_meters < road_psa) & (legs_.rel_max_signalloss < road_psr)].to_crs(crs=target_crs).to_file('gg_fra_mode_road_.geojson', driver='GeoJSON')

#legs_.loc[(legs_['mode'].isin(mode_active)) & (legs_.max_signlalloss_meters > active_psa) & (legs_.rel_max_signalloss > active_psr)].to_crs(crs=target_crs).to_file('gg_fra_mode_active.geojson', driver='GeoJSON')


In [None]:
##We can access the coordinates of a Linestring as follows:
#legs.geometry[30831] reads the object linestring
#type(legs.geometry[30831]) must be a shapely.geometry.linestring.LineString
#list(legs_.geometry[30831].coords) displays the coordinate tuples from which we can compute the point-to-point distance