# A protocol for movement data exploration

This notebook presents a systematic movement data exploration protocol. 

Following Zuur et al.'s (2010) example, our protocol consists of a series of flexible steps that should be treated as questions to be asked of the data. The steps are grouped by problem types, based on our extension of the typology of movement data quality problems by Andrienko et al. (2016).

The protocol starts with steps exploring elementary records, before it advances to steps looking into intermediate segments of consecutive records, and finally, to the steps requiring whole trajectories. This approach reflects the typical flow of movement data processing since raw data is usually provided as elementary records that need to be processes to connect consecutive records into continuous tracks that can then be split to extract individual trajectories. Consequently, choices made along the processing chain, for example, regarding how tracks are split into trajectories, will affect results of later steps (Andrienko et al. 2013, p371).  

At each step of the protocol, we first describe the data problem and its potential causes, then explain the potential consequences if the problem is not identified, and finally, propose suitable exploratory analysis methods.

* **A Missing data**
 * A-1 Spatial gaps & outliers
 * A-2 Temporal gaps & outliers
 * A-3 Spatiotemporal gaps
 * A-4 Attribute gaps
 * A-5 Gaps in trajectories
* **B Precision problems**
 * B-1 Coordinate imprecision
 * B-2 Timestamp imprecision
* **C Consistency problems**
 * C-1 Sampling heterogeneity
 * C-2 Mover heterogeneity
 * C-3 Tracker heterogeneity
* **D Accuracy problems**
 * D-1 Object identity issues
 * D-2 Spatial inaccuracy 
 * D-3 Temporal inaccuracy


## Setup

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import warnings
warnings.filterwarnings('ignore')

In [None]:
FIGSIZE = (600,400)
SMSIZE = 300
COLOR = 'darkblue'
COLOR_HIGHLIGHT = 'red'
COLOR_BASE = 'grey'

In [None]:
from math import sin, cos, atan2, radians, degrees, sqrt, pi
from datetime import datetime, date
import numpy as np
import pandas as pd
import geopandas as gpd
import movingpandas as mpd
import datashader as ds
import holoviews as hv
from shapely.geometry import Point, LineString
from holoviews.operation.datashader import datashade, spread
from holoviews.element import tiles
from holoviews import opts, dim 
import hvplot
from shapely.geometry import Point

R_EARTH = 6371000  # radius of earth in meters
C_EARTH = 2 * R_EARTH * pi  # circumference
BG_TILES = tiles.CartoLight()

pd.set_option('use_inf_as_na', True)

In [None]:
def plot_single_mover(df, mover_id, the_date):
    tmp = df[(df.id==mover_id) & (df.index.date==the_date)]
    gdf = gpd.GeoDataFrame(tmp.drop(['x', 'y'], axis=1), crs={'init': 'epsg:3857'}, geometry=[Point(xy) for xy in zip(tmp.x, tmp.y)])
    plot = mp.Trajectory(gdf, 1).hvplot(title=f'Mover {mover_id} ({the_date})', c='speed_m/s', cmap='RdYlBu',  colorbar=True, clim=(0,15), 
                                        line_width=5, width=FIGSIZE[0], height=FIGSIZE[1], tiles='CartoLight')
    return plot

In [None]:
input_files = [
    'E:/Geodata/AISDK/raw_ais/aisdk_20170701.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20170702.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20170703.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20170704.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20170705.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20170706.csv',
    'E:/Geodata/AISDK/raw_ais/aisdk_20180101.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20180102.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20180103.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20180104.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20180105.csv',
    #'E:/Geodata/AISDK/raw_ais/aisdk_20180106.csv'
]

In [None]:
df = pd.read_csv(input_files[0], nrows=100)

In [None]:
df.head()

In [None]:
df['SOG'].hist(bins=100, figsize=(15,3))

In [None]:
df = None
for input_file in input_files[:2]: 
    a = pd.read_csv(input_file, usecols=['# Timestamp', 'MMSI', 'Latitude', 'Longitude', 'SOG', 'Type of mobile', 'Ship type', 'Navigational status'])
    a = a[(a['Type of mobile'] == 'Class A') & (a.SOG>0)]
    a.drop(columns=['Type of mobile', 'SOG'], inplace=True)
    if df is None:
        df = a
    else:
        df = df.append(a)
    
df.rename(columns={'# Timestamp':'time', 'MMSI':'id', 'Latitude':'lat', 'Longitude':'lon', 'Ship type':'shiptype', 'Navigational status':'navstat'}, inplace=True)
df['time'] = pd.to_datetime(df['time'], format='%d/%m/%Y %H:%M:%S')

In [None]:
df.loc[:, 'x'], df.loc[:, 'y'] = ds.utils.lnglat_to_meters(df.lon, df.lat)

df.set_index('time', inplace=True)

df['navstat'] = df['navstat'].astype('category')
df['shiptype'] = df['shiptype'].astype('category')

In [None]:
df.head()

In [None]:
print('Number of records: {} million'.format(round(len(df)/1000000)))

## A) Missing data

Checking for missing data is a common starting point for exploring new movement datasets. At this early stage, we usually start with raw location records that have yet to be aggregated into trajectories. Therefore, initial analyses look at elementary position records.

The following protocol steps target issues of missing data with respect to movement data's spatial, temporal, and attribute dimensions.


### A-1) Spatial gaps & outliers

To gain an overview, the analysis should start from the whole time span before drilling down. Spatial context (usually in the form of base maps) is essential when assessing spatial extent and gaps because context influences movement.

#### Spatial spread / extent & outliers (whole territory / all movers / whole time span)

This step addresses the question if the dataset covers the expected spatial extent. This can be as simple as checking the minimum and maximum coordinate values of the raw records. However, it is not uncommon to encounter spurious location records or outliers that are not representative of the actual covered extent. These outliers may be truly erroneous positions but can also be correct positions that happen to be located outside the usual extent. Looking at elementary position records only, it is usually not possible to distinguish these two cases. It is therefore necessary to take note of these outliers and investigate further in later steps.

TODO: consequences 

Classic scatter plots (or point maps) are helpful at this step. Point density maps (often called heat maps) on their default settings tend to hide outliers and are therefore not recommended.

In [None]:
print(f'Spatial extent: x_min={df.lon.min()}, x_max={df.lon.max()}, y_min={df.lat.min()}, y_max={df.lat.max()}')

In [None]:
def plot_basic_scatter(df, color='darkblue', title='', width=FIGSIZE[0], height=FIGSIZE[1], size=2):
    opts.defaults(opts.Overlay(active_tools=['wheel_zoom']))
    pts = df.hvplot.scatter(x='x', y='y', datashade=True, cmap=[color, color], frame_width=width, frame_height=height, title=str(title))
    return BG_TILES * spread(pts, px=size)

In [None]:
plot_basic_scatter(df)

In [None]:
df = df[(df.lon>-90) & (df.lon<90) & (df.lat>0) & (df.lat<80)]

In [None]:
cropped_df = df[(df.lon>0) & (df.lon<20) & (df.lat>52) & (df.lat<60)]
cropped_df['navstat'] = cropped_df['navstat'].astype('category')
cropped_df['shiptype'] = cropped_df['shiptype'].astype('category')
plot_basic_scatter(cropped_df)

#### Spatial gaps (selected areas / all movers / whole time span)

This step addresses the question if there are spatial gaps in the data coverage. Depending on the type of movers, gaps in certain spatial contexts are to be expected. For example, we wouldn't expect taxi locations in lakes. Other gaps may indicate issues with the data collection process or the data export used to generate the analysis dataset. Therefore, it is essential to evaluate these gaps in their spatial context using base maps showing relevant geographic features, such as the road network for vehicle data or navigation markers for vessel data. The visualization scale influences which size of gaps can be discovered. However, there are of course practical limitations to exploring ever more detailed scales and resulting continuously growing numbers of gaps.

TODO: consequences

Point density maps are helpful since they make it easy to identify areas with low densities, ignoring occasional outliers.

In [None]:
def plot_point_density(df, title='', width=FIGSIZE[0], height=FIGSIZE[1]):
    opts.defaults(opts.Overlay(active_tools=['wheel_zoom']))
    pts = df.hvplot.scatter(x='x', y='y', title=str(title), datashade=True, frame_width=width, frame_height=height)
    return BG_TILES * pts

In [None]:
plot_point_density(df)

### A-2) Temporal gaps & outliers

#### Temporal extent & outliers (whole territory / all movers / whole time span)

This step addresses the question if the dataset covers the expected temporal extent. Similar to exploring the spatial extent, the obvious step is to determine the minimum and maximum timestamps first. Since GPS tracking requires accurate clocks to function, time information on the tracker is usually reliable. However, it is not guaranteed that these timestamps make it through the whole data collection and (pre)processing chain leading up to the exploratory analysis. For example, in some cases, tracker (or sender) time is replaced by receiver or storage time. Thus clock errors on the receiving or storage devices can result in unexpected timestamps.

TODO: consequences

Temporal charts, particularly record counts over time, are helpful to gain a first impression of the overall temporal extent and whether it is continuous or split into multiple time frames with little or no data in between.

In [None]:
print(f'Temporal extent: {df.index.min()} to {df.index.max()}')

In [None]:
TIME_SAMPLE = '15min'

df['id'].resample(TIME_SAMPLE).count()\
    .hvplot(title=f'Number of records per {TIME_SAMPLE}', width=FIGSIZE[0])

#### Temporal gaps in linear sequence & temporal cycles (whole territory / all movers / time spans)

This step addresses the question if there are temporal gaps in the dataset. Temporal gaps can be due to scheduled breaks in data collection, deliberate choices during data export, as well as unintended issues during data collection or (pre)processing. Similar to exploring spatial gaps, the temporal scale influences which size of gaps can be discovered. Temporal gaps can be one-time events or exhibit reoccurring patterns. For example, daily and weekly cycles are typical for human movement data.

TODO: consequences

Two-dimensional time histograms are helpful at this step.

In [None]:
counts_df = df['id'].groupby([df.index.hour, pd.Grouper(freq='d')]).count().to_frame(name='n')
counts_df.rename_axis(['hour', 'day'], inplace=True)
counts_df.hvplot.heatmap(title='Record count', x='hour', y='day', C='n', width=FIGSIZE[0])

### A-3) Spatiotemporal changes / gaps

While the previous two steps looked at spatial gaps over the whole time span or temporal gaps for the whole territory, this step aims to explore spatiotemporal changes and gaps.

#### Changing extent

This step addresses the question whether there are changes in spatial extent over time. Changing spatial extent may be due to planned extensions or reductions of the data collection / observation area. Similarly, the extent is also expected to shift if the movers collectively change their location, as is the case, for example, with tracks of migrating birds.

TODO: consequences

Small multiples are helpful since they provide a quick way to compare extents during different time spans.

In [None]:
def plot_multiple_by_day(df, day):
    return plot_basic_scatter(df[df.index.date==day], title=day, width=SMSIZE, height=SMSIZE)
    
def plot_multiples_by_day(df):
    days = df.index.to_period('D').unique()
    a = None
    for a_day in days:
        a_day = a_day.to_timestamp().date()
        plot = plot_multiple_by_day(df, a_day)
        if a is None: a = plot
        else: a = a  + plot
    return a

In [None]:
plot_multiples_by_day(df).cols(2)

In [None]:
plot_multiples_by_day(cropped_df).cols(2)

In [None]:
def plot_multiple_by_hour_of_day(df, hour, fun):
    return fun(df[df.index.hour==hour], title=hour, width=SMSIZE, height=SMSIZE)
    
def plot_multiples_by_hour_of_day(df, hours=range(0,24), fun=plot_basic_scatter):
    a = None
    for hour in hours:
        plot = plot_multiple_by_hour_of_day(df, hour, fun)
        if a is None: a = plot
        else: a = a + plot
    return a

In [None]:
#plot_multiples_by_hour_of_day(df[df.shiptype=='Fishing']).cols(2)
plot_multiples_by_hour_of_day(df, hours=[6,7,8,9]).cols(2)

#### Temporary gaps

This step addresses the question whether there are temporary gaps in the overall spatial coverage. Like temporary changes in the overall extent, temporary gaps can be due to mover behavior, as well as planned and unplanned changes of the data collection or (pre)processing workflows.

TODO: consequences

Small multiples of density maps or animated density maps are helpful at this step.

In [None]:
plot_multiples_by_hour_of_day(cropped_df, hours=[6,7,8,9], fun=plot_point_density).cols(2)

### A-4) Attribute gaps

Some attributes may only be available during certain time spans / or in certain areas.

#### Spatial attribute gaps

This step addresses the question if there are areas with missing attribute data. Locally missing attribute data can be due to heterogeneous data collection system setups.

TODO: consequences

The methods used to explore spatial extent and gaps can be adopted to missing attribute data.

In [None]:
CATEGORY = 'shiptype' #'navstat'

cats = df[CATEGORY].unique()
#[cat for cat in cats]

In [None]:
cmap = {} 
for cat in cats:
    cmap[cat] = COLOR_BASE
cmap['Unknown value'] = COLOR_HIGHLIGHT
cmap['Undefined'] = COLOR_HIGHLIGHT

In [None]:
def plot_categorized_scatter(df, cat, title='', width=SMSIZE, height=SMSIZE, cmap=cmap):
    opts.defaults(opts.Overlay(active_tools=['wheel_zoom']))
    pts = df.hvplot.scatter(x='x', y='y', datashade=True, by=cat, colormap=cmap, legend=True, frame_width=width, frame_height=height, title=str(title))
    return BG_TILES * pts

In [None]:
unknown = df[(df[CATEGORY]=='Unknown value') | (df[CATEGORY]=='Undefined')]
known = df[(df[CATEGORY]!='Unknown value') & (df[CATEGORY]!='Undefined')]

( plot_categorized_scatter(df, CATEGORY, title='Categorized', width=SMSIZE, height=SMSIZE, cmap=cmap) + 
  plot_basic_scatter(unknown, COLOR_HIGHLIGHT, title='Unknown only', width=SMSIZE, height=SMSIZE, size=1) +
  plot_basic_scatter(known, COLOR_BASE, title='Known only', width=SMSIZE, height=SMSIZE, size=1)
)

#### Temporal attribute gaps

This step addresses the question if there are temporary gaps in attribute data. Changes to the data collection or (pre)processing workflow can affect which attributes are available during certain time spans.

TODO: consequences

The methods used to explore temporal extent and gaps can be adopted to missing attribute data.

In [None]:
plot_multiples_by_day(unknown).cols(2)

### DATA PREPARATION: Computing segment information

In [None]:
def time_difference(row):
    t1 = row['prev_t']
    t2 = row['t']
    return (t2-t1).total_seconds()

def speed_difference(row):
    return row['speed_m/s'] - row['prev_speed']

def acceleration(row):
    if row['diff_t_s'] == 0:
        return None
    return row['diff_speed'] / row['diff_t_s']

def spherical_distance(lon1, lat1, lon2, lat2):
    delta_lat = radians(lat2 - lat1)
    delta_lon = radians(lon2 - lon1)
    a = sin(delta_lat/2) * sin(delta_lat/2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(delta_lon/2) * sin(delta_lon/2)
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    dist = R_EARTH * c
    return dist

def distance_to_prev(row):
    return spherical_distance(row['prev_lon'], row['prev_lat'], row['lon'], row['lat'])
    
def distance_to_next(row):
    return spherical_distance(row['next_lon'], row['next_lat'], row['lon'], row['lat'])

def direction(row):
    lon1, lat1, lon2, lat2 = row['prev_lon'], row['prev_lat'], row['lon'], row['lat']
    lat1 = radians(lat1)
    lat2 = radians(lat2)
    delta_lon = radians(lon2 - lon1)
    x = sin(delta_lon) * cos(lat2)
    y = cos(lat1) * sin(lat2) - (sin(lat1) * cos(lat2) * cos(delta_lon))
    initial_bearing = atan2(x, y)
    initial_bearing = degrees(initial_bearing)
    compass_bearing = (initial_bearing + 360) % 360
    return compass_bearing

def angular_difference(row):
    diff = abs(row['prev_dir'] - row['dir'])
    if diff > 180:
        diff = abs(diff - 360)
    return diff 

def compute_segment_info(df):
    df = df.copy()
    df['t'] = df.index
    df['prev_t'] = df.groupby('id')['t'].shift()
    df['diff_t_s'] = df.apply(time_difference, axis=1)
    df['prev_lon'] = df.groupby('id')['lon'].shift()
    df['prev_lat'] = df.groupby('id')['lat'].shift()
    df['prev_x'] = df.groupby('id')['x'].shift()
    df['prev_y'] = df.groupby('id')['y'].shift()
    df['diff_x'] = df['x'] - df['prev_x']
    df['diff_y'] = df['y'] - df['prev_y']
    df['next_lon'] = df.groupby('id')['lon'].shift(-1)
    df['next_lat'] = df.groupby('id')['lat'].shift(-1)
    df['dist_prev_m'] = df.apply(distance_to_prev, axis=1)
    df['dist_next_m'] = df.apply(distance_to_next, axis=1)
    df['speed_m/s'] = df['dist_prev_m']/df['diff_t_s']
    df['prev_speed'] = df.groupby('id')['speed_m/s'].shift()
    df['diff_speed'] = df.apply(speed_difference, axis=1)
    df['acceleration'] = df.apply(acceleration, axis=1)
    df['dir'] = df.apply(direction, axis=1)
    df['prev_dir'] = df.groupby('id')['dir'].shift()
    df['diff_dir'] = df.apply(angular_difference, axis=1)
    df = df.drop(columns=['prev_x', 'prev_y', 'next_lon', 'next_lat', 'prev_speed', 'prev_dir'])
    return df

In [None]:
%%time

try:
    segment_df = pd.read_pickle('./segments.pkl')
except:
    segment_df = compute_segment_info(cropped_df)
    segment_df.to_pickle("./segments.pkl")

In [None]:
easteregg = cropped_df[(cropped_df.id==636092484) | (cropped_df.id==636092478)]
easteregg['id'] = 1
segment_df = segment_df.append(compute_segment_info(easteregg))

### A-5) Gaps in trajectories

Depending on the method used for splitting tracks into trajectories, the resulting trajectories can include gaps. These gaps can be due to technical failure of the tracking device, the mover leaving the observable area, deliberate deactivation of the tracking device, or (pre)processing issues. 

TODO: consequences

TODO: method

In [None]:
GAP_MIN = 10000
GAP_MAX = 100000

segment_df['is_gap'] = ( (segment_df['dist_prev_m']>GAP_MIN) & (segment_df['dist_prev_m']<GAP_MAX) ) | ( (segment_df['dist_next_m']>GAP_MIN) & (segment_df['dist_next_m']<GAP_MAX) ) 
segment_df['id_by_gap'] = segment_df.groupby("id")['is_gap'].transform(lambda x: x.ne(x.shift()).cumsum())

In [None]:
grouped = [df[['x','y']] for name, df in segment_df[segment_df.is_gap].groupby(['id', 'id_by_gap']) ]
path = hv.Path(grouped, kdims=['x','y'])
plot = datashade(path, cmap=COLOR_HIGHLIGHT).opts(frame_height=FIGSIZE[1], frame_width=FIGSIZE[0])
BG_TILES * plot

## B) Precision problems

Precision issues in movement data may affect both spatial coordinates as well as timestamps of records. 

The following protocol steps target issues of excessively truncated coordinates and timestamps. 


### B-1) Coordinate imprecision

This step addresses the question if the coordinates have been truncated excessively. Due to the limited accuracy of conventional GPS, one may argue that here is little benefit to more than five decimal places if coordinates are reported as latitude and longitude. However, rounding or truncating coordinates can lead to stair-shaped trajectories, particularly in densely sampled datasets. 

TODO: consequences

Direction histograms are useful in revealing excessively truncated coordinates which results in an over-representation of direction values at 45 degree steps. 

In [None]:
segment_df['dir'][segment_df.dist_prev_m>0].hvplot.hist(bins=72, title='Histogram of directions')

### B-2) Timestamp imprecision 

#### Truncated timestamps

This step addresses the question if timestamps have been truncated excessively. Imprecise timestamps are the result of undue truncation or rounding in the date collection or (pre)processing workflow. Truncation can result in multiple position records of the same mover referring to the same time (Andrienko et al. 2016). Consecutive records with identical timestamps but different positions will result in zero-length time deltas between affected records and thus to division-by-zero errors when computing speeds. If positions are sparsely sampled, moderate truncation (for example, of milliseconds) will not result in multiple records with identical timestamps. 

TODO: consequences

Counts of records per timestamp and mover ID can help identify cases of excessively truncated timestamps.

In [None]:
non_zero_movement = segment_df[segment_df.dist_prev_m>0]

n_per_id_t = non_zero_movement[['id', 't', 'x']].groupby(['id', 't']).count().reset_index()
n_per_id_t['x'].plot.hist(title='Counts of records per timestamp and mover ID', log=True)
#n_per_id_t.groupby('x').count().hvplot(title='Counts of records per timestamp and mover ID', y='id', logy=True)  # line plot not ideal
#n_per_id_t['x'].hvplot.hist(title='Counts of records per timestamp and mover ID', logy=True)  # upstream bug in log scale

In [None]:
duplicates_per_id = n_per_id_t[n_per_id_t.x>1].drop(columns=['t']).groupby(['id']).count().rename(columns={'x':'n'})
duplicates_per_id['n'].plot.hist(title='Count of duplicate timestamps per mover ID', log=True)

## C) Consistency problems

Datasets may not be as consistent with regards to collection parameters and covered movers as analysts expect. These problems usually cannot be detected from elementary position records. Therefore, intermediate segments or overall trajectories are needed. 

The following protocol steps target issues of heterogeneous sampling intervals and unexpected heterogeneous mover types. 


### C-1) Sampling heterogeneity

#### Heterogeneous sampling intervals

This step addresses the question whether the sampling frequency is stable. Some tracking systems provide records at regular time intervals. Other systems have rule-based sampling strategies. For example, in the Automatic Identification System (AIS), updates are more frequent when objects move quickly than when they stand still. Some GPS trackers may skip positions during straight-line movement (Andrienko et al. 2016). Other systems work on a best-effort base with a target sampling interval that may be exceeded if the system is busy.

TODO: consequences

Histograms of sampling intervals help determine whether sampling intervals are stable and, if yes, what the typical sampling interval is. If not, they show the range of observed sampling intervals.

In [None]:
segment_df.diff_t_s.hvplot.hist(title='Histogram of intervals between consecutive records (in seconds)', bins=100)

In [None]:
segment_df[segment_df.diff_t_s<=120].diff_t_s.hvplot.hist(title='Histogram of intervals between consecutive records (in seconds)', bins=60)

In [None]:
segment_df.hvplot.scatter(title='Coordinate change plot', x='diff_x', y='diff_y', datashade=True, 
                          xlim=(-1000,1000), ylim=(-1000,1000), frame_width=FIGSIZE[1], frame_height=FIGSIZE[1])

### C-2) Mover heterogeneity

#### Heterogeneous mover types

This step addresses the question whether the dataset contains heterogeneous types of movers. Datasets of human movement are expected to contain a mix of different transport modes. Other datasets, such as floating car data (FCD), are expected to be more heterogeneous, for example, to only contain car movements. However, errors in the collection process can invalidate this assumption. For example, if mobile (as opposed to built-in) trackers are used, they may be removed from vehicles and carried around by other means of transport. Other sources of heterogeneity are not due to errors but may still surprise the analysts. For example, AIS datasets also contain track from search and rescue vessels which include helicopters.

TODO: consequences

Scatterplots of different combinations of trajectory characteristics, such as total length, median speed, median direction change, and typical acceleration can help gain a better understanding of how heterogeneous the movers in a dataset are. 

In [None]:
non_zero_speed = segment_df[(segment_df['speed_m/s']>0.1)]
daily = non_zero_speed.groupby(['id', pd.Grouper(freq='d')]).agg({'dist_prev_m':'sum', 'speed_m/s':'median'}) 

daily.hvplot.scatter(title='Daily travelled distance over median speed (m/s)', x='dist_prev_m', y='speed_m/s', 
                    hover_cols=['id','time'], frame_width=FIGSIZE[1], frame_height=FIGSIZE[1], alpha=0.3, 
                    xlim=(-100000,1500000), ylim=(-10,100))

In [None]:
def plot_paths(original_df, title='', add_bg=True):
    grouped = [df[['x','y']] for name, df in original_df.groupby(['id']) ]
    path = hv.Path(grouped, kdims=['x','y'])
    plot = datashade(path, cmap=COLOR_HIGHLIGHT).opts(title=title, frame_height=FIGSIZE[1], frame_width=FIGSIZE[0])
    if add_bg:
        return BG_TILES * plot
    else: 
        return plot

In [None]:
speedsters = daily[daily['speed_m/s']>20].reset_index().id.unique()
speedsters = segment_df[segment_df.id.isin(speedsters)]
plot_paths(speedsters, title='Speedsters')

In [None]:
longdist = daily[daily['dist_prev_m']>800000].reset_index().id.unique()
longdist = segment_df[segment_df.id.isin(longdist)]
plot_paths(longdist, title='Long distance travelers')

#grouped = [df[['x','y']] for name, df in longdist.groupby(['id']) ]
#path = hv.Path(grouped, kdims=['x','y'])
#plot = datashade(path, cmap=COLOR_HIGHLIGHT).opts(title='Long distance travelers', frame_height=FIGSIZE[1], frame_width=FIGSIZE[0])
#BG_TILES * plot

### DATA PREPARATION: Computing trajectory information

In [None]:
MINIMUM_NUMBER_OF_RECORDS = 100
MINIMUM_SPEED_MS = 1

def reset_values_at_daybreaks(tmp, columns):
    tmp['ix'] = tmp.index
    tmp['zero'] = 0
    ix_first = tmp.groupby(['id', pd.Grouper(freq='d')]).first()['ix']
    for col in columns:
        tmp[col] = tmp['zero'].where(tmp['ix'].isin(ix_first), tmp[col])
    tmp = tmp.drop(columns=['zero', 'ix'])
    return tmp

tmp = segment_df.copy()
tmp['acceleration_abs'] = np.abs(tmp['acceleration'])
tmp['diff_speed_abs'] = np.abs(tmp['diff_speed'])
tmp = tmp.replace([np.inf, -np.inf], np.nan)

tmp = reset_values_at_daybreaks(tmp, ['diff_t_s','dist_prev_m','diff_speed_abs','acceleration_abs'])

traj_df = tmp.groupby(['id', pd.Grouper(freq='d')]) \
    .agg({'diff_t_s':['median', 'sum'], 
          'speed_m/s':['median','std'],
          'diff_dir':['median','std'], 
          'dist_prev_m':['median', 'sum'], 
          'diff_speed_abs':['max'], 
          'acceleration_abs':['median','max','mean','std'], 
          't':['min','count'],
         'shiptype':lambda x:x.value_counts().index[0]}) 

traj_df.columns = ["_".join(x) for x in traj_df.columns.ravel()]
traj_df = traj_df.rename(columns={'t_count':'n', 'shiptype_<lambda>':'shiptype', 
                                  'diff_t_s_sum':'duration_s', 'dist_prev_m_sum':'length_m'})
traj_df['length_km'] = traj_df['length_m'] / 1000
traj_df['duration_h'] = traj_df['duration_s'] / 3600
traj_df['t_min_h'] = traj_df['t_min'].dt.hour + traj_df['t_min'].dt.minute / 60

traj_df = traj_df[traj_df.n>=MINIMUM_NUMBER_OF_RECORDS]
traj_df = traj_df[traj_df['speed_m/s_median']>=MINIMUM_SPEED_MS]
traj_df

In [None]:
hvplot.scatter_matrix(
    traj_df[['length_km', 'speed_m/s_median', 'duration_h', 'acceleration_abs_mean', 'diff_dir_median']]
)

### C-3) Tracker heterogeneity

#### Heterogeneous trackers

This step addresses the question whether the dataset contains records from devices with different tracking characteristics. Devices with GPS tracking capabilities vary widely in performance. For example, when data is collected using smartphone apps, coordinates may have passed through a variety of (not always fully transparent) preprocessing steps that depend on the operating system version and hardware manufacturer (citation needed).  

TODO: consequences

Effects of heterogeneous trackers can be hard to distinguish from effects of heterogeneous movers. Tracker heterogeneity may result in sampling rates and/or spatial accuracy that differ between movers. Furthermore, differences in the availability of additional attribute data within movement records may point towards tracker heterogeneity. 

In [None]:
traj_df[(traj_df['diff_t_s_median']<=120) & (traj_df['speed_m/s_median']>0)] \
    .hvplot.scatter(
        title='Median sampling interval over median speed', alpha=0.3,
        x='diff_t_s_median', y='speed_m/s_median', hover_cols=['id','time'], #datashade=True,
        frame_width=FIGSIZE[1], frame_height=FIGSIZE[1], ylim=(-10,100))

## D) Accuracy problems 

Incorrect mover identities, coordinates, and timestamps can affect movement data analyses in a variety of ways. These problems usually cannot be detected from elementary position records. Therefore, intermediate segments or overall trajectories are needed. 

The following protocol steps target issues of mover identity, as well as spatial and temporal inaccuracy. 


### D-1) Mover identity issues 

Reliable mover identifiers are needed to identify which movement data records belong to the same mover. Identity issues occur when ids are not unique, i.e. if multiple movers are assigned the same identifier. A single mover may also be referred to by multiple different identifiers, either at the same time or due to changes over time. This can happen because the data collection system or (pre)processing workflow reassigns identifiers based on business rules or in regular time intervals. 

#### Non-unique IDs

This step addresses the question whether the dataset contains cases of non-unique identifiers. Due to misconfiguration of trackers or (pre)processing errors, the same identifier may be assigned to multiple movers simultaneously (or to different movers over time which is covered by the Unstable IDs step). Simultaneous non-unique IDs result in trajectories that connect location records by multiple movers traveling on their distinct paths. The resulting trajectory therefore jumps between locations along these different paths. Consequently, the trajectory assumes a zig-zag shape and speeds derived from consecutive location records assume unrealistic values. 

TODO: consequences

Scatterplots of trajectory length and direction change are useful to identify cases of non-unique IDs.
Since movers with idetnical IDs rarely travel in close vicinity, potential candidates for non-unique IDs are characterized by long trajectories with high direction change values. 

In [None]:
traj_df.hvplot.scatter(
    title='Trajectory length over direction difference (median)', alpha=0.3,
    x='length_km', y='diff_dir_median', hover_cols=['id','time'], #datashade=True,
    frame_width=FIGSIZE[1], frame_height=250#, ylim=(-10,100)
) + traj_df.sort_values(by='length_km', ascending=False)[:10][['length_km', 'speed_m/s_median', 'diff_dir_median']].hvplot.table(
    title='Top 10 trajectories - length', frame_width=FIGSIZE[1])

In [None]:
plot_single_mover(segment_df, 1, date(2017,7,1)) 

#### Unstable IDs

This step addresses the question whether mover identifiers are stable and for how long they remain stable. Some data sources do not provide permanently stable identifiers. Systems may reassign identifiers based on business rules or in regular time intervals. For example, taxi floating car systems may not include stable vehicle IDs, instead relying on trip IDs that are reassigned whenever a taxi finishes a trip. 

TODO: consequences

Scatterplots of trajectory duration versus start time are useful to find out how often IDs change and whether they tend to change at the same time. 

In [None]:
hvplot.scatter_matrix(traj_df[['t_min_h', 'duration_h']])

### D-2) Spatial inaccuracy 

Coordinate errors range from basic noise due to the inherent inaccuracy of GPS to unrealistic jumps caused by technical errors or deliberate action. 

#### Outliers with unrealistic jumps

This step addresses the question if trajectories contain unrealistic jumps that require data cleaning. These jumps result in unrealistic derived speed values. The limit for being unrealistic depends on the use case. For example, for ground-based transport, Fillekes et al. (2019) set the limit at 330 km/h based on the maximum speed of German high-speed trains. 

TODO: consequences

Histograms of derived speed between consecutive location records are useful to see if there is a long tail of high speed values.  


In [None]:
segment_df['speed_m/s'].hvplot.hist(
    title='Histogram of speed between consecutive records', bins=100, frame_width=FIGSIZE[1], frame_height=250
) + segment_df.sort_values(by='speed_m/s', ascending=False)[:10][['id', 'speed_m/s']].hvplot.table(
    title='Top 10 records - speed', frame_width=FIGSIZE[1])

In [None]:
plot_single_mover(segment_df, 218057000, date(2018,1,1))

In [None]:
plot_single_mover(segment_df, 219348000, date(2017,7,1))

#### Jitter / noise

This step addresses the question of how noisy the trajectories are. GPS noise causes a systematic "overestimation of distance" when the sampling frequency is high (Ranacher et al. 2016 IJGIS). On the other hand, distances are underestimated when the sampling frequency is low. Without evaluating the sampling frequency, distance and derived speed values therefore are insufficient to understand noise. Noise also affects trajectories of movers that are standing still, appearing as fake jittery movement. 

TODO: consequences

Trajectory plots are an intuitive way to evaluate small sets of trajectories. However, in some cases, trajectories that appear noisy can reflect real movement patterns. For example, vessel routes may have a zig-zag shape in case of adverse weather conditions (Patroumpas et al. 2017). 

In [None]:
traj_df.hvplot.scatter(
    title='Direction difference median over standard deviation', alpha=0.3,
    x='diff_dir_median', y='diff_dir_std', hover_cols=['id','time'], #datashade=True,
    frame_width=FIGSIZE[1], frame_height=250, ylim=(-10,100)
) + traj_df.sort_values(by='diff_dir_median', ascending=False)[:10][['diff_dir_median','diff_dir_std']].hvplot.table(
    title='Top 10 trajectories - direction difference', frame_width=FIGSIZE[1])

In [None]:
plot_single_mover(segment_df, 244063000, date(2018,1,1))

In [None]:
plot_single_mover(segment_df, 220614000, date(2018,1,1))

### D-3) Temporal inaccuracy

Timestamp errors potentially affect the synchronization between trajectories as well as the order of records within individual trajectories. 

#### Time zone and daylight saving issues

This step addresses the question how time zones and daylight saving affect the dataset. In some datasets, time zone information may be included with each time stamp. However usually, this is not the case and analysts have to resort to metadata or documentation which are not always comprehensive or reliable. Time zone issues can be hard to detect, particularly if the dataset contains tracks from multiple time zones but the zone information got lost along the way. This issues may be discovered due to unexpected derived movement patterns, such as, for example, significant numbers of people leaving their homes in the middle of the night or excessive movement of nocturnal animals during the day. 

TODO: consequences

Temporal histograms of record counts are helpful to detect gaps or double counting when daylight saving goes into and out of effect. Temporal heatmaps of movement properties, such as speed, can help recognize time zone issues.

In [None]:
tmp = segment_df[segment_df['speed_m/s']>1]
hourly = tmp['id'].groupby([tmp.index.hour, pd.Grouper(freq='d')]).count().to_frame(name='n')
hourly.rename_axis(['hour', 'day'], inplace=True)
hourly.hvplot.heatmap(title='Count of records with speed > 1m/s', x='hour', y='day', C='n', width=FIGSIZE[0])

#### Out-of-sequence positions

This step addresses the question if records belonging to a trajectory appear out of sequence. A closely related problem is when a mover appears at two different locations at the same time. These problems can happen in systems that do not provide tracker timestamps and instead use receiver or storage time. For example, the Automatic Identification System (AIS) protocol does not transmit tracker timestamps and instead provides only offsets (in second) from the previously transmitted message which is insufficient to establish temporal order "since positional updates from a single vessel may come from a series of base stations (those within range of its antenna along the route)." (Patroumpas et al. 2017) 

TODO: consequences

Scatterplots of acceleration versus direction change are helpful to distinguish out-of-sequence problems from large jumps caused by location errors since out-of-sequence records results in sudden reversals of the movement direction. 

In [None]:
traj_df.hvplot.scatter(
    title='Direction difference (median) over speed (median)', alpha=0.3,
    x='diff_dir_median', y='acceleration_abs_median', hover_cols=['id','time'], #datashade=True,
    frame_width=FIGSIZE[1], frame_height=250#, ylim=(-10,100)
) + traj_df.sort_values(by='diff_dir_median', ascending=False)[:10][['diff_dir_median','diff_dir_std']].hvplot.table(
    title='Top 10 trajectories - direction difference', frame_width=FIGSIZE[1])

In [None]:
plot_single_mover(segment_df, 308322000, date(2017,7,1))

In [None]:
plot_single_mover(segment_df, 265615040, date(2017,7,1))

# Appendix -- Experiments

In [None]:
tmp = 

In [None]:

def compute_distance(row):
    lon1 = row['prev_lon']
    lat1 = row['prev_lat']
    lon2 = row['lon']
    lat2 = row['lat']
    delta_lat = radians(lat2 - lat1)
    delta_lon = radians(lon2 - lon1)
    a = sin(delta_lat/2) * sin(delta_lat/2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(delta_lon/2) * sin(delta_lon/2)
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    dist = R_EARTH * c
    return dist

def connect_pts(row):
    lon1 = row['prev_lon']
    lat1 = row['prev_lat']
    lon2 = row['lon']
    lat2 = row['lat']
    return LineString([(lon1, lat1), (lon2, lat2)])

def find_gaps(df, min_dist, max_dist):
    if len(df)<2:
        return None
    i = df.copy()
    i = i.assign(prev_lon=i.lon.shift())
    i = i.assign(prev_lat=i.lat.shift())
    i = i.assign(dist=i.apply(compute_distance, axis=1))
    i = i[(i.dist>min_dist) & (i.dist<max_dist)]
    if len(i)==0: 
        return None
    i = i.assign(geometry=i.apply(connect_pts, axis=1))
    return i

def make_gap_gdf(df, min_dist, max_dist):
    a = None
    for the_id in df.id.unique():
        i = df[df.id==the_id]
        gaps_df = find_gaps(i, min_dist, max_dist)
        if gaps_df is None:
            continue
        if a is None: 
            a = gaps_df
        else:
            a = a.append(gaps_df)
    if a is not None:
        return gpd.GeoDataFrame(a, geometry='geometry')

def plot_gaps(df, min_dist, max_dist, width=FIGSIZE[0], height=FIGSIZE[1]):
    gaps_gdf = make_gap_gdf(df, min_dist, max_dist)
    if gaps_gdf is not None:
        plot = gaps_gdf.hvplot(geo=True, color=COLOR_HIGHLIGHT, frame_width=width, frame_height=height)
        return tiles.OSM() * plot   

In [None]:
sample_df = df[(df['id']==304752000) | (df['id']==257024000)]
sample_df.head()

In [None]:
for name, df in sample_df.groupby('id'):
    print(f'{name}: {len(df)}')

In [None]:
grouped = [df[['id','x','y']] for name, df in sample_df.groupby(['id', pd.Grouper(freq='d')])]
path = hv.Path(grouped, kdims=['x','y'], vdims=['id','x']).opts(line_width=2, width=600, color=COLOR)
plot = datashade(path).opts(frame_height=FIGSIZE[1], frame_width=FIGSIZE[0])
BG_TILES * plot

In [None]:
grouped = [df[['x','y']] for name, df in cropped_df.groupby(['id', pd.Grouper(freq='d')]) if len(df)>100]
path = hv.Path(grouped, kdims=['x','y'])
plot = datashade(path).opts(frame_height=FIGSIZE[1], frame_width=FIGSIZE[0])
BG_TILES * plot

In [None]:
tmp = cropped_df

a = None
for the_id in tmp.id.unique():
    i = tmp[tmp.id==the_id].copy()
    i = i.assign(prev_lon=i.lon.shift())
    i = i.assign(prev_lat=i.lat.shift())
    i = i.assign(dist=i.apply(compute_distance, axis=1))    
    #i = find_gaps(i, 1000, 10000000)
    plot = hv.Path(i, kdims=['x','y'], vdims=['id', 'dist']).opts(color='dist', line_width=4)
    plot = datashade(plot, normalization='linear', aggregator=ds.by('id', ds.min("dist")))
    #plot = datashade(plot, normalization='linear')
    if a is None: a = plot
    else: a = a * plot
tiles.OSM() * a


In [None]:
a * sample_df.hvplot.scatter(x='x', y='y', datashade=True, by='id', frame_width=FIGSIZE[0], frame_height=FIGSIZE[1])

In [None]:
datashade(hv.Path(sample_df, kdims=['x','y']), normalization='linear', aggregator=ds.any())

In [None]:
tmp = sample_df[['id','x','y']]
hv.Path(tmp[tmp.id==304752000], kdims=['x','y'])

In [None]:
xxx = compute_segment_info(cropped_df)
xxx[xxx.is_gap]

In [None]:
def reset_values_at_daybreaks(tmp, columns):
    tmp['ix'] = tmp.index
    tmp['zero'] = 0
    ix_first = tmp.groupby(['id', pd.Grouper(freq='d')]).first()['ix']
    for col in columns:
        tmp[col] = tmp['zero'].where(tmp['ix'].isin(ix_first), tmp[col])
    tmp = tmp.drop(columns=['zero', 'ix'])
    return tmp

tmp = segment_df[segment_df.id==5322].copy()
tmp['acceleration_abs'] = np.abs(tmp['acceleration'])
tmp['diff_speed_abs'] = np.abs(tmp['diff_speed'])

tmp = reset_values_at_daybreaks(tmp, ['diff_t_s','dist_prev_m','diff_speed_abs','acceleration_abs'])

traj_df = tmp.groupby(['id', pd.Grouper(freq='d')]) \
    .agg({'diff_t_s':['median', 'sum'], 
          'speed_m/s':['median','var'],
          'diff_dir':['median'], 
          'dist_prev_m':['median', 'sum'], 
          'diff_speed_abs':['max'], 
          'acceleration_abs':['median','max','mean','var'], 
          't':['min','count'],
         'shiptype':lambda x:x.value_counts().index[0]}) 

traj_df.columns = ["_".join(x) for x in traj_df.columns.ravel()]
traj_df = traj_df.rename(columns={'t_count':'n', 'shiptype_<lambda>':'shiptype', 
                                  'diff_t_s_sum':'duration_s', 'dist_prev_m_sum':'length_m'})
traj_df['length_km'] = traj_df['length_m'] / 1000
traj_df['duration_h'] = traj_df['duration_s'] / 3600
traj_df['t_min_h'] = traj_df['t_min'].dt.hour + traj_df['t_min'].dt.minute / 60

traj_df = traj_df[traj_df.n>=100]
traj_df


In [None]:
ix_first