<div class="frontmatter text-center">
<h1>Geospatial Data Science</h1>
<h2>Lecture 11: Individual mobility</h2>
<h3>IT University of Copenhagen, Spring 2022</h3>
<h3>Instructor: Michael Szell</h3>
</div>

# Source
This notebook was adapted from:
* sci-kit mobility tutorials: https://github.com/scikit-mobility/tutorials/tree/master/mda_masterbd2020

## What is scikit-mobility?

a library to analyze <font color="blue">*mobility data*</font>, suited for working with:

- **trajectories** composed by lat/long points (e.g., GPS data)
- **fluxes** of movements between places (e.g., OD matrix)


In [None]:
# import the library
import skmob
import warnings
import geopandas as gpd
import pandas as pd
from skmob.tessellation import tilers
from skmob.utils import plot
import matplotlib.pyplot as plt
from tqdm import tqdm
from stats_utils import *

warnings.filterwarnings('ignore')
tess_style = {'color':'gray', 'fillColor':'gray', 'opacity':0.2}

scikit-mobility provides two user-friendly data structures that extends the *pandas* `DataFrame`:

- `TrajDataFrame` - for spatio-temporal <font color="blue">**trajectories**</font>
- `FlowDataFrame` - for <font color="blue">**fluxes**</font> mapped into a tessellation


### What you can do with scikit-mobility?

- **Preprocessing** of mobility data
- **Measuring** individual and collective behaviours
- <font color="grey">**Assessing** privacy risk</font>
- <font color="grey">**Predicting** migration flows</font>
- <font color="grey">**Generating** synthetic trajectories</font>
    

## `TrajDataFrame`


Each row describes a trajectory's point and contains the following columns:

- `lat` - latitude of the point
- `lng` - longitude of the point
- `datetime` - date and time of the point

For multi-user data sets, there are two *optional* columns:

- `uid` - user's identifier to which the trajectory belongs to
- `tid` - identifier for the trajectory

A `TrajDataFrame` can be created from:

- a python list or *numpy* array
- a python dictionary
- a *pandas* `DataFrame`
- a text file

### From a `list`

In [None]:
# From a list
data_list = [[1, 39.984094, 116.319236, '2008-10-23 13:53:05'],
             [1, 39.984198, 116.319322, '2008-10-23 13:53:06'],
             [1, 39.984224, 116.319402, '2008-10-23 13:53:11'],
             [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]
data_list

We must set the indexes of the mandatory columns using arguments `latitude`, `longitude` and `datetime`.

In [None]:
tdf = skmob.TrajDataFrame(data_list, 
                          latitude=1, longitude=2, 
                          datetime=3)
print(type(tdf))
tdf

### From a `DataFrame`

In [None]:
# import the pandas library
import pandas as pd 
# build a dataframe from the 2D list
data_df = pd.DataFrame(data_list, 
                       columns=['user', 'latitude', 'lng', 'hour']) 

In [None]:
print(type(data_df)) # type of the structure
data_df.head() # head of the DataFrame

Note that: 
- name of columns in `data_df` don't match the names required
- you must specify the names of the mandatory columns using arguments `latitude`, `longitude` and `datetime` 

In [None]:
# Create a TrajDataFrame from a DataFrame
tdf = skmob.TrajDataFrame(data_df, 
                          latitude='latitude', 
                          datetime='hour', 
                          user_id='user')

print(type(tdf))
tdf.head()

### From a text file

Class `TrajDataFrame` has a method `from_file` to construct the object from an input text file.

Let's try with a subsample of the <font color="blue">**GeoLife**</font> trajectories. The whole dataset can be found [here](https://www.microsoft.com/en-us/download/details.aspx?id=52367).

In [None]:
# create a TrajDataFrame from a dataset of trajectories 
tdf = skmob.TrajDataFrame.from_file(
    'files/geolife_sample.txt.gz', sep=',')
print(type(tdf))

In [None]:
# explore the TrajDataFrame
tdf.head(5)

### Attributes of a `TrajDataFrame`


- `crs`: the coordinate reference system. Default: `epsg:4326` (lat/long)
- `parameters`: dictionary to add as many as necessary additional properties

In [None]:
tdf.crs

In [None]:
tdf.parameters

In [None]:
# add your own parameter
tdf.parameters['compress'] = {'thre': 10}
tdf.parameters

Columns of `TrajDataFrame` have specific types

In [None]:
# In the DataFrame
print(type(data_df))
data_df.dtypes

In [None]:
print(type(tdf)) # In the TrajDataFrame
tdf.dtypes

In [None]:
tdf.lat.head()

### Write and read 

To write/read a `TrajDataFrame` into a file, scikit-mobility provides ad-hoc methods.

#### Writing a `TrajDataFrame` to a file

- includes the `parameters` and `crs`attributes
- preserves `dtype` of columns with timestamps (time zone info is lost though).

In [None]:
skmob.write(tdf, './tdf.json')

In [None]:
tdf.parameters

### Read a `TrajDataFrame` from a json file

In [None]:
# read the file written before
tdf2 = skmob.read('./tdf.json') 
tdf2[:4]

`dtype`s and the `parameters` and `crs` attributes are preserved

In [None]:
print(tdf2.dtypes)
tdf2.parameters

### Plotting trajectories and flows

*scikit-mobility* relies on the *folium* library to plot:
- trajectories
- flows
- tessellations

In [None]:
tdf.plot_trajectory(zoom=12, weight=3, opacity=0.9, 
                    tiles='Stamen Toner', start_end_markers=True)

## `FlowDataFrame`

Each row describes a flow and contains the columns:

- `origin`: ID of the origin tile
- `destination`: ID of the destination tile
- `flow`: number of people travelling from `origin` to `destination`

<!-- NOTE: `FlowDataFrame` is a dataframe way of having Origin-Destination Matrix. -->

### Tessellation
Each `FlowDataFrame` is associated  with a <font color="blue">**tessellation**</font>, i.e., a `GeoDataFrame` that  contains two columns:
- `tile_ID`, identifier of a location
- `geometry`, geometric shape of the location

### Create of a `FlowDataFrame`

The `FlowDataFrame` can be created from:

- a python list or a numpy array
- a *pandas* `DataFrame`
- a python dictionary
- a text file


### From a file

method `from_file` creates a `FlowDataFrame` from a text file with the format:
    
- `origin`, `destination`, `flow`, `datetime` (optional)


In [None]:
tessellation = gpd.GeoDataFrame.from_file(
    "files/NY_counties_2011.geojson") # load a tessellation

# create a FlowDataFrame from a file and a tessellation
fdf = skmob.FlowDataFrame.from_file(
    "files/NY_commuting_flows_2011.csv",
    tessellation=tessellation, tile_id='tile_id', sep=",")

In [None]:
fdf.head()

In [None]:
fdf.dtypes

In [None]:
# The tessellation is an attribute of the FlowDataFrame
fdf.tessellation.head() 

### Plot the tessellation

In [None]:
fdf.plot_tessellation(popup_features=['tile_ID', 'population']) 

### Plot the flows

In [None]:
fdf.plot_flows(flow_color='green')

### Plot tessellation and flows

In [None]:
map_f = fdf.plot_tessellation(style_func_args=tess_style)
fdf[fdf['origin'] == '36061'].plot_flows(map_f=map_f, flow_exp=0., flow_popup=True)

# Mobility measures
- We load data of *checkins* made by users on **Brightkite**
- Brightkite is a location-based social network (LBSN)
- The dataset is freely available at the SNAP website: https://snap.stanford.edu/data/loc-brightkite.html

In [None]:
# load the dataset using pandas
df = pd.read_csv("files/loc-brightkite_totalCheckins.txt.gz", sep='\t', header=0, nrows=500000, 
                 names=['user', 'check-in_time', "latitude", "longitude", 
                        "location id"])

# convert the pandas DataFrame into an skmob TrajDataFrame
tdf = skmob.TrajDataFrame(df, latitude='latitude', 
            longitude='longitude', datetime='check-in_time', user_id='user')
print(tdf.shape)
tdf.head()

In [None]:
print("number of users:\t", len(tdf.uid.unique()))
print("number of records:\t", len(tdf))

# Individual measures

- computed on the trajectories of a <u>single individual</u>
- quantify standard *mobility patterns*
- examples: 
    - radius of gyration
    - jump lengths
    - max distance
    - individual mobility network

## Radius of gyration $r_g$
characteristic distance traveled by an individual:

$$r_g = \sqrt{\frac{1}{N} \sum_{i=1}^N (\mathbf{r}_i - \mathbf{r}_{cm})^2}$$

$r_{cm}$ is the position vector of the *center of mass* of the set of locations visited by the individual

In [None]:
from skmob.measures.individual import radius_of_gyration

In [None]:
rg_df = radius_of_gyration(tdf)

In [None]:
# let's plot the distribution of the radius of gyration
fig = plt.figure(figsize=(4, 4))
rg_list = list(rg_df.radius_of_gyration[rg_df.radius_of_gyration >= 1.0])
x, y = zip(*lbpdf(1.5, rg_list))
plt.plot(x, y, marker='o')
plt.xlabel('$r_g$ [km]', fontsize=20);plt.ylabel('P($r_g$)', fontsize=20)
plt.grid(alpha=0.2);
plt.loglog();
plt.show()

## Jump lengths
- a jump length is is the distance between two consecutive visits of an individual
- given a `TrajDataFrame`, skmob computes the lengths for each individual independently
- use the `jump_lengths` function

In [None]:
from skmob.measures.individual import jump_lengths

In [None]:
jl_df = jump_lengths(tdf) # disable progress bar with show_progress=False
jl_df.head(4)

In [None]:
# merge=True put all distances of the individuals into a single list
jl_list = jump_lengths(tdf, merge=True)
type(jl_list)

In [None]:
# let's plot the distribution of jump lengths
fig = plt.figure(figsize=(4, 4))
d_list = [dist for dist in jl_list[:10000] if dist >= 1]
x, y = zip(*lbpdf(1.5, d_list))
plt.plot(x, y, marker='o')
plt.xlabel('jump length [km]', fontsize=15);plt.ylabel('P(jump length)', fontsize=15)
plt.grid(alpha=0.2);plt.loglog();plt.show()

### Distances

- maximum distance traveled by each individual `maximum_distance`


In [None]:
from skmob.measures.individual import max_distance_from_home, distance_straight_line, maximum_distance

In [None]:
md_df = maximum_distance(tdf)
md_df.head()

In [None]:
# let's plot the distribution
fig, ax1 = plt.subplots(1, 1)
ax1.hist(md_df.maximum_distance, bins=50, rwidth=0.8)
ax1.set_xlabel('max', fontsize=15)


## Individual mobility network
a network where: 
- nodes represent locations visited by the individual
- directed edges represent trips between the locations made by the individual 

In [None]:
from skmob.measures.individual import individual_mobility_network

In [None]:
imn_df = individual_mobility_network(tdf)
imn_df.head()

In [None]:
an_imn = imn_df[imn_df.uid == 2]
an_imn.sort_values(by='n_trips', ascending=False).head(5)

# Collective measures

- are computed on the trajectories of a <u>population of individuals</u>
- quantify standard *mobility patterns*
- examples: 
    - visits per time unit
    - origin destination matrix

## Visits per location

number of visits to a location made by the population of individuals

In [None]:
from skmob.measures.collective import visits_per_location

In [None]:
vpl_df = visits_per_location(tdf)
vpl_df.head()

In [None]:
fig = plt.figure(figsize=(4, 4))
x, y = zip(*lbpdf(1.5, list(vpl_df.n_visits)))
plt.plot(x, y, marker='o')
plt.xlabel('visits per location', fontsize=15)
plt.loglog() 
plt.show()

### Many many other measures can be computed with scikit-mobility. 
#### Just check the documentation https://scikit-mobility.github.io/scikit-mobility/reference/measures.html