Author: Luca Pappalardo, Giovanni Mauro

Geospatial Analytics, Master degree in Data Science and Business Informatics, University of Pisa

# Geospatial Analytics - Lesson 2: Fundamental Concepts

In this lesson, we will learn how to handle spatial data in Python using Shapely, Geopandas, and scikit-mobility.

1. [Shapely](#shapely)
2. [Geopandas](#geopandas)
4. [scikit-mobility](#scikitmobility)

<a id='shapely'></a>
# Shapely

In [**Shapely**](https://shapely.readthedocs.io/en/stable/manual.html), the most fundamental geometric objects are `Point`, `Line` and `Polygon`, the basic ingredients when working with spatial data in vector data format. 

Basic knowledge of using Shapely is fundamental for understanding how geometries are stored and handled in `GeoPandas` and `scikit-mobility`.

Geometric objects consist of coordinate tuples where:

- `Point` represents a single point in space. Points can be either two-dimensional $(x, y)$ or three dimensional $(x, y, z)$

- `LineString` (i.e., a line) represents a sequence of points joined together to form a line. Hence, a line consist of a list of at least two coordinate tuples

- `Polygon` represents a filled area that consists of a list of at least three coordinate tuples that forms the outerior ring and a (possible) list of hole polygons

It is also possible to have a collection of geometric objects (e.g., `Polygon`s with multiple parts):

- `MultiPoint` object: represents a collection of `Point`s and consists of a list of coordinate-tuples

- `MultiLineString` object: represents a collection of `LineString`s and consists of a list of line-like sequences

- `MultiPolygon` object: represents a collection of `Polygon`s that consists of a list of polygon-like sequences that construct from exterior ring and (possible) hole list tuples

In [1]:
# Import necessary geometric objects from shapely module
from shapely.geometry import Point, LineString, Polygon

### Point

In [2]:
# Create Point geometric object(s) with coordinates
point1 = Point(2.2, 4.2)
point2 = Point(7.2, -25.1)
point3 = Point(9.26, -2.456)
point3D = Point(9.26, -2.456, 0.57)

In [None]:
point1

In [None]:
print(point1)
point1.geom_type

In [None]:
print(list(point1.coords))
print(point1.x)
print(point1.y)

In [None]:
# Calculate the distance between point1 and point2
dist = point1.distance(point2)

# Print out a nicely formatted info message
print(f"Distance between the points is {round(dist, 2)} units")

### Line

In [None]:
# Create a LineString from our Point objects
line = LineString([point1, point2, point3])

# It is also possible to produce the same outcome using coordinate tuples
line2 = LineString([(2.2, 4.2), (7.2, -25.1), (9.26, -2.456)])

# Check if lines are identical
line == line2 

In [None]:
line

In [None]:
print(list(line.coords))
print(line.length)
print(line.centroid)
print(line.xy)
print(line.xy[0]) 
print(line.xy[1])

### Polygon

In [None]:
# Create a Polygon from the coordinates
poly = Polygon([(2.2, 4.2), (7.2, -25.1), (9.26, -2.456)])

poly

In [None]:
poly.area

In [None]:
# Create a Polygon based on information from the Shapely points
poly2 = Polygon([[p.x, p.y] for p in [point1, point2, point3]])
poly2

In [None]:
poly == poly2

In [15]:
# Define the outer border
border = [(-180, 90), (-180, -90), (180, -90), (180, 90)]

In [None]:
# Outer polygon
world = Polygon(shell=border)
print(world)

In [None]:
world

### Polygon attributes and functions¶
We can again access different attributes directly from the `Polygon` object itself that can be really useful for many analyses, such as `area`, `centroid`, bounding box (`bounds`), `exterior`, and exterior-length (`exterior.length`). 

Here, we can see a few of the available attributes and how to access them:

In [None]:
# Print the outputs
print(f"Polygon centroid: {world.centroid}")
print(f"Polygon Area: {world.area}")
print(f"Polygon Bounding Box: {world.bounds}")
print(f"Polygon Exterior: {world.exterior}")
print(f"Polygon Exterior Length: {world.exterior.length}")

<a id="geometrycollections"></a>
## Geometry collections
In some occassions it is useful to store multiple geometries (e.g., several points or several polygons) in a single feature. For example, when country is composed of several islands, the polygons share the same attributes on the country-level and it might be reasonable to store that country as geometry collection that contains all the polygons. The attribute table would then contain one row of information with country-level attributes, and the geometry related to those attributes would represent several polygons.

In Shapely, collections of `Point`s are implemented by using a `MultiPoint` object, collections of `LineString`s by using a `MultiLineString` object, and collections of `Polygon`s by a `MultiPolygon` object.

In [19]:
from shapely.geometry import Point, LineString, Polygon
from shapely.geometry import MultiPoint, MultiLineString, MultiPolygon

In [None]:
point1, point2, point3 = (2.2, 4.2), (7.2, -25.1), (9.26, -2.456)
    
# Create a MultiPoint object of our points 1,2 and 3
multi_point = MultiPoint([point1, point2, point3])

# It is also possible to pass coordinate tuples inside
multi_point2 = MultiPoint([(2.2, 4.2), (7.2, -25.1), (9.26, -2.456)])

# We can also create a MultiLineString with two lines
line1 = LineString([point1, point2])
line2 = LineString([point2, point3])
multi_line = MultiLineString([line1, line2])

# Print object definitions
print(multi_point)
print(multi_line)

In [None]:
multi_point

In [None]:
multi_line

`MultiPolygon`s are constructed in a similar manner. Let’s create a bounding box for “the world” by combining two separate polygons that represent the western and eastern hemispheres.

In [None]:
# Let's create the exterior of the western part of the world
west_exterior = [(-180, 90), (-180, -90), (0, -90), (0, 90)]

# Let's create a hole --> remember there can be multiple holes, thus we need to have a list of hole(s). 
# Here we have just one.
west_hole = [[(-170, 80), (-170, -80), (-10, -80), (-10, 80)]]

# Create the Polygon
west_poly = Polygon(shell=west_exterior, holes=west_hole)

# Print object definition
print(west_poly)

In [None]:
west_poly

Shapely also has a tool for creating a bounding box based on minimum and maximum $x$ and $y$ coordinates. Instead of using the `Polygon` constructor, let’s use the box constructor for creating the polygon:

In [25]:
from shapely.geometry import box

In [None]:
# Specify the bbox extent (lower-left corner coordinates and upper-right corner coordinates)
min_x, min_y = 0, -90
max_x, max_y = 180, 90

# Create the polygon using Shapely
east_poly = box(minx=min_x, miny=min_y, maxx=max_x, maxy=max_y)

# Print object definition
print(east_poly)

In [None]:
east_poly

Finally, we can combine the two polygons into a `MultiPolygon`:

In [None]:
# Let's create our MultiPolygon. We can pass multiple Polygon -objects into our MultiPolygon as a list
multi_poly = MultiPolygon([west_poly, east_poly])

# Print object definition
print(multi_poly)

In [None]:
multi_poly

We can check if we have a "valid" `MultiPolygon`, i.e., if the individual polygons does notintersect with each other. Here, because the polygons have a common 0-meridian, we should NOT have a valid polygon. 

We can check the validity of an object from the `is_valid` attribute that tells if the polygons or lines intersect with each other. This can be really useful information when trying to find topological errors from your data:

In [None]:
print(f"Is polygon valid? {multi_poly.is_valid}")

<a id="geopandas"></a>
# Geopandas

[**Geopandas**](http://geopandas.org/) makes it possible to work with geospatial data in Python in a relatively easy way. Geopandas combines the capabilities of the data analysis library pandas with other packages like Shapely and fiona for managing spatial data.

The main data structures in geopandas are `GeoSeries` and `GeoDataFrame` which extend the capabilities of `Series` and `DataFrame`s from pandas. This means that we can use all our pandas skills also when working with geopandas!

The main difference between `GeoDataFrame`s and pandas `DataFrame`s is that a `GeoDataFrame` should contain one column for geometries. By default, the name of this column is `'geometry'`. The geometry column is a `GeoSeries` which contains the geometries (`Point`, `LineString`, `Polygon`) as shapely objects.

![](https://autogis-site.readthedocs.io/en/latest/_images/geodataframe.png)

In [None]:
import geopandas as gpd

<a id="readingshapefile"></a>
## Reading a Shapefile

In [None]:
fp = "data/L2_data/NLS/2018/L4/L41/L4132R.shp/m_L4132R_p.shp"
# Read file using gpd.read_file()
data = gpd.read_file(fp)
data.head()

In [None]:
type(data)

As you might guess, the column names are in Finnish. Let’s select only the useful columns and rename them into English:

In [None]:
data = data[['RYHMA', 'LUOKKA',  'geometry']]
colnames = {'RYHMA':'GROUP', 'LUOKKA':'CLASS'}
data.rename(columns=colnames, inplace=True)

data.head()

In [None]:
data.plot()

Here we see that our data variable is a `GeoDataFrame`. `GeoDataFrame` extends the functionalities of `pandas.DataFrame` in a way that it is possible to handle spatial data using similar approaches and datastructures as in pandas (hence the name geopandas).

It is always a good idea to explore your data also on a map. Creating a simple map from a GeoDataFrame is really easy: you can use `.plot()` function from geopandas that creates a map based on the geometries of the data. Geopandas actually uses matplotlib for plotting.

In [None]:
import matplotlib.pyplot as plt

# Plot the data
fig, ax = plt.subplots(figsize=(20, 20))
data.plot(ax=ax)

<a id="geometriesgeopandas"></a>
## Geometries in Geopandas
Geopandas takes advantage of Shapely’s geometric objects. Geometries are stored in a column called geometry that is a default column name for storing geometric information in geopandas.

In [None]:
data['geometry'].head()

The geometry column contains familiar looking values, namely Shapely `Polygon` objects. Since the spatial data is stored as Shapely objects, it is possible to use Shapely methods when dealing with geometries in geopandas.

In [None]:
# Access the geometry on the first row of data
data.at[0, "geometry"]

In [None]:
# Print information about the area 
print("Area:", round(data.at[0, "geometry"].area, 0), "square meters")

Iterate over the GeoDataFrame rows using the `iterrows()` function. For each row, print the area of the polygon:

In [None]:
for i, row in data.iterrows():
    area = row['geometry'].area
    print("Area:", round(area, 0), "square meters")

As you see from here, all pandas methods, such as the `iterrows()` function, are directly available in Geopandas without the need to call pandas separately because Geopandas is an extension for pandas.

In practice, it is not necessary to use the `iterrows()` approach to calculate the area for all features. Geodataframes and geoseries have an attribute area which we can use for accessing the area for each feature at once:

In [None]:
data.area

<a id="writingdata"></a>
## Writing data into a shapefile
It is possible to export `GeoDataFrame`s into various data formats using the `to_file()` method. In our case, we want to export subsets of the data into Shapefiles (one file for each feature class).

Let’s first select one class (class number 36200, “Lake water”) from the data as a new `GeoDataFrame`:

In [None]:
# Select a class
selection = data[data["CLASS"]==36200]
selection.head()

In [None]:
fig = plt.figure(figsize=(6, 6))
ax = plt.axes()
selection.plot(ax=ax)

write this layer into a new Shapefile using the `gpd.to_file()` function:

In [None]:
import os

# Create a output path for the data
output_fp = "created_files/Class_36200.shp"

# Create the  folder if it does not exist
if not os.path.exists('created_files'):
    os.makedirs('created_files')

# Write those rows into a new file (the default output file format is Shapefile)
selection.to_file(output_fp)

<a id='sjoins'></a>
## Spatial joins
A **spatial join** uses binary predicates such as `intersects` and `crosses` to combine two `GeoDataFrame`s based on the spatial relationship between their geometries.

A common use case might be a spatial join between a `Point` and a `Polygon` where you want to retain the point geometries and grab the attributes of the intersecting polygons.

![](https://web.natur.cuni.cz/~langhamr/lectures/vtfg1/mapinfo_1/about_gis/Image23.gif)

### Types of spatial joins
Geopandas currently supports the following methods of spatial joins. We refer to the `left_df` and `right_df` which are the correspond to the two dataframes passed in as arguments.

#### Left outer join
In a LEFT OUTER JOIN (`how='left'`), we keep all rows from the left and duplicate them if necessary to represent multiple hits between the two dataframes. We retain attributes of the right if they intersect and lose right rows that do not intersect. A left outer join implies that we are interested in retaining the geometries of the left.

#### Right outer join
In a RIGHT OUTER JOIN (`how='right'`), we keep all rows from the right and duplicate them if necessary to represent multiple hits between the two dataframes. We retain attributes of the left if they intersect and lose left rows that don’t intersect. A right outer join implies that we are interested in retaining the geometries of the right.

#### Inner join
In an INNER JOIN (`how='inner'`), we keep rows from the right and left only where their binary predicate is `True`. We duplicate them if necessary to represent multiple hits between the two dataframes. We retain attributes of the right and left only if they intersect and lose all rows that do not. An inner join implies that we are interested in retaining the geometries of the left.

![](https://www.golinuxcloud.com/wp-content/uploads/types_joins.png)

In [45]:
%matplotlib inline
from shapely.geometry import Point

In [None]:
# NYC Boros
zippath = gpd.datasets.get_path('nybb')
polydf = gpd.read_file(zippath)
polydf

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
polydf.plot(ax = ax)

In [None]:
polydf_lat_lng = polydf.to_crs(epsg=4326)

polydf_lat_lng

In [None]:
polydf_lat_lng.plot()

In [None]:
polydf.total_bounds
# tuple containing minx, miny, maxx, maxy values 
# for the bounds of the series as a whole.

In [52]:
# Generate some points
b = [int(x) for x in polydf.total_bounds]
N = 8
pointdf = gpd.GeoDataFrame([
    {'geometry': Point(x, y), 'value1': x + y, 'value2': x - y}
    for x, y in zip(range(b[0], b[2], int((b[2] - b[0]) / N)),
                    range(b[1], b[3], int((b[3] - b[1]) / N)))])

# Make sure they're using the same projection reference
pointdf.crs = polydf.crs

In [None]:
pointdf

In [None]:
pointdf.plot()

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 20))

ax = plt.axes()
ax = polydf.plot(ax = ax) 
ax = pointdf.plot(ax=ax, color='r')
ax.plot()

In [None]:
join_left_df = pointdf.sjoin(polydf, how="left")
join_left_df
# Note the NaNs where the point did not intersect a boro
# Note that the default join predicate is 'intersects'

In [None]:
join_left_df.plot()

In [None]:
join_inner_df = pointdf.sjoin(polydf, how="inner")
join_inner_df
# Note the lack of NaNs; dropped anything that didn't intersect

In [None]:
join_inner_df.plot()

We’re not limited to using the `intersection` binary predicate. Any of the Shapely geometry methods that return a Boolean can be used by specifying the op kwarg.

In [None]:
pointdf.sjoin(polydf, how="left", predicate="within")

# Exercise 1

-Retrieve the shapefile of italian regions

-Retrieve the list of italian regional capitals, with coordinates

-Assign the correct capital to each region

In [None]:
import pandas as pd

capitals_df = pd.read_csv('data/italian_regional_capitals.csv')

# Create a GeoDataFrame
capitals_gdf = gpd.GeoDataFrame(capitals_df,geometry=gpd.points_from_xy(capitals_df.longitude, capitals_df.latitude)
)

capitals_df.crs = "EPSG:4326"

capitals_gdf.drop(['latitude', 'longitude', 'region'], axis=1, inplace=True)

capitals_gdf

In [None]:
region_gdf = gpd.read_file("data/it.json")

region_gdf

In [68]:
region_gdf.crs = capitals_gdf.crs

In [69]:
joined_df = capitals_gdf.sjoin(region_gdf, how="left", predicate="within")

In [None]:
joined_df.plot()

In [None]:
joined_df

In [None]:
joined_df = joined_df.dropna()
joined_df

In [None]:
fig, ax = plt.subplots(figsize=(20,20))

region_gdf.plot(ax=ax, color='white', edgecolor='black')
joined_df.plot(ax=ax, color='red')

In [None]:
import pandas as pd

capitals_df = pd.read_csv('data/italian_regional_capitals.csv')

# Create a GeoDataFrame
capitals_gdf = gpd.GeoDataFrame(capitals_df,geometry=gpd.points_from_xy(capitals_df.longitude, capitals_df.latitude)
)

capitals_df.crs = "EPSG:4326"

print(capitals_gdf)

In [None]:
region_gdf = gpd.read_file("data/it.json")

region_gdf.plot()

In [None]:
region_gdf

In [60]:
region_gdf.crs = capitals_gdf.crs

In [61]:
joined_df = capitals_gdf.sjoin(region_gdf, how="left", predicate="within")

In [62]:
joined_df = joined_df.dropna()

In [None]:
fig, ax = plt.subplots(figsize=(40, 40))

region_gdf.plot(ax=ax, color='white', edgecolor='black')

joined_df.plot(ax=ax, color='red', markersize=100)

In [None]:
fig, ax = plt.subplots(figsize=(40, 40))

region_gdf.plot(ax=ax, color='white', edgecolor='black')

capitals_gdf.plot(ax=ax, color='red', markersize=100)

plt.show()

<a id="scikitmobility"></a>
# scikit-mobility

![GitHub Repo stars](https://img.shields.io/github/stars/scikit-mobility/scikit-mobility?style=social)
![GitHub](https://img.shields.io/github/license/scikit-mobility/scikit-mobility)
![GitHub release (latest by date)](https://img.shields.io/github/v/release/scikit-mobility/scikit-mobility)

[**scikit-mobility**](https://github.com/scikit-mobility/scikit-mobility) is a python library that provides scientists and practitioners with an environment to:

1. load and represent mobility data, both at the individual and the collective level, through easy-to-use data structures
(`TrajDataFrame` and `FlowDataFrame`); 
2. visualize trajectories and flows on interactive maps;
3. clean and preprocess mobility data using state-of-the-art techniques, such as trajectory clustering, compression, segmentation, and filtering;
4. analyze mobility data by using the main measures characterizing mobility patterns both at the individual and at the collective level, such as the computation of travel and characteristic distances, object and location entropies, location frequencies, waiting times, origin-destination matrices, and more;
4. run the most popular mechanistic generative models to simulate individual mobility, such as the Exploration and Preferential Return model (EPR) and its variants, and commuting and migratory flows, such as the Gravity
Model and the Radiation Model;
5. estimate the privacy risk associated with the analysis of a given mobility dataset through the simulation of the reidentification risk associated with a vast repertoire of privacy attacks.

- scikit-mobility is publicly available on GitHub at the following link: https://scikit-mobility.github.io/scikit-mobility/. 

- the documentation describing all the classes and functions of scikit-mobility
is available at https://scikit-mobility.github.io/scikit-mobility/.

The paper describing scikit-mobility may be found at: https://www.jstatsoft.org/article/view/v103i04

In [79]:
# import the library
import skmob

skmob.__version__

import pandas as pd
import geopandas as gpd

<a id="datastructures"></a>
## Data Structures

scikit-mobility provides two data structures to deal with raw trajectories and flows between places: 
- `TrajDataFrame`, for spatio-temporal trajectories; 
- `FlowDataFrame`, for mobility flows.

Both the data structures are an extension of the DataFrame implemented in the data analysis library [pandas](https://pandas.pydata.org/). Thus, both `TrajDataFrame`
and `FlowDataFrame` inherit all the functionalities provided by the `DataFrame` as well as all the efficient optimizations for reading and writing tabular data (e.g., mobility datasets). 

The current version of the library is designed to work with the latitude and longitude system (`epsg:4326`). Therefore, the Haversine formula is used by default when the library’s functions compute distances. 

<a id="trajdataframe"></a>
### The `TrajDataFrame`

Mobility data describe the movements of a set of objects during a period of observation. The objects may represent individuals, private vehicles, boats, and even players on a sports field. 

Mobility data are generally collected in an automatic way as a by-product of human activity on electronic devices (e.g., mobile phones, GPS devices, social
networking platforms, video cameras) and stored as trajectories, a temporally ordered sequence of spatio-temporal points where an object stopped in or went through. 

A `TrajDataFrame` is an extension of the pandas DataFrame that has specific columns names and data types. Each row in a `TrajDataFrame` describes a trajectory's point and contains the following columns:

- `lat` - latitude of the point
- `lng` - longitude of the point
- `datetime` - date and time of the point

For multi-user data sets, there are two optional columns:

- `uid` - user's identifier to which the trajectory belongs to
- `tid` - identifier for the trajectory


#### Creating a `TrajDataFrame`

A `TrajDataFrame` can be created from:

- a python list or numpy array
- a python dictionary
- a pandas DataFrame
- a text file

#### From a python list

In [None]:
# From a list
data_list = [[1, 39.984094, 116.319236, '2008-10-23 13:53:05'],
             [1, 39.984198, 116.319322, '2008-10-23 13:53:06'],
             [1, 39.984224, 116.319402, '2008-10-23 13:53:11'],
             [1, 39.984211, 116.319389, '2008-10-23 13:53:16']]
data_list

We must set the indexes of the mandatory columns using arguments `latitude`, `longitude` and `datetime`.

In [None]:
tdf = skmob.TrajDataFrame(data_list, 
                          latitude=1, longitude=2, 
                          datetime=3)
print(type(tdf))
tdf

##### From a pandas DataFrame

In [None]:
# build a dataframe from the 2D list
data_df = pd.DataFrame(data_list, columns=['user', 'lat', 'lng', 'hour'])
print(type(data_df)) # type of the structure
data_df.head() # head of the DataFrame

Note that:

- the name of columns in `data_df` do not match the names required
- you must specify the names of the mandatory columns using arguments `latitude`, `longitude` and `datetime`

In [None]:
# Create a TrajDataFrame from a DataFrame
tdf = skmob.TrajDataFrame(data_df, datetime='hour', user_id='user', latitude='lat', longitude='lng')
print(type(tdf))
tdf.head()

Columns of a `TrajDataFrame` have specific types

In [None]:
# In the DataFrame
print(type(data_df))
data_df.dtypes

In [None]:
print(type(tdf)) # In the TrajDataFrame
tdf.dtypes

In [None]:
tdf['lat'].head()

##### From an URL

In [None]:
# create a TrajDataFrame from a dataset of trajectories 
url = "https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz"
tdf = skmob.TrajDataFrame.from_file(url)
print(type(tdf))
tdf.head()

#### Attributes of a TrajDataFrame
- `crs`: the coordinate reference system. Default: epsg:4326 (lat/long)
- `parameters`: dictionary to add as many as necessary additional properties

In [None]:
## wsg84 datum
print(tdf.crs)
print(tdf.parameters)

In [None]:
# add your own parameter
tdf.parameters['analyzed'] = 1
tdf.parameters

In [None]:
### Visualizing a `TrajDataFrame`
tdf.plot_trajectory(hex_color='red')

<a id="tessellation"></a>
### Tessellation
In mobility tasks, the geography is often discretized by mapping the coordinates to a *tessellation*, i.e., a covering of the
bi-dimensional space using a countable number of geometric shapes (e.g., squares, hexagons), called tiles, with no overlaps
and no gaps. 

For instance, for the analysis or prediction of mobility flows, a spatial tessellation is used to aggregate flows of people moving among locations (the tiles of the tessellation). 

#### Creating tessellations given a city name and a tile size

##### Squared tessellations

In [93]:
from skmob.tessellation.tilers import tiler
from skmob.utils.plot import plot_gdf

In [None]:
tess_squared = tiler.get('squared', base_shape='Florence, Italy', meters=1000)
print("tiles = %s" %len(tess_squared))
tess_squared.head()

In [None]:
plot_gdf(tess_squared, zoom=11)

In [None]:
tess_squared = tiler.get('squared', base_shape='Florence, Italy', meters=200)
print("tiles = %s" %len(tess_squared))
tess_squared.head()

In [None]:
plot_gdf(tess_squared, zoom=11)

##### Hexagonal tessellation

In [None]:
tess_h3 = tiler.get('h3_tessellation', base_shape='Florence, Italy', meters=1000)
print("tiles = %s" %len(tess_h3))
tess_h3.head()

In [None]:
plot_gdf(tess_h3, zoom=11)

In [None]:
tess_h3 = tiler.get('h3_tessellation', base_shape='Florence, Italy', meters=200)
print("tiles = %s" %len(tess_h3))
tess_h3.head()

In [None]:
plot_gdf(tess_h3, zoom=11)

<a id="flowdataframe"></a>
### The `FlowDataFrame`

Origin-destination matrices, aka *flows*, are another common representation of mobility data. While trajectories refer to movements of single objects, flows refer to aggregated movements of objects between a set of locations. An example of flows is the daily commuting flows between the neighbourhoods of a city.

In scikit-mobility, an origin-destination matrix is described by a `FlowDataFrame`, an extension of the pandas DataFrame that has specific column names and data types. 

A row in a `FlowDataFrame` represents a flow of objects between two locations, described by three mandatory columns: 
- `origin` (any type), 
- `destination` (any type),
- `flow` (type: integer). 

In mobility tasks, the geography is often discretized by mapping the coordinates to a *tessellation*, i.e., a covering of the
bi-dimensional space using a countable number of geometric shapes (e.g., squares, hexagons), called tiles, with no overlaps
and no gaps. 

For instance, for the analysis or prediction of mobility flows, a spatial tessellation is used to aggregate flows of people moving among locations (the tiles of the tessellation). 

For this reason, each `FlowDataFrame` is associated with a spatial tessellation, a [geopandas](https://geopandas.org/) GeoDataFrame that contains two mandatory columns: 
- `tile_ID` (any type) indicates the identifier of
a location; 
- `geometry` indicates the geometric shape that describes the location on a territory (e.g., a square, an hexagon, the shape of a neighborhood).

Each location identifier in the origin and destination columns of a `FlowDataFrame` must be present in the associated spatial tessellation. Otherwise, the library raises an exception. 

Similarly, scikit-mobility raises an exception if the type of the `origin` and `destination` columns in the `FlowDataFrame` and the type of
the `tile_ID` column in the associated tessellation are different.

#### Creating a `FlowDataFrame`

Each `FlowDataFrame` goes in companion with a spatial tessellation. So, we must first create/upload a spatial tessellation, which as geopandas GeoDataFrame.



In [None]:
url = "https://raw.githubusercontent.com/scikit-mobility/tutorials/master/mda_masterbd2020/data/NY_counties_2011.geojson"
tessellation = gpd.read_file(url) # load a tessellation
tessellation.head()

In [None]:
plot_gdf(tessellation, zoom=6)

In [None]:
tessellation = tessellation.explode()
plot_gdf(tessellation, zoom=6)

#### Tip
Once you have a `GeoDataFrame` or a `GeoSeries` (i.e., just the `geometry` column), you can construct a squared tessellation on it.
(There's a bug instead for the h3 tessellation).

In [None]:
ny_tess_squared = tiler.get('squared', base_shape=tessellation, meters=10000)
print("tiles = %s" %len(ny_tess_squared))
ny_tess_squared.head()

In [None]:
plot_gdf(ny_tess_squared, zoom=7)

Then, we can create a `FlowDataFrame` from a file/url, specifying the spatial tessellation it refers to using argument `tessellation`. 

Also, you must specify the name of the column in the tessellation `GeoDataFrame` containing the identifier of the locations.

In [None]:
url = "https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/NY_commuting_flows_2011.csv"
fdf = skmob.FlowDataFrame.from_file(url, tessellation=tessellation, tile_id='tile_id')
fdf.head()

In [None]:
fdf.tessellation

In [None]:
fdf.dtypes

In [None]:
type(fdf)

You can access the spatial tessellation associated with the created `FlowDataFrame` using the attribute `.tessellation`.

In [None]:
# The tessellation is an attribute of the FlowDataFrame
fdf.tessellation.head()

In [None]:
fdf['origin'].unique()

In [None]:
tessellation['tile_id'].unique()

In [None]:
fdf.plot_flows(tiles = 'cartodbpositron')

In [None]:
fdf.plot_tessellation(tiles = 'cartodbpositron')

In [None]:
map_f = fdf.plot_tessellation(tiles='cartodbpositron')
fdf.plot_flows(map_f=map_f)

# Exercise

Download your mobility data from https://takeout.google.com/

-Create and visualize you tdf

-Create your personal flow_df

In [117]:
import json
from datetime import datetime

In [120]:
with open('data/Takeout/2022_OCTOBER.json', 'r') as f:
    data = json.load(f)

In [None]:
data

In [None]:
data.keys()

In [None]:
data['timelineObjects'][2].keys()

In [None]:
data['timelineObjects'][2]['placeVisit']

In [None]:
place_visits = []

for item in data.get('timelineObjects', []):

    if 'placeVisit' in item:
        visit = item['placeVisit']
        

In [None]:
place_visits = []

for item in data.get('timelineObjects', []):

    if 'placeVisit' in item:
        visit = item['placeVisit']
        
        # Extract relevant details from the place visit
        location = visit.get('location', {})
        lat = location.get('latitudeE7', 0) / 1e7
        lon = location.get('longitudeE7', 0) / 1e7
        name = location.get('name', 'Unknown Place')
        address = location.get('address', 'Unknown Address')
        place_id = location.get('placeId', 'N/A')
        
        # Extract visit duration details
        start_timestamp = visit['duration'].get('startTimestamp')
        end_timestamp = visit['duration'].get('endTimestamp')
        
        # Convert timestamps to datetime format
        if start_timestamp:
            start_timestamp = datetime.fromisoformat(start_timestamp.replace("Z", "+00:00"))
        if end_timestamp:
            end_timestamp = datetime.fromisoformat(end_timestamp.replace("Z", "+00:00"))
        
        # Append place visit data to the list
        place_visits.append([1, lat, lon, start_timestamp, end_timestamp, name, address, place_id])


# Create a DataFrame from the extracted place visit data
df = pd.DataFrame(place_visits, columns=['Uid', 'lat', 'lon', 'start_timestamp', 'end_timestamp', 'name', 'address', 'place_id'])

# Display the DataFrame
df

In [None]:
len(df)

In [None]:
#create a TrajDataFrame from the DataFrame

tdf = skmob.TrajDataFrame(df, datetime='start_timestamp', user_id='Uid', latitude='lat', longitude='lon')
tdf

In [None]:
tdf.plot_trajectory(hex_color='red', zoom=10, weight=2, tiles='cartodbpositron', max_points=1000)


In [None]:
acitivity_Segments = []

for item in data.get('timelineObjects', []):

    if 'activitySegment' in item:
        visit = item['activitySegment']
        
        # Extract relevant details from the place visit
        location = visit.get('startLocation', {})
        lat = location.get('latitudeE7', 0) / 1e7
        lon = location.get('longitudeE7', 0) / 1e7
        name = location.get('name', 'Unknown Place')
        address = location.get('address', 'Unknown Address')
        place_id = location.get('placeId', 'N/A')
        
        # Extract visit duration details
        start_timestamp = visit['duration'].get('startTimestamp')
        end_timestamp = visit['duration'].get('endTimestamp')
        
        # Convert timestamps to datetime format
        if start_timestamp:
            start_timestamp = datetime.fromisoformat(start_timestamp.replace("Z", "+00:00"))
        if end_timestamp:
            end_timestamp = datetime.fromisoformat(end_timestamp.replace("Z", "+00:00"))
        
        # Append place visit data to the list
        acitivity_Segments.append([1, lat, lon, start_timestamp, end_timestamp, name, address, place_id])


# Create a DataFrame from the extracted place visit data
df = pd.DataFrame(acitivity_Segments, columns=['Uid', 'lat', 'lon', 'start_timestamp', 'end_timestamp', 'name', 'address', 'place_id'])

# Display the DataFrame
df

In [None]:
tdf = skmob.TrajDataFrame(df, datetime='start_timestamp', user_id='Uid', latitude='lat', longitude='lon')

tdf.plot_trajectory(hex_color='red', zoom=10, weight=2, tiles='cartodbpositron', max_points=1000)