# Pollution data analysis example

<img align="right" src="https://movingpandas.github.io/movingpandas/assets/img/movingpandas.png">

This tutorial uses data published by the Department of Computer Science and Engineering, Indian Institute of Technology Delhi, specifically: [Delhi Pollution Dataset](http://cse.iitd.ac.in/pollutiondata/delhi). The workflow consists of the following steps:

1. Establishing an overview by visualizing raw input data records
2. Converting data into trajectories
3. Removing problematic trajectories using ObservationGapSplitter and filtering by speed
4. Plotting cleaned trajectories
5. Assigning H3 cell IDs to each trajectory point
6. Plotting H3 cells as polygons with pollution measurements

Some of the steps working with H3 are based on the following: [Medium article](https://medium.com/@jesse.b.nestler/how-to-convert-h3-cell-boundaries-to-shapely-polygons-in-python-f7558add2f63).

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import movingpandas as mpd
import shapely as shp
import hvplot.pandas
import matplotlib.pyplot as plt
import h3
import folium

from geopandas import GeoDataFrame, read_file
from shapely.geometry import Point, LineString, Polygon
from datetime import datetime, timedelta
from holoviews import opts, dim
from os.path import exists
from urllib.request import urlretrieve

import warnings

warnings.filterwarnings("ignore")

plot_defaults = {"linewidth": 5, "capstyle": "round", "figsize": (9, 3), "legend": True}
opts.defaults(
    opts.Overlay(active_tools=["wheel_zoom"], frame_width=300, frame_height=500)
)
hvplot_defaults = {"tiles": None, "cmap": "Viridis", "colorbar": True}

mpd.show_versions()

## Loading pollution data

In [None]:
%%time
df = pd.read_csv("../data/2021-01-30_all.zip", index_col=0)
print(f"Finished reading {len(df)}")

Let's see what the data looks like:

In [None]:
df.head()

In [None]:
df.plot(c="pm2_5", x="long", y="lat", kind="scatter")

Let's create trajectories:

In [None]:
tc = mpd.TrajectoryCollection(df, "deviceId", t="dateTime", x="long", y="lat")
print(tc)

## Removing problematic trajectories

We use Particulate Matter (PM) as an indicator for air pollution:

In [None]:
traj_gdf = tc.to_traj_gdf(agg={"pm2_5": "mean"})

In [None]:
traj_gdf.plot("pm2_5_mean", cmap="YlOrRd", linewidth=0.7, legend=True, aspect=1)

Let's remove problematic trajectories as much as we can:

In [None]:
split = mpd.ObservationGapSplitter(tc).split(gap=timedelta(minutes=10))
split

In [None]:
split = split.add_speed(units=("km", "h"))

In [None]:
traj_gdf = split.to_traj_gdf(agg={"pm2_5": "mean", "speed": "max"})

Anything over a speed of 108km/h or 30m/s seems unlikely for a bus, so let's filter these points out:

In [None]:
traj_gdf = traj_gdf[traj_gdf.speed_max < 108]

## Plotting trajectories

Let's plot the resulting trajectories:

In [None]:
traj_gdf["start_t"] = traj_gdf["start_t"].astype(str)
traj_gdf["end_t"] = traj_gdf["end_t"].astype(str)

In [None]:
traj_gdf = traj_gdf.round(2)

In [None]:
traj_gdf.explore(
    "pm2_5_mean",
    tiles="CartoDB positron",
    cmap="YlOrRd",
    linewidth=0.7,
    legend=True,
    aspect=1,
)

## Assigning H3 cell IDs to trajectory points

Let's again filter by realistic speed:

In [None]:
point_gdf = split.to_point_gdf()

In [None]:
point_gdf = point_gdf[point_gdf.speed < 30]

In [None]:
point_gdf["x"] = point_gdf.geometry.x
point_gdf["y"] = point_gdf.geometry.y

We can assign H3 cell IDs to each point in a trajectory:

In [None]:
res = 7
point_gdf["h3_cell"] = point_gdf.apply(
    lambda r: str(h3.geo_to_h3(r.y, r.x, res)), axis=1
)
point_gdf.head()

We can use the mean of PM2.5 as a pollution measurement:

In [None]:
h3_df_mean = point_gdf.groupby(["h3_cell"])["pm2_5"].mean().round(0).reset_index()
h3_df_mean = h3_df_mean.rename(columns={"pm2_5": "pm2_5_mean"})
h3_df_mean.head()

We can also use the maximum of PM2.5 as a pollution measurement:

In [None]:
h3_df_max = point_gdf.groupby(["h3_cell"])["pm2_5"].max().reset_index()
h3_df_max = h3_df_max.rename(columns={"pm2_5": "pm2_5_max"})
h3_df_max.head()

## Visualizing pollution measurements

Let's create polygons with pollution data:

In [None]:
def cell_to_shapely(cell):
    coords = h3.h3_to_geo_boundary(cell)
    flipped = tuple(coord[::-1] for coord in coords)
    return Polygon(flipped)


h3_geoms_mean = h3_df_mean["h3_cell"].apply(lambda x: cell_to_shapely(x))
h3_gdf_mean = gpd.GeoDataFrame(data=h3_df_mean, geometry=h3_geoms_mean, crs=4326)

h3_geoms_max = h3_df_max["h3_cell"].apply(lambda x: cell_to_shapely(x))
h3_gdf_max = gpd.GeoDataFrame(data=h3_df_max, geometry=h3_geoms_max, crs=4326)

Let's plot the results for mean pollution data:

In [None]:
h3_gdf_mean.explore("pm2_5_mean", cmap="YlOrRd")

We can plot polygons and trajectories together:

In [None]:
map = h3_gdf_mean.explore("pm2_5_mean", cmap="YlOrRd", name="PM2.5 mean")

traj_gdf.explore(m=map, name="Bus trajectories")

folium.TileLayer("Cartodb Positron").add_to(map)

folium.LayerControl().add_to(map)

map

Lastly, let's plot mean and maximum values next to each other for comparison:

In [None]:
h3_gdf_max = h3_gdf_max.rename(columns={"geometry": "geometry1"})

In [None]:
pollution = pd.concat([h3_gdf_mean, h3_gdf_max], axis=1)

In [None]:
(
    pollution.hvplot.polygons(
        geo=True, tiles="OSM", c="pm2_5_mean", alpha=0.8, title="Mean pollution data"
    )
    + pollution.hvplot.polygons(
        geo=True, tiles="OSM", c="pm2_5_max", alpha=0.8, title="Maximum pollution data"
    )
)

## Continue exploring MovingPandas

1. [Bird migration analysis](bird-migration.ipynb)
1. [Ship data analysis](ship-data.ipynb)
1. [Horse collar data exploration](horse-collar.ipynb)
1. [OSM traces](osm-traces.ipynb)
1. [Soccer game](soccer-game.ipynb)
1. [Mars rover & heli](mars-rover.ipynb)
1. [Ever Given](ever-given.ipynb)
1. [Iceberg](iceberg.ipynb) 
1. [Pollution data](pollution-data.ipynb)