# Simple ML model using Overture Maps data

In this notebook, we aim to predict the distance to the nearest bicycle-sharing station in three Spanish cities: Madrid, Seville, and Valencia.

Our goal is to develop a model capable of accurately predicting this distance based on geospatial features, which can assist in urban planning and enhance the accessibility of bicycle-sharing services.

The task involves several key steps:

1. **Loading bicycle-sharing stations**: 
We begin by loading geospatial data for bicycle-sharing stations from OpenStreetMap using [`OsmOnlineLoader`](../../loaders/osm_online_loader/) for the specified cities. We also visualize the locations of these stations.

2. **Regionalization using H3 hexagons**:
The cities are regionalized using H3 hexagons with [`H3Regionalizer`](../../regionalizers/h3_regionalizer/), enabling us to divide them into smaller, manageable regions. Next, we buffer the generated H3 regions and compute the distance from each hexagon to the nearest bicycle-sharing station.

3. **Feature extraction and embedding**:
We load additional geospatial features from Overture Maps using [`OvertureMapsLoader`](../../loaders/overture_maps_loader/) and join them with the H3 regions (with [`IntersectionJoiner`](../../joiners/intersection_joiner/)). We then generate embeddings for these regions using [`ContextualCountEmbedder`](../../embedders/contextual_count_embedder/), which will be used as input features for our machine learning model.

4. **Training and evaluating the model**: 
We prepare the data for training and validation, standardize the features, and train an XGBoost regression model to predict the distance to the nearest bicycle-sharing station. Data from Madrid is used for training, while data from Seville is used for validation.

5. **Prediction and Visualization**:
Finally, the trained model predicts the distance to the nearest bicycle-sharing station for all three cities. The predicted distances and prediction errors are visualized on maps to evaluate the model's performance.

<div class="admonition info">
    <p class="admonition-title">Prerequisites</p>
    <p>
    <ul>
        <li>12 GB of RAM</li>
        <li>
            Installed libraries: 
            <code>srai[osm,overturemaps,plotting]</code>, 
            <code>contextily</code>, <code>seaborn</code>, 
            <code>scikit-learn</code>, <code>xgboost</code>, <code>pypalettes</code>
        </li>
    </ul>
    </p>
</div>

<a target="_blank" href="https://colab.research.google.com/github/kraina-ai/srai/blob/main/examples/use_cases/simple_ml_with_overture_maps_data.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
# Uncomment the line below to install the required packages
# (e.g. when running in a new environment or Google Colab)

# ! pip install srai[osm,overturemaps,plotting] contextily seaborn scikit-learn xgboost pypalettes

## Import required libraries

In [None]:
import contextily as cx
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd
import pyarrow as pa
import seaborn as sns
import xgboost as xgb
from h3 import int_to_str, str_to_int
from h3ronpy import grid_disk_aggregate_k
from pypalettes import load_cmap
from sklearn.preprocessing import StandardScaler

from srai.embedders import ContextualCountEmbedder
from srai.h3 import h3_to_shapely_geometry
from srai.joiners import IntersectionJoiner
from srai.loaders import OSMOnlineLoader
from srai.loaders.overturemaps_loader import OvertureMapsLoader
from srai.neighbourhoods import H3Neighbourhood
from srai.regionalizers import H3Regionalizer, geocode_to_region_gdf

In [None]:
SEED = 71

## Define regions of interest

Geocode three Spanish cities using the [`geocode_to_region_gdf`](../../../api/regionalizers/#srai.regionalizers.geocode_to_region_gdf) function.

In [None]:
cities_names = ["Madrid", "Seville", "Valencia"]
regions = geocode_to_region_gdf(cities_names)
regions.index = cities_names
regions

## Load bicycle-sharing stations data

Load locations of the bicycle-sharing station from OpenStreetMap using  [`OsmOnlineLoader`](../../loaders/osm_online_loader/) for the defined regions.

We will use the `{"amenity": "bicycle_rental"}` filter to get only locations where you can rental a bicycle (see the [`amenity=bicycle_rental`](https://wiki.openstreetmap.org/wiki/Tag:amenity%3Dbicycle_rental) definition on the OSM wiki).

In [None]:
bicycle_stations = OSMOnlineLoader().load(area=regions, tags={"amenity": "bicycle_rental"})
bicycle_stations

Let's join the locations of the bicycle-sharing stations with defined cities geometries using [`IntersectionJoiner`](../../joiners/intersection_joiner/) to group them per city.

In [None]:
bicycle_stations_in_city = IntersectionJoiner().transform(
    regions, bicycle_stations, return_geom=True
)

bicycle_stations_per_city = {}
for city_name in cities_names:
    bicycle_stations_per_city[city_name] = bicycle_stations_in_city.loc[city_name]

Let's visualize the Madrid data on the map.

In [None]:
bicycle_stations_per_city["Madrid"].explore(tiles="CartoDB Positron")

todo

In [None]:
H3_RESOLUTION = 11
H3_PREDICTION_RANGE = 10
H3_PREDICTION_MAX_DISTANCE = 10
H3_NEIGHBOURS = 5


def buffer_h3_cells_with_aggregation(h3_regions):
    """Expand H3 regions and calculate minimal distance to origin cells."""
    return (
        pa.table(
            grid_disk_aggregate_k(
                h3_regions.index.map(str_to_int),
                H3_NEIGHBOURS + H3_PREDICTION_RANGE,
                "min",
            )
        )
        .to_pandas()
        .rename(columns={"k": "distance_to_station", "cell": "region_id"})
    )


h3_regionalizer = H3Regionalizer(resolution=H3_RESOLUTION)
h3_regions_gdfs = []
for city_name, bicycle_stations_data in bicycle_stations_per_city.items():
    city_h3_regions = h3_regionalizer.transform(bicycle_stations_data)

    expanded_city_h3_regions = buffer_h3_cells_with_aggregation(city_h3_regions)
    expanded_city_h3_regions["region_id"] = expanded_city_h3_regions["region_id"].map(int_to_str)
    expanded_city_h3_regions = expanded_city_h3_regions.set_index("region_id")
    expanded_city_h3_regions["city"] = city_name
    expanded_city_h3_regions["clamped_distance_to_station"] = expanded_city_h3_regions[
        "distance_to_station"
    ].clip(0, H3_PREDICTION_MAX_DISTANCE)
    expanded_city_h3_regions = gpd.GeoDataFrame(
        expanded_city_h3_regions,
        geometry=h3_to_shapely_geometry(expanded_city_h3_regions.index),
        crs=4326,
    )
    h3_regions_gdfs.append(expanded_city_h3_regions)

h3_regions = gpd.pd.concat(h3_regions_gdfs)

h3_regions

In [None]:
cmap = load_cmap("Temps", cmap_type="continuous", reverse=False)

ax = h3_regions[
    (h3_regions.city == "Madrid") & (h3_regions.distance_to_station <= H3_PREDICTION_RANGE)
].plot(
    column="clamped_distance_to_station",
    figsize=(20, 20),
    cmap=cmap,
    alpha=0.6,
    legend=True,
    legend_kwds={
        "shrink": 0.3,
        "location": "bottom",
        "label": "Distance to station",
        "pad": -0.05,
    },
)
h3_regions[
    (h3_regions.city == "Madrid") & (h3_regions.distance_to_station > H3_PREDICTION_RANGE)
].plot(ax=ax, color="gray", alpha=0.3)
bicycle_stations_per_city["Madrid"].representative_point().plot(ax=ax, color="black", markersize=1)

cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
ax.set_axis_off()
ax.set_title("Distance to the nearest bike station in Madrid", fontsize=20)
plt.show()

In [None]:
OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES = {
    ("base", "infrastructure"): 1,
    ("base", "land"): 1,
    ("base", "land_use"): 1,
    ("base", "water"): 1,
    ("transportation", "segment"): 2,
    ("buildings", "building"): 2,
    ("places", "place"): 1,
}

features = OvertureMapsLoader(
    release="2024-12-18.0",
    theme_type_pairs=list(OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES.keys()),
    hierarchy_depth=list(OVERTURE_MAPS_HIERARCHY_DEPTH_VALUES.values()),
    include_all_possible_columns=False,
).load(area=h3_regions)
features

# if you want to use OpenStreetMap data instead you can use `OSMPbfLoader`
# with `GEOFABRIK_LAYERS` filter

# from srai.loaders.osm_loaders.filters import GEOFABRIK_LAYERS
# from srai.loaders.osm_loaders import OSMPbfLoader
# features = OSMPbfLoader().load(area=h3_regions, tags=GEOFABRIK_LAYERS)

In [None]:
joint = IntersectionJoiner().transform(regions=h3_regions, features=features)
joint

In [None]:
embeddings = ContextualCountEmbedder(
    neighbourhood=H3Neighbourhood(),
    neighbourhood_distance=H3_NEIGHBOURS,
    concatenate_vectors=False,
    count_subcategories=False,
).transform(regions_gdf=h3_regions, features_gdf=features, joint_gdf=joint)

# If you are using OpenStreetMap data, remeber to remove bicycle_sharing stations from the dataset
# embeddings = embeddings.drop(columns=[c for c in embeddings.columns if "bicycle_rental" in c])

embeddings

In [None]:
TARGET = "distance_to_station"
TRAIN_CITY = "Madrid"
VALIDATION_CITY = "Seville"

In [None]:
madrid_data = h3_regions[h3_regions["city"] == "Madrid"].merge(
    embeddings, left_index=True, right_index=True
)

seville_data = h3_regions[h3_regions["city"] == "Seville"].merge(
    embeddings, left_index=True, right_index=True
)

valencia_data = h3_regions[h3_regions["city"] == "Valencia"].merge(
    embeddings, left_index=True, right_index=True
)


madrid_data.shape, seville_data.shape, valencia_data.shape

In [None]:
madrid_data.head()

In [None]:
x_madrid = StandardScaler().fit_transform(madrid_data[embeddings.columns])
y_madrid = madrid_data[TARGET]

x_seville = StandardScaler().fit_transform(seville_data[embeddings.columns])
y_seville = seville_data[TARGET]

x_valencia = StandardScaler().fit_transform(valencia_data[embeddings.columns])
y_valencia = valencia_data[TARGET]

x_madrid.shape, y_madrid.shape, x_seville.shape, y_seville.shape, x_valencia.shape, y_valencia.shape

In [None]:
mask = y_madrid <= H3_PREDICTION_MAX_DISTANCE
dtrain = xgb.DMatrix(x_madrid[mask], label=y_madrid[mask])
mask = y_seville <= H3_PREDICTION_MAX_DISTANCE
dval = xgb.DMatrix(x_seville[mask], label=y_seville[mask])

# Set parameters for XGBoost
params = {
    "objective": "reg:squarederror",
    "eta": 0.01,
    "max_depth": 8,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": SEED,
}

# Train the model
bst = xgb.train(
    params,
    dtrain,
    num_boost_round=2000,
    verbose_eval=50,
    early_stopping_rounds=100,
    evals=[(dtrain, "train"), (dval, "valid")],
)

In [None]:
bst.best_iteration, bst.best_score

In [None]:
concatenated_regions = pd.concat([madrid_data, seville_data, valencia_data])[
    [TARGET, "city", "geometry"]
]

concatenated_regions["predicted_distance_to_station"] = (
    np.concatenate(
        [
            bst.predict(xgb.DMatrix(x_madrid)),
            bst.predict(xgb.DMatrix(x_seville)),
            bst.predict(xgb.DMatrix(x_valencia)),
        ]
    )
    .round(2)
    .clip(min=0)
)

concatenated_regions = concatenated_regions[concatenated_regions[TARGET] <= H3_PREDICTION_RANGE]

concatenated_regions["prediction_error"] = concatenated_regions[
    "predicted_distance_to_station"
] - concatenated_regions[TARGET].clip(upper=H3_PREDICTION_MAX_DISTANCE)

concatenated_regions

In [None]:
cmap = load_cmap("pal12", reverse=True, keep_first_n=H3_PREDICTION_MAX_DISTANCE + 1)

_, axs = plt.subplots(2, 3, figsize=(12, 8), sharey=True, sharex=True, dpi=600)

axs[0, 0].set_ylabel("Predicted distance to station")
axs[1, 0].set_ylabel("Predicted distance to station")

for idx, city_name in enumerate(cities_names):
    expected_values = concatenated_regions[concatenated_regions["city"] == city_name][TARGET]
    mask = expected_values <= H3_PREDICTION_MAX_DISTANCE
    expected_values = expected_values[mask]
    predicted_values = concatenated_regions[concatenated_regions["city"] == city_name][
        "predicted_distance_to_station"
    ][mask]

    sns.regplot(
        x=expected_values,
        y=predicted_values,
        ax=axs[0, idx],
        scatter=True,
        order=2,
        scatter_kws=dict(
            alpha=0.02,
            color=[cmap.colors[_y] for _y in expected_values],
        ),
        line_kws=dict(
            color="black",
        ),
        x_jitter=0.1,
    )
    sns.violinplot(
        x=expected_values,
        y=predicted_values,
        ax=axs[1, idx],
        fill=True,
        palette=cmap.colors,
        hue=expected_values,
        legend=False,
    )
    title = city_name
    if city_name == TRAIN_CITY:
        title += " (train)"
    elif city_name == VALIDATION_CITY:
        title += " (validation)"

    axs[0, idx].set_title(title)
    axs[0, idx].set_xlabel(None)
    axs[1, idx].set_xlabel("Distance to station")

axs[0, 0].set_ylim(bottom=0)
axs[1, 0].set_ylim(bottom=0)

plt.tight_layout()

In [None]:
cmap = load_cmap("Temps", cmap_type="continuous", reverse=False)

for city_name in cities_names:
    city_data = concatenated_regions[concatenated_regions.city == city_name]
    ax = city_data.plot(
        column="predicted_distance_to_station",
        figsize=(20, 20),
        cmap=cmap,
        alpha=0.8,
        legend=True,
        legend_kwds={
            "shrink": 0.3,
            "location": "bottom",
            "label": "Predicted distance to station",
            "pad": -0.05,
        },
        vmin=max(0, city_data["predicted_distance_to_station"].min()),
        vmax=city_data["predicted_distance_to_station"].max(),
    )
    bicycle_stations_per_city[city_name].representative_point().plot(
        ax=ax, color="black", markersize=3, alpha=0.4
    )

    cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
    ax.set_axis_off()

    title = f"Predicted distance to the nearest bike station in {city_name}"
    if city_name == TRAIN_CITY:
        title += " (train)"
    elif city_name == VALIDATION_CITY:
        title += " (validation)"

    ax.set_title(title, fontsize=20)

    plt.show()

In [None]:
cmap = load_cmap("TangerineBlues", cmap_type="continuous")

for city_name in cities_names:
    city_data = concatenated_regions[concatenated_regions.city == city_name].copy()
    city_data["normalized_prediction_error"] = (
        city_data["prediction_error"].apply(
            lambda x, city_data=city_data: (
                -x / city_data["prediction_error"].min()
                if x < 0
                else x / city_data["prediction_error"].max()
            )
        )
        + 1
    ) / 2

    city_data["normalized_prediction_error_alpha"] = (
        city_data["normalized_prediction_error"] - 0.5
    ).abs() * 2

    ax = city_data.plot(
        column="normalized_prediction_error",
        figsize=(20, 20),
        cmap=cmap,
        alpha=city_data["normalized_prediction_error_alpha"],
        legend=True,
        legend_kwds={
            "shrink": 0.3,
            "location": "bottom",
            "label": "Distance prediction error",
            "pad": -0.05,
            "ticks": [0, 0.5, 1],
            "format": mticker.FixedFormatter(
                [
                    city_data["prediction_error"].min().round(2),
                    "0",
                    city_data["prediction_error"].max().round(2),
                ]
            ),
        },
    )
    bicycle_stations_per_city[city_name].representative_point().plot(
        ax=ax, color="black", markersize=3, alpha=0.4
    )

    cx.add_basemap(ax, crs=h3_regions.crs, source=cx.providers.CartoDB.PositronNoLabels, zoom=13)
    ax.set_axis_off()

    title = f"Distance prediction error in {city_name}"
    if city_name == TRAIN_CITY:
        title += " (train)"
    elif city_name == VALIDATION_CITY:
        title += " (validation)"

    ax.set_title(title, fontsize=20)

    plt.show()