In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import warnings

warnings.simplefilter("ignore")

<div class="main-title">
<h1>Machine Learning Applications</h1>
<p>Transfer learning and clustering</p>
</div>

## Transfer learning with bicycle rental stations

In this part we will see:
* How to use a pre-trained hex2vec model with `srai`
* How to train classification model based on `srai` embeddings
* How to use `srai` to gather training data

In [3]:
# srai components used in this lesson
from srai.loaders import OSMOnlineLoader, OSMPbfLoader
from srai.regionalizers import geocode_to_region_gdf
from srai.joiners import IntersectionJoiner
from srai.embedders import Hex2VecEmbedder
from srai.regionalizers import H3Regionalizer

# classification model using scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# plotting utilities
from srai.plotting import plot_numeric_data
import plotly.express as px

from utils import CB_SAFE_PALLETE
import pandas as pd

### Experiment description

In [4]:
SOURCE_CITY = "Wrocław, Poland"
TARGET_CITY = "Basel, Switzerland"
H3_RESOLUTION = 10

bike_stations_osm_tag = {"amenity": "bicycle_rental"}

In [5]:
source_area = geocode_to_region_gdf(SOURCE_CITY)
target_area = geocode_to_region_gdf(TARGET_CITY)

### Downloading bike rental stations

In [6]:
loader = OSMOnlineLoader()
stations = loader.load(source_area, bike_stations_osm_tag)
stations.explore(height=600, tiles="CartoDB positron")

Downloading amenity: bicycle_rental: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.02it/s]


### Load pre-trained embedding model

We use pre-trained hex2vec model. This one was trained by us on all polish cities with 50k+ inhabitants. Models are available for download, link in [our repo](https://github.com/kraina-ai/srai#pre-trained-models-usage). For this tutorial, model for resoulution 10 is already downloaded and placed in `models` directory.

In [7]:
embedder = Hex2VecEmbedder.load(f"models/hex2vec_{H3_RESOLUTION}_poland_50k")
embedder.expected_output_features

0           aeroway_aerodrome
1               aeroway_apron
2                aeroway_gate
3              aeroway_hangar
4             aeroway_helipad
                ...          
720    waterway_tidal_channel
721    waterway_turning_point
722      waterway_water_point
723        waterway_waterfall
724             waterway_weir
Length: 725, dtype: object

We need to translate those features to OSM tags (+ remove bicycle rental stations)

In [8]:
embedder_osm_tags = {}

for element in embedder.expected_output_features:
    if element == 'amenity_bicycle_rental':
        continue
    key, value = element.split('_', 1)
    if key not in embedder_osm_tags:
        embedder_osm_tags[key] = [value]
    else:
        embedder_osm_tags[key].append(value)

### Load features from OSM and prepare regions

We need to load features to calculate embeddings for our cities. We will use OSMPbfLoader this time, since it is faster than OSMOnlineLoader when we have a lot of tags to download.

In [9]:
# load features
train_features = OSMPbfLoader().load(source_area, embedder_osm_tags)
# split into regions
train_regions = H3Regionalizer(resolution=H3_RESOLUTION).transform(source_area)
# join regions and features
train_joint = IntersectionJoiner().transform(train_regions, train_features)
# calculate embeddings
train_embeddings = embedder.transform(train_regions, train_features, train_joint)

[Wrocław, Lower Silesian Voivodeship, Poland] Counting pbf features: 2878335it [00:09, 298600.91it/s]
[Wrocław, Lower Silesian Voivodeship, Poland] Parsing pbf file #1: 100%|███████████████████| 2878335/2878335 [00:58<00:00, 49508.13it/s]


### Assign bike rental stations to regions to create training data for machine learning

In [11]:
bikes_joint = IntersectionJoiner().transform(train_regions, stations)

Select regions with stations as positive samples

In [12]:
positive_samples = train_regions.join(bikes_joint, how="inner")
positive_samples = positive_samples.reset_index().drop(columns=["feature_id"]).set_index("region_id")
positive_samples["is_positive"] = True
len(positive_samples)

223

Now remaining regions are negative samples

In [13]:
negative_samples = train_regions.copy()
negative_samples["is_positive"] = False
negative_samples.loc[positive_samples.index, "is_positive"] = True
negative_samples = negative_samples[~negative_samples["is_positive"]]
len(negative_samples)

21114

This is very imbalanced! Let's undersample to make it possible to train model

In [14]:
negative_undersampled = negative_samples.sample(n=3 * len(positive_samples), random_state=42)
negative_undersampled

Unnamed: 0_level_0,geometry,is_positive
region_id,Unnamed: 1_level_1,Unnamed: 2_level_1
8a1e2041b9a7fff,"POLYGON ((16.96955 51.05518, 16.96920 51.05457...",False
8a1e20470227fff,"POLYGON ((17.11079 51.13765, 17.11045 51.13704...",False
8a1e2045158ffff,"POLYGON ((17.13916 51.08148, 17.13882 51.08087...",False
8a1e20442d47fff,"POLYGON ((17.15939 51.11020, 17.15904 51.10959...",False
8a1e20430d1ffff,"POLYGON ((16.83801 51.13481, 16.83766 51.13420...",False
...,...,...
8a1e20419557fff,"POLYGON ((16.98588 51.06145, 16.98554 51.06083...",False
8a1e204e5b97fff,"POLYGON ((17.06842 51.07148, 17.06807 51.07087...",False
8a1e204220effff,"POLYGON ((16.90902 51.14892, 16.90867 51.14831...",False
8a1e2041838ffff,"POLYGON ((16.96905 51.06130, 16.96871 51.06069...",False


We can see training data on the map

In [15]:
train_data = pd.concat([positive_samples, negative_undersampled])
train_data.explore("is_positive", cmap=CB_SAFE_PALLETE, zoom_start=14, height=600, tiles="CartoDB positron")

### Train classifier

In [16]:
X = train_embeddings.loc[train_data.index].to_numpy()
y = train_data["is_positive"].astype(int).to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [17]:
classifier = SVC(probability=True)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_pred_proba = classifier.predict_proba(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.96      0.92       134
           1       0.83      0.64      0.73        45

    accuracy                           0.88       179
   macro avg       0.86      0.80      0.82       179
weighted avg       0.87      0.88      0.87       179



### Transfer knowledge to Basel

Let's repeat embedding for target city

In [18]:
target_regions = H3Regionalizer(resolution=10).transform(target_area)
target_features = OSMPbfLoader().load(target_area, embedder_osm_tags)
target_joint = IntersectionJoiner().transform(target_regions, target_features)
target_embeddings = embedder.transform(target_regions, target_features, target_joint)

[Basel, Basel-City, Switzerland] Counting pbf features: 636239it [00:02, 266739.95it/s]
[Basel, Basel-City, Switzerland] Parsing pbf file #1: 100%|██████████████████████████████████| 636239/636239 [00:14<00:00, 42549.76it/s]


And now find regions with high score for station location

In [20]:
station_probas = classifier.predict_proba(target_embeddings.to_numpy())

target_regions["station_proba"] = station_probas[:, 1]
plot_numeric_data(target_regions, "station_proba", colormap="Spectral_r")

### Way better results

Kamil's past project took this task more seriously. He used larger selection of cities and obtained great results. See them here:

https://t.ly/gPEt9