In [None]:
%load_ext autoreload
%autoreload 2

<div class="main-title">
<h1>Machine Learning Applications</h1>
<p>Transfer learning and clustering</p>
</div>

## Transfer learning with bicycle rental stations

In this part we will see:
* How to use a pre-trained hex2vec model with srai
* How to train classification model based on srai embeddings
* How to use srai to gather training data

In [None]:
# srai components used in this lesson
from srai.loaders import OSMOnlineLoader, OSMPbfLoader
from srai.regionalizers import geocode_to_region_gdf
from srai.joiners import IntersectionJoiner
from srai.embedders import Hex2VecEmbedder
from srai.regionalizers import H3Regionalizer

# classification model using scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report

from utils import CB_SAFE_PALLETE
import pandas as pd

### Experiment description

In [None]:
SOURCE_CITY = "Wrocław, Poland"
TARGET_CITY = "Basel, Switzerland"
H3_RESOLUTION = 10

bike_stations_osm_tag = {"amenity": "bicycle_rental"}

In [None]:
source_area = geocode_to_region_gdf(SOURCE_CITY)
target_area = geocode_to_region_gdf(TARGET_CITY)

### Downloading bike rental stations

In [None]:
loader = OSMOnlineLoader()
stations = loader.load(source_area, bike_stations_osm_tag)
stations.explore()

### Load pre-trained embedding model

We use pre-trained hex2vec model. This one was trained by us on all polish cities with 50k+ inhabitants. Models are available for download, link in [our repo](https://github.com/kraina-ai/srai#pre-trained-models-usage). For this tutorial, model for resoulution 10 is already downloaded and placed in `models` directory.

In [None]:
embedder = Hex2VecEmbedder.load(f"models/hex2vec_{H3_RESOLUTION}_poland_50k")
embedder.expected_output_features

We need to translate those features to OSM tags

In [None]:
embedder_osm_tags = {}

for element in embedder.expected_output_features:
    key, value = element.split('_', 1)
    if key not in embedder_osm_tags:
        embedder_osm_tags[key] = [value]
    else:
        embedder_osm_tags[key].append(value)

### Load features from OSM and prepare regions

We need to load features to calculate embeddings for our cities. We will use OSMPbfLoader this time, since it is faster than OSMOnlineLoader when we have a lot of tags to download.

In [None]:
# load features
train_features = OSMPbfLoader().load(source_area, embedder_osm_tags)
# split into regions
train_regions = H3Regionalizer(resolution=H3_RESOLUTION).transform(source_area)
# join regions and features
train_joint = IntersectionJoiner().transform(train_regions, train_features)
# calculate embeddings
train_embeddings = embedder.transform(train_regions, train_features, train_joint)

### Assign bike rental stations to regions to create training data for machine learning

In [None]:
bikes_joint = IntersectionJoiner().transform(train_regions, stations)

Select regions with stations as positive samples

In [None]:
positive_samples = train_regions.join(bikes_joint, how="inner")
positive_samples = positive_samples.reset_index().drop(columns=["feature_id"]).set_index("region_id")
positive_samples["is_positive"] = True
len(positive_samples)

Now remaining regions are negative samples

In [None]:
negative_samples = train_regions.copy()
negative_samples["is_positive"] = False
negative_samples.loc[positive_samples.index, "is_positive"] = True
negative_samples = negative_samples[~negative_samples["is_positive"]]
len(negative_samples)

This is very imbalanced! Let's undersample to make it possible to train model

In [None]:
negative_undersampled = negative_samples.sample(n=3 * len(positive_samples), random_state=42)
negative_undersampled

We can see training data on the map

In [None]:
train_data = pd.concat([positive_samples, negative_undersampled])
train_data.explore("is_positive", cmap=CB_SAFE_PALLETE, zoom_start=14)

### Train classifier

In [None]:
X = train_embeddings.loc[train_data.index].to_numpy()
y = train_data["is_positive"].astype(int).to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
classifier = SVC(probability=True)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_pred_proba = classifier.predict_proba(X_test)

print(classification_report(y_test, y_pred))

### Transfer knowledge to Basel

Let's repeat embedding for target city

In [None]:
target_regions = H3Regionalizer(resolution=10).transform(target_area)
target_features = OSMPbfLoader().load(target_area, embedder_osm_tags)
target_joint = IntersectionJoiner().transform(target_regions, target_features)
target_embeddings = embedder.transform(target_regions, target_features, target_joint)

And now find regions with high score for station location

In [None]:
station_probas = classifier.predict_proba(target_embeddings.to_numpy())

target_regions["add_station"] = station_probas[:, 1] > 0.7
target_regions.explore("add_station", cmap=CB_SAFE_PALLETE)

### Way better results

Kamil's past project took this task more seriously. He used larger selection of cities and obtained great results. See them here:

https://t.ly/gPEt9

## Highway2Vec Clustering and similarity search

In this part we will see:
<!-- * How to use a pre-trained hex2vec model with srai
* How to train classification model based on srai embeddings
* How to use srai to gather training data -->