## Probabalistic Clustering using K-Means
So far, station usage types were derived from a single k-means clustering applied to features aggregated over the full observation period of each station.

However, the temporal coverage differs between stations, as shown below.

As a consequence, cluster assignments may depend on the chosen observation window. To assess the temporal robustness of the clustering, we recompute the feature vectors and cluster assignments for a restricted time interval.

In [None]:
from data_io.loader.data_loader import DataLoader
import polars as pl

dl = DataLoader(city="Stadt_Heidelberg")

rows = []

for station in dl.get_bicyle_stations():
    bd = dl.get_bicycle(station_name=station)
    min_date, max_date = bd.date_range()
    rows.append({
        "station": station,
        "start_date": min_date,
        "end_date": max_date,
    })

df_dates = pl.DataFrame(rows)

df_dates = df_dates.with_columns(
    (pl.col("end_date") - pl.col("start_date"))
        .dt.total_days()
        .alias("summed_days")
)

df_dates

### Temporal robustness check

We focus on the period 2021–2024, since all stations are continuously observed from 2021 onwards (excluding Ernst-Walz-Brücke West – alt, which was removed in 2018).
The resulting clustering is compared to the clustering obtained from the full data set.

Cluster consistency is quantified using the Adjusted Rand Index (ARI), which measures the agreement between two clusterings while correcting for chance agreement.

In [None]:
N_CLUSTERS = 3

In [None]:
from analysis.visualization.characterisation.clustering import cluster_ari
from analysis.visualization.characterisation.clustering import kmeans_clustering
from analysis.visualization.characterisation.features import build_feature_df

interval_21_24 = ("2021-01-01", "2024-01-01")
features_21_24 = build_feature_df(dl, interval_21_24)

features_full = build_feature_df(dl)

clustering_full = kmeans_clustering(features=features_full, k=N_CLUSTERS)
clustering_21_24 = kmeans_clustering(features=features_21_24, k=N_CLUSTERS)

score = cluster_ari(clustering_full, clustering_21_24)
print(f"Cluster Consistency Score: {score * 100:.2f}%")

The ARI indicates that cluster assignments are sensitive to the selected time window.
This behaviour is expected, as traffic patterns may evolve over time and aggregated features depend on the available observation period.

Rather than interpreting this variability as a weakness of the feature set, it motivates a probabilistic interpretation of station usage types.

### Probabilistic station classification
Instead of assigning each station to a single fixed cluster, we repeatedly apply k-means clustering over an expanding (cumulative) time window and compute cluster membership probabilities.

For each station $s$ and usage type $u$, we define:

$$
\begin{align*}
P(\;u\;|\;s\;) = \frac{\text{Number of assignments of station s to type u}}{\text{Total number of clustering runs}}
\end{align*}
$$

#### Clustering probabilities

To obtain a probabilistic and temporally robust station classification, k-means clustering is applied repeatedly using either a sliding or a cumulative time window.

- **Sliding window (`sliding`)**: Clustering is performed on a fixed-length window (e.g. 24 months) that is shifted forward month by month.
- **Cumulative window (`cumulative`)**: Clustering is performed on an expanding window with a fixed start date.

In the following, we use a two-year sliding window, as consecutive windows differ only slightly. This allows temporal changes in station usage to be captured more effectively than with the cumulative approach, which increasingly aggregates long-term behaviour.

In [None]:
from analysis.visualization.characterisation.clustering import monthly_dates, make_interval

DATASET_START = "2016-01-01"
DATASET_END = "2025-01-01"
TIME_SERIES_MODE = "sliding"
WINDOW_MONTHS = 24


dates = monthly_dates(start=DATASET_START, end=DATASET_END)
for d in dates:
  i = make_interval(start=DATASET_START, end=d, mode=TIME_SERIES_MODE, window_months=24)
  if i is None:
    continue
  print(i)

### Cluster Interpretation

K-means assigns arbitrary numeric cluster IDs without semantic meaning.  
To obtain interpretable usage types, clusters are labelled post hoc based on their aggregated feature characteristics.

For each cluster, mean feature values (DPI, WSD, SDI) are computed and standardised.

A scalar utilitarian score is defined as

$$
\text{Utilitarian Score} = \text{DPI} + \text{WSD} - \text{SDI}
$$

Clusters are ordered by this score and labelled as **recreational**, **mixed**, and **utilitarian** from low to high values.  
Labels are derived from the full-period feature representation.

In [None]:
X = build_feature_df(dl)

EXCLUDE = {"station", "valid", "cluster", "date"}

FEATURES = [
    c for c in X.columns
    if c not in EXCLUDE and X[c].dtype in (pl.Float32, pl.Float64)
]

In [None]:
from analysis.visualization.characterisation.clustering import cluster_timeseries_usage, usage_probabilities

usage = cluster_timeseries_usage(
    loader=dl,
    k=N_CLUSTERS,
    features=FEATURES,
    start=DATASET_START,
    end=DATASET_END,
    mode=TIME_SERIES_MODE,
    window_months=WINDOW_MONTHS
)

usage_probs = usage_probabilities(usage).sort(["station", "probability"], descending=True)

In [None]:
from analysis.visualization.characterisation.plotting import plot_cluster_probabilities_ci

plot_cluster_probabilities_ci(cluster_probs_ci=usage_probs, station_col="station", prob_col="probability", lo_col="ci_low", hi_col="ci_high")

In [None]:
top_per_usage_type = (
    usage_probs
    .sort("probability", descending=True)
    .group_by("usage_type")
    .head(20)  
)

top_per_usage_type

### Temporal Stability of Station Usage Types

To quantify the temporal stability of station usage patterns, we compute the Shannon entropy of the cluster membership distribution for each station. For a station $s$ with usage probabilities $P(u \mid s)$, the entropy is defined as

$$
H(s) = -\sum_{u} P(u \mid s)\,\log P(u \mid s)
$$

In [None]:
from analysis.visualization.characterisation.helpers import entropy, dominant_usage_per_station

entropy_df = entropy(usage_probs=usage_probs)
dominant_usage = dominant_usage_per_station(usage_probs=usage_probs)

entropy_labeled = (
    entropy_df
    .join(dominant_usage, on="station", how="left")
    .sort("entropy")
)

entropy_labeled

Stations with near-deterministic usage probabilities show minimal entropy, indicating stable temporal behaviour.

Higher entropy values occur predominantly for mixed-use stations (see plot below), reflecting structurally variable or context-dependent usage patterns.

In [None]:
from analysis.visualization.characterisation.plotting import plot_usage_entropy

plot_usage_entropy(entropy_df=entropy_labeled)

### Map View of Clustering
Lets look how the clustering looks on a map.

#### Dynamic Map

In [None]:
from analysis.visualization.characterisation.map.dynamic_map import bicycle_station_cluster_map
from IPython.display import display, clear_output

clear_output(wait=True)
display(bicycle_station_cluster_map(loader=dl, usage_probs=usage_probs))

#### Static Map

In [None]:
from analysis.visualization.characterisation.map.static_map import plot_bicycle_usage_map

fig, ax = plot_bicycle_usage_map(
    usage_probs=usage_probs,
    loader=dl,
    city_geojson_path="analysis/visualization/characterisation/map/stadtgrenze_heidelberg.geojson",
    save=False,
    zoom=0.37,
    shift_y=0,
    shift_x= 0.015
)