# Example 7: K-Medoids Clustering

K-medoids clustering is a **constructive** (Workflow Type 2) algorithm that partitions the feature space into $k$ clusters and selects the **medoid** of each cluster as a representative period. Unlike k-means, which produces synthetic centroids, k-medoids always selects actual data points --- making it a natural fit for representative period selection.

Key properties:

- **Internal objective**: minimizes within-cluster sum of squares (WCSS)
- **Weights**: pre-computed as cluster-size fractions ($w_j = n_j / N$)
- **No external ObjectiveSet needed**: the algorithm has its own built-in objective
- **Fast**: converges in a few iterations for typical problem sizes

In [None]:
import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'

In [None]:
url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)

---
## Monthly K-Medoids

We select 4 representative months from 12 using k-medoids clustering on statistical features. The algorithm partitions the 12 months into 4 clusters and picks the medoid of each.

In [None]:
context = rep.ProblemContext(df_raw=df_raw, slicer=rep.TimeSlicer(unit="month"))

workflow = rep.Workflow(
    feature_engineer=rep.StandardStatsFeatureEngineer(),
    search_algorithm=rep.KMedoidsSearch(k=4, random_state=42),
)
experiment = rep.RepSetExperiment(context, workflow)
result = experiment.run()

print(f"Selection: {result.selection}")
print(f"WCSS:      {result.scores['wcss']:.4f}")
print(f"Weights:   { {str(k): round(v, 3) for k, v in result.weights.items()} }")

In [None]:
if 'cluster_info' in result.diagnostics:
    print("Cluster membership:")
    for info in result.diagnostics['cluster_info']:
        print(f"  Cluster {info['cluster']}: medoid={info['medoid']}, "
              f"size={info['size']}, members={info['members']}")

In [None]:
fig = diag.ResponsibilityBars().plot(result.weights, show_uniform_reference=True)
fig.update_layout(title='K-Medoids: Responsibility Weights (Cluster Fractions)')
fig.show()

In [None]:
feature_ctx = experiment.feature_context
cols = list(feature_ctx.df_features.columns[:2])

fig = diag.FeatureSpaceScatter2D().plot(
    feature_ctx.df_features, x=cols[0], y=cols[1], selection=result.selection
)
fig.update_layout(title='K-Medoids: Feature Space (First Two Features)')
fig.show()

In [None]:
slicer = rep.TimeSlicer(unit="month")
selected_idx = slicer.get_indices_for_slice_combi(df_raw.index, result.selection)
df_sel = df_raw.loc[selected_idx]

fig = diag.DistributionOverlayECDF().plot(df_raw['load'], df_sel['load'])
fig.update_layout(title='K-Medoids: ECDF Overlay (Load)')
fig.show()

---
## Effect of k

More clusters mean lower WCSS (tighter clusters), but fewer representatives per cluster means less compression. Let's compare k=3 and k=6.

In [None]:
results_by_k = {}
for k in [3, 4, 6]:
    wf = rep.Workflow(
        feature_engineer=rep.StandardStatsFeatureEngineer(),
        search_algorithm=rep.KMedoidsSearch(k=k, random_state=42),
    )
    res = rep.RepSetExperiment(context, wf).run()
    results_by_k[k] = res

print(f"{'k':>3}  {'WCSS':>10}  {'Selection'}")
print("-" * 50)
for k, res in results_by_k.items():
    print(f"{k:>3}  {res.scores['wcss']:>10.4f}  {res.selection}")

---
## Summary

K-medoids clustering is a good default when you want a **standard, fast, well-understood** clustering-based selection without additional constraints. It works well for monthly or weekly slicing and produces cluster-size-proportional weights automatically.

For contiguous temporal segments, use [CTPC](ex6_constructive_algorithms.ipynb) instead. For multi-day subsequences, use the [Snippet](ex6_constructive_algorithms.ipynb) algorithm.