# Example 1: Getting Started

A minimal end-to-end workflow that selects **4 representative months** from a year of hourly time-series data.

This example walks through the five pillars of the `energy-repset` framework:

| Pillar | Component | Choice in this example |
|--------|-----------|------------------------|
| **F** — Feature Space | How periods are compared | Statistical summaries (mean, std, min, max, quantiles, ramps) |
| **O** — Objective | What "representative" means | Wasserstein distance (marginal distribution fidelity) |
| **S** — Selection Space | What we pick from | All 4-of-12 monthly combinations (495 candidates) |
| **R** — Representation | How selected periods stand in for the year | Uniform weights (each month = 1/4 of the year) |
| **A** — Search Algorithm | How we find the best selection | Exhaustive generate-and-test with weighted-sum policy |

In [11]:
import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'

## Load data

One year of hourly time series with four variables: electricity demand (`load`), onshore wind (`onwind`), offshore wind (`offwind`), and solar capacity factors (`solar`).

In [12]:
url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)
df_raw

variable,load,onwind,offwind,solar
2015-01-01 00:00:00,41.151,0.1566,0.7030,0.0
2015-01-01 01:00:00,40.135,0.1659,0.6875,0.0
2015-01-01 02:00:00,39.106,0.1746,0.6535,0.0
2015-01-01 03:00:00,38.765,0.1745,0.6803,0.0
2015-01-01 04:00:00,38.941,0.1826,0.7272,0.0
...,...,...,...,...
2015-12-31 19:00:00,47.719,0.1388,0.4434,0.0
2015-12-31 20:00:00,45.911,0.1211,0.4023,0.0
2015-12-31 21:00:00,45.611,0.1082,0.4171,0.0
2015-12-31 22:00:00,43.762,0.1026,0.4716,0.0


## Define the problem context

The `TimeSlicer` divides the year into candidate periods — here, **12 calendar months**. The `ProblemContext` bundles the raw data and slicing logic into a single object that flows through the entire pipeline.

In [13]:
slicer = rep.TimeSlicer(unit="month")
context = rep.ProblemContext(df_raw=df_raw, slicer=slicer)
print(f"Candidate slices: {context.get_unique_slices()}")

Candidate slices: [Period('2015-01', 'M'), Period('2015-02', 'M'), Period('2015-03', 'M'), Period('2015-04', 'M'), Period('2015-05', 'M'), Period('2015-06', 'M'), Period('2015-07', 'M'), Period('2015-08', 'M'), Period('2015-09', 'M'), Period('2015-10', 'M'), Period('2015-11', 'M'), Period('2015-12', 'M')]


## Pillar F: Feature engineering

Before we can compare months, we need a numerical representation. `StandardStatsFeatureEngineer` computes a set of statistical summaries (mean, std, min, max, quantiles, ramp rates) per variable per month. This transforms each month into a fixed-length feature vector.

In [14]:
feature_engineer = rep.StandardStatsFeatureEngineer()

## Pillar O: Objective

We use a single score component: **Wasserstein fidelity**. It measures how well the marginal distribution of the selected months matches the full year. Lower distance = better match.

With only one objective, the selection policy is straightforward — just pick the combination with the best score.

In [15]:
objective_set = rep.ObjectiveSet({
    'wasserstein': (1.0, rep.WassersteinFidelity()),
})

## Pillars S + A: Selection space and search

`ExhaustiveCombiGen` enumerates all $\binom{12}{4} = 495$ ways to pick 4 months from 12. Each candidate is scored by the objective, and `WeightedSumPolicy` (trivial here with one component) picks the winner.

In [16]:
k = 4
combi_gen = rep.ExhaustiveCombiGen(k=k)
policy = rep.WeightedSumPolicy()
search_algorithm = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
    objective_set, policy, combi_gen
)

## Pillar R: Representation model

With uniform weights, each selected month represents exactly 1/4 of the year. This is the simplest model — no cluster assignment, no optimization of weights. It places the full burden on the *selection itself* being intrinsically representative.

In [17]:
representation_model = rep.UniformRepresentationModel()

## Run the workflow

In [18]:
workflow = rep.Workflow(feature_engineer, search_algorithm, representation_model)
experiment = rep.RepSetExperiment(context, workflow)
result = experiment.run()

Iterating over combinations: 100%|██████████| 495/495 [00:02<00:00, 215.34it/s]


## Inspect results

In [19]:
print(f"Selected months: {result.selection}")
print(f"Weights: {result.weights}")
print(f"Wasserstein score: {result.scores['wasserstein']:.4f}")

Selected months: (Period('2015-01', 'M'), Period('2015-02', 'M'), Period('2015-05', 'M'), Period('2015-06', 'M'))
Weights: {Period('2015-01', 'M'): 0.25, Period('2015-02', 'M'): 0.25, Period('2015-05', 'M'): 0.25, Period('2015-06', 'M'): 0.25}
Wasserstein score: 0.0684


## Diagnostic: responsibility weights

The bar chart below shows the weight assigned to each selected month. With uniform representation, all bars are equal at 0.25. The dashed line indicates the "ideal" uniform reference.

In [20]:
fig = diag.ResponsibilityBars().plot(result.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights (Uniform)')
fig.show()