# Example 2: Feature Space Exploration

This notebook explores how **feature engineering** and **objective choice** shape the selection process. We:

1. Start with statistical features and examine why dimensionality reduction is needed
2. Use PCA to compress the feature space, guided by variance analysis
3. Run two experiments with different objectives to see how they influence the result
4. Compare the two selections side-by-side in feature space

Key concepts introduced:
- Manual feature engineering chain (stats then PCA)
- PCA variance analysis to choose the number of components
- Multi-objective selection with `CentroidBalance` vs `DiversityReward`
- `SelectionComparisonScatterMatrix` for comparing multiple selections

In [1]:
import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'

In [2]:
url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)

slicer = rep.TimeSlicer(unit="month")
context = rep.ProblemContext(df_raw=df_raw, slicer=slicer)
print(f"{len(context.get_unique_slices())} candidate monthly slices")

12 candidate monthly slices


## Statistical features: a first look

We begin with `StandardStatsFeatureEngineer`, which computes summary statistics (mean, std, quantiles, ramp rates, etc.) for each variable and slice. This gives us a multi-dimensional profile for every candidate month.

In [3]:
stats_eng = rep.StandardStatsFeatureEngineer()
context_stats = stats_eng.run(context)
print(f"{context_stats.df_features.shape[1]} statistical features for {context_stats.df_features.shape[0]} slices")
context_stats.df_features.head(3)

38 statistical features for 12 slices


Unnamed: 0,mean__load,mean__onwind,mean__offwind,mean__solar,std__load,std__onwind,std__offwind,std__solar,q10__load,q10__onwind,...,ramp_std__load,ramp_std__onwind,ramp_std__offwind,ramp_std__solar,corr__load__onwind,corr__load__offwind,corr__load__solar,corr__onwind__offwind,corr__onwind__solar,corr__offwind__solar
2015-01,0.790427,1.623676,1.22718,-1.400283,0.552545,1.875624,1.933594,-1.648768,0.623443,0.404962,...,0.484976,-0.042419,0.637891,-1.853069,-0.161411,-1.234681,-0.919195,0.790412,0.522001,0.45811
2015-02,1.454759,-0.178153,0.03289,-0.779126,-1.266598,-0.386469,0.387692,-0.710218,1.967376,0.269301,...,-0.11069,-0.511887,0.024703,-0.629509,-0.948116,-1.024592,-0.810737,-1.246019,0.26222,1.687519
2015-03,0.845602,0.381202,0.172626,-0.018479,-1.705158,0.914114,0.375525,0.222239,1.09817,-0.158519,...,-0.465317,0.921492,0.768165,0.394653,0.160418,-0.492397,0.132615,0.368639,-0.497747,-1.376189


### How do the months compare on average values?

A scatter matrix of just the mean features shows how months relate in terms of their average load, onshore wind, offshore wind, and solar generation. This is a useful starting point, but we want fidelity across *all* statistical dimensions — not just means.

In [4]:
mean_cols = [c for c in context_stats.df_features.columns if c.startswith('mean__')]
fig = diag.FeatureSpaceScatterMatrix().plot(context_stats.df_features, dimensions=mean_cols)
fig.update_layout(title='Scatter Matrix: Monthly Means')
fig.show()

### The curse of dimensionality

With 12 data points and dozens of features, many of which are highly correlated, distance-based comparisons become unreliable. The feature correlation heatmap reveals substantial redundancy:

In [5]:
fig = diag.FeatureCorrelationHeatmap().plot(context_stats.df_features, method='pearson')
fig.update_layout(title='Feature Correlation Matrix (Statistical Features)')
fig.show()

## Dimensionality reduction with PCA

PCA projects the correlated features onto orthogonal axes ordered by variance explained. This addresses the curse of dimensionality while retaining the essential structure.

In [6]:
pca_full = rep.PCAFeatureEngineer()
context_full_pca = pca_full.run(context_stats)

fig = diag.PCAVarianceExplained(pca_full).plot(show_cumulative=True)
fig.update_layout(title='PCA Variance Explained')
fig.show()

The cumulative curve shows a clear bend around 4 components — beyond that, each additional PC adds very little. Four PCs capture the essential structure while keeping the feature space compact.

In [7]:
fig = diag.FeatureSpaceScatterMatrix().plot(
    context_full_pca.df_features, dimensions=['pc_0', 'pc_1', 'pc_2', 'pc_3']
)
fig.update_layout(title='Scatter Matrix: First 4 Principal Components')
fig.show()

## Narrowing the feature space

Based on the variance analysis, 4 PCs capture the essential structure. We create a dedicated 4-PC feature context for both experiments below.

In [8]:
pca_4 = rep.PCAFeatureEngineer(n_components=4)
context_4pc = pca_4.run(context_stats)
context_4pc.df_features

Unnamed: 0,pc_0,pc_1,pc_2,pc_3
2015-01,6.184021,0.96444,-0.653615,0.223279
2015-02,1.934236,-3.160402,-1.649667,2.709401
2015-03,1.081686,0.569398,-3.265472,0.251567
2015-04,-4.15623,1.690578,-0.116559,-0.684499
2015-05,-3.367315,3.007706,1.534711,0.538564
2015-06,-4.609243,0.098181,0.590968,-0.152467
2015-07,-2.18337,2.267731,-1.127886,0.872675
2015-08,-5.217145,-1.117851,-0.008774,-0.607658
2015-09,-0.651842,-0.811346,0.338338,-0.044285
2015-10,-1.720184,-4.710915,1.411115,-1.246839


## Experiment A: Balanced selection

Our first objective set combines:
- **Wasserstein fidelity**: marginal distribution similarity
- **Correlation fidelity**: preservation of cross-variable dependencies
- **Centroid balance**: penalises selections whose feature centroid deviates from the data centroid

The `ParetoMaxMinStrategy` picks the Pareto-optimal combination that maximises the *worst* objective — a balanced, conservative choice.

In [9]:
obj_balanced = rep.ObjectiveSet({
    'wasserstein': (0.5, rep.WassersteinFidelity()),
    'correlation': (0.5, rep.CorrelationFidelity()),
    'centroid_balance': (0.5, rep.CentroidBalance()),
})

k = 3
search_a = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
    obj_balanced,
    rep.ParetoMaxMinStrategy(),
    rep.ExhaustiveCombiGen(k=k),
)
repr_model = rep.KMedoidsClustersizeRepresentation()

workflow_a = rep.Workflow(pca_4, search_a, repr_model)
exp_a = rep.RepSetExperiment(context_4pc, workflow_a)
result_a = exp_a.run()

print(f"Selected months (A): {result_a.selection}")
print(f"Weights (A):         {result_a.weights}")
print(f"Scores (A):          {result_a.scores}")

Iterating over combinations: 100%|██████████| 220/220 [00:01<00:00, 201.66it/s]

Selected months (A): (Period('2015-01', 'M'), Period('2015-04', 'M'), Period('2015-09', 'M'))
Weights (A):         {Period('2015-01', 'M'): 0.25, Period('2015-04', 'M'): 0.4166666666666667, Period('2015-09', 'M'): 0.3333333333333333}
Scores (A):          {'wasserstein': 0.20025433806827211, 'correlation': 0.0231217738568774, 'centroid_balance': 0.7982187924943351}





## Experiment B: Diversity-focused selection

We replace `CentroidBalance` with `DiversityReward`, which favours selections that are maximally spread out in feature space (large pairwise distances). This can pull the selection towards extreme months rather than central ones.

In [10]:
obj_diverse = rep.ObjectiveSet({
    'wasserstein': (0.5, rep.WassersteinFidelity()),
    'correlation': (0.5, rep.CorrelationFidelity()),
    'diversity': (0.5, rep.DiversityReward()),
})

search_b = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
    obj_diverse,
    rep.ParetoMaxMinStrategy(),
    rep.ExhaustiveCombiGen(k=k),
)

workflow_b = rep.Workflow(pca_4, search_b, rep.KMedoidsClustersizeRepresentation())
exp_b = rep.RepSetExperiment(context_4pc, workflow_b)
result_b = exp_b.run()

print(f"Selected months (B): {result_b.selection}")
print(f"Weights (B):         {result_b.weights}")
print(f"Scores (B):          {result_b.scores}")

Iterating over combinations: 100%|██████████| 220/220 [00:01<00:00, 216.62it/s]

Selected months (B): (Period('2015-07', 'M'), Period('2015-10', 'M'), Period('2015-11', 'M'))
Weights (B):         {Period('2015-07', 'M'): 0.5833333333333334, Period('2015-10', 'M'): 0.16666666666666666, Period('2015-11', 'M'): 0.25}
Scores (B):          {'wasserstein': 0.17951351867468845, 'correlation': 0.04652147452533339, 'diversity': 9.132970566687343}





## Comparing the two selections

How does the objective shape the result? Let's compare which months each experiment chose.

In [11]:
set_a = set(result_a.selection)
set_b = set(result_b.selection)

print(f"Experiment A (Balanced):  {result_a.selection}")
print(f"Experiment B (Diverse):   {result_b.selection}")
print(f"Overlap:                  {set_a & set_b or 'none'}")
print(f"Only in A:                {set_a - set_b or 'none'}")
print(f"Only in B:                {set_b - set_a or 'none'}")

Experiment A (Balanced):  (Period('2015-01', 'M'), Period('2015-04', 'M'), Period('2015-09', 'M'))
Experiment B (Diverse):   (Period('2015-07', 'M'), Period('2015-10', 'M'), Period('2015-11', 'M'))
Overlap:                  none
Only in A:                {Period('2015-09', 'M'), Period('2015-01', 'M'), Period('2015-04', 'M')}
Only in B:                {Period('2015-11', 'M'), Period('2015-07', 'M'), Period('2015-10', 'M')}


### Side-by-side in feature space

The `SelectionComparisonScatterMatrix` plots both selections on top of the full feature space. Distinct markers and colours make it easy to spot where the objectives push the selection.

In [12]:
fig = diag.SelectionComparisonScatterMatrix().plot(
    context_4pc.df_features,
    selections={
        'A: Balanced': result_a.selection,
        'B: Diverse': result_b.selection,
    },
    dimensions=['pc_0', 'pc_1', 'pc_2', 'pc_3'],
)
fig.update_layout(title='Selection Comparison in PCA Feature Space')
fig.show()

## Per-experiment diagnostics

### Responsibility weights

The KMedoids representation assigns weights proportional to how many months each representative "covers". The dashed line shows uniform (1/k) for reference.

In [13]:
fig = diag.ResponsibilityBars().plot(result_a.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights — Experiment A (Balanced)')
fig.show()

In [14]:
fig = diag.ResponsibilityBars().plot(result_b.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights — Experiment B (Diverse)')
fig.show()

### Distribution fidelity (ECDF grid)

The ECDF grid shows — for every variable at once — how well the selection's marginal distribution tracks the full year. Gaps indicate value ranges that the selection under- or over-represents.

In [15]:
selected_idx_a = context.slicer.get_indices_for_slice_combi(context.df_raw.index, result_a.selection)
df_sel_a = context.df_raw.loc[selected_idx_a]

fig = diag.DistributionOverlayECDFGrid().plot(context.df_raw, df_sel_a)
fig.update_layout(title='Distribution Fidelity — Experiment A (Balanced)')
fig.update_xaxes(matches=None)
fig.update_yaxes(matches=None)
fig.show()

### Correlation preservation

The heatmap shows the *difference* between the correlation matrix of the selection and the full year. Values near zero (light) mean the selection preserves that cross-variable relationship well.

In [16]:
fig = diag.CorrelationDifferenceHeatmap().plot(
    context.df_raw, df_sel_a, method='pearson', show_lower_only=True
)
fig.update_layout(title='Correlation Difference — Experiment A (Balanced)')
fig.show()

### Diurnal profiles

Average hourly shape (hour 0-23) of each variable, comparing the full year with the selection. Good matches indicate the selection captures typical within-day patterns.

In [17]:
fig = diag.DiurnalProfileOverlay().plot(
    context.df_raw, df_sel_a, variables=['load', 'onwind', 'offwind', 'solar']
)
fig.update_layout(title='Diurnal Profiles — Experiment A (Balanced)')
fig.show()