**For correct rendering, view this notebook in [nbviewer](https://nbviewer.org/github/markuskrecik/preference-dynamics-learning/blob/main/notebooks/20_data_exploration.ipynb)**

# Data Exploration

This notebook performs exploratory data analysis (EDA) on the generated preference dynamics dataset. Since the data is generated synthetically, I can keep the data cleaning step to a minimum.

**This notebook:**
- Gives a first visual impression of the generated time series
- Cleans divergent time series
- Analyzes dataset statistics like parameter distributions and correlations

In [1]:
%load_ext autoreload
%autoreload 2


import numpy as np
import pandas as pd
import plotly.express as px
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)

from preference_dynamics.data import DataConfig, DataManager
from preference_dynamics.data.transformer import DeleteSampleTransformer
from preference_dynamics.visualization import plot_time_series
from preference_dynamics.utils import get_param_names, get_var_names

pd.set_option("display.float_format", "{:.2f}".format)

## Load Dataset

Let's continue the analysis with 2 actions.


In [2]:
n_actions = 2
data_dir = f"data/n{n_actions}"

config = DataConfig(data_dir=data_dir, load_if_exists=False)
dm = DataManager(config=config)
raw_data = dm.load_raw()

## Visualize Dataset

Let's get a first intuition of what the sythetic data looks like by showing the first 3 time series.

In the first one, actions **cannot** be performed in parallel, so the simulated person switches between them regularly, depending on how desired each action currently is.

In the second example, both actions can be performed in parallel.

The third one is like the first, but the person can switch faster between actions.

In [3]:
for i, sample in enumerate(raw_data[:3]):
    plot_time_series(sample, title=f"Time series {i}", height=400, width=600)

## Clean Data

There might still be some divergent time series in the dataset, which we'd like to remove before final processing.
The heuristic for stability checking is already quite good, so only few exponentially divergent samples remain.
Large amplitudes without exponential growth are stable, so will be kept.

In [4]:
def is_divergent(sample, threshold=1e3):
    return np.any(np.abs(sample.time_series) > threshold)


divergent_samples = [{"idx": i, "sample": s} for i, s in enumerate(raw_data) if is_divergent(s)]

print(f"Found {len(divergent_samples)} possibly divergent samples.")

for s in divergent_samples:
    # print(f"{repr(s['sample'].parameters)}\n{repr(s['sample'].initial_conditions)}")
    plot_time_series(s["sample"], title=f"Time series {s['idx']}", height=400, width=500)

# Chosen from visual inspection:
remove_positions = [59, 151, 594, 956, 853, 1194, 2609, 2834, 3296, 4607, 6364, 7749, 7951, 9890]

Found 28 possibly divergent samples.


Let's reinstantiate the DataManager with a data transformer to delete specific samples.
We also have the option to chain other transformers, as will be done in following notebooks.

In [5]:
config = DataConfig(
    data_dir=data_dir,
    load_if_exists=False,
    transformers=[
        DeleteSampleTransformer(remove_positions=remove_positions),
    ],
)
dm = DataManager(config=config)

# setup() will (re-)load all raw data, split it into train/val/test sets,
# apply transformations, and save the processed data.
dm.setup()

<preference_dynamics.data.manager.DataManager at 0x7af29bbdf390>

## Time Series Statistics

The summary statistics across all time series below show that desires ($u_i$) are similary distributed across all splits, as well as actions ($a_i$), confirming that our dataset is sized sufficiently.
Validation and test sets show more variation due to the smaller size, as expected. But mean and quantiles show that they are well contained within the training set.

In [6]:
var_names = get_var_names(n_actions)

for name, samples in dm.splits.items():
    all_ts = pd.DataFrame(
        np.concatenate([s.time_series.T for s in samples], axis=0), columns=var_names
    )
    print("Statistics for split:", name)
    display(all_ts.describe())

Statistics for split: train


Unnamed: 0,u_0,u_1,a_0,a_1
count,1404990.0,1404990.0,1404990.0,1404990.0
mean,11.03,10.48,5.0,4.99
std,52.37,56.26,23.38,22.93
min,-9.98,-9.98,0.0,0.0
25%,0.71,0.73,0.83,0.8
50%,3.34,3.46,1.8,1.81
75%,9.22,9.27,4.26,4.12
max,17522.84,17050.44,9169.63,9645.98


Statistics for split: val


Unnamed: 0,u_0,u_1,a_0,a_1
count,301098.0,301098.0,301098.0,301098.0
mean,10.06,9.46,4.45,4.71
std,31.54,27.02,12.2,15.1
min,-9.97,-9.93,0.0,0.0
25%,0.64,0.59,0.8,0.8
50%,3.26,3.13,1.78,1.77
75%,8.64,8.63,3.82,3.98
max,2992.85,1349.9,1680.35,2202.2


Statistics for split: test


Unnamed: 0,u_0,u_1,a_0,a_1
count,301098.0,301098.0,301098.0,301098.0
mean,11.03,10.14,4.77,4.66
std,35.11,27.51,13.89,11.79
min,-9.94,-9.75,0.0,0.0
25%,0.71,0.82,0.78,0.83
50%,3.57,3.45,1.79,1.78
75%,9.45,9.12,4.0,4.03
max,1441.19,684.42,598.56,454.3


## Parameter Distributions

The parameter sampler for the ODE solver rejected samples due to unstable solutions, so I have to check if the parameter space is still covered sufficiently.
Luckily, I don't have to care about outliers or other anomalies, since I chose the parameter ranges for the synthetic data myself.

As can be seen, most parameters, and all initial conditions are uniformly sampled, except for the matrix elements of $\Pi$. Small diagonal values and large negative off-diagonal values lead to instability, so that these parameters are less sampled. Therefore I expect slightly degraded learning performance in these ranges.

`val` and `test` splits have similar distributions.

In [7]:
train_samples = dm.splits["train"]

param_ic_names = get_param_names(n_actions, ic=True)
param_ic_data = [
    [s.parameters.values for s in train_samples],
    [s.initial_conditions.values for s in train_samples],
]
param_ic_df = pd.DataFrame(
    np.concatenate(param_ic_data, axis=1),
    columns=param_ic_names,
    index=[sample.sample_id for sample in train_samples],
)
param_ic_df_long = param_ic_df.melt()


fig = px.histogram(
    param_ic_df_long,
    title="Parameter Distributions",
    labels={"_value": "Param/IC Value", "_variable": "Parameter"},
    facet_col="variable",
    facet_col_wrap=4,
    height=800,
)
fig.update_xaxes(matches=None, showticklabels=True)

fig.show()

As shown below, the distribution of the determinants $\det(\Pi)$ and $\det(\Gamma)$ show a sharp drop at zero. In particular, no samples have $\det(\Pi) < 0$, indicating that a necessary condition for stability is $\det(\Pi) \ge 0$, which can help in designing a more efficient stability check for the sampler.

In [8]:
px.histogram(
    [
        {"det(Π)": np.linalg.det(t.parameters.Pi), "det(Γ)": np.linalg.det(t.parameters.Gamma)}
        for t in train_samples
    ],
    title="Determinant distribution of Π & Γ",
    labels={"value": "Determinant"},
)

## Parameter Correlations

Due to the sampling, elements of $\Pi$ are weakly correlated (as expected). We also see some correlation between $\Pi$ and $\Gamma$ elements because larger negative off-diagonal elements of $\Gamma$ create spikes in the time series, which are also filtered out through thresholding.

Overall, these weak correlations should not impact learning performance.


In [9]:
px.imshow(
    param_ic_df.corr(),
    title="Parameter & Initial Condition Correlation Matrix",
    color_continuous_scale="RdBu_r",
    zmin=-0.3,
    zmax=0.3,
)

## Summary

This EDA:
- Gives intuition about the generated data
- Ensures a clean dataset without divergent samples
- Shows that the parameter space is covered sufficiently
- Verifies that the parameter correlations are low enough for stable learning

**Future extensions:**
- Analyze identifiability of parameters through the Fisher Information Matrix: Which parameters can actually be predicted from the time series?

**Next steps** (see `30_feature_engineering.ipynb`):
- Create and analyze features of the time series
- Set up a baseline linear regression model with only these features