Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions clawbench/dynamics.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ class Dynamics:
pca_trajectory: np.ndarray | None = None # (n_steps, 2)
bigram_transitions: dict[str, dict[str, float]] = field(default_factory=dict)
memory_depth: float = 0.0 # I(X_t; X_{t-2} | X_{t-1})
renyi_d2: float = 0.0


@dataclass
Expand Down Expand Up @@ -287,6 +288,7 @@ def compute_dynamics(transcript: Transcript) -> Dynamics:
}

ci = 0.5
renyi_d2 = 0.0
if n > 2:
cov = np.cov(X.T)
eigvals = np.maximum(np.linalg.eigvalsh(cov), 0)
Expand All @@ -295,6 +297,8 @@ def compute_dynamics(transcript: Transcript) -> Dynamics:
p = eigvals / tv
pr = 1.0 / np.sum(p**2)
ci = 1.0 - (pr - 1) / (X.shape[1] - 1)
sum_p2 = np.sum(p**2)
renyi_d2 = float(-np.log2(sum_p2)) if sum_p2 > 0 else 0.0

h = _entropy(dict(fam_acc))
er = err_count / n if n else 0
Expand All @@ -320,6 +324,7 @@ def compute_dynamics(transcript: Transcript) -> Dynamics:
constraint_index=ci,
bigram_transitions=_compute_bigram_transitions(families),
memory_depth=_conditional_mi(families),
renyi_d2=renyi_d2,
)


Expand Down
15 changes: 10 additions & 5 deletions clawbench/dynamics_archive.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,11 +102,16 @@ def discover_model_roots(archive_dir: Path) -> dict[str, Path]:
if _is_task_collection_root(archive_dir):
return {archive_dir.name: archive_dir}

roots = {
child.name: child
for child in sorted(archive_dir.iterdir())
if child.is_dir() and _is_task_collection_root(child)
}
roots = {}
for child in sorted(archive_dir.iterdir()):
if not child.is_dir():
continue
if _is_task_collection_root(child):
roots[child.name] = child
else:
for subchild in sorted(child.iterdir()):
if subchild.is_dir() and _is_task_collection_root(subchild):
roots[f"{child.name}/{subchild.name}"] = subchild
return roots


Expand Down
182 changes: 182 additions & 0 deletions docs/long_term_dynamics.md

Large diffs are not rendered by default.

100 changes: 100 additions & 0 deletions docs/semantic_spatiotemporal_dynamics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Semantic Spatio-Temporal Dynamics Analysis

## 1. Introduction: Bridging Space and Time

Evaluating iterative, long-running Large Language Model (LLM) agents requires understanding two fundamentally different axes of their behavior:
1. **The Semantic Space (What the agent is doing)**: The distribution of tasks, intents, and prompts the agent interacts with.
2. **The Temporal Dynamics (How the agent evolves)**: The trajectory of the agent over time, characterized by its ability to converge on solutions versus drifting into unrecoverable hallucination loops.

Historically, evaluating these dimensions in isolation creates a blind spot. **Raw temporal dynamics metrics treat all tasks in an arbitrary benchmark equally.** If a benchmark dataset over-represents simple, tightly constrained tasks, the agent's overall dynamic stability will look artificially robust. Conversely, if it over-indexes on open-ended creative tasks, the agent might look chaotic.

The **Semantic Spatio-Temporal Dynamics** framework solves this by fusing these two methodologies. It maps the geometry of the agent's time-series trajectories directly onto a debiased, user-aligned semantic manifold, projecting abstract mathematical stability metrics onto concrete operational realities.

---

## 2. The Spatial Dimension: Task Distribution Reweighting

Evaluation datasets ($Q$) inherently suffer from distribution shifts compared to true real-world usage ($P$). To correct this, we stratify and reweight the semantic space of tasks.

### 2.1 NLU/NLI Semantic Clustering
We embed the natural language instructions of each task $q_i$ using Dense NLU models to capture semantic intent, and employ Natural Language Inference (NLI) to confirm entailment and redundancy.
Using clustering algorithms (e.g., HDBSCAN), we partition the dataset into $K$ distinct functional stratums: $\mathcal{C} = \{C_1, C_2, \dots, C_K\}$.

### 2.2 Importance Weighting (Radon-Nikodym Derivatives)
Let $Q(C_k)$ be the empirical fraction of the evaluation dataset belonging to cluster $C_k$, and $P(C_k)$ be the target real-world probability of that cluster. We compute the importance weight (Radon-Nikodym derivative) for any task $i$ in stratum $k_i$ as:
$$ \rho_{k_i} = \frac{P(C_{k_i})}{Q(C_{k_i})} $$
This scaling factor ensures that over-represented tasks are suppressed, and under-represented but critical real-world tasks are amplified.

---

## 3. The Temporal Dimension: Long-Term Trajectory Dynamics

As an agent iteratively reasons and invokes tools, its transcript generates a sequence of discrete actions $x_t$. We project this sequence into a continuous $d$-dimensional behavioral feature space to analyze its geometry.

### 3.1 Attractor Geometry and The Constraint Index $C(q)$
For a given task $q$, we measure how tightly the agent's trajectory is bound to an attractor basin using three core metrics:
* **Participation Ratio (PR) & Rényi Dimension ($D_2$)**: We extract the eigenspectrum of the trajectory's covariance matrix. The Rényi correlation dimension $D_2 = -\log_2 \sum p_i^2$ measures the structural volume/complexity of the phase space explored by the agent.
* **Response Entropy ($H$)**: The Shannon entropy over the eigenspectrum (or discrete action distribution) measuring the intrinsic uncertainty and diffusion of the agent.
* **Bayesian Optimal Prediction Score (BOPS)**: A measure of inter-run predictability, proxying how consistently the agent targets the maximum a posteriori (MAP) trajectory.

These are standardized and fused into the **Constraint Index $C(q)$**, where a high $C(q)$ implies tight bounded behavior (a strong point attractor).

### 3.2 Perturbation Sensitivity (Lyapunov Proxy)
To test robustness, we generate semantically identical but lexically perturbed prompts $q'$. We track the divergence between the original trajectory $e_t$ and the perturbed trajectory $e'_t$ over time, extracting a Lyapunov-like proxy:
$$ \widehat{\lambda}(t) = \frac{1}{t}\log\frac{D_t+\epsilon}{D_0+\epsilon} $$
A positive $\widehat{\lambda}(t)$ indicates chaotic sensitivity, where tiny prompt variations cause exponentially diverging behavior.

### 3.3 Dynamical Regimes
Trajectories are ultimately classified into distinct kinetic states:
* **Trapped**: Collapsing into a highly recurrent, localized subset of actions.
* **Limit Cycle**: Bounded drift with quasi-periodic revisits to states.
* **Wandering/Diffusive**: Unbounded expansion with low predictability and high entropy.

---

## 4. Spatio-Temporal Fusion: The Hajek Estimator

The core theoretical leap is applying the Spatial weights ($\rho_i$) to the Temporal properties ($D_i$) to estimate the *true expected real-world dynamics*.

For any dynamic property $D$, the debiased expectation under the real-world user distribution $P$ is given by the asymptotically efficient Hajek estimator:
$$ \mathbb{E}_{P}[D] \approx \frac{\sum_{i=1}^N \rho_{k_i} D_i}{\sum_{i=1}^N \rho_{k_i}} $$

### Key Fused Metrics
1. **Expected Regime Probability ($E_P[\text{Regime} = r]$)**: Instead of stating "20% of benchmark trajectories hit a chaotic wandering regime," this calculates the exact probability that a *deployed user* will experience that failure mode.
2. **Debiased Survival Curves ($S_{debiased}(t)$)**: A weighted Kaplan-Meier estimation. If simple, high-survival tasks are overrepresented in the benchmark, the raw curve is falsely optimistic. The debiased curve corrects this, providing a true expected time-to-failure.
3. **Expected Chaos ($E_P[\widehat{\lambda}]$) & Predictability ($E_P[C(q)]$)**: The true weighted average of prompt fragility and system volatility.

---

## 5. Expanding the Spatial Definition: State, Action, and Conditioned Survival

While the standard formulation defines "Space" via the NLU embedding of the *initial prompt*, this framework is naturally extensible to other spatial dimensions of the trajectory:

* **Action Space (Tools Called)**: Stratifying trajectories based on the specific tools invoked (e.g., isolating all runs where `edit_file` or `bash` was called).
* **Intermediate State Space**: Stratifying based on the environment state or agent memory (e.g., isolating runs where a `SyntaxError` was encountered).

This is where **Time-to-Event (Survival Analysis)** breaks back in with immense power. Because ClawBench logs the full trajectory state, we can compute dynamically conditioned expected properties. Rather than just asking "What is the expected survival time of this task?", we can condition on any arbitrary combination of parameters:
* $\mathbb{E}[\text{Time-to-Failure} \mid \text{Tool} = \text{bash}]$
* $\mathbb{E}[\text{Probability of Limit Cycle} \mid \text{State} = \text{SyntaxError}]$

By using Stratified Kaplan-Meier curves or Cox Proportional Hazards models with time-dependent covariates, researchers can isolate the exact state-action transitions that induce catastrophic drift.

---

## 6. Interpretation and Impact for Researchers

Merging these dimensions unlocks powerful theoretical and practical insights:

* **Kaplan-Yorke and Hidden Fragility**: If the Spatio-Temporal fusion reveals a high expected Rényi dimension $D_2$ and high Lyapunov sensitivity $\widehat{\lambda}$, the deployed agent lacks a definitive "point attractor" for real-world tasks. An agent might appear stable on a benchmark, but if its chaotic trajectories align heavily with the most frequent user tasks, its operational stability is critically low.
* **Ergodicity and Markovian Traps**: LLMs are generally non-ergodic due to absorbing states (completing a task or hitting turn limits). However, when trapped in a limit cycle, they suffer from context blindness, collapsing into a destructive Markovian loop. The Spatio-Temporal framework identifies exactly *which semantic regions* trigger these non-ergodic traps, allowing researchers to surgically apply early-stopping heuristics rather than blanket constraints.
* **Task-Sensitivity Mutual Information $I(q; \lambda)$**: There is massive mutual information between a task's Constraint Index $C(q)$ and its perturbation sensitivity. Tightly constrained tasks yield deep attractor basins with near-zero sensitivity. The Spatio-Temporal framework proves mathematically where *prompt engineering matters most*—specifically on the loosely constrained tasks that dominate a user's target distribution.

---

## 7. Implementation Pipeline

The Spatio-Temporal decomposition is fully operationalized through a bridging script that ingests the outputs of both upstream modules:

1. **Spatial Baseline**: `scripts/compute_posterior_weights.py` computes the weights $\rho_i$ based on NLU clusters and user schemas.
2. **Temporal Baseline**: `scripts/run_posterior_dynamics_pipeline.py` computes the unweighted survival, regimes, and constraint indices.
3. **Spatio-Temporal Fusion**: `scripts/compute_debiased_dynamics.py` applies the Hajek estimators to produce the final `debiased_regimes_probability` and `debiased_expected_C_q`.
93 changes: 93 additions & 0 deletions docs/task_distribution_reweighting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Aligning LLM Evaluations with Reality: Debiasing via Task Distribution Reweighting
## Investigating Semantic Task Clustering and Stratified Reweighting for Real-World Accuracy

Evaluation benchmarks often suffer from severe distribution shifts compared to real-world usage. A dataset might consist of 80% mathematics tasks and 20% coding tasks, whereas an actual user's interaction distribution might be exactly the opposite (20% math, 80% code). Evaluating an LLM on the raw dataset yields a biased performance estimate that over-indexes on specific capabilities while under-representing others. This document outlines an empirical framework to debias evaluation scores by clustering tasks using Natural Language Understanding (NLU) and Natural Language Inference (NLI) models, and reweighting these task strata to match true usage distributions.

---

## 1. Introduction: The Need for Distribution Alignment

**Key question: Does our benchmark score actually reflect the user's experience?**

Standard evaluation paradigms treat every task in a dataset equally, computing an unweighted mean over all instances. However, evaluation datasets are typically constructed via programmatic generation or scraping, leading to arbitrary internal distributions that do not reflect operational reality.

If a system is deployed where coding represents the vast majority of user queries, a math-heavy benchmark will misjudge the model's practical utility. We therefore treat the evaluation dataset as a biased sample from a broader semantic space, and apply **stratified reweighting** to correct this bias, moving from a static dataset score to a dynamic, user-aligned capability metric.

---

## 2. Methodology: Clustering and Stratification

### 2.1 Task Representation and NLU Clustering
To reweight a dataset, we first need to map its internal composition. We map each task/prompt $q_i$ into a semantic space using pre-trained NLU models to identify latent capabilities.

* **Dense NLU Embeddings:** We extract representations for each task instruction using modern embedding models to capture semantic intent.
* **NLI for Semantic Equivalence:** We employ Natural Language Inference (NLI) models to evaluate pairs of tasks. If task $A$ entails the capabilities required by task $B$, we can aggressively group similar prompts to prevent over-counting highly redundant queries.
* **Stratification:** We apply clustering algorithms (e.g., HDBSCAN) on the semantic representations to partition the dataset into $K$ distinct functional clusters (stratums), $\mathcal{C} = \{C_1, C_2, \dots, C_K\}$, representing distinct capability areas (e.g., "Math Word Problems", "Code Refactoring", "Information Retrieval").

> **Implementation:** Computed in `scripts/cluster_tasks_nlu.py` using embedding and NLI models to output a cluster assignment mapping for all benchmark tasks.

### 2.2 Estimating True Usage Distributions
Let $P_{eval}$ be the empirical distribution of tasks in the evaluation dataset, and $P_{user}$ be the target real-world usage distribution. We determine the proportion of each cluster $k$ in both:
* $w_{eval}^{(k)}$: The fraction of tasks in the evaluation set that belong to cluster $C_k$.
* $w_{user}^{(k)}$: The fraction of tasks in the expected user distribution that belong to cluster $C_k$.

If a cluster makes up 80% of the benchmark but only 20% of user interactions, it is heavily over-represented.

> **Implementation:** Computed in `scripts/compute_distribution_weights.py` by comparing the empirical cluster sizes against a provided user telemetry schema.

### 2.3 Stratified Importance Reweighting
We compute a debiased performance metric by applying Inverse Probability Weighting (IPW) to the task strata. If a model achieves an average success rate $S_k$ on cluster $C_k$, the naive unweighted dataset score is simply $\sum_k w_{eval}^{(k)} S_k$.

The debiased, user-aligned score corrects for this by scaling by the true usage rates:

$$ S_{debiased} = \sum_{k=1}^K w_{user}^{(k)} S_k $$

Alternatively, we can assign an importance weight $\rho_i$ to each individual task $i$ belonging to cluster $C_k$:

$$ \rho_i = \frac{w_{user}^{(k)}}{w_{eval}^{(k)}} $$

Yielding the weighted expected score: $\mathbb{E}_{q \sim P_{user}} [ \text{Score}(q) ] \approx \frac{1}{N} \sum_{i=1}^N \rho_i \text{Score}(q_i)$.

> **Implementation:** Weights are integrated during metric aggregation in `clawbench.evaluation.debiased_metrics`.

---

## 3. Advanced Capabilities: Inter-Task Similarity and Overlap

Beyond simple clustering, NLU and NLI models allow us to construct a full **Task Similarity Graph**.

1. **Redundancy Penalties:** If a cluster contains highly identical tasks (as measured by bidirectional NLI entailment), we can down-weight individual tasks within that cluster to avoid "capability farming" where a model succeeds only because the same question is asked 50 times in slightly different ways.
2. **Cross-Cluster Leakage:** Tasks may not neatly fit into orthogonal clusters. By computing soft-assignments or probabilities $P(C_k \mid q_i)$, we can allocate fractional weights, allowing complex multi-step reasoning tasks to contribute to the scores of multiple capabilities (e.g., a prompt requiring both Python coding and mathematical proofs).

> **Implementation:** Computed via graph-based adjacency matrices in `clawbench.evaluation.task_graph`.

---

## 4. Pipeline Implementation: Debiasing Computation

The theoretical framework is operationalized through a series of analysis scripts designed to run sequentially after the core evaluation rollouts are complete:

* **`cluster_tasks_nlu.py`**: Embeds task instructions and clusters them into distinct semantic stratums. Uses NLI models to verify similarity within clusters and builds the Task Similarity Graph.
* **`compute_distribution_weights.py`**: Compares the cluster assignments against a reference user distribution profile to compute the importance weights $\rho_i$ for each task.
* **`debiased_evaluation.py`**: Aggregates the raw execution traces and applies the computed importance weights to produce the final, debiased performance metrics.
* **`generate_reweighting_report.py`**: Renders the comparative diagnostics into a markdown summary (`EVAL_REPORT_DEBIASED.md`), highlighting which capabilities were inflated by dataset bias and presenting the true expected performance under user conditions.

---

## 5. Interpretation and Impact

Framing dataset evaluation through the lens of usage distributions prevents capability over-fitting to skewed benchmarks. By triangulating NLU-based task clusters with stratified IPW reweighting, we ensure that our metrics accurately reflect the expected real-world performance of the agentic system.

This approach highlights a critical distinction: a model might be "State of the Art" on an arbitrary academic dataset, but severely underperform when re-weighted to match the exact operational footprint of an end-user.

---

## 6. Space-Time Decomposition

While the techniques described above debias single-step task success, they can also be combined with long-term dynamic metrics (the "Time" axis) to compute the expected real-world dynamical behavior of the agent. By applying the Radon-Nikodym derivatives ($\rho_i$) to temporal characteristics like Kaplan-Meier survival curves, Constraint Index $C(q)$, and regime clustering probabilities (e.g., trapped vs. chaotic limit cycles), we generate a **Space-Time Decomposition**.

This fusion calculates the Hajek estimators for time-series properties:
$$ \mathbb{E}_{P}[\text{Regime} = r] \approx \frac{\sum_{i=1}^N \rho_{k_i} \mathbf{1}(\text{regime}_i = r)}{\sum_{i=1}^N \rho_{k_i}} $$
Revealing the true likelihood that a model falls into an unrecoverable hallucination loop under actual user workload conditions.

> **Implementation:** Operationalized via `scripts/compute_debiased_dynamics.py` which takes the weights from this spatial framework and applies them to the outputs of the temporal dynamics framework.
4 changes: 4 additions & 0 deletions profiles/empirical_topic_distribution.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"math": 0.80,
"code": 0.20
}
4 changes: 4 additions & 0 deletions profiles/radon_nikodym_weights.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"math": 0.25,
"code": 4.0
}
4 changes: 4 additions & 0 deletions profiles/user_target_distribution.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"math": 0.20,
"code": 0.80
}
Loading
Loading