Rolled-back df.start() leaves failed duroxide.executions rows that inflate df.metrics()

## Summary

`df.metrics()` appears to count failed rows from `duroxide.executions`, including orphan executions created when `df.start()` is rolled back, while `df.instances` and `df.list_instances('failed')` only show persisted workflow instances.

That makes the failed instance count disagree across public APIs after rollback scenarios.

## Observed

Tested against current `main` at `11ac64e3adb64c14386be5c737b3a3806d873fc4`.

After rollback-oriented tests, the counts diverged:

```text
source               total  completed  failed  running
df.metrics()         399    392        7       0
df.instances         396    392        4       0
duroxide.executions  399    392        7       0
```

The extra failed rows were in `duroxide.executions` with no matching row in `df.instances`:

```sql
SELECT
  e.instance_id,
  e.execution_id,
  e.status,
  left(e.output, 180) AS output_prefix,
  i.id AS df_instance_id
FROM duroxide.executions e
LEFT JOIN df.instances i ON i.id = e.instance_id
WHERE e.status = 'Failed'
  AND i.id IS NULL
ORDER BY e.instance_id;
```

Example output prefix:

```text
Instance <id> not found after 5s (transaction may have been rolled back)
```

So `df.metrics()` reports these as failed instances even though `df.instances` and `df.list_instances('failed')` do not expose them as failed workflow instances.

## Repro Shape

One way to trigger this is to start a workflow inside a transaction that later rolls back, wait for the worker to observe the missing instance, then compare the metrics API with `df.instances`.

```sql
BEGIN;
SELECT df.start('SELECT 1', 'rollback-metrics-probe');
ROLLBACK;

-- wait long enough for the worker to record the missing instance failure

SELECT * FROM df.metrics();

SELECT status, count(*)
FROM df.instances
GROUP BY status;

SELECT e.instance_id, e.status, e.output, i.id AS df_instance_id
FROM duroxide.executions e
LEFT JOIN df.instances i ON i.id = e.instance_id
WHERE e.status = 'Failed'
  AND i.id IS NULL;
```

## Expected

Either:

- `df.metrics()` should count the same persisted workflow instances that `df.instances` / `df.list_instances()` expose, or
- the docs should clearly state that `df.metrics().failed_instances` includes lower-level failed `duroxide.executions`, including orphan executions created by rolled-back starts.

For dashboards and alerting, the current behavior makes rollback probes look like durable workflow failures.

## Notes

From a quick source read, `df.metrics()` appears to come from the generated `get_system_metrics()` path and counts failed rows in `duroxide.executions`. That explains why it can diverge from `df.instances` after rollback.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolled-back df.start() leaves failed duroxide.executions rows that inflate df.metrics() #213

Summary

Observed

Repro Shape

Expected

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Rolled-back df.start() leaves failed duroxide.executions rows that inflate df.metrics() #213

Description

Summary

Observed

Repro Shape

Expected

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions