Skip to content

Enabling efficient paginated listing of df.instances for external clients #146

@erdemtuna

Description

@erdemtuna

Hi everyone,

We've started to build Durable Pipelines in in the VS Code PostgreSQL extension. We hit a few rough edges in df.instances that we'd love your help smoothing. They all share one root cause: external clients (like us) can't paginate df.instances efficiently today.

The first durable surface our users will see in the MVP is a paginated Instances list sourced straight from df.instances. We identified three independent gaps surfaced. They land cleanly in any order, so we've ranked them by impact for convenience, but whichever you pick up first unblocks us.

Where we got stuck

There's no index on df.instances.created_at. The canonical list API df.list_instances has no pagination or total count support, and it omits timestamps. The random 8 character id can't serve as a proxy for chronological ordering either. So any client that wants a paginated, timestamped Instances list in chronological order has two options: (a) sort the full table on every page request, or (b) fetch a bounded window in one batch and paginate on the client as a workaround.

Root cause: three independent gaps

Gap 1: df.instances has no index for time ordering

Per src/lib.rs:190, 193, the only df.* indexes today are:

CREATE INDEX IF NOT EXISTS idx_instances_status ON df.instances(status);
CREATE INDEX IF NOT EXISTS idx_nodes_instance ON df.nodes(instance_id);

Notably absent: any index supporting ORDER BY created_at or completed_at. Every chronological query against df.instances requires a full table sort.

We considered using id as a proxy for chronological order to lean on the PK index. But short_id() (src/types.rs:91-101) keeps only the trailing 8 hex characters of a fresh UUIDv4. Fully random, no time component. So ORDER BY id is arbitrary, not chronological. No proxy is possible.

Gap 2: df.list_instances lacks pagination and total count

Per src/monitoring.rs:17-107, df.list_instances(status_filter, limit_count):

  • Runs SELECT id, label FROM df.instances ORDER BY created_at DESC LIMIT $N (gated by Gap 1).
  • Caps at 10 000 rows.
  • Has no OFFSET and no total count return value.
  • Enriches each row with a client.get_instance_info(id) call per row.

There's no canonical API a downstream client can call to render "Page X of Y".

Gap 3: df.list_instances omits timestamps

The function returns (instance_id, label, function_name, status, execution_count, output). It does not return created_at, completed_at, or duration. Any UI that wants to show "when did this run?" has to run a second query against df.instances to enrich. That's PK keyed, so it's cheap, but it's still an extra round trip on every list call. And the data is right there in the function.

Our current workaround

In MVP, we plan to:

  1. Call df.list_instances(NULL, 200) to fetch a bounded recent window (still gated by Gap 1, but only one sort per list call rather than one per page click).
  2. Enrich via SELECT id, created_at, completed_at FROM df.instances WHERE id = ANY($ids). PK keyed, fast (workaround for Gap 3).
  3. Paginate on the client through the 200 row window (Prev / current page / Next, page size 20).
  4. Surface a banner when the window cap is hit: "Showing the 200 most recent instances. Extended pagination ships in a later release."

Functional, but it caps user visible history at 200 rows and still incurs a full table sort on every list operation. We want to swap in real server side pagination as soon as the upstream supports it.

Proposed changes

1. Add a (created_at DESC, id) index on df.instances

CREATE INDEX IF NOT EXISTS idx_instances_created_at_desc_id
    ON df.instances (created_at DESC, id);

Smallest change with the largest payoff. Unblocks efficient ORDER BY created_at DESC for every downstream client, including the existing df.list_instances.

2. Add a paginated server side list function

Just a starting point to react to, we're flexible on the exact shape (keyset vs offset, total_count snapshot vs has_more, etc.):

df.list_instances_paginated(
    status_filter text DEFAULT NULL,
    limit_count   int  DEFAULT 100,
    after_cursor  text DEFAULT NULL
) RETURNS TABLE (
    instance_id     text,
    label           text,
    function_name   text,
    status          text,
    execution_count bigint,
    output          text,
    created_at      timestamptz,
    completed_at    timestamptz,
    total_count     bigint,
    next_cursor     text
);

Happy to align on whatever idiom fits the extension best.

3. Return created_at and completed_at from df.list_instances

The smallest of the three. Even if (2) is out of scope, this alone would eliminate our enrichment query.


If any of these are welcome but lower priority on your side, give us a shout. We're happy to send PRs.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions