Skip to content

Search-within-a-facet (facet_query typeahead) for high-cardinality facets beyond the maxFacetValues cap #533

Description

@ddeboer

Context

The keyed DatasetFacets GraphQL surface returns a bounded set of facet buckets per field, capped by the engine’s maxFacetValues (CompileOptions.maxFacetValues → Typesense max_facet_values). The Dataset Register consumer sets this to a fixed ceiling (recently raised from 250 to 2000). Beyond the ceiling, Typesense returns only the top-N buckets by count and silently drops the rest — a facet value that exists in the data but ranks below the cutoff becomes unreachable through the UI, with no signal that anything is missing.

The value is not client-parameterizable: maxFacetValues is set once at engine construction and is not part of the SearchQuery IR or the GraphQL schema, so every request gets the deployment’s fixed ceiling. The consumer’s facet search box is currently client-side only — it filters the already-fetched top-N buckets, so it cannot reach anything the cap dropped.

Measured production cardinality (Dataset Register, ~2.6k datasets) shows the cap is genuinely hit: keyword ≈ 838 distinct values, class ≈ 786, publisher ≈ 303. Raising the cap to 2000 covers these today with headroom, so this is not urgent — but it does not scale: a literal facet like keyword growing into the tens of thousands would make “return every bucket on every page load” too heavy for the per-facet fan-out (one engine search per facet, per page).

Proposal

Add a per-facet value-query (search-within-a-facet / typeahead) so a client can search a facet’s full value space without prefetching all buckets:

  • @lde/search — extend the query IR so a facet request can carry an optional value-query string (and optionally return a truncation signal / total distinct count, which Typesense already exposes as facet_counts[].stats.total_values).
  • @lde/search-typesense — compile that into Typesense facet_query, composed with the existing skip-own-filter per-facet search so typeahead counts still respect the other active filters.
  • @lde/search-api-graphql — expose the value-query in the schema (capped server-side — it is a public endpoint, so a client-supplied bound must not be unbounded).
  • Consumer (Dataset Register) — promote the client-side facet filter to a server-backed lookup and add the UI wiring.

Caveat: stored value vs. translated label

Typesense facet_query matches the stored facet value. This works directly for literal-valued facets (e.g. keyword), but not for IRI-valued facets whose labels are resolved/translated on the client (e.g. publisher, class) — a facet_query on the raw IRI will not match what the user types. Server-backed typeahead for those facets needs a searchable indexed label field first, so the natural first slice is the literal high-cardinality facets only.

Current mitigation

maxFacetValues raised to 2000 in the consumer, which covers present cardinality (max ≈ 838) with headroom. Optionally surface stats.total_values as a “showing X of Y” indicator so any future truncation is never silent.

When to pick this up

When a literal (non-IRI) facet’s distinct cardinality grows large enough that returning its full bucket list on every page load becomes a payload/latency problem, or when product wants genuine search-within-a-facet UX rather than filtering a prefetched slice.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Fields

    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions