Context
The keyed DatasetFacets GraphQL surface returns a bounded set of facet buckets per field, capped by the engine’s maxFacetValues (CompileOptions.maxFacetValues → Typesense max_facet_values). The Dataset Register consumer sets this to a fixed ceiling (recently raised from 250 to 2000). Beyond the ceiling, Typesense returns only the top-N buckets by count and silently drops the rest — a facet value that exists in the data but ranks below the cutoff becomes unreachable through the UI, with no signal that anything is missing.
The value is not client-parameterizable: maxFacetValues is set once at engine construction and is not part of the SearchQuery IR or the GraphQL schema, so every request gets the deployment’s fixed ceiling. The consumer’s facet search box is currently client-side only — it filters the already-fetched top-N buckets, so it cannot reach anything the cap dropped.
Measured production cardinality (Dataset Register, ~2.6k datasets) shows the cap is genuinely hit: keyword ≈ 838 distinct values, class ≈ 786, publisher ≈ 303. Raising the cap to 2000 covers these today with headroom, so this is not urgent — but it does not scale: a literal facet like keyword growing into the tens of thousands would make “return every bucket on every page load” too heavy for the per-facet fan-out (one engine search per facet, per page).
Proposal
Add a per-facet value-query (search-within-a-facet / typeahead) so a client can search a facet’s full value space without prefetching all buckets:
@lde/search — extend the query IR so a facet request can carry an optional value-query string (and optionally return a truncation signal / total distinct count, which Typesense already exposes as facet_counts[].stats.total_values).
@lde/search-typesense — compile that into Typesense facet_query, composed with the existing skip-own-filter per-facet search so typeahead counts still respect the other active filters.
@lde/search-api-graphql — expose the value-query in the schema (capped server-side — it is a public endpoint, so a client-supplied bound must not be unbounded).
- Consumer (Dataset Register) — promote the client-side facet filter to a server-backed lookup and add the UI wiring.
Caveat: stored value vs. translated label
Typesense facet_query matches the stored facet value. This works directly for literal-valued facets (e.g. keyword), but not for IRI-valued facets whose labels are resolved/translated on the client (e.g. publisher, class) — a facet_query on the raw IRI will not match what the user types. Server-backed typeahead for those facets needs a searchable indexed label field first, so the natural first slice is the literal high-cardinality facets only.
Current mitigation
maxFacetValues raised to 2000 in the consumer, which covers present cardinality (max ≈ 838) with headroom. Optionally surface stats.total_values as a “showing X of Y” indicator so any future truncation is never silent.
When to pick this up
When a literal (non-IRI) facet’s distinct cardinality grows large enough that returning its full bucket list on every page load becomes a payload/latency problem, or when product wants genuine search-within-a-facet UX rather than filtering a prefetched slice.
Related
Context
The keyed
DatasetFacetsGraphQL surface returns a bounded set of facet buckets per field, capped by the engine’smaxFacetValues(CompileOptions.maxFacetValues→ Typesensemax_facet_values). The Dataset Register consumer sets this to a fixed ceiling (recently raised from 250 to 2000). Beyond the ceiling, Typesense returns only the top-N buckets by count and silently drops the rest — a facet value that exists in the data but ranks below the cutoff becomes unreachable through the UI, with no signal that anything is missing.The value is not client-parameterizable:
maxFacetValuesis set once at engine construction and is not part of theSearchQueryIR or the GraphQL schema, so every request gets the deployment’s fixed ceiling. The consumer’s facet search box is currently client-side only — it filters the already-fetched top-N buckets, so it cannot reach anything the cap dropped.Measured production cardinality (Dataset Register, ~2.6k datasets) shows the cap is genuinely hit:
keyword≈ 838 distinct values,class≈ 786,publisher≈ 303. Raising the cap to 2000 covers these today with headroom, so this is not urgent — but it does not scale: a literal facet likekeywordgrowing into the tens of thousands would make “return every bucket on every page load” too heavy for the per-facet fan-out (one engine search per facet, per page).Proposal
Add a per-facet value-query (search-within-a-facet / typeahead) so a client can search a facet’s full value space without prefetching all buckets:
@lde/search— extend the query IR so a facet request can carry an optional value-query string (and optionally return a truncation signal / total distinct count, which Typesense already exposes asfacet_counts[].stats.total_values).@lde/search-typesense— compile that into Typesensefacet_query, composed with the existing skip-own-filter per-facet search so typeahead counts still respect the other active filters.@lde/search-api-graphql— expose the value-query in the schema (capped server-side — it is a public endpoint, so a client-supplied bound must not be unbounded).Caveat: stored value vs. translated label
Typesense
facet_querymatches the stored facet value. This works directly for literal-valued facets (e.g.keyword), but not for IRI-valued facets whose labels are resolved/translated on the client (e.g.publisher,class) — afacet_queryon the raw IRI will not match what the user types. Server-backed typeahead for those facets needs a searchable indexed label field first, so the natural first slice is the literal high-cardinality facets only.Current mitigation
maxFacetValuesraised to 2000 in the consumer, which covers present cardinality (max ≈ 838) with headroom. Optionally surfacestats.total_valuesas a “showing X of Y” indicator so any future truncation is never silent.When to pick this up
When a literal (non-IRI) facet’s distinct cardinality grows large enough that returning its full bucket list on every page load becomes a payload/latency problem, or when product wants genuine search-within-a-facet UX rather than filtering a prefetched slice.
Related
multi_search) — a value-query typeahead would ride the same per-facet fan-out.