Skip to content

explorer: 'Samples in View' counter is the fetch budget, not the real count (#201 Part 1) #206

@rdhyee

Description

@rdhyee

Originally filed as Part 1 of #201. Splitting out as a dedicated issue since #201 was closed by #203 / #205 (which fixed Part 2 only).

Symptom

The "Samples in View" stat box reads exactly 5,000 in dense regions — the value of DEFAULT_POINT_BUDGET at explorer.qmd:418. In Cyprus (lat ≈ 34.99, lng ≈ 33.70), direct DuckDB query against data.isamples.org/isamples_202601_samples_map_lite.parquet returns 23,421 samples in a ±0.1° box. The counter underreports by ~5x there. The cluster is one dense site (almost certainly Polis Excavations, OPENCONTEXT source).

Root cause

explorer.qmd:1530-1538 — the point-mode viewport query:

SELECT pid, label, source, latitude, longitude, place_name, result_time
FROM read_parquet('${lite_url}')
WHERE latitude BETWEEN ${padded.south} AND ${padded.north}
  AND longitude BETWEEN ${padded.west} AND ${padded.east}
  ${sourceFilterSQL('source')}
  ${facetFilterSQL()}
LIMIT 5000

explorer.qmd:1557:

updateStats('Samples', cachedData.length, cachedData.length, ..., 'Samples in View', 'Samples in View');

cachedData.length IS the row count of the LIMIT 5000 result. The counter therefore tops out at 5000 by construction.

Secondary smells:

  • No ORDER BY before LIMIT → which 5000 rows return is undefined (probably stable in DuckDB-on-parquet but not contractual).
  • Label says "in View" but fetch uses a padded (30%) viewport (explorer.qmd:1514-1522). Even ignoring the cap, the count meaning is loose.
  • renderSamplePoints plots all of cachedData including rows outside the actual viewport.

Fix directions (from Codex retrospective on #203)

In rough order of effort:

  1. Honest relabel (cheapest): change the label to "Samples Loaded (max N)" and wire the budget value into the label. Counter stops lying.
  2. Compute real count alongside: a fast SELECT count(*) against the same WHERE (no LIMIT) is cheap on the lite parquet via DuckDB-WASM range reads. Display "X loaded / Y total in view", with explicit signaling when Y > X.
  3. Adaptive aggregation: if real count > budget, fall back to a cluster-style representation or surface a "too dense to render individually — Y samples here" affordance.
  4. Add ORDER BY pid to the point query so the 5000 subset is at least deterministic across browsers and sessions.

Direction 2 (real-count alongside) is probably the right user-visible answer; direction 4 is independent and could ship with any of the others.

Acceptance

  • Counter accurately represents the in-view sample count, or is unambiguously labeled as a capped/loaded count.
  • Cyprus deep-link (#v=1&lat=34.9957&lng=33.6798&alt=15212&mode=point) shows a number that does not silently understate the real density.
  • No regression in cluster-mode "Samples in View" (which already counts viewport intersections correctly).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions