Skip to content

Sampling Cache: ensure consolidated values are reported; cache all sites, not only >threshold#5783

Merged
aerosol merged 3 commits intomasterfrom
sampling-cache-all
Oct 20, 2025
Merged

Sampling Cache: ensure consolidated values are reported; cache all sites, not only >threshold#5783
aerosol merged 3 commits intomasterfrom
sampling-cache-all

Conversation

@aerosol
Copy link
Copy Markdown
Member

@aerosol aerosol commented Oct 7, 2025

Changes

Current sampling issues aside, for a consolidated view consisting of n sites, none of which exceeds the sampling threshold alone, a complete sum must be calculated.
The cache refresh query has been changed, but luckily it makes it run much faster (1s ballpark now). So at the tiny expense of ets memory used we've got all site ids available for any estimation efforts.

https://3.basecamp.com/5308029/buckets/43891605/card_tables/cards/9147369994

Tests

  • Automated tests have been added
  • This PR does not require tests

Changelog

  • Entry has been added to changelog
  • This PR does not make a user-facing change

Documentation

  • Docs have been updated
  • This change does not need a documentation update

Dark mode

  • The UI has been tested both in dark and light mode
  • This PR does not change the UI

above_threshold_only? = Keyword.get(opts, :above_threshold_only?, true)

case super(key, opts) do
result when is_integer(result) and above_threshold_only? and result >= @threshold ->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't really understand why this conditional is there in the first place. It looks like if a site has 9m events in the last 30d, it is excluded from sampling?

So if we're querying 3 years of data, the traffic estimate would be 3 * 12 * 9m = 324m but due to this early exclusion we wouldn't apply sampling? If true, it doesn't much sense to me.

Sorry I know not exactly a comment on the changes made in this PR which preserves existing behaviour, it just stood out to me when reviewing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't really understand why this conditional is there in the first place. It looks like if a site has 9m events in the last 30d, it is excluded from sampling?

Correct, it's how it works right now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we're querying 3 years of data, the traffic estimate would be 3 * 12 * 9m = 324m but due to this early exclusion we wouldn't apply sampling? If true, it doesn't much sense to me.

No, if we're querying 3 years of data, the traffic estimate would be nil, because 9m. Only >10m are included in sampling currently which is a very small number of sites if you query the SamplingCache.size() on prod...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zoldar WYT? having now all 30d values, can we make the estimate any more accurate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that when populating the cache, we should filter by a fraction of sample threshold, something like:

having: selected_as(:events_ingested) > ^(Sampling.default_sample_threshold() / 12)
  • we'd still skip sampling for ranges when estimate goes below default threshold but we'd account for at least most common long term queries.

Though, on a second thought, this does not save us from very long period queries against sites just under that 30d threshold.

To really address that, we'd have to somehow account for sites.stats_start_date in that cache refresh query - I'm not really sure how though yet.

Copy link
Copy Markdown
Contributor

@zoldar zoldar Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's doable, then definitely yes 👍 Then we would lower risk of missing sites just under the threshold. It's still not an ironclad guarantee as there might be sites with a very seasonal traffic pattern which might fly under the radar during quieter months, but still, it would be an improvement for sure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll work on that here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍 Have you checked the time/resources it takes to execute the cache query in production now?

Copy link
Copy Markdown
Member Author

@aerosol aerosol Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First query can go as high as 6s, second seems to be cached and finishes in around 1s. No problem I guess.

@aerosol aerosol force-pushed the sampling-cache-all branch from 45d4389 to 34d6967 Compare October 8, 2025 06:57
@aerosol aerosol changed the title Sampling Cache: ensure consolidated values above threshold are reported Sampling Cache: ensure consolidated values above threshold are reported; cache all sites, not only >threshold Oct 20, 2025
@aerosol aerosol added this pull request to the merge queue Oct 20, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 20, 2025
@aerosol aerosol changed the title Sampling Cache: ensure consolidated values above threshold are reported; cache all sites, not only >threshold Sampling Cache: ensure consolidated values are reported; cache all sites, not only >threshold Oct 20, 2025
@aerosol aerosol added this pull request to the merge queue Oct 20, 2025
Merged via the queue into master with commit 5976a6a Oct 20, 2025
16 checks passed
@aerosol aerosol deleted the sampling-cache-all branch November 24, 2025 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants