Sampling Cache: ensure consolidated values are reported; cache all sites, not only >threshold#5783
Sampling Cache: ensure consolidated values are reported; cache all sites, not only >threshold#5783
Conversation
| above_threshold_only? = Keyword.get(opts, :above_threshold_only?, true) | ||
|
|
||
| case super(key, opts) do | ||
| result when is_integer(result) and above_threshold_only? and result >= @threshold -> |
There was a problem hiding this comment.
Hmm I don't really understand why this conditional is there in the first place. It looks like if a site has 9m events in the last 30d, it is excluded from sampling?
So if we're querying 3 years of data, the traffic estimate would be 3 * 12 * 9m = 324m but due to this early exclusion we wouldn't apply sampling? If true, it doesn't much sense to me.
Sorry I know not exactly a comment on the changes made in this PR which preserves existing behaviour, it just stood out to me when reviewing.
There was a problem hiding this comment.
Hmm I don't really understand why this conditional is there in the first place. It looks like if a site has 9m events in the last 30d, it is excluded from sampling?
Correct, it's how it works right now.
There was a problem hiding this comment.
So if we're querying 3 years of data, the traffic estimate would be 3 * 12 * 9m = 324m but due to this early exclusion we wouldn't apply sampling? If true, it doesn't much sense to me.
No, if we're querying 3 years of data, the traffic estimate would be nil, because 9m. Only >10m are included in sampling currently which is a very small number of sites if you query the SamplingCache.size() on prod...
There was a problem hiding this comment.
@zoldar WYT? having now all 30d values, can we make the estimate any more accurate?
There was a problem hiding this comment.
I think that when populating the cache, we should filter by a fraction of sample threshold, something like:
having: selected_as(:events_ingested) > ^(Sampling.default_sample_threshold() / 12)- we'd still skip sampling for ranges when estimate goes below default threshold but we'd account for at least most common long term queries.
Though, on a second thought, this does not save us from very long period queries against sites just under that 30d threshold.
To really address that, we'd have to somehow account for sites.stats_start_date in that cache refresh query - I'm not really sure how though yet.
There was a problem hiding this comment.
If it's doable, then definitely yes 👍 Then we would lower risk of missing sites just under the threshold. It's still not an ironclad guarantee as there might be sites with a very seasonal traffic pattern which might fly under the radar during quieter months, but still, it would be an improvement for sure.
There was a problem hiding this comment.
okay, I'll work on that here
There was a problem hiding this comment.
Looks good to me 👍 Have you checked the time/resources it takes to execute the cache query in production now?
There was a problem hiding this comment.
First query can go as high as 6s, second seems to be cached and finishes in around 1s. No problem I guess.
45d4389 to
34d6967
Compare
Changes
Current sampling issues aside, for a consolidated view consisting of n sites, none of which exceeds the sampling threshold alone, a complete sum must be calculated.
The cache refresh query has been changed, but luckily it makes it run much faster (1s ballpark now). So at the tiny expense of ets memory used we've got all site ids available for any estimation efforts.
https://3.basecamp.com/5308029/buckets/43891605/card_tables/cards/9147369994
Tests
Changelog
Documentation
Dark mode