Cherry-pick #1344 to main: Fix ClickHouse OOM in /internal/metrics#1402
Cherry-pick #1344 to main: Fix ClickHouse OOM in /internal/metrics#1402
Conversation
…1344) ## Summary Fixes the Sentry `StackAssertionError: Failed to load monthly active users for internal metrics` crash (ClickHouse OOM at the 7.2 GiB per-query cap) and applies two related optimizations to other queries in the same route while here. Adds a local benchmark harness that validates correctness and measures peak memory / duration before & after. ## Root cause (the original Sentry error) `loadMonthlyActiveUsers` was written as `SELECT user_id … GROUP BY user_id` and then counting in Node via a `Set`. On a large project that ships back millions of user_ids. Two failure modes stacked: 1. **Result materialization** — every distinct user_id had to be buffered in the server before streaming to Node (~20 MiB of result for 450k users; much more at real scale). 2. **`JSONExtract(toJSONString(data), 'is_anonymous', 'UInt8')`** — the `toJSONString(data)` per-row re-serialization of the entire nested JSON column, billions of times, just to pull one boolean. Dominates bytes-read. Combined, on a single partition read from S3-backed MergeTree, this can exceed ClickHouse's 7.2 GiB per-query memory cap. That's exactly what the Sentry trace showed. ## Changes ### 1. Fix MAU query (`loadMonthlyActiveUsers`) Moved counting to the server with `uniqExact(sipHash64(normalized_user_id))` and pulled the JS-side normalization (`lower`, `trim`, `isUuid`) into SQL. Picked `sipHash64` after benchmarking 7 variants — it's exact (at <<2³² users) and halves the uniqExact hash-state vs. raw string keys. ### 2. Fix 1 — `JSONExtract(toJSONString(data), …)` → direct `CAST(data.is_anonymous, …)` Applied everywhere the pattern appeared in the metrics route: - `loadDailyActiveUsers` - the `analyticsUserJoin` subquery - the `nonAnonymousAnalyticsUserFilter` - `analyticsOverview:topRegion` - `analyticsOverview:online` Semantics preserved (`coalesce(CAST(data.is_anonymous, 'Nullable(UInt8)'), 0)` matches `JSONExtract(…, 'UInt8')` behavior when the field is missing). ### 3. Fix 3 — server-aggregate the split queries `loadDailyActiveUsersSplit` and `loadDailyActiveTeamsSplit` used to ship 1.2M+ `(day, user_id)` rows back to Node just so the JS could bucket them into new / retained / reactivated. Rewrote both as one CTE-style query that returns 31 rows (one per day in the 30-day window) with the counts precomputed. **Minor semantic shift** (documented inline in `route.tsx`): \"new\" is now based on the user's first-ever `\$token-refresh` event rather than their Postgres `signedUpAt`. Agrees for users who log in immediately after sign-up (the common case). Disagrees for the rare edge case of an account that existed pre-window but never generated a `\$token-refresh` until now — old code classified as \"reactivated,\" new code classifies as \"new.\" Judged acceptable; can be revisited. Postgres round-trips for `ProjectUser.signedUpAt` / `Team.createdAt` are no longer needed for the split, and the 76 MiB-ish wire ship is gone. ### 4. Benchmark harness (`apps/backend/scripts/benchmark-internal-metrics.ts`) Local-only tool. Three modes: - **MAU equivalence matrix** — 13 edge cases (empty, dedup, anonymous filter, window boundary, null user_id, non-UUID user_id, case variation, project isolation, missing/null `is_anonymous`, wrong event_type). Asserts OLD pipeline and NEW query return the **same set** of users, not just the same count. - **MAU perf** — OLD vs NEW plus 6 other candidate variants (inline regex, UUID keys, sipHash64, HLL sketches), reads `memory_usage` / `read_rows` / `result_bytes` from `system.query_log` for each, prints a ranked table. - **Full-route benchmark** (`BENCH_ROUTE_QUERIES=1`) — runs every ClickHouse query in `/internal/metrics` in three stages (BEFORE, AFTER, candidate OPTIMIZED) against the same seed and prints per-query deltas plus endpoint-level totals. Seeds under a synthetic `project_id` so real data is never touched; cleans up on exit via `ALTER TABLE … DELETE`. ## Benchmark results ### MAU query alone Ran at two scales; set-equality verified (new query identifies the same individual users, not just the same count). | seed | MAU | peak memory (old → new) | bytes read | duration | |---|---|---|---|---| | 500k events | 89,939 | 158.7 MiB → 46.7 MiB (**3.4×**, −70%) | 175.7 MiB → 63.0 MiB (2.8×) | 483 ms → 76 ms (**6.4×**) | | 2.5M events | 449,990 | 439.2 MiB → 281.4 MiB (1.56×, −36%) | 865.0 MiB → 310.9 MiB (2.8×) | 783 ms → 126 ms (**6.2×**) | MAU variant bake-off at 2.5M events (all exact, all set-equal to OLD): | variant | memory | duration | notes | |---|---|---|---| | v0_old (baseline) | 440 MiB | 567 ms | — | | v1_uniqExact_string | 284 MiB | 110 ms | naive fix | | v3_uniqExact_toUUID | 244 MiB | 153 ms | UUID keys, slower per-row | | **v4_uniqExact_sipHash64** | **125 MiB** | **95 ms** | **shipped** | | v5_uniq (HLL) ~approx | 30 MiB | 86 ms | −0.25% error | | v6_uniqCombined ~approx | 31 MiB | 67 ms | −0.15% error | ### Full `/internal/metrics` route (2.7M events, 300k users + page-views + clicks + teams) Ranked by BEFORE peak memory: | query | mem BEFORE | mem AFTER | Δ mem | dur BEFORE | dur AFTER | Δ dur | |---|---|---|---|---|---|---| | analyticsOverview:topReferrers | 588.1 MiB | 411.1 MiB | 1.43× | 1833 ms | 110 ms | **16.66×** | | analyticsOverview:totalVisitors | 584.3 MiB | 403.5 MiB | 1.45× | 1829 ms | 121 ms | 15.12× | | analyticsOverview:dailyEvents | 584.1 MiB | 403.7 MiB | 1.45× | 1897 ms | 140 ms | 13.55× | | loadUsersByCountry | 393.1 MiB | 385.4 MiB | ≈same | 74 ms | 80 ms | ≈same | | loadDailyActiveUsersSplit | 363.4 MiB | 396.8 MiB | *+9%* | 1966 ms | 356 ms | 5.52× | | analyticsOverview:topRegion | 269.9 MiB | 106.4 MiB | 2.54× | 1602 ms | 65 ms | 24.65× | | loadDailyActiveUsers | 268.3 MiB | 84.0 MiB | 3.19× | 1111 ms | 44 ms | 25.25× | | loadDailyActiveTeamsSplit | 59.6 MiB | 78.1 MiB | *+31%* | 70 ms | 123 ms | *+76%* | | loadMonthlyActiveUsers | 54.9 MiB | 54.9 MiB | ≈same | 68 ms | 56 ms | ≈same | | analyticsOverview:online | 18.4 MiB | 5.8 MiB | 3.17× | 58 ms | 4 ms | 14.50× | **Endpoint-level totals** | metric | BEFORE | AFTER | Δ | |---|---|---|---| | Sum peak ClickHouse memory | 3.11 GiB | 2.28 GiB | **−27%** | | **Max query duration** (endpoint wall-clock floor) | **1966 ms** | **356 ms** | **−82%** (5.5×) | | Sum query duration (total CPU) | 10508 ms | 1099 ms | **−90%** (9.6×) | | Bytes read | 10.70 GiB | 4.55 GiB | −57% | | Bytes shipped to Node | 94.8 MiB | 44.2 KiB | **−99.95%** | Both split queries show a small memory *regression* at this seed size (the new server-side window-function + self-join has its own state cost that's near break-even with \"materialize + ship\" at 300k users); at prod scale the 76 MiB-ship saving dominates. Duration is unambiguously better. ## Why we don't need to drop the `analyticsUserJoin` in this PR The benchmark includes an OPTIMIZED stage that drops the LEFT JOIN and trusts `e.data.is_anonymous` directly, which would shave another **1.2 GiB / 1.9× duration** off the endpoint. **But we can't ship that here** — an audit of the client tracker (`packages/js/src/lib/stack-app/apps/implementations/event-tracker.ts`) confirmed `is_anonymous` is never set on client-emitted `$page-view` / `$click` events. The JOIN is currently load-bearing. A follow-up PR will enrich `is_anonymous` at the batch ingest endpoint using `auth.user.is_anonymous`; after one metrics-window cycle (~30 days) the JOIN can be dropped. ## Follow-up work (out of scope for this PR) - **Batch-endpoint enrichment** + drop the analytics-overview LEFT JOIN (est. further −53% endpoint memory, −46% duration per the benchmark). - **Teams-split hash-variant count mismatch** — `sipHash64(team_id)` variant of the teams split shows a count discrepancy vs. the string-keyed version in the benchmark. Not blocking since teams-split is only #8 by memory; needs a root-cause pass before shipping that particular optimization. - **`loadUsersByCountry` window bound** — currently scans every `$token-refresh` event ever for the tenancy (no time filter). Bounding to 30 days would bound memory growth with project age, but changes semantics (\"country of latest login ever\" → \"in last 30 days\"). Deferred because it's product-facing. ## Snapshot changes in `internal-metrics.test.ts.snap` The `should return metrics data with users` test signs in 10 users today, then deletes one of them mid-test. Two small snapshot values change on today's date; both are just a reclassification of that single deleted user — the total (10 active users) is unchanged. - **`daily_active_users_split.new[today]`: 9 → 10** All 10 users really did sign in for the first time today. The old code only counted 9 because the deleted user's Postgres row was gone by the time the metrics query ran, so the old classifier couldn't see they were created today. The new query looks at ClickHouse events directly, sees the deleted user's first event was today, and counts them as new like everyone else. - **`daily_active_users_split.reactivated[today]`: 1 → 0** No user was "reactivated" today — nobody was active on an earlier day and came back. The old "1" was the deleted user falling into this bucket by default (the old classifier had no other rule that fit them). The new code correctly reports zero. Totals match either way (9 + 1 = 10 + 0). We're moving one deleted user out of the "returning visitor" bucket and into the "brand-new user" bucket, which is what they actually were. ## Test plan - [x] `pnpm typecheck` and `pnpm lint` pass on the backend package - [x] MAU equivalence matrix: 13/13 cases return the same set of users (not just the same count) between OLD and NEW pipelines - [x] Set-equality verified at 500k-MAU perf scale - [x] Full-route benchmark confirms the expected memory / duration improvements - [ ] Sanity-check the dashboard rendering after deploy (split charts, MAU counter, analytics overview) - [ ] Monitor Sentry for the assertion error — should drop to zero <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Performance Improvements** * Monthly and daily active metrics are now computed entirely server-side for faster queries and reduced client-side processing. * **Bug Fixes** * More consistent handling of anonymous/missing IDs and stricter ID filtering to improve accuracy across edge cases. * **Tests** * Added a comprehensive benchmark and validation harness to measure query performance and verify result equivalence across variants. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis cherry-pick fixes the ClickHouse The one snapshot change (deleted user shifts from "reactivated" → "new") is intentional and documented; a related semantic shift — where the new Confidence Score: 4/5Safe to merge; fixes a live OOM with no correctness regressions beyond one documented and two minor semantic edge cases. All findings are P2: one benchmark incongruence (matrix validates v1 not the shipped v4 sipHash64) and one undocumented semantic shift in new user classification for anonymous to non-anonymous transitions. No data loss, no broken contracts, and the totals are preserved in both cases. P2s alone cap confidence at 4/5. apps/backend/scripts/benchmark-internal-metrics.ts — NEW_QUERY in runNew/runNewSet should mirror the production sipHash64 query; apps/backend/src/app/api/latest/internal/metrics/route.tsx — confirm intended behaviour of first_date classification when anonFilter is applied. Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant Route as /internal/metrics route
participant CH as ClickHouse
participant PG as Postgres (replica)
Note over Route,PG: BEFORE (OOM path)
Client->>Route: GET /internal/metrics
Route->>CH: SELECT assumeNotNull(user_id) GROUP BY user_id
CH-->>Route: N x 36-byte user_id rows (7.2 GiB cap exceeded)
Route->>PG: SELECT projectUserId, signedUpAt IN (activeUserIds)
PG-->>Route: user creation dates
Route-->>Client: split response
Note over Route,PG: AFTER (this PR)
Client->>Route: GET /internal/metrics
Route->>CH: SELECT uniqExact(sipHash64(normalized_user_id)) AS mau
CH-->>Route: single integer row
Route->>CH: loadDailyActiveSplitFromClickhouse (lagInFrame + LEFT JOIN first_date)
CH-->>Route: pre-aggregated new/retained/reactivated counts per day
Note over PG: No Prisma fan-out for split classification
Route-->>Client: split response
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
apps/backend/scripts/benchmark-internal-metrics.ts:127-141
**Benchmark validates v1, not the shipped v4**
`NEW_QUERY` (used by `runNew` and `runNewSet`) uses `uniqExact(normalized_user_id)` — string-based exact count. The production `loadMonthlyActiveUsers` ships `uniqExact(sipHash64(normalized_user_id))` (v4). The 13-case equivalence matrix in `runMatrix` therefore validates v1 against the old query, not the variant actually running in production. The v4 hash variant is tested in `runPerf` via the `VARIANTS` array, but that is a statistical perf comparison, not a logical equivalence check. If a dataset ever produced a `sipHash64` collision (64-bit, birthday-paradox probability ~N²/2⁶⁵), the matrix would not catch the discrepancy. Consider making `NEW_QUERY` match `loadMonthlyActiveUsers` exactly, or adding a v4-specific equivalence assertion in `runPerf`.
### Issue 2 of 2
apps/backend/src/app/api/latest/internal/metrics/route.tsx:244-256
**`first_date` bounded by `anonFilter` changes "new" semantics for anonymous→non-anonymous users**
The `f` subquery that computes `first_date` applies the same `${anonFilter}` as `w`, so when `includeAnonymous = false`, `first_date` is the date of the entity's first *non-anonymous* `$token-refresh` event (not their actual sign-up date). A user who spent months as anonymous and then converted will appear as "new" on the day of their first non-anonymous event inside the window, even though their Postgres `signedUpAt` is much older. The old code looked up `signedUpAt` from Postgres for this classification. The PR documents the deleted-user reclassification in the snapshot, but this analogous case (anonymous→non-anonymous transition) could silently inflate "new" counts. Worth verifying this is the intended behaviour and documenting it alongside the existing snapshot note.
Reviews (1): Last reviewed commit: "Fix ClickHouse OOM in MAU query + optimi..." | Re-trigger Greptile |
| AND project_id = {projectId:String} | ||
| AND branch_id = {branchId:String} | ||
| AND user_id IS NOT NULL | ||
| AND event_at >= {since:DateTime} | ||
| AND event_at < {untilExclusive:DateTime} | ||
| AND ({includeAnonymous:UInt8} = 1 OR coalesce(CAST(data.is_anonymous, 'Nullable(UInt8)'), 0) = 0) | ||
| ) | ||
| WHERE match(normalized_user_id, {uuidRe:String}) | ||
| `; | ||
|
|
||
| type QueryParams = { | ||
| projectId: string, | ||
| branchId: string, | ||
| since: Date, | ||
| untilExclusive: Date, |
There was a problem hiding this comment.
Benchmark validates v1, not the shipped v4
NEW_QUERY (used by runNew and runNewSet) uses uniqExact(normalized_user_id) — string-based exact count. The production loadMonthlyActiveUsers ships uniqExact(sipHash64(normalized_user_id)) (v4). The 13-case equivalence matrix in runMatrix therefore validates v1 against the old query, not the variant actually running in production. The v4 hash variant is tested in runPerf via the VARIANTS array, but that is a statistical perf comparison, not a logical equivalence check. If a dataset ever produced a sipHash64 collision (64-bit, birthday-paradox probability ~N²/2⁶⁵), the matrix would not catch the discrepancy. Consider making NEW_QUERY match loadMonthlyActiveUsers exactly, or adding a v4-specific equivalence assertion in runPerf.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/backend/scripts/benchmark-internal-metrics.ts
Line: 127-141
Comment:
**Benchmark validates v1, not the shipped v4**
`NEW_QUERY` (used by `runNew` and `runNewSet`) uses `uniqExact(normalized_user_id)` — string-based exact count. The production `loadMonthlyActiveUsers` ships `uniqExact(sipHash64(normalized_user_id))` (v4). The 13-case equivalence matrix in `runMatrix` therefore validates v1 against the old query, not the variant actually running in production. The v4 hash variant is tested in `runPerf` via the `VARIANTS` array, but that is a statistical perf comparison, not a logical equivalence check. If a dataset ever produced a `sipHash64` collision (64-bit, birthday-paradox probability ~N²/2⁶⁵), the matrix would not catch the discrepancy. Consider making `NEW_QUERY` match `loadMonthlyActiveUsers` exactly, or adding a v4-specific equivalence assertion in `runPerf`.
How can I resolve this? If you propose a fix, please make it concise.| LEFT JOIN ( | ||
| SELECT | ||
| assumeNotNull(${idCol}) AS entity_id, | ||
| toDate(min(event_at)) AS first_date | ||
| FROM analytics_internal.events | ||
| WHERE event_type = '$token-refresh' | ||
| AND project_id = {projectId:String} | ||
| AND branch_id = {branchId:String} | ||
| AND ${idCol} IS NOT NULL | ||
| AND event_at < {untilExclusive:DateTime} | ||
| ${anonFilter} | ||
| GROUP BY entity_id | ||
| ) AS f USING (entity_id) |
There was a problem hiding this comment.
first_date bounded by anonFilter changes "new" semantics for anonymous→non-anonymous users
The f subquery that computes first_date applies the same ${anonFilter} as w, so when includeAnonymous = false, first_date is the date of the entity's first non-anonymous $token-refresh event (not their actual sign-up date). A user who spent months as anonymous and then converted will appear as "new" on the day of their first non-anonymous event inside the window, even though their Postgres signedUpAt is much older. The old code looked up signedUpAt from Postgres for this classification. The PR documents the deleted-user reclassification in the snapshot, but this analogous case (anonymous→non-anonymous transition) could silently inflate "new" counts. Worth verifying this is the intended behaviour and documenting it alongside the existing snapshot note.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/backend/src/app/api/latest/internal/metrics/route.tsx
Line: 244-256
Comment:
**`first_date` bounded by `anonFilter` changes "new" semantics for anonymous→non-anonymous users**
The `f` subquery that computes `first_date` applies the same `${anonFilter}` as `w`, so when `includeAnonymous = false`, `first_date` is the date of the entity's first *non-anonymous* `$token-refresh` event (not their actual sign-up date). A user who spent months as anonymous and then converted will appear as "new" on the day of their first non-anonymous event inside the window, even though their Postgres `signedUpAt` is much older. The old code looked up `signedUpAt` from Postgres for this classification. The PR documents the deleted-user reclassification in the snapshot, but this analogous case (anonymous→non-anonymous transition) could silently inflate "new" counts. Worth verifying this is the intended behaviour and documenting it alongside the existing snapshot note.
How can I resolve this? If you propose a fix, please make it concise.
Summary
Cherry-pick of #1344 (
85ae4b1c9) fromdevontomainto address user-reported ClickHouse errors on/internal/metricswithout merging the rest ofdev.The original PR fixes the
MEMORY_LIMIT_EXCEEDED(7.2 GiB cap) hit byloadMonthlyActiveUsersand applies related optimizations to other queries on the same route. Endpoint-level: −27% peak memory, −82% max query duration, −99.95% bytes shipped to Node. See #1344 for full root-cause analysis, benchmark methodology, and results.Why cherry-pick rather than merge dev
Users are hitting the OOM now and we don't want to ship the other 42 commits on
devtomainyet. The patch applied cleanly — no other dev commits between the main/dev divergence point and #1344 touchedapps/backend/src/app/api/latest/internal/metrics/route.tsxor the snapshot file.Files
apps/backend/src/app/api/latest/internal/metrics/route.tsx— query rewritesapps/e2e/tests/backend/endpoints/api/v1/__snapshots__/internal-metrics.test.ts.snap— one documented snapshot reclassification (deleted-user moves from "reactivated" to "new"; totals match either way — see Fix ClickHouse OOM in MAU query + optimize /internal/metrics route #1344)apps/backend/scripts/benchmark-internal-metrics.ts— local-only benchmark harness (new file)Test plan
Failed to load monthly active users for internal metrics— should drop to zero