Skip to content

Move internal metrics queries to ClickHouse replica#1463

Merged
mantrakp04 merged 4 commits into
devfrom
metrics-clickhouse-migration
May 21, 2026
Merged

Move internal metrics queries to ClickHouse replica#1463
mantrakp04 merged 4 commits into
devfrom
metrics-clickhouse-migration

Conversation

@mantrakp04
Copy link
Copy Markdown
Collaborator

@mantrakp04 mantrakp04 commented May 21, 2026

Summary

  • Move loadTotalUsers, loadAuthOverview, and loadRecentlyActiveUsers off direct Postgres queries to read from the ClickHouse analytics_internal tables.
  • Route the remaining projectUser.findMany reads in loadActiveUsersByCountry and loadRecentlyActiveUsers through $replica().
  • loadRecentlyActiveUsers falls back to an empty list on ClickHouse query failure (captured via captureError) rather than failing the whole metrics endpoint.

Test plan

  • Hit the internal metrics endpoint on a tenancy with users/teams and confirm totals, daily series, and recently-active users match the previous Postgres-backed numbers.
  • Verify the 30-day daily-users series fills zero-activity days correctly.
  • Simulate a ClickHouse failure for the recently-active query and confirm the endpoint still responds with the rest of the payload.

Summary by CodeRabbit

  • Bug Fixes & Improvements
    • Improved metrics aggregation for more consistent reporting.
    • More accurate active-user and total-user time series with missing days zero-filled.
    • Authentication overview updated with clearer counts for verified, unverified, and anonymous users.
    • Performance improvements: recently-active and overview calculations run more efficiently and in parallel.

Review Change Stack

Switches loadTotalUsers, loadAuthOverview, and loadRecentlyActiveUsers
to read from the ClickHouse analytics tables instead of hitting Postgres
directly, and routes the remaining Postgres reads through $replica().
@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
stack-auth-hosted-components Ready Ready Preview, Comment May 21, 2026 9:30pm
stack-auth-mcp Ready Ready Preview, Comment May 21, 2026 9:30pm
stack-auth-skills Ready Ready Preview, Comment May 21, 2026 9:30pm
stack-backend Ready Ready Preview, Comment May 21, 2026 9:30pm
stack-dashboard Ready Ready Preview, Comment May 21, 2026 9:30pm
stack-demo Ready Ready Preview, Comment May 21, 2026 9:30pm
stack-docs Ready Ready Preview, Comment May 21, 2026 9:30pm
stack-preview-backend Ready Ready Preview, Comment May 21, 2026 9:30pm
stack-preview-dashboard Ready Ready Preview, Comment May 21, 2026 9:30pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 57ebe67f-3a1b-41f3-9480-b88121ee865a

📥 Commits

Reviewing files that changed from the base of the PR and between 2fcc281 and 9738d04.

📒 Files selected for processing (1)
  • apps/backend/src/app/api/latest/internal/metrics/route.tsx

📝 Walkthrough

Walkthrough

The internal metrics route now reads selected Postgres lookups from Prisma's read replica and computes daily totals and auth overview aggregates from ClickHouse within {since, untilExclusive}, mapping daily results into a fixed 31-day series with zero-filled missing days.

Changes

Internal metrics aggregation data source migration

Layer / File(s) Summary
Read replica for country aggregation
apps/backend/src/app/api/latest/internal/metrics/route.tsx
loadActiveUsersByCountry now uses Prisma's read replica (prisma.$replica().projectUser.findMany) for the Postgres enrichment call; ClickHouse selection remains.
Daily user count aggregation from ClickHouse
apps/backend/src/app/api/latest/internal/metrics/route.tsx
loadTotalUsers now queries ClickHouse analytics_internal.users FINAL constrained by {since, untilExclusive}, groups counts by day, maps results into a fixed-length 31-day output, and zero-fills missing days.
Recently active users read-replica switch
apps/backend/src/app/api/latest/internal/metrics/route.tsx
loadRecentlyActiveUsers switches its Postgres projectUser.findMany call to use the Prisma read replica (prisma.$replica()); ordering-by-lastActiveAt and take: 5 behavior is preserved.
Auth overview totals from ClickHouse
apps/backend/src/app/api/latest/internal/metrics/route.tsx
loadAuthOverview replaces tenancy-specific Postgres $queryRaw totals with ClickHouse aggregations over analytics_internal.users FINAL, analytics_internal.teams FINAL, and analytics_internal.contact_channels FINAL to compute total, anonymous, verified/unverified users and total teams; daily/monthly split loaders remain parallel.

Sequence Diagram(s)

sequenceDiagram
  participant Client as MetricsRoute (API)
  participant CH as ClickHouse
  participant PR as Prisma.$replica()
  Client->>CH: windowed aggregations (users, teams, contacts) / per-day signed_up_at counts
  CH-->>Client: aggregated results
  Client->>PR: projectUser.findMany enrichment (user IDs) for country/recent lists
  PR-->>Client: enriched ProjectUser records
  Client->>Client: assemble response (zero-fill 31-day series, compute auth splits)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hexclave/stack-auth#1420: Updates internal-metrics e2e test expectations and polling behavior to match ClickHouse-backed active-user query semantics.
  • hexclave/stack-auth#1457: Modifies the same route module to constrain ClickHouse-backed aggregations to the metrics window using event_at bounds and windowed semantics.

Suggested reviewers

  • N2D4
  • nams1570

Poem

🐰 I hop through ClickHouse rows and replica streams,
Counting days and filling zeros in dreams.
Five recent hops, countries traced anew,
Totals tallied clean, auth splits in view.
A carrot of metrics, crunchy and true.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: migrating internal metrics queries to use ClickHouse replica instead of direct Postgres queries.
Description check ✅ Passed The description covers the key changes (migration to ClickHouse, replica routing, error handling), includes a test plan with specific validation steps, and is well-structured and complete.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch metrics-clickhouse-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR migrates loadTotalUsers and loadAuthOverview from direct Postgres queries to ClickHouse analytics_internal tables, and routes the projectUser.findMany calls in loadActiveUsersByCountry and loadRecentlyActiveUsers through $replica(). The ClickHouse queries add proper zero-filling for the 30-day series and run in parallel via Promise.all.

  • loadTotalUsers: Replaced a Postgres window-function query with a ClickHouse aggregate + a JS-side zero-fill loop; uses getClickhouseAdminClientForMetrics() consistently with the rest of the file.
  • loadAuthOverview: Split the single Postgres multi-sub-select into two parallel ClickHouse queries (users and teams) plus the existing parallel helpers; downstream arithmetic (unverified_users = nonAnonymousTotal − verifiedNonAnonymousUsers) holds as long as the ETL keeps the two tables in sync.
  • Replica routing: loadActiveUsersByCountry and loadRecentlyActiveUsers now use prisma.$replica() to offload read traffic.

Confidence Score: 4/5

Safe to merge; the ClickHouse migrations are logically correct and the parallel Promise.all structure is sound, but whether NULL signed_up_at values in the ETL are coalesced before landing in ClickHouse determines whether the daily-users series undercounts.

The ClickHouse queries are well-structured, zero-filling is handled correctly in JS, and the replica routing additions are straightforward. The open question is whether the ETL pipeline populates signed_up_at via COALESCE(signedUpAt, createdAt) — if not, users whose signedUpAt was NULL in Postgres will be silently dropped from the loadTotalUsers chart, reproducing an undercount that the old Postgres query did not have.

apps/backend/src/app/api/latest/internal/metrics/route.tsx — specifically the signed_up_at filter in loadTotalUsers and whether the ETL guarantees that column is always non-NULL.

Important Files Changed

Filename Overview
apps/backend/src/app/api/latest/internal/metrics/route.tsx Migrates loadTotalUsers and loadAuthOverview to ClickHouse and adds $replica() routing; logic is sound, though a NULL signed_up_at in the ETL will silently drop rows from the daily-users series (flagged in a prior thread).

Sequence Diagram

sequenceDiagram
    participant Handler as GET /internal/metrics
    participant CH as ClickHouse (analytics_internal)
    participant PG as Postgres Replica

    Handler->>CH: loadTotalUsers (users FINAL, 30-day window)
    CH-->>Handler: daily signup counts

    Handler->>CH: loadAuthOverview usersRow (users FINAL)
    Handler->>CH: loadAuthOverview teamsRow (teams FINAL)
    Handler->>CH: loadDailyActiveUsersSplit / loadDailyActiveTeamsSplit / loadMAU
    CH-->>Handler: aggregate counts + daily splits

    Handler->>PG: loadActiveUsersByCountry ($replica)
    PG-->>Handler: user rows for GeoIP enrichment

    Handler->>PG: loadRecentlyActiveUsers ($replica, top-5 by lastActiveAt)
    PG-->>Handler: 5 most recently active users
Loading

Reviews (3): Last reviewed commit: "Fix remaining getClickhouseAdminClient r..." | Re-trigger Greptile

Comment thread apps/backend/src/app/api/latest/internal/metrics/route.tsx Outdated
Comment thread apps/backend/src/app/api/latest/internal/metrics/route.tsx
Comment thread apps/backend/src/app/api/latest/internal/metrics/route.tsx Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/src/app/api/latest/internal/metrics/route.tsx`:
- Around line 544-545: The loadRecentlyActiveUsers function re-computes its own
timestamp causing inconsistent 30-day windows; change its signature to accept a
request-scoped Date (e.g., now: Date) and replace new Date() with that parameter
when calling getMetricsWindowBounds, then update all callers (including the
other occurrence noted around line 1585) to pass the same request-scoped now so
all loaders share the identical metrics window; reference
loadRecentlyActiveUsers(tenancy: Tenancy, includeAnonymous: boolean = false) and
Tenancy to locate and update the function and its call sites.
- Around line 550-573: The ClickHouse query in route.tsx that builds
`recently_active` currently filters events by `event_at >= {since}` and
`event_at < {untilExclusive}`, changing semantics by excluding users whose last
`$token-refresh` is older than 30 days; update the query used to compute
recently_active in the clickhouseClient.query call (the template string with
SELECT assumeNotNull(user_id) AS user_id, max(event_at) AS last_active ...) to
remove the time window filters (`AND event_at >= {since}` and `AND event_at <
{untilExclusive}`) or otherwise ensure it uses an unbounded time span so it
returns the latest RECENTLY_ACTIVE_USERS_LIMIT users regardless of age (you can
keep the other filters like projectId, branchId, includeAnonymous and keep using
formatClickhouseDateTimeParam elsewhere but do not restrict the token-refresh
selection by since/untilExclusive for this recently_active computation).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7618506d-37a7-495d-9388-512a4f11872d

📥 Commits

Reviewing files that changed from the base of the PR and between b8fc04b and 156427e.

📒 Files selected for processing (1)
  • apps/backend/src/app/api/latest/internal/metrics/route.tsx

Comment thread apps/backend/src/app/api/latest/internal/metrics/route.tsx Outdated
Comment thread apps/backend/src/app/api/latest/internal/metrics/route.tsx Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates internal admin metrics endpoints away from direct Postgres aggregation queries by reading user/team/auth aggregates from the ClickHouse analytics_internal tables, while also routing remaining Postgres reads through the Prisma replica client. It also makes the “recently active users” portion resilient to ClickHouse query failures by returning an empty list instead of failing the entire endpoint.

Changes:

  • Moved loadTotalUsers and loadAuthOverview aggregates from Postgres SQL to ClickHouse queries over analytics_internal.*.
  • Updated remaining Postgres lookups in metrics to use prisma.$replica() (e.g., user joins for country/live/recent lists).
  • Reworked “recently active users” to be driven by ClickHouse $token-refresh activity, with a ClickHouseError-only fallback to [].

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/backend/src/app/api/latest/internal/metrics/route.tsx Outdated
Comment thread apps/backend/src/app/api/latest/internal/metrics/route.tsx Outdated
Removed ClickHouse query fallback and directly utilized Prisma for fetching recently active users. The function now orders results by last active date and limits the output to the top 5 users, improving performance and simplifying error handling.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/backend/src/app/api/latest/internal/metrics/route.tsx (1)

324-324: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix missing/incorrect ClickHouse client usage in metrics route.

  • loadTotalUsers calls getClickhouseAdminClient() (~line 324), but apps/backend/src/app/api/latest/internal/metrics/route.tsx only imports getClickhouseAdminClientForMetrics (no import for getClickhouseAdminClient), so getClickhouseAdminClient is undefined here.
  • Same issue in loadAuthOverview (~line 1389).
  • Update both call sites to getClickhouseAdminClientForMetrics() or explicitly import getClickhouseAdminClient and document why the non-metrics client is required.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/src/app/api/latest/internal/metrics/route.tsx` at line 324, The
code calls getClickhouseAdminClient() in loadTotalUsers and loadAuthOverview but
only imports getClickhouseAdminClientForMetrics, so replace the undefined calls
with getClickhouseAdminClientForMetrics() at both call sites (or alternatively
add an explicit import for getClickhouseAdminClient if the non-metrics admin
client is actually required and add a short comment explaining why), and update
any related type/usages to match the chosen client to ensure the client variable
(e.g., clickhouseClient) is correctly instantiated and used.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@apps/backend/src/app/api/latest/internal/metrics/route.tsx`:
- Line 324: The code calls getClickhouseAdminClient() in loadTotalUsers and
loadAuthOverview but only imports getClickhouseAdminClientForMetrics, so replace
the undefined calls with getClickhouseAdminClientForMetrics() at both call sites
(or alternatively add an explicit import for getClickhouseAdminClient if the
non-metrics admin client is actually required and add a short comment explaining
why), and update any related type/usages to match the chosen client to ensure
the client variable (e.g., clickhouseClient) is correctly instantiated and used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c560751d-5994-4334-9030-a7643a97c047

📥 Commits

Reviewing files that changed from the base of the PR and between db86c34 and 2fcc281.

📒 Files selected for processing (1)
  • apps/backend/src/app/api/latest/internal/metrics/route.tsx

Comment thread apps/backend/src/app/api/latest/internal/metrics/route.tsx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants