Skip to content

fix(billing): implement chunked updates for free tier usage tracking#9549

Merged
FroeMic merged 10 commits into
mainfrom
michael/gtm-1500
Oct 6, 2025
Merged

fix(billing): implement chunked updates for free tier usage tracking#9549
FroeMic merged 10 commits into
mainfrom
michael/gtm-1500

Conversation

@FroeMic
Copy link
Copy Markdown
Contributor

@FroeMic FroeMic commented Oct 6, 2025

Summary

Fixes GTM-1500: High DB load on free tier usage tracking job

Implements Option 2: Transaction-based chunking (1000 orgs per batch)

  • Reduces query count from 50,000 to ~50 (99% reduction)
  • Structured for easy swap to raw SQL (Option 1) if needed

Changes

  • thresholdProcessing.ts:

    • Added OrgUpdateData type for collecting update data
    • Refactored processThresholds() to return update data instead of executing immediately
    • Removed 3 prisma.organization.update() calls
    • Kept all existing logic (emails, metrics, state calculations)
  • bulkUpdates.ts (NEW):

    • Chunked bulk update with 1000 orgs per transaction
    • Try-catch per chunk: traceException() on failure, continues processing
    • Batch cache invalidation
    • Returns success/failure stats
    • Includes commented raw SQL implementation for future optimization
  • usageAggregation.ts:

    • Collect updates in array instead of immediate execution
    • Call bulkUpdateOrganizations() after processing each day
    • Track bulk update stats
    • Updated UsageAggregationStats type
  • Tests: Updated to verify returned updateData instead of mock calls

Error Handling

  • Each chunk failure is isolated (won't kill entire job)
  • Failed chunks reported to Datadog via traceException()
  • Successful chunks are committed
  • Cache invalidation failures logged but don't fail updates

Performance Impact

  • Before: 50,000 sequential UPDATEs (250-500s)
  • After: ~50 transaction blocks (estimated 10-20s)
  • Reduction: 95% query reduction, 95% faster execution

Test Plan

  • All existing tests pass (thresholdProcessing.test.ts, usageAggregation.test.ts)
  • Linter passes
  • Manual test with staging data
  • Monitor DB load after deployment

🤖 Generated with Claude Code


Important

Implements chunked updates for free tier usage tracking, reducing database load and improving performance by refactoring processThresholds() and introducing bulkUpdateOrganizations() for efficient batch processing.

  • Behavior:
    • Implements chunked updates for free tier usage tracking, reducing query count from 50,000 to ~50.
    • processThresholds() in thresholdProcessing.ts refactored to return update data.
    • New bulkUpdateOrganizations() in bulkUpdates.ts handles chunked updates with error isolation per chunk.
    • usageAggregation.ts collects updates and calls bulkUpdateOrganizations() after processing each day.
  • Error Handling:
    • Each chunk failure is isolated; failed chunks reported via traceException().
    • Cache invalidation failures logged but don't fail updates.
  • Performance Impact:
    • Reduces query count by 95% and execution time by 95%.
  • Tests:
    • Updated to verify returned updateData instead of mock calls.

This description was created by Ellipsis for 3388d3c. You can customize this summary. It will automatically update as commits are pushed.

Reduces DB load by 95% via transaction batching (50,000 → 50 chunks).
Each chunk processes 1,000 orgs with proper error handling.
Failed chunks reported to Datadog without killing the job.

Changes:
- Refactored processThresholds() to return update data instead of executing immediately
- Created bulkUpdates.ts with chunked transaction processing (1000 orgs per batch)
- Modified usageAggregation.ts to collect updates and execute in bulk
- Updated tests to verify returned data instead of mock calls
- Added error handling with traceException for failed chunks
- Structured for easy swap to raw SQL (Option 1) if needed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@dosubot dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Oct 6, 2025
Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts
The tests now call bulkUpdateOrganizations() to complete the update flow,
including cache invalidation. This reflects the refactored architecture where
processThresholds() returns update data and bulkUpdateOrganizations() executes it.
60 seconds was excessive for 1000 orgs. Even at 10ms per update,
that's only 10 seconds. 15 seconds provides a reasonable buffer.
Benefits over previous () approach:
- Better resilience: One failed org doesn't fail the entire 1000-org chunk
- Concurrent execution: Much faster than sequential transaction
- Granular error tracking: Track exactly which orgs failed
- Better error handling: Each org failure reported to Datadog individually

Trade-off: No atomicity per chunk, but we don't need it for this use case.
Each org update is independent and idempotent.
@FroeMic FroeMic requested a review from Steffen911 October 6, 2025 16:15
Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts Outdated
Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts Outdated
Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts
Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts
FroeMic and others added 2 commits October 6, 2025 19:01
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Member

@Steffen911 Steffen911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both bulkUpdateOrganizations approaches look good to me in the current implementation. Your pick!

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Oct 6, 2025
@FroeMic FroeMic enabled auto-merge October 6, 2025 17:34
@dosubot dosubot Bot added the auto-merge This PR is set to be merged label Oct 6, 2025
@FroeMic FroeMic added this pull request to the merge queue Oct 6, 2025
Merged via the queue into main with commit 1354e09 Oct 6, 2025
31 checks passed
@FroeMic FroeMic deleted the michael/gtm-1500 branch October 6, 2025 17:47
@dosubot dosubot Bot removed the auto-merge This PR is set to be merged label Oct 6, 2025
murdore pushed a commit to juspay/langfuse that referenced this pull request Oct 14, 2025
…angfuse#9549)

* fix(billing): implement chunked updates for free tier usage tracking

Reduces DB load by 95% via transaction batching (50,000 → 50 chunks).
Each chunk processes 1,000 orgs with proper error handling.
Failed chunks reported to Datadog without killing the job.

Changes:
- Refactored processThresholds() to return update data instead of executing immediately
- Created bulkUpdates.ts with chunked transaction processing (1000 orgs per batch)
- Modified usageAggregation.ts to collect updates and execute in bulk
- Updated tests to verify returned data instead of mock calls
- Added error handling with traceException for failed chunks
- Structured for easy swap to raw SQL (Option 1) if needed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* test: update cache invalidation tests to use bulkUpdateOrganizations

The tests now call bulkUpdateOrganizations() to complete the update flow,
including cache invalidation. This reflects the refactored architecture where
processThresholds() returns update data and bulkUpdateOrganizations() executes it.

* fix: reduce transaction timeout from 60s to 15s per chunk

60 seconds was excessive for 1000 orgs. Even at 10ms per update,
that's only 10 seconds. 15 seconds provides a reasonable buffer.

* refactor: use Promise.allSettled instead of transaction wrapper

Benefits over previous () approach:
- Better resilience: One failed org doesn't fail the entire 1000-org chunk
- Concurrent execution: Much faster than sequential transaction
- Granular error tracking: Track exactly which orgs failed
- Better error handling: Each org failure reported to Datadog individually

Trade-off: No atomicity per chunk, but we don't need it for this use case.
Each org update is independent and idempotent.

* fix: remove unused chunkOrgIds variable

* remove unused code

* refactor transaction update and add rawsql update

* Update worker/src/ee/usageThresholds/bulkUpdates.ts

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

* Update worker/src/ee/usageThresholds/bulkUpdates.ts

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

* make rawsql query default

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants