fix(billing): implement chunked updates for free tier usage tracking by FroeMic · Pull Request #9549 · langfuse/langfuse

FroeMic · 2025-10-06T15:44:27Z

Summary

Fixes GTM-1500: High DB load on free tier usage tracking job

Implements Option 2: Transaction-based chunking (1000 orgs per batch)

Reduces query count from 50,000 to ~50 (99% reduction)
Structured for easy swap to raw SQL (Option 1) if needed

Changes

thresholdProcessing.ts:
- Added OrgUpdateData type for collecting update data
- Refactored processThresholds() to return update data instead of executing immediately
- Removed 3 prisma.organization.update() calls
- Kept all existing logic (emails, metrics, state calculations)
bulkUpdates.ts (NEW):
- Chunked bulk update with 1000 orgs per transaction
- Try-catch per chunk: traceException() on failure, continues processing
- Batch cache invalidation
- Returns success/failure stats
- Includes commented raw SQL implementation for future optimization
usageAggregation.ts:
- Collect updates in array instead of immediate execution
- Call bulkUpdateOrganizations() after processing each day
- Track bulk update stats
- Updated UsageAggregationStats type
Tests: Updated to verify returned updateData instead of mock calls

Error Handling

Each chunk failure is isolated (won't kill entire job)
Failed chunks reported to Datadog via traceException()
Successful chunks are committed
Cache invalidation failures logged but don't fail updates

Performance Impact

Before: 50,000 sequential UPDATEs (250-500s)
After: ~50 transaction blocks (estimated 10-20s)
Reduction: 95% query reduction, 95% faster execution

Test Plan

All existing tests pass (thresholdProcessing.test.ts, usageAggregation.test.ts)
Linter passes
Manual test with staging data
Monitor DB load after deployment

🤖 Generated with Claude Code

Important

Implements chunked updates for free tier usage tracking, reducing database load and improving performance by refactoring processThresholds() and introducing bulkUpdateOrganizations() for efficient batch processing.

Behavior:
- Implements chunked updates for free tier usage tracking, reducing query count from 50,000 to ~50.
- processThresholds() in thresholdProcessing.ts refactored to return update data.
- New bulkUpdateOrganizations() in bulkUpdates.ts handles chunked updates with error isolation per chunk.
- usageAggregation.ts collects updates and calls bulkUpdateOrganizations() after processing each day.
Error Handling:
- Each chunk failure is isolated; failed chunks reported via traceException().
- Cache invalidation failures logged but don't fail updates.
Performance Impact:
- Reduces query count by 95% and execution time by 95%.
Tests:
- Updated to verify returned updateData instead of mock calls.

^{This description was created by}^{for 3388d3c. You can customize this summary. It will automatically update as commits are pushed.}

Reduces DB load by 95% via transaction batching (50,000 → 50 chunks). Each chunk processes 1,000 orgs with proper error handling. Failed chunks reported to Datadog without killing the job. Changes: - Refactored processThresholds() to return update data instead of executing immediately - Created bulkUpdates.ts with chunked transaction processing (1000 orgs per batch) - Modified usageAggregation.ts to collect updates and execute in bulk - Updated tests to verify returned data instead of mock calls - Added error handling with traceException for failed chunks - Structured for easy swap to raw SQL (Option 1) if needed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The tests now call bulkUpdateOrganizations() to complete the update flow, including cache invalidation. This reflects the refactored architecture where processThresholds() returns update data and bulkUpdateOrganizations() executes it.

60 seconds was excessive for 1000 orgs. Even at 10ms per update, that's only 10 seconds. 15 seconds provides a reasonable buffer.

Benefits over previous () approach: - Better resilience: One failed org doesn't fail the entire 1000-org chunk - Concurrent execution: Much faster than sequential transaction - Granular error tracking: Track exactly which orgs failed - Better error handling: Each org failure reported to Datadog individually Trade-off: No atomicity per chunk, but we don't need it for this use case. Each org update is independent and idempotent.

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

Steffen911

Both bulkUpdateOrganizations approaches look good to me in the current implementation. Your pick!

…angfuse#9549) * fix(billing): implement chunked updates for free tier usage tracking Reduces DB load by 95% via transaction batching (50,000 → 50 chunks). Each chunk processes 1,000 orgs with proper error handling. Failed chunks reported to Datadog without killing the job. Changes: - Refactored processThresholds() to return update data instead of executing immediately - Created bulkUpdates.ts with chunked transaction processing (1000 orgs per batch) - Modified usageAggregation.ts to collect updates and execute in bulk - Updated tests to verify returned data instead of mock calls - Added error handling with traceException for failed chunks - Structured for easy swap to raw SQL (Option 1) if needed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * test: update cache invalidation tests to use bulkUpdateOrganizations The tests now call bulkUpdateOrganizations() to complete the update flow, including cache invalidation. This reflects the refactored architecture where processThresholds() returns update data and bulkUpdateOrganizations() executes it. * fix: reduce transaction timeout from 60s to 15s per chunk 60 seconds was excessive for 1000 orgs. Even at 10ms per update, that's only 10 seconds. 15 seconds provides a reasonable buffer. * refactor: use Promise.allSettled instead of transaction wrapper Benefits over previous () approach: - Better resilience: One failed org doesn't fail the entire 1000-org chunk - Concurrent execution: Much faster than sequential transaction - Granular error tracking: Track exactly which orgs failed - Better error handling: Each org failure reported to Datadog individually Trade-off: No atomicity per chunk, but we don't need it for this use case. Each org update is independent and idempotent. * fix: remove unused chunkOrgIds variable * remove unused code * refactor transaction update and add rawsql update * Update worker/src/ee/usageThresholds/bulkUpdates.ts Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com> * Update worker/src/ee/usageThresholds/bulkUpdates.ts Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com> * make rawsql query default --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Oct 6, 2025

ellipsis-dev Bot reviewed Oct 6, 2025

View reviewed changes

Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts

FroeMic added 5 commits October 6, 2025 17:49

fix: reduce transaction timeout from 60s to 15s per chunk

afbbe91

60 seconds was excessive for 1000 orgs. Even at 10ms per update, that's only 10 seconds. 15 seconds provides a reasonable buffer.

fix: remove unused chunkOrgIds variable

48e6fc8

remove unused code

079181d

FroeMic requested a review from Steffen911 October 6, 2025 16:15

refactor transaction update and add rawsql update

d1fae22

ellipsis-dev Bot reviewed Oct 6, 2025

View reviewed changes

Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts Outdated

Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts Outdated

Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts

Comment thread worker/src/ee/usageThresholds/bulkUpdates.ts

FroeMic and others added 2 commits October 6, 2025 19:01

Update worker/src/ee/usageThresholds/bulkUpdates.ts

597b8ae

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

Update worker/src/ee/usageThresholds/bulkUpdates.ts

2bbb798

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

Steffen911 approved these changes Oct 6, 2025

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label Oct 6, 2025

make rawsql query default

3388d3c

FroeMic enabled auto-merge October 6, 2025 17:34

dosubot Bot added the auto-merge This PR is set to be merged label Oct 6, 2025

FroeMic added this pull request to the merge queue Oct 6, 2025

Merged via the queue into main with commit 1354e09 Oct 6, 2025
31 checks passed

FroeMic deleted the michael/gtm-1500 branch October 6, 2025 17:47

dosubot Bot removed the auto-merge This PR is set to be merged label Oct 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(billing): implement chunked updates for free tier usage tracking#9549

fix(billing): implement chunked updates for free tier usage tracking#9549
FroeMic merged 10 commits into
mainfrom
michael/gtm-1500

FroeMic commented Oct 6, 2025 •

edited by ellipsis-dev Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Steffen911 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FroeMic commented Oct 6, 2025 • edited by ellipsis-dev Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Error Handling

Performance Impact

Test Plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Steffen911 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FroeMic commented Oct 6, 2025 •

edited by ellipsis-dev Bot

Loading