feat: add bucket-based processing to PR analysis snapshot IN-1180#4166
Conversation
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
There was a problem hiding this comment.
Copilot encountered an error: Your billing is not configured or you have Copilot licenses from multiple standalone organizations or enterprises. To use premium requests, select a billing entity via the GitHub site, under Settings > Copilot > Features.
| TYPE COPY | ||
| TARGET_DATASOURCE pull_requests_analyzed | ||
| COPY_MODE replace | ||
| COPY_MODE append |
There was a problem hiding this comment.
Append duplicates full unbucketed run
High Severity
With COPY_MODE append, any on-demand run that omits bucket_id still scans the full dataset and appends every PR row. The previous replace mode cleared the target first. Appending onto an already populated pull_requests_analyzed duplicates keys and skews downstream averages and counts that read the table without a snapshotId filter.
Reviewed by Cursor Bugbot for commit 153af27. Configure here.
| TYPE COPY | ||
| TARGET_DATASOURCE pull_requests_analyzed | ||
| COPY_MODE replace | ||
| COPY_MODE append |
There was a problem hiding this comment.
Re-run bucket appends duplicate PRs
High Severity
COPY_MODE append has no idempotency for a given bucket_id. Re-running the same bucket after a successful copy writes another copy of the same PR rows (same keys and snapshotId). The hourly merger unions historical rows without deduplicating identical keys, so duplicates can remain in pull_requests_analyzed and inflate analytics.
Reviewed by Cursor Bugbot for commit 153af27. Configure here.
| TYPE COPY | ||
| TARGET_DATASOURCE pull_requests_analyzed | ||
| COPY_MODE replace | ||
| COPY_MODE append |
There was a problem hiding this comment.
Merger during partial bucket load
High Severity
Append mode exposes partially loaded data in pull_requests_analyzed while buckets 0–N are still running. The hourly pull_request_analysis_snapshot_merger_copy job uses COPY_MODE replace and treats whatever is already in the table as the historical baseline. If it runs before every bucket finishes, the replace output can permanently under-represent PRs until a full rebootstrap.
Reviewed by Cursor Bugbot for commit 153af27. Configure here.
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
There was a problem hiding this comment.
Copilot encountered an error: Your billing is not configured or you have Copilot licenses from multiple standalone organizations or enterprises. To use premium requests, select a billing entity via the GitHub site, under Settings > Copilot > Features.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 18f7b98. Configure here.
| required=False, | ||
| ) | ||
| }} | ||
| {% end %} |
There was a problem hiding this comment.
Bucket gate ignores num_buckets
Medium Severity
Sharding is enabled when only bucket_id is defined, while num_buckets can default independently per template. A run with bucket_id but a missing or different num_buckets than other bucket runs mis-partitions segments, leaving gaps or double-processing PR data across the append loads.
Reviewed by Cursor Bugbot for commit 18f7b98. Configure here.


Summary
Adds bucket-based (sharded) processing to the PR analysis initial snapshot pipe to avoid hitting memory limits when processing large datasets. Each run can target a subset of segments via
bucket_idandnum_bucketsparameters; once all buckets complete, the hourly snapshot merger takes over. Also switches the copy mode fromreplacetoappendto support incremental runs.Changes
%(templated query) marker to all NODEs so Tinybird evaluates the{% if defined(bucket_id) %}conditionalscityHash64(segmentId) % num_buckets = bucket_idfilter to every NODE, allowing callers to process one bucket at a timeCOPY_MODEfromreplacetoappendto support incremental bucket runsType of change
JIRA ticket
https://linuxfoundation.atlassian.net/browse/IN-1180
Note
Medium Risk
Changes how
pull_requests_analyzedis bootstrapped: append mode plus partial bucket runs can leave incomplete or duplicate snapshot data if operations skip buckets or omit a full reset.Overview
The PR analysis initial snapshot Tinybird pipe can now be run in sharded passes using optional
bucket_idandnum_buckets(default 5), filtering each node oncityHash64(segmentId) % num_bucketsso large backfills stay within memory limits.Every upstream SQL node is switched to templated queries (
%plus Jinja) so the bucket filter applies consistently across opened, lifecycle, and patchset nodes.COPY_MODEmoves fromreplacetoappend, so each bucket run adds rows instead of wipingpull_requests_analyzed; operators are expected to run all buckets, then rely on the hourly merger as before.Reviewed by Cursor Bugbot for commit 18f7b98. Bugbot is set up for automated code reviews on this repo. Configure here.