feat: add bucket-based processing to PR analysis snapshot IN-1180 by gaspergrom · Pull Request #4166 · linuxfoundation/crowd.dev

gaspergrom · 2026-06-03T19:25:31Z

Summary

Adds bucket-based (sharded) processing to the PR analysis initial snapshot pipe to avoid hitting memory limits when processing large datasets. Each run can target a subset of segments via bucket_id and num_buckets parameters; once all buckets complete, the hourly snapshot merger takes over. Also switches the copy mode from replace to append to support incremental runs.

Changes

Added % (templated query) marker to all NODEs so Tinybird evaluates the {% if defined(bucket_id) %} conditionals
Added cityHash64(segmentId) % num_buckets = bucket_id filter to every NODE, allowing callers to process one bucket at a time
Changed COPY_MODE from replace to append to support incremental bucket runs

Type of change

JIRA ticket

https://linuxfoundation.atlassian.net/browse/IN-1180

Note

Medium Risk
Changes how pull_requests_analyzed is bootstrapped: append mode plus partial bucket runs can leave incomplete or duplicate snapshot data if operations skip buckets or omit a full reset.

Overview
The PR analysis initial snapshot Tinybird pipe can now be run in sharded passes using optional bucket_id and num_buckets (default 5), filtering each node on cityHash64(segmentId) % num_buckets so large backfills stay within memory limits.

Every upstream SQL node is switched to templated queries (% plus Jinja) so the bucket filter applies consistently across opened, lifecycle, and patchset nodes. COPY_MODE moves from replace to append, so each bucket run adds rows instead of wiping pull_requests_analyzed; operators are expected to run all buckets, then rely on the hourly merger as before.

^{Reviewed by Cursor Bugbot for commit 18f7b98. Bugbot is set up for automated code reviews on this repo. Configure here.}

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>

Copilot

Copilot encountered an error: Your billing is not configured or you have Copilot licenses from multiple standalone organizations or enterprises. To use premium requests, select a billing entity via the GitHub site, under Settings > Copilot > Features.

…napshot

cursor · 2026-06-03T19:27:59Z

 TYPE COPY
 TARGET_DATASOURCE pull_requests_analyzed
-COPY_MODE replace
+COPY_MODE append


Append duplicates full unbucketed run

High Severity

With COPY_MODE append, any on-demand run that omits bucket_id still scans the full dataset and appends every PR row. The previous replace mode cleared the target first. Appending onto an already populated pull_requests_analyzed duplicates keys and skews downstream averages and counts that read the table without a snapshotId filter.

^{Reviewed by Cursor Bugbot for commit 153af27. Configure here.}

cursor · 2026-06-03T19:27:59Z

 TYPE COPY
 TARGET_DATASOURCE pull_requests_analyzed
-COPY_MODE replace
+COPY_MODE append


Re-run bucket appends duplicate PRs

High Severity

COPY_MODE append has no idempotency for a given bucket_id. Re-running the same bucket after a successful copy writes another copy of the same PR rows (same keys and snapshotId). The hourly merger unions historical rows without deduplicating identical keys, so duplicates can remain in pull_requests_analyzed and inflate analytics.

^{Reviewed by Cursor Bugbot for commit 153af27. Configure here.}

cursor · 2026-06-03T19:27:59Z

 TYPE COPY
 TARGET_DATASOURCE pull_requests_analyzed
-COPY_MODE replace
+COPY_MODE append


Merger during partial bucket load

High Severity

Append mode exposes partially loaded data in pull_requests_analyzed while buckets 0–N are still running. The hourly pull_request_analysis_snapshot_merger_copy job uses COPY_MODE replace and treats whatever is already in the table as the historical baseline. If it runs before every bucket finishes, the replace output can permanently under-represent PRs until a full rebootstrap.

^{Reviewed by Cursor Bugbot for commit 153af27. Configure here.}

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>

Copilot

Copilot encountered an error: Your billing is not configured or you have Copilot licenses from multiple standalone organizations or enterprises. To use premium requests, select a billing entity via the GitHub site, under Settings > Copilot > Features.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 18f7b98. Configure here.}

cursor · 2026-06-04T06:44:44Z

+                    required=False,
+                )
+            }}
+        {% end %}


Bucket gate ignores num_buckets

Medium Severity

Sharding is enabled when only bucket_id is defined, while num_buckets can default independently per template. A run with bucket_id but a missing or different num_buckets than other bucket runs mis-partitions segments, leaving gaps or double-processing PR data across the append loads.

^{Reviewed by Cursor Bugbot for commit 18f7b98. Configure here.}

feat: add bucket-based processing to PR analysis snapshot (IN-1180)

ab3f339

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>

gaspergrom self-assigned this Jun 3, 2026

Copilot AI review requested due to automatic review settings June 3, 2026 19:25

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Merge branch 'main' into feat/IN-1180-pull-request-analysis-initial-s…

153af27

…napshot

cursor Bot reviewed Jun 3, 2026

View reviewed changes

style: format bucket filter params in PR analysis pipe

18f7b98

Signed-off-by: Gašper Grom <gasper.grom@gmail.com>

Copilot AI review requested due to automatic review settings June 4, 2026 06:42

Copilot AI reviewed Jun 4, 2026

View reviewed changes

cursor Bot reviewed Jun 4, 2026

View reviewed changes

gaspergrom merged commit 225d3fe into main Jun 4, 2026
16 checks passed

gaspergrom deleted the feat/IN-1180-pull-request-analysis-initial-snapshot branch June 4, 2026 06:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add bucket-based processing to PR analysis snapshot IN-1180#4166

feat: add bucket-based processing to PR analysis snapshot IN-1180#4166
gaspergrom merged 3 commits into
mainfrom
feat/IN-1180-pull-request-analysis-initial-snapshot

gaspergrom commented Jun 3, 2026 •

edited by cursor Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

cursor Bot Jun 3, 2026

Uh oh!

cursor Bot Jun 3, 2026

Uh oh!

cursor Bot Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gaspergrom commented Jun 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Type of change

JIRA ticket

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Append duplicates full unbucketed run

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Re-run bucket appends duplicate PRs

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Merger during partial bucket load

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 4, 2026

Choose a reason for hiding this comment

Bucket gate ignores num_buckets

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gaspergrom commented Jun 3, 2026 •

edited by cursor Bot

Loading