feat: github enricher v2 [CM-1211]#4165
Conversation
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
|
|
There was a problem hiding this comment.
Pull request overview
This PR upgrades the packages_worker GitHub repository enricher to “v2” by switching from PAT-based auth to GitHub App installation tokens, adding rate-limit diagnostics/handling, and improving enrichment throughput by streaming DB reads and batching DB writes.
Changes:
- Added GitHub App authentication utilities (JWT minting, installation discovery, installation-token caching) plus a one-time startup rate-limit capacity diagnostic.
- Reworked the enrichment loop to stream repos from DB with configurable concurrency, use installation tokens round-robin, and handle rate limits by parking installations and re-queuing work.
- Added a
skip_enrichmentflag (DB + worker logic) to permanently exclude repos that fail with permanent errors, and optimized DB updates with a single bulk UPDATE per flush.
Reviewed changes
Copilot reviewed 8 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| services/apps/packages_worker/src/enricher/updateEnrichedRepos.ts | Replaces per-row updates with a bulk UPDATE and adds a helper to mark repos as skip_enrichment. |
| services/apps/packages_worker/src/enricher/types.ts | Makes several repo fields nullable and adds optional rateLimit info to enrichment results. |
| services/apps/packages_worker/src/enricher/runEnrichmentLoop.ts | Implements streaming DB batching, write buffering, concurrency, installation selection, and rate-limit parking/requeue behavior. |
| services/apps/packages_worker/src/enricher/githubAppAuth.ts | New module for GitHub App JWT auth, installation token caching, installation discovery, and rate-limit diagnostics. |
| services/apps/packages_worker/src/enricher/fetchLightRepo.ts | Adds request timeouts via AbortController, improves rate-limit/IP-allowlist detection, and returns nullable numeric/boolean fields plus GraphQL rateLimit info. |
| services/apps/packages_worker/src/config.ts | Adds GitHub App config loader and new concurrency/timeout envs (but still contains legacy enricher env parsing). |
| services/apps/packages_worker/src/bin/github-repos-enricher.ts | Switches CLI entrypoint from PAT tokens to GitHub App installations and runs startup diagnostics. |
| services/apps/packages_worker/package.json | Adds jsonwebtoken (+ types) dependency for GitHub App JWT creation. |
| pnpm-lock.yaml | Locks new JWT deps and includes dependency resolution updates. |
| backend/src/osspckgs/migrations/V1780496100__add_skip_enrichment_to_repos.sql | Adds skip_enrichment column to support permanently skipping repos with permanent enrichment failures. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
| export function getEnricherConfig() { | ||
| return { | ||
| updateIntervalHours: requireEnvInt('ENRICHER_REPO_UPDATE_INTERVAL_HOURS'), | ||
| idleSleepSec: requireEnvInt('ENRICHER_IDLE_SLEEP_SEC'), | ||
| concurrency: parseInt(process.env.ENRICHER_CONCURRENCY ?? '80', 10), | ||
| fetchTimeoutMs: parseInt(process.env.ENRICHER_FETCH_TIMEOUT_MS ?? '10000', 10), | ||
| } |
| async flush(): Promise<number> { | ||
| const batch = [...this.results] | ||
| const skips = [...this.skipUrls] | ||
| this.lastFlushAt = Date.now() | ||
| await Promise.all([bulkUpdateEnrichedRepos(this.qx, batch), markReposSkipped(this.qx, skips)]) | ||
| // Clear only after both writes succeed — preserves items if the DB call throws | ||
| this.results.splice(0, batch.length) | ||
| this.skipUrls.splice(0, skips.length) | ||
| return batch.length | ||
| } |
| ) | ||
| } | ||
|
|
||
| async flush(): Promise<number> { |
There was a problem hiding this comment.
WriteBuffer is shared across all workers but flush() has no mutex. Two workers can both snapshot this.results and call bulkUpdateEnrichedRepos with the same data, then both splice — the second splice may drop items added in the gap. Is this gonna be run in multiple containers at once or just one at a time?
There was a problem hiding this comment.
Good catch! Added an isFlushing guard on shouldFlush() so concurrent flushes are not possible.
Is this gonna be run in multiple containers at once or just one at a time?
A single container
|
|
||
| const fillQueue = (): void => { | ||
| if (pendingFetch || dbDone) return | ||
| pendingFetch = fetchDbBatch(qx, cursor, config.updateIntervalHours).then((rows) => { |
There was a problem hiding this comment.
.then() never runs if there is an error → pendingFetch stays as rejected promise → workers hit await pendingFetch at line 183 → throws → Promise.all at line 194 rejects → pool dies on one transient DB error.
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f5ebe13. Configure here.
| export function getEnricherConfig() { | ||
| return { | ||
| updateIntervalHours: requireEnvInt('ENRICHER_REPO_UPDATE_INTERVAL_HOURS'), | ||
| idleSleepSec: requireEnvInt('ENRICHER_IDLE_SLEEP_SEC'), | ||
| concurrency: parseInt(process.env.ENRICHER_CONCURRENCY ?? '80', 10), | ||
| fetchTimeoutMs: parseInt(process.env.ENRICHER_FETCH_TIMEOUT_MS ?? '10000', 10), | ||
| } | ||
| } |
| let pendingFetch: Promise<void> | null = null | ||
| let logTimer = Date.now() | ||
|
|
||
| const fillQueue = (): void => { | ||
| if (pendingFetch || dbDone) return | ||
| pendingFetch = fetchDbBatch(qx, cursor, config.updateIntervalHours) | ||
| .then((rows) => { | ||
| pendingFetch = null | ||
| if (rows.length === 0) { | ||
| dbDone = true | ||
| } else { | ||
| queue.push(...rows) | ||
| cursor = rows[rows.length - 1].id | ||
| } | ||
| }) | ||
| .catch((err) => { | ||
| pendingFetch = null | ||
| log.warn({ err }, 'DB batch fetch failed, will retry') | ||
| }) | ||
| } |
| 2. The GitHub enricher polls `repos` for rows where `skip_enrichment = false` and `last_synced_at IS NULL` (never enriched) or `last_synced_at < NOW() - INTERVAL '<configurable hours>'` (stale). The re-sync interval is controlled via `ENRICHER_REPO_UPDATE_INTERVAL_HOURS`. | ||
| 3. The enricher updates those rows with full metadata and sets `last_synced_at`. Permanently unreachable repos (deleted, private, IP-allowlisted) are marked `skip_enrichment = true` and not retried. |
| function selectInstallation( | ||
| installationIds: number[], | ||
| parkedUntil: Map<number, number>, | ||
| roundRobinIdx: { value: number }, | ||
| ): { installationId: number; waitMs: number } { |
| export async function getInstallationToken( | ||
| appId: string, | ||
| privateKeyPem: string, | ||
| installationId: number, | ||
| ): Promise<string> { | ||
| const cached = tokenCache.get(installationId) | ||
| const fiveMinFromNow = new Date(Date.now() + 5 * 60 * 1000) | ||
|
|
||
| if (cached && cached.expiresAt > fiveMinFromNow) { | ||
| return cached.token | ||
| } |
Aligns with enricher-v2 (#4165). getDockerhubConfig drops the ENRICHER_GITHUB_TOKENS PAT pool; the entrypoint now calls getGithubAppConfig + resolveInstallations + fetchRateLimitDiagnostics, and the discovery fan-out runs one worker per installation id with parkedUntil keyed on installationId. getInstallationToken is called per-request so token refresh/caching is shared with the enricher. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>

This pull request introduces several improvements and new features to the GitHub repository enrichment workflow, focusing on robust GitHub App authentication, rate limit handling, and configurability. The most significant change is the addition of GitHub App authentication with support for multiple installations and rate limit diagnostics. It also adds new environment-based configuration options, enhances error handling, and introduces new dependencies for JWT handling.
Key changes:
GitHub App Authentication & Rate Limit Handling
githubAppAuth.tsmodule implementing GitHub App JWT authentication, installation token caching, rate limit diagnostics, and installation discovery. This enables the enrichment process to use a pool of GitHub App installations for API requests, improving scalability and reliability.runEnrichmentLoop.ts) to use installation tokens, select installations round-robin with rate limit awareness, and handle rate limit errors by parking installations until reset.Configuration Enhancements
getGithubAppConfig()inconfig.tsto load GitHub App credentials and installation ID overrides from environment variables. Added new configuration options for enrichment concurrency and fetch timeouts. [1] [2]Enrichment & Error Handling Improvements
fetchLightRepo()to support request timeouts, propagate rate limit and IP allowlist errors, and include rate limit info in the result. Numeric fields are now nullable for better error handling. [1] [2] [3] [4] [5]Dependency Updates
jsonwebtokenand@types/jsonwebtokenas dependencies for JWT creation and validation. [1] [2] [3]pnpm-lock.yamlwith the new dependencies and adjusted AWS SDK peer dependency resolutions. [1] [2] [3] [4] [5] [6] [7] [8] [9]Database Schema Change
skip_enrichmentboolean column to therepostable to mark repositories that should be excluded from future enrichment attempts.Note
Medium Risk
Requires a DB migration and new GitHub App secrets before deploy; bulk repo updates and skip semantics change which URLs get retried and how metadata is written at scale.
Overview
GitHub repos enricher v2 replaces comma-separated PATs with a GitHub App installation token pool, adds a high-concurrency streaming worker, and stops retrying repos that will never enrich.
Auth and config now use
CROWD_GITHUB_APP_ID/ base64CROWD_GITHUB_PRIVATE_KEY(optionalENRICHER_GITHUB_INSTALLATION_IDS), withgetGithubAppConfig()and startup GraphQL rate-limit diagnostics.ENRICHER_GITHUB_TOKENSand page/batch sizing tied to token count are gone in favor ofENRICHER_CONCURRENCYandENRICHER_FETCH_TIMEOUT_MS.The enrichment loop is reworked: cursor-backed DB queue (
skip_enrichment = false), many parallel workers, round-robin installation selection with rate-limit parking and re-queue, buffered bulk JSON updates (~500 rows / 5s), andmarkReposSkippedfor permanentNOT_FOUND/AUTH/MALFORMED(and bad URLs). Flyway addsrepos.skip_enrichment; the ADR documents the new flow.fetchLightRepogains abort timeouts, IP-allowlist →AUTH, and nullable numeric/boolean fields;jsonwebtokenis a new dependency. Lockfile also picks upjsonwebtokentypes and minor AWS SDK peer-resolution churn unrelated to enricher logic.Reviewed by Cursor Bugbot for commit 16dfba5. Bugbot is set up for automated code reviews on this repo. Configure here.