Skip to content

feat: github enricher v2 [CM-1211]#4165

Merged
mbani01 merged 9 commits into
mainfrom
fix/github_enricher
Jun 4, 2026
Merged

feat: github enricher v2 [CM-1211]#4165
mbani01 merged 9 commits into
mainfrom
fix/github_enricher

Conversation

@mbani01
Copy link
Copy Markdown
Contributor

@mbani01 mbani01 commented Jun 3, 2026

This pull request introduces several improvements and new features to the GitHub repository enrichment workflow, focusing on robust GitHub App authentication, rate limit handling, and configurability. The most significant change is the addition of GitHub App authentication with support for multiple installations and rate limit diagnostics. It also adds new environment-based configuration options, enhances error handling, and introduces new dependencies for JWT handling.

Key changes:

GitHub App Authentication & Rate Limit Handling

  • Added a new githubAppAuth.ts module implementing GitHub App JWT authentication, installation token caching, rate limit diagnostics, and installation discovery. This enables the enrichment process to use a pool of GitHub App installations for API requests, improving scalability and reliability.
  • Updated the enrichment loop (runEnrichmentLoop.ts) to use installation tokens, select installations round-robin with rate limit awareness, and handle rate limit errors by parking installations until reset.

Configuration Enhancements

  • Introduced getGithubAppConfig() in config.ts to load GitHub App credentials and installation ID overrides from environment variables. Added new configuration options for enrichment concurrency and fetch timeouts. [1] [2]

Enrichment & Error Handling Improvements

  • Improved fetchLightRepo() to support request timeouts, propagate rate limit and IP allowlist errors, and include rate limit info in the result. Numeric fields are now nullable for better error handling. [1] [2] [3] [4] [5]
  • Enhanced retry logic and error categorization in the enrichment loop to distinguish between permanent and transient failures, and to mark repositories as skipped when appropriate.

Dependency Updates

  • Added jsonwebtoken and @types/jsonwebtoken as dependencies for JWT creation and validation. [1] [2] [3]
  • Updated pnpm-lock.yaml with the new dependencies and adjusted AWS SDK peer dependency resolutions. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Database Schema Change

  • Added a new skip_enrichment boolean column to the repos table to mark repositories that should be excluded from future enrichment attempts.

Note

Medium Risk
Requires a DB migration and new GitHub App secrets before deploy; bulk repo updates and skip semantics change which URLs get retried and how metadata is written at scale.

Overview
GitHub repos enricher v2 replaces comma-separated PATs with a GitHub App installation token pool, adds a high-concurrency streaming worker, and stops retrying repos that will never enrich.

Auth and config now use CROWD_GITHUB_APP_ID / base64 CROWD_GITHUB_PRIVATE_KEY (optional ENRICHER_GITHUB_INSTALLATION_IDS), with getGithubAppConfig() and startup GraphQL rate-limit diagnostics. ENRICHER_GITHUB_TOKENS and page/batch sizing tied to token count are gone in favor of ENRICHER_CONCURRENCY and ENRICHER_FETCH_TIMEOUT_MS.

The enrichment loop is reworked: cursor-backed DB queue (skip_enrichment = false), many parallel workers, round-robin installation selection with rate-limit parking and re-queue, buffered bulk JSON updates (~500 rows / 5s), and markReposSkipped for permanent NOT_FOUND / AUTH / MALFORMED (and bad URLs). Flyway adds repos.skip_enrichment; the ADR documents the new flow.

fetchLightRepo gains abort timeouts, IP-allowlist → AUTH, and nullable numeric/boolean fields; jsonwebtoken is a new dependency. Lockfile also picks up jsonwebtoken types and minor AWS SDK peer-resolution churn unrelated to enricher logic.

Reviewed by Cursor Bugbot for commit 16dfba5. Bugbot is set up for automated code reviews on this repo. Configure here.

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Copilot AI review requested due to automatic review settings June 3, 2026 17:35
@mbani01 mbani01 self-assigned this Jun 3, 2026
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the packages_worker GitHub repository enricher to “v2” by switching from PAT-based auth to GitHub App installation tokens, adding rate-limit diagnostics/handling, and improving enrichment throughput by streaming DB reads and batching DB writes.

Changes:

  • Added GitHub App authentication utilities (JWT minting, installation discovery, installation-token caching) plus a one-time startup rate-limit capacity diagnostic.
  • Reworked the enrichment loop to stream repos from DB with configurable concurrency, use installation tokens round-robin, and handle rate limits by parking installations and re-queuing work.
  • Added a skip_enrichment flag (DB + worker logic) to permanently exclude repos that fail with permanent errors, and optimized DB updates with a single bulk UPDATE per flush.

Reviewed changes

Copilot reviewed 8 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
services/apps/packages_worker/src/enricher/updateEnrichedRepos.ts Replaces per-row updates with a bulk UPDATE and adds a helper to mark repos as skip_enrichment.
services/apps/packages_worker/src/enricher/types.ts Makes several repo fields nullable and adds optional rateLimit info to enrichment results.
services/apps/packages_worker/src/enricher/runEnrichmentLoop.ts Implements streaming DB batching, write buffering, concurrency, installation selection, and rate-limit parking/requeue behavior.
services/apps/packages_worker/src/enricher/githubAppAuth.ts New module for GitHub App JWT auth, installation token caching, installation discovery, and rate-limit diagnostics.
services/apps/packages_worker/src/enricher/fetchLightRepo.ts Adds request timeouts via AbortController, improves rate-limit/IP-allowlist detection, and returns nullable numeric/boolean fields plus GraphQL rateLimit info.
services/apps/packages_worker/src/config.ts Adds GitHub App config loader and new concurrency/timeout envs (but still contains legacy enricher env parsing).
services/apps/packages_worker/src/bin/github-repos-enricher.ts Switches CLI entrypoint from PAT tokens to GitHub App installations and runs startup diagnostics.
services/apps/packages_worker/package.json Adds jsonwebtoken (+ types) dependency for GitHub App JWT creation.
pnpm-lock.yaml Locks new JWT deps and includes dependency resolution updates.
backend/src/osspckgs/migrations/V1780496100__add_skip_enrichment_to_repos.sql Adds skip_enrichment column to support permanently skipping repos with permanent enrichment failures.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/apps/packages_worker/src/config.ts Outdated
Comment thread services/apps/packages_worker/src/config.ts Outdated
Comment thread services/apps/packages_worker/src/enricher/githubAppAuth.ts
mbani01 and others added 2 commits June 3, 2026 18:46
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 requested a review from themarolt June 3, 2026 17:47
@mbani01 mbani01 marked this pull request as ready for review June 3, 2026 17:47
Copilot AI review requested due to automatic review settings June 3, 2026 17:47
Comment thread services/apps/packages_worker/src/enricher/runEnrichmentLoop.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 10 changed files in this pull request and generated 3 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/enricher/runEnrichmentLoop.ts
Comment thread services/apps/packages_worker/src/enricher/runEnrichmentLoop.ts
Comment thread services/apps/packages_worker/src/config.ts
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Comment thread services/apps/packages_worker/src/enricher/runEnrichmentLoop.ts
Comment thread services/apps/packages_worker/src/enricher/fetchLightRepo.ts
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Copilot AI review requested due to automatic review settings June 3, 2026 18:02
Comment thread services/apps/packages_worker/src/enricher/runEnrichmentLoop.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment on lines +40 to 46
export function getEnricherConfig() {
return {
updateIntervalHours: requireEnvInt('ENRICHER_REPO_UPDATE_INTERVAL_HOURS'),
idleSleepSec: requireEnvInt('ENRICHER_IDLE_SLEEP_SEC'),
concurrency: parseInt(process.env.ENRICHER_CONCURRENCY ?? '80', 10),
fetchTimeoutMs: parseInt(process.env.ENRICHER_FETCH_TIMEOUT_MS ?? '10000', 10),
}
Comment on lines +112 to +121
async flush(): Promise<number> {
const batch = [...this.results]
const skips = [...this.skipUrls]
this.lastFlushAt = Date.now()
await Promise.all([bulkUpdateEnrichedRepos(this.qx, batch), markReposSkipped(this.qx, skips)])
// Clear only after both writes succeed — preserves items if the DB call throws
this.results.splice(0, batch.length)
this.skipUrls.splice(0, skips.length)
return batch.length
}
)
}

async flush(): Promise<number> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteBuffer is shared across all workers but flush() has no mutex. Two workers can both snapshot this.results and call bulkUpdateEnrichedRepos with the same data, then both splice — the second splice may drop items added in the gap. Is this gonna be run in multiple containers at once or just one at a time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Added an isFlushing guard on shouldFlush() so concurrent flushes are not possible.

Is this gonna be run in multiple containers at once or just one at a time?

A single container


const fillQueue = (): void => {
if (pendingFetch || dbDone) return
pendingFetch = fetchDbBatch(qx, cursor, config.updateIntervalHours).then((rows) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.then() never runs if there is an error → pendingFetch stays as rejected promise → workers hit await pendingFetch at line 183 → throws → Promise.all at line 194 rejects → pool dies on one transient DB error.

mbani01 and others added 2 commits June 4, 2026 11:15
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Copilot AI review requested due to automatic review settings June 4, 2026 10:16
Comment thread services/apps/packages_worker/src/enricher/runEnrichmentLoop.ts
themarolt
themarolt previously approved these changes Jun 4, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f5ebe13. Configure here.

Comment thread services/apps/packages_worker/src/enricher/runEnrichmentLoop.ts
@mbani01 mbani01 changed the title feat: github enricher v2 feat: github enricher v2 [CM-1211] Jun 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 5 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment on lines +40 to 47
export function getEnricherConfig() {
return {
updateIntervalHours: requireEnvInt('ENRICHER_REPO_UPDATE_INTERVAL_HOURS'),
idleSleepSec: requireEnvInt('ENRICHER_IDLE_SLEEP_SEC'),
concurrency: parseInt(process.env.ENRICHER_CONCURRENCY ?? '80', 10),
fetchTimeoutMs: parseInt(process.env.ENRICHER_FETCH_TIMEOUT_MS ?? '10000', 10),
}
}
Comment on lines +170 to +189
let pendingFetch: Promise<void> | null = null
let logTimer = Date.now()

const fillQueue = (): void => {
if (pendingFetch || dbDone) return
pendingFetch = fetchDbBatch(qx, cursor, config.updateIntervalHours)
.then((rows) => {
pendingFetch = null
if (rows.length === 0) {
dbDone = true
} else {
queue.push(...rows)
cursor = rows[rows.length - 1].id
}
})
.catch((err) => {
pendingFetch = null
log.warn({ err }, 'DB batch fetch failed, will retry')
})
}
Comment on lines +258 to +259
2. The GitHub enricher polls `repos` for rows where `skip_enrichment = false` and `last_synced_at IS NULL` (never enriched) or `last_synced_at < NOW() - INTERVAL '<configurable hours>'` (stale). The re-sync interval is controlled via `ENRICHER_REPO_UPDATE_INTERVAL_HOURS`.
3. The enricher updates those rows with full metadata and sets `last_synced_at`. Permanently unreachable repos (deleted, private, IP-allowlisted) are marked `skip_enrichment = true` and not retried.
Comment on lines +20 to +24
function selectInstallation(
installationIds: number[],
parkedUntil: Map<number, number>,
roundRobinIdx: { value: number },
): { installationId: number; waitMs: number } {
Comment on lines +60 to +70
export async function getInstallationToken(
appId: string,
privateKeyPem: string,
installationId: number,
): Promise<string> {
const cached = tokenCache.get(installationId)
const fiveMinFromNow = new Date(Date.now() + 5 * 60 * 1000)

if (cached && cached.expiresAt > fiveMinFromNow) {
return cached.token
}
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 merged commit 757af33 into main Jun 4, 2026
15 checks passed
@mbani01 mbani01 deleted the fix/github_enricher branch June 4, 2026 10:43
joanreyero added a commit that referenced this pull request Jun 5, 2026
Aligns with enricher-v2 (#4165). getDockerhubConfig drops the
ENRICHER_GITHUB_TOKENS PAT pool; the entrypoint now calls getGithubAppConfig +
resolveInstallations + fetchRateLimitDiagnostics, and the discovery fan-out
runs one worker per installation id with parkedUntil keyed on installationId.
getInstallationToken is called per-request so token refresh/caching is shared
with the enricher.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants