feat: osv advisories ingestion#4149
Conversation
|
Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability. Example:
Projects:
Please add a Jira issue key to your PR title. |
There was a problem hiding this comment.
Pull request overview
Adds an osv-sync sub-worker to packages_worker that ingests OSV bulk advisories (npm + Maven), normalizes them into packages-db advisory tables, and derives the denormalized packages.has_critical_vulnerability flag based on ecosystem-specific version comparisons and scored severity.
Changes:
- Introduces OSV ingestion pipeline (download/parse/score/upsert) and a post-pass derivation step for
has_critical_vulnerability. - Adds a Maven ComparableVersion-style comparator and npm semver comparator plus unit/integration tests (Vitest).
- Adds docs (ADRs) and local/docker service wiring for running
osv-sync.
Reviewed changes
Copilot reviewed 24 out of 27 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| services/apps/packages_worker/vitest.config.ts | Adds Vitest configuration for the packages_worker test suite. |
| services/apps/packages_worker/src/osv/versionCompare.ts | Implements ecosystem-specific version comparison (npm semver + Maven-like comparator). |
| services/apps/packages_worker/src/osv/upsertAdvisory.ts | Writes normalized OSV advisories/packages/ranges to packages-db via transactional upserts. |
| services/apps/packages_worker/src/osv/types.ts | Defines OSV raw/normalized types and a FetchError for ingestion. |
| services/apps/packages_worker/src/osv/index.ts | Orchestrates per-ecosystem sync loop with retries and post-pass critical-flag derivation. |
| services/apps/packages_worker/src/osv/fetchEcosystemZip.ts | Streams OSV zip download to disk and iterates JSON entries for parsing. |
| services/apps/packages_worker/src/osv/extractSeverity.ts | Extracts/seeds severity and CVSS score from OSV records (MAL-/V3/qualitative fallback). |
| services/apps/packages_worker/src/osv/deriveCriticalFlag.ts | Recomputes packages.has_critical_vulnerability by checking latest_version against critical ranges. |
| services/apps/packages_worker/src/osv/cvssScoring.ts | Implements inline CVSS v3.1 base-score calculation. |
| services/apps/packages_worker/src/osv/tests/versionCompare.test.ts | Unit tests for npm and Maven version ordering. |
| services/apps/packages_worker/src/osv/tests/parseOsvRecord.test.ts | Unit tests for OSV record parsing behaviors (name splitting, allowlist, range flattening). |
| services/apps/packages_worker/src/osv/tests/extractSeverity.test.ts | Unit tests for severity extraction paths (MAL-, V3, V4-only qualitative fallback). |
| services/apps/packages_worker/src/osv/tests/deriveCriticalFlag.integration.test.ts | DB-backed integration tests for end-to-end derivation behavior (skipped without env). |
| services/apps/packages_worker/src/osv/tests/cvssScoring.test.ts | Reference-vector tests to pin CVSS scoring implementation. |
| services/apps/packages_worker/src/config.ts | Adds OSV-specific worker config sourced from env vars. |
| services/apps/packages_worker/src/bin/osv-sync.ts | Adds standalone entrypoint binary for the OSV sync worker with shutdown handling. |
| services/apps/packages_worker/package.json | Adds scripts and dependencies/devDependencies for osv-sync + tests. |
| scripts/services/osv-sync.yaml | Adds docker-compose service definition for running osv-sync locally/composed. |
| pnpm-lock.yaml | Locks newly added dependencies (semver/unzipper/vitest and transitive deps). |
| docs/adr/README.md | Registers ADRs 0003–0005 in the ADR index. |
| docs/adr/0005-cvss-scoring-strategy.md | Documents CVSS scoring strategy (inline v3.1, defer v4). |
| docs/adr/0004-standalone-bin-vs-temporal-for-batch-sub-workers.md | Documents rationale for standalone-bin execution shape for batch sub-workers. |
| docs/adr/0003-has-critical-vulnerability-semantics.md | Documents semantics for has_critical_vulnerability and derivation strategy. |
| backend/src/osspckgs/migrations/V1779871327__add_has_critical_vulnerability_to_packages.sql | Adds the has_critical_vulnerability column + partial index to packages-db. |
| backend/src/osspckgs/migrations/V1779871303__add_cvss_source_to_advisories.sql | Adds advisories.cvss_source for score provenance. |
| backend/.env.dist.local | Adds default local env vars for running osv-sync. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Joan Reyero <joan@reyero.io>
Adds the osv-sync sub-worker inside packages_worker. Pulls OSV's daily per-ecosystem zip for npm and Maven, normalizes each record, and upserts into advisories, advisory_packages, and advisory_affected_ranges (transactional UPSERT, idempotent on osv_id + the range unique index). MAL- malicious-package reports are ingested with cvss=NULL and cvss_source='osv_malicious_package'. A deriveCriticalFlag step runs at the end of each pass and flips packages.has_critical_vulnerability TRUE iff a critical advisory (cvss>=7.0 OR osv_id LIKE 'MAL-%') has an affected range covering the package's current latest_version, using ecosystem-specific comparators (semver for npm, ComparableVersion-style for Maven). See ADR-0003 for the semantics. CVSS scoring computes v3.1 base scores inline from the FIRST spec; v4 numeric scoring is deferred (V4-only records fall back to the qualitative tag from database_specific.severity). Verified locally against the full OSV dataset (226,258 advisories; log4shell CVSS=10.0, lodash CVE-2021-23337 CVSS=7.2; 213,414 MAL- entries ingested). Signed-off-by: Joan Reyero <joan@reyero.io>
Adds vitest with five test files covering the OSV pipeline: - cvssScoring: 10 cases pinning the inline v3.1 implementation against FIRST-published scores (log4shell 10.0, shellshock 9.8, heartbleed 7.5, and others). Catches future regressions in the formula. - extractSeverity: MAL- short-circuit, V3 vector path, V4-only fall through, qualitative fallback, malformed-vector handling. - parseOsvRecord: Maven groupId:artifactId split, npm @scope/ split, ecosystem allowlist filter, range flattening (introduced -> fixed, introduced -> last_affected, MAL- always-vulnerable, GIT skipped), multi-affected[] coalescing. - versionCompare: npm semver ordering + coercion; Maven dotted versions, qualifier ranks (alpha < beta < milestone < rc < snapshot < ga/final < sp), qualifier aliases, numeric > alpha at same depth, cross-ecosystem null. - deriveCriticalFlag (integration, real packages-db, skipped without DB env): lodash 4.17.20 flips TRUE / 5.0.0 clears, log4j-core 2.14.1 flips TRUE / 2.17.0 clears, MAL- target flips via the osv_id LIKE prefix override, catch-up resolver populates advisory_packages.package_id for late-arriving packages, and a regression guard around the Maven 1.0-final edge case. The versionCompare suite caught a real bug: compareMaven used a num:0 pad for missing tokens in both kinds of comparison. That made 1.0-final < 1.0 (should be equal: 'final' is an alias for the empty 'ga' qualifier) and 1.0 > 1.0-sp1 (should be less: 'sp' outranks 'ga'). Fixed by picking the pad type based on the other side's kind (num:0 vs str:''), matching the Maven ComparableVersion algorithm. Also verified out-of-band (not in suite): - Idempotency: rerunning OSV sync leaves advisories, advisory_packages, advisory_affected_ranges row counts and the md5 hash of (osv_id, cvss, cvss_source) bit-identical. - SIGINT mid-pass: shutdown handler runs, current batch flushes, derive + sleep skip, process exits 0. 68 tests / 5 files pass; lint + prettier + tsc clean. Signed-off-by: Joan Reyero <joan@reyero.io>
ADR-0004 captures the standalone-bin vs Temporal decision for batch sub-workers in packages_worker (OSV uses standalone; npm package sync will use Temporal). ADR-0005 captures the CVSS scoring strategy (inline v3.1 from the FIRST spec, V4 numeric scoring deferred to a follow-up, qualitative fallback in the meantime). Both record the alternatives that were considered and rejected so the next engineer touching these areas has the rationale in one place. Signed-off-by: Joan Reyero <joan@reyero.io>
- fetchEcosystemZip: move clearTimeout to cover the pipeline body
stream, not just the headers fetch. Map pipeline rejection to a
NETWORK FetchError so withRetry handles stalled mid-transfer
connections instead of hanging past the daily window.
- index.ts: hoist counters and buffer into the withRetry closure so a
transient retry restarts from zero. UPSERT is idempotent on osv_id,
so re-flushed batches are safe.
- index.ts: switch error/warn logs from err.message to { err } so the
structured logger preserves stack and metadata, matching the rest
of the service.
- extractSeverity.ts: rewrite the lede comment to match ADR-0005
(V4 numeric scoring deferred; v1 skips V4 entirely and falls
through to qualitative for V4-only records).
- V1779871303 migration: list all four cvss_source values so the
schema doc matches the contract in types.ts and ADR-0005.
- deriveCriticalFlag integration test: extend HAVE_DB to require
CROWD_PACKAGES_DB_DATABASE and CROWD_PACKAGES_DB_PASSWORD too, so
half-set envs skip cleanly instead of failing in beforeAll.
Signed-off-by: Joan Reyero <joan@reyero.io>
cvssScoring.ts read `v.S` directly into the `s === 'C' ? ... : ...`
branches but never validated it. A vector missing `S` or carrying an
invalid value like `S:X` would silently take the Scope:Unchanged
branch in every formula and return a wrong numeric score instead of
null. The 10 reference-vector tests didn't catch it because every
test vector had a valid S:U or S:C.
This is the exact failure mode ADR-0005 named as the headline risk
of choosing inline scoring over the cvss npm package — wrong scores
feed advisories.is_critical and packages.has_critical_vulnerability,
i.e. the entire security overlay.
Fix: validate `s` against {U, C} up front and return null otherwise.
Added two regression tests covering the missing-S and invalid-S
paths.
Caught by Cursor's bot review on cbaf41d.
Signed-off-by: Joan Reyero <joan@reyero.io>
The unique index on advisory_affected_ranges shipped in V1779710880 keyed on (advisory_package_id, COALESCE(introduced_version, '')) — strictly narrower than the natural uniqueness of a range tuple, and narrower than the principle locked in osv-plan §2 #1 ("one package has many version ranges; no denormalization"). dedupeRanges in upsertAdvisory.ts was keying on introduced_version alone to match that index, with the side effect that two ranges sharing an introduced_version but differing in fixed_version or last_affected (cross-distro patches, partial fixes) silently collapsed to the first occurrence. When the surviving range was the narrower one, isInRange returned FALSE for versions inside the wider window — a missed critical alert. Three changes: - V1779897650__widen_advisory_affected_ranges_unique_index.sql: drop the narrow unique index (located via pg_indexes since the initial migration didn't name it) and replace with the full-tuple unique index over (advisory_package_id, COALESCE(introduced_version,''), COALESCE(fixed_version,''), COALESCE(last_affected,'')). - upsertAdvisory.ts dedupeRanges: key on the full tuple so the application-side pre-flight matches the database constraint. Exported for unit testing. - upsertAdvisory.test.ts: 5 cases pinning the new semantics (same-introduced-different-fixed preserved, same-introduced- different-last_affected preserved, identical-tuple collapsed, null-introduced disambiguated by other fields, first-wins on truly identical tuples). ADR-0006 captures the decision and the alternatives considered (coalesce-to-widest at parse time, drop the constraint, dedup at query time). Cursor's bot review on 1b978ac surfaced the bug. Signed-off-by: Joan Reyero <joan@reyero.io>
ADR-0001 §Worker architecture (decided 2026-05-25) standardized packages_worker sub-workers on the Temporal shape: workflows.ts + activities.ts + schedule.ts per sub-worker, registered against the shared packages-worker task queue, with ScheduleOverlapPolicy.SKIP so a slow run does not queue a second concurrent execution. This branch originally implemented osv-sync as a standalone bin following the github-repos-enricher precedent and proposed ADR-0004 to codify that pattern. Rebasing onto current main surfaced the conflict: github-repos-enricher is now explicitly listed as a legacy exception with migration deferred, and osv/ is named alongside npm/ and deps-dev/ as Temporal-based from the start. ADR-0004 is removed (superseded by ADR-0001 before merging). Changes: - src/osv/workflows.ts: osvSync workflow definition. Loops the ecosystems passed via schedule args, calls osvSyncEcosystem activity per ecosystem (per-ecosystem failure logged but does not abort the pass), then osvDeriveCriticalFlag. proxyActivities with retry policy (3 attempts, exp backoff) and nonRetryableErrorTypes for NOT_FOUND / PARSE. - src/osv/activities.ts: osvSyncEcosystem and osvDeriveCriticalFlag activities wrapping the existing pure-function pipeline (fetchEcosystemZip + parseOsvRecord + upsertAdvisoryBatch + deriveCriticalFlag). Heartbeats every 1000 records keep Temporal aware during the ~1-hour npm pass. FetchError(NOT_FOUND|PARSE) is translated to ApplicationFailure.nonRetryable so RetryPolicy short-circuits. - src/osv/schedule.ts: scheduleOsvSync registers an idempotent daily schedule at 03:30 UTC (offset from npm-registry-ingest at 03:15). 4-hour workflowExecutionTimeout, SKIP overlap. - src/workflows/index.ts + src/activities.ts: re-export the new workflow + activities for the worker bundle / runtime. - src/bin/packages-worker.ts: call scheduleOsvSync on startup alongside scheduleNpmIngest. Deleted: - src/bin/osv-sync.ts (entry point no longer needed) - src/osv/index.ts (runOsvSync sleep loop replaced by Temporal) - scripts/services/osv-sync.yaml (separate compose service gone; packages-worker container serves osv via the shared task queue) - docs/adr/0004 (superseded by ADR-0001 §Worker architecture) Env vars no longer needed (replaced by Temporal scheduling): - OSV_SYNC_INTERVAL_HOURS (cron expression owns cadence) - OSV_IDLE_SLEEP_SEC (no internal sleep loop) - getOsvConfig() removed from src/config.ts The pure-function pipeline survives untouched — extractSeverity, cvssScoring, parseOsvRecord, fetchEcosystemZip, upsertAdvisory, deriveCriticalFlag, versionCompare, types. All 68 unit tests still pass. The Temporal activity is a thin orchestrator over the same functions, so the migration is shape-only — no business logic changes. Signed-off-by: Joan Reyero <joan@reyero.io>
d2e92b5 to
0898c23
Compare
Joana introduced docs/adr/0001-oss-packages-design-decisions.md (PR #4151, merged 2026-05-27) as the single living record for the oss-packages domain, with one section per decision and a Changelog at the bottom. Our slice landed three standalone ADRs (0003 / 0005 / 0006) before that consolidation merged. This rebases them into her structure. Folded into ADR-0001: - CVSS scoring strategy (was ADR-0005) — inline v3.1, qualitative fallback, v4 deferred, Scope-metric validation. Adjacent to §OSV. - has_critical_vulnerability semantics (was ADR-0003) — option (b) + MAL- override, comparator-driven derivation, idempotent recompute. Resolves the prior open question on this flag. - advisory_affected_ranges uniqueness scope (was ADR-0006) — full tuple unique index restores the "no denormalization" invariant. Other changes: - Updated the Scope and current status table at the top of ADR-0001 to list the three new decisions and remove the "flag deferred" qualifier on OSV. - Removed the open question on has_critical_vulnerability — now decided in §has_critical_vulnerability semantics. - Added a 2026-05-28 Changelog entry documenting the fold and the removal of standalone ADR-0004 (Temporal vs standalone-bin — superseded by §Worker architecture before merge). - Rewrote all in-tree ADR cross-references (6 source files, 3 migration SQL files) to point at the new section anchors instead of the removed standalone ADR ids. - Updated docs/adr/README.md to list only ADR-0001 (matches Joana's intent in PR #4151). All 68 unit tests still pass. Signed-off-by: Joan Reyero <joan@reyero.io>
Two fixes derived from the alignment audit against ADR-0001 (Joana's oss-packages living ADR, merged in PR #4151): 1. Lowercase ecosystem ('Maven' -> 'maven') ADR-0001 §OSV "Ecosystem normalization" stores ecosystems lowercase ('npm' and 'maven', not 'Maven'). The OSV worker preserved OSV's titlecase 'Maven' on the way in, leaving 7,957 Maven rows in advisory_packages that would never join correctly against a lowercase-storing packages table. Changes: - parseOsvRecord lowercases the ecosystem at the OSV boundary before allowlist check, splitName, and storage. Allowlist set is lowercase in env (OSV_ECOSYSTEMS=npm,maven) and in tests. - splitName branches on 'maven' instead of 'Maven'. - deriveCriticalFlag.ts catch-up SQL CASE expression uses 'maven'. - New migration V1779951727 backfills existing Maven rows in advisory_packages (and packages, which is empty today but reads cheaply). 2. Populate advisories.source and source_url The consolidated initial schema has two columns the upsert path didn't set: source ('OSV' for everything this worker writes; granular GHSA / NVD / NSWG attribution belongs to the future deps.dev BQ worker) and source_url (canonical OSV URL: https://osv.dev/vulnerability/<id>). Changes: - NormalizedAdvisory grew source: string and sourceUrl: string | null. - parseOsvRecord sets source: 'OSV' and sourceUrl: osvSourceUrl(id). - upsertAdvisory INSERT and ON CONFLICT UPDATE include both columns. All 68 unit tests still pass; migration applied locally and confirmed the existing Maven rows backfilled to maven without unique-index collisions. Signed-off-by: Joan Reyero <joan@reyero.io>
Cursor pointed out that osvDeriveCriticalFlag was calling getActivityConfig() and eagerly validating OSV_BULK_BASE_URL, OSV_TMP_DIR, and OSV_BATCH_SIZE — env vars only the sync activity uses. Running the derive activity in isolation (e.g. for testing or debugging) failed with a misleading "Missing required env var" pointing at vars the derive never needed. Split into getSyncConfig() and getDeriveConfig(), each activity reads only its own env. Matches the same shape applied to proxyActivities in workflows.ts last round — sync and derive are independent activities and should have independent contracts. No behavior change in normal scheduled runs (the workflow sets up the full env anyway); the cleanup is for isolated/debug paths. 72 unit tests still pass. Signed-off-by: Joan Reyero <joan@reyero.io>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Joan Reyero <joan@reyero.io>
Defensive guard against an upstream OSV record (or zip-bomb payload) that would otherwise OOM the worker on `file.buffer()`. Real OSV entries are well under 100 KB; the 10 MB cap is ~200x the observed maximum and catches the unlikely-but-not-impossible case where the upstream emits a malformed or maliciously-large record. Surfaced as a PARSE FetchError so withRetry gives up immediately (retrying the same payload won't help). Final third of the three Copilot low-severity findings on the split-config commit. The other two (de-duping OSV_ECOSYSTEMS in schedule.ts and rewording the "re-registers" doc comment) were already addressed by Cursor's autofix in 5c9b86f and c91840d. 72 unit tests still pass. Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot caught a real silent-failure path: OSV's GCS bucket is case-sensitive (`Maven/all.zip` exists, `maven/all.zip` 404s), and the env value flows straight into the download URL. An operator typo like `OSV_ECOSYSTEMS=npm,maven` would deterministically 404 every day, surface as `NOT_FOUND` (non-retryable), and the workflow would silently miss Maven advisories from then on with nothing but a log line as evidence. Now scheduleOsvSync validates each input against a small VALID_ECOSYSTEMS list (currently `npm` + `Maven`) and refuses to register the schedule on a mismatch. The error suggests the canonical case when the lowercased forms match — common "did you mean Maven?" recovery — and lists the supported set otherwise. Adding a new ecosystem in the future means adding it to the list. 72 unit tests still pass. Signed-off-by: Joan Reyero <joan@reyero.io>
| function isInRange(ecosystem: string, version: string, range: RangeRow): boolean { | ||
| const introduced = range.introduced_version | ||
| if (introduced && introduced !== '0') { | ||
| const c = compareVersion(ecosystem, version, introduced) | ||
| if (c === null || c < 0) return false | ||
| } | ||
|
|
||
| if (range.fixed_version) { | ||
| const c = compareVersion(ecosystem, version, range.fixed_version) | ||
| if (c === null || c >= 0) return false | ||
| } | ||
|
|
||
| if (range.last_affected) { | ||
| const c = compareVersion(ecosystem, version, range.last_affected) | ||
| if (c === null || c > 0) return false | ||
| } | ||
|
|
||
| // MAL- ranges often have introduced=null/0 with both fixed and last_affected | ||
| // null. That collapses to "always vulnerable" — the early returns above never | ||
| // fire, so we fall through to true here. | ||
| return true | ||
| } |
| const buffer = await file.buffer() | ||
| // Real OSV entries are well under 100 KB. A 10 MB cap is ~200x the | ||
| // observed max and catches the (admittedly unlikely) case where a bad | ||
| // upstream record or a zip-bomb-style payload would otherwise cause the | ||
| // worker to OOM on file.buffer(). We surface it as PARSE so withRetry | ||
| // gives up immediately — retrying the same payload won't help. | ||
| if (buffer.length > MAX_ENTRY_BYTES) { | ||
| throw new FetchError( | ||
| 'PARSE', | ||
| `Entry ${ecosystem}/${file.path} exceeds ${MAX_ENTRY_BYTES} bytes (got ${buffer.length})`, | ||
| ) | ||
| } |
There was a problem hiding this comment.
A zip bomb will decompress into memory first (file.buffer), and the MAX_ENTRY_BYTES guard fires too late to prevent the OOM. Maybe we can use file.uncompressedSize instead?
There was a problem hiding this comment.
Good catch — fixed in 9db1c44. Now checking file.uncompressedSize from the central directory before file.buffer() is called, so a small-compressed/huge-uncompressed bomb gets rejected without decompression. The post-decompress check is kept as defense in depth in case the central directory lies.
- Add companion partial index on advisory_packages (ecosystem, package_name) WHERE package_id IS NULL so the resolveMissingPackageIds catch-up UPDATE uses an index scan instead of a seq scan over the full table. - Narrow upsertAdvisoryBatch tx granularity from per-batch (~2500 stmts) to per-record so advisory_packages row locks are held briefly and a Temporal cancel mid-batch only loses the in-flight record. - Drop the lossy semver.coerce fallback in compareNpm; under-flag over mis-flag so a malformed introduced/fixed boundary like "1.2-junk-3" doesn't mint a false-positive critical match. - Guard the zip-bomb path by checking file.uncompressedSize from the central directory BEFORE calling file.buffer(), since the buffer() call decompresses into memory first and would OOM before the post-decompress size check could fire. Signed-off-by: Joan Reyero <joan@reyero.io>
Per CLAUDE.md, all database queries should live in services/libs/data-access-layer rather than inlined in the worker. This move puts every advisories / advisory_packages / advisory_affected_ranges query used by packages_worker/src/osv into a new data-access-layer/src/packages/osv.ts so they can be reused (e.g. by the future deps.dev BQ worker that will write the raw range columns) and so schema-aware refactors stay in one place. No behavior change: query strings, parameters, and ordering are kept byte-for-byte; deriveCriticalFlag and upsertAdvisoryBatch just delegate to the new functions. Existing unit + integration tests pass unchanged. Signed-off-by: Joan Reyero <joan@reyero.io>
| // Only delete OSV-derived rows: rows with at least one of | ||
| // introduced/fixed/last_affected populated AND no deps.dev-source raw text | ||
| // columns. The deps.dev BQ worker (future) is expected to populate range_raw | ||
| // / unaffected_raw on rows of its own; we must not wipe those on every OSV | ||
| // pass. |
There was a problem hiding this comment.
Right — comment was wrong, behavior is fine. Fixed in 5904d0e. The SQL only checks the deps.dev raw columns, and that IS the ownership rule (OSV-owned = no deps.dev raw cols). The earlier comment over-narrowed it by also requiring at least one structured column populated, which the SQL doesn't and shouldn't enforce — some MAL- "always vulnerable" ranges have all three structured cols NULL but are still OSV-owned and need to be deleted/re-inserted on resync. Comment now matches the predicate.
| -- Drives the resolveMissingPackageIds catch-up UPDATE in deriveCriticalFlag: | ||
| -- the query filters WHERE package_id IS NULL and joins on (ecosystem, | ||
| -- package_name), so the planner needs an index whose predicate matches the | ||
| -- WHERE clause to avoid a seq scan over the full table. The non-partial | ||
| -- (ecosystem, package_name) index above can't be used here because it doesn't | ||
| -- prove package_id IS NULL. |
There was a problem hiding this comment.
Fair — fixed in 5904d0e. Verified locally: with the partial index dropped, EXPLAIN on the catch-up UPDATE shows Index Scan using advisory_packages_ecosystem_package_name_idx ... Filter: (package_id IS NULL) (so the non-partial index is reachable). Reworded the comment to say what the partial index actually buys: selectivity as the table grows — it stays O(unresolved), the non-partial would still need to scan/filter every (ecosystem, package_name) match.
…ents - Bump osvSyncEcosystem heartbeatTimeout from 5m to 15m. The first heartbeat only fires after the full ecosystem zip is downloaded, but DOWNLOAD_TIMEOUT_MS in fetchEcosystemZip is 10m. On a slow CDN a healthy download could exceed 5m of silence and Temporal would kill the activity as unresponsive; 15m leaves 5m of headroom past the download cap. - Reword the partial-index comment on advisory_packages. The earlier text said the non-partial index "can't be used" for the catch-up UPDATE — Postgres can use it with a Filter on package_id IS NULL. The real reason for the partial index is selectivity: it stays O(unresolved) as the table grows. - Reword the deleteOsvOnlyRanges comment in data-access-layer. The earlier text claimed the SQL also required at least one of introduced/fixed/last_affected to be populated; the SQL only checks the deps.dev raw columns. Behavior is correct; the comment now matches. Heartbeat fix is the only behavior change; the other two are doc-only. Signed-off-by: Joan Reyero <joan@reyero.io>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5904d0e. Configure here.
Per themarolt's review on #4149, packages-db queries belong in services/libs/data-access-layer/src/packages/ alongside osv.ts. The worker now imports fetchStaleRepoDocker, fetchPendingDockerRepos, upsertRepoDockerRow, upsertRepoDockerDailySnapshot, touchRepoDocker, markRepoDockerChecked from @crowd/data-access-layer; dockerhub/upsertRepoDocker.ts is reduced to the tx orchestrator. Query strings unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>

Summary
Adds the
osv-syncTemporal workflow topackages_worker— daily-scheduled ingest of OSV advisories for npm and Maven into packages-db (advisories+advisory_packages+advisory_affected_ranges), followed by derivation ofpackages.has_critical_vulnerability. Engineer-5 slice of the Tier 2 / Project Osprey sprint.Verified end-to-end locally (2026-05-28): 226k advisories ingested across two idempotent passes (bit-identical md5), Maven download works, lowercase storage, comparator activates, log4j-core 2.14.1 flag flips TRUE.
Architecture
Per ADR-0001 §Worker architecture (Temporal-based sub-workers on the shared
packages-workertask queue, schedules withScheduleOverlapPolicy.SKIP):src/osv/workflows.ts—osvSyncworkflow. Loops the ecosystems passed via schedule args, calls one activity per ecosystem (per-ecosystem failure does not abort the pass), thenosvDeriveCriticalFlag. Two separateproxyActivitiesconfigs: sync uses a 5-minheartbeatTimeoutbecause it emits one heartbeat per ~1000 records; derive usesstartToCloseTimeoutonly because it's a single tight loop with no heartbeat.src/osv/activities.ts—osvSyncEcosystem(streamed download + parse + batched UPSERT, heartbeats every 1000 records) andosvDeriveCriticalFlag(paged comparator-driven flag flip).FetchError(NOT_FOUND|PARSE)translated toApplicationFailure.nonRetryableso the retry policy short-circuits. Env vars validated viarequirePositiveIntso a typo can't propagate asNaN.src/osv/schedule.ts— idempotent registration ofosv-advisories-syncat30 3 * * *UTC (offset from15 3 * * *fornpm-registry-ingest), 4hworkflowExecutionTimeout.The pure-function pipeline (
cvssScoring,extractSeverity,parseOsvRecord,fetchEcosystemZip,upsertAdvisory,deriveCriticalFlag,versionCompare,types) is shared by the activity layer and unit-testable in isolation.Schema
All OSV-related schema lives in a single migration —
V1779710880__initial_schema.sql— following the team's "edit the initial migration in place during pre-production" convention (consistent with PR #4148 absorbing earlier auxiliary migrations and PR #4151 consolidating ADRs). The following changes are folded directly into the initial schema rather than layered as separate migrations:advisories.cvss_sourcetext column with four documented values (osv_cvss_v3,osv_cvss_v4,osv_qualitative_fallback,osv_malicious_package)packages.has_critical_vulnerabilitybool with partial index on TRUE (uncomments the deferred stub)advisory_affected_rangesunique index widened from(advisory_package_id, COALESCE(introduced_version,''))to the full tuple(advisory_package_id, COALESCE(introduced_version,''), COALESCE(fixed_version,''), COALESCE(last_affected,''))advisory_affected_rangestable header — the stale "future range-parsing workstream" note is replaced with the OSV-vs-deps.dev-BQ split now reflected in the codeThe OSV upsert path only deletes range rows where
range_raw IS NULL AND unaffected_raw IS NULL, so any future deps.dev BQ rows (which write only the raw text columns) are not clobbered on each OSV sync pass.ADRs
Three OSV-related decisions are folded into Joana's living ADR-0001 (
docs/adr/0001-oss-packages-design-decisions.md) as sibling sections under §OSV, matching her style:NULL). Scope (S) metric validated up front.has_critical_vulnerabilitysemantics — option (b) (TRUE ifflatest_versionis inside a critical advisory's affected range), plus aMAL-*id-prefix override so malicious-package reports flip the flag even withcvss = NULL. Resolves the prior open question in ADR-0001.advisory_affected_rangesuniqueness scope — full-tuple uniqueness restores theosv-plan §2 #1invariant ("one package has many version ranges; no denormalization").ADR-0004 (the standalone-bin vs Temporal proposal that was on the branch briefly) was removed before merging because ADR-0001 §Worker architecture (decided 2026-05-25, after this branch was cut) standardized sub-workers on Temporal.
Tests
72 vitest tests pass (
pnpm testinservices/apps/packages_worker):cvssScoring.test.ts(12) — FIRST reference vectors + regression guards for missing-SandS:X.extractSeverity.test.ts(8) — MAL- short-circuit, V3 vector, V4-only fall-through, qualitative tag, malformed input.parseOsvRecord.test.ts(15) — name splits, allowlist, range flattening, multi-affected coalescing,versions[]→ discrete ranges conversion (+2 new), redundantversions[]ignored whenranges[]is non-empty.versionCompare.test.ts(33) — npm semver + Maven qualifier ranks +nullreturn for unparseable input + regression guard rejecting titlecase'Maven'.upsertAdvisory.test.ts(5) —dedupeRangesfull-tuple regression guards.deriveCriticalFlag.integration.test.ts(7, skipped without DB env) — lodash, log4j-core, MAL- override,1.0-finalregression guard, catch-up resolver.Local verification (2026-05-28)
End-to-end Temporal workflow run twice via
temporal schedule trigger:Idempotency confirmed: md5 over
(osv_id, cvss, cvss_source)is bit-identical across both clean passes. After seeding alog4j-core@2.14.1fixture intopackages, the third pass correctly flippedhas_critical_vulnerability=TRUE, exercising both the lowercase Maven storage path and the comparator activation through deriveCriticalFlag.Bot review summary
All inline threads resolved across the iteration cycle. Highlights:
versionCompare === 'Maven'line was missed); fixed inde1fce65aalong withcompareMavenreturningnullfor unparseable input,requirePositiveIntenv validation, and the activity-level allowlist lowercasing.versions[]being dropped silently (HIGH); fixed in this round.upsertOneN+1 perf comment is still deferred. WithLOG_LEVEL=infothe worker runs ~2.5 min per pass at OSV scale, so it's no longer urgent; the comparator catch-up resolver already makes derivation cheap.Known gaps
cvss = NULL(V4-only with no qualitative tag).upsertOneN+1 (~5 round-trips per advisory) is deferred to a follow-up. Daily-cron scale; ~2.5 min per pass at LOG_LEVEL=info.Outstanding before merge
.github/workflows/pr-title-jira-key-lint.ymlwill reject. Either add aCM-XXXXticket or get the lint relaxed for this branch.main.Test plan
services/apps/packages_worker.pnpm testpasses the 6 unit test files (integration suite skipped without DB env)../scripts/cli scaffold upthenpnpm run start:packages-workerregistersosv-advisories-syncschedule on Temporal.temporal schedule trigger --schedule-id osv-advisories-syncruns the workflow end-to-end, ~2.5 min.🤖 Generated with Claude Code
Note
High Risk
Security-critical path: advisory ingestion, CVSS derivation, and version-range matching drive has_critical_vulnerability; schema and uniqueness changes affect downstream security queries.
Overview
Adds a daily OSV bulk sync to
packages_worker: a TemporalosvSyncworkflow (cron30 3 * * *, overlap SKIP) downloads npm/Mavenall.zipfeeds, normalizes records (CVSS v3 inline scoring, ecosystem allowlist, range/versions[]handling), and upserts intopackages-dbvia a new DAL module. A follow-on activity derivespackages.has_critical_vulnerabilityusing semver/Maven comparators, critical advisories, andMAL-*overrides.Schema (folded into the initial oss-packages migration): enables
has_critical_vulnerabilitywith a partial index; addsadvisories.cvss_source; widensadvisory_affected_rangesuniqueness to the full introduced/fixed/last_affected tuple; adds a partial index for unresolvedadvisory_packages.package_id. Config/docs:OSV_*env vars in.env.dist.local; ADR-0001 updated with CVSS, flag semantics, and range-uniqueness decisions. Tooling: vitest + dependencies (semver,unzipper,@temporalio/activity) and broad unit/integration tests for parsing, scoring, dedupe, and derive behavior.Reviewed by Cursor Bugbot for commit 5904d0e. Bugbot is set up for automated code reviews on this repo. Configure here.