Skip to content

feat: osv advisories ingestion#4149

Merged
joanreyero merged 21 commits into
mainfrom
feat/osv-advisories
May 29, 2026
Merged

feat: osv advisories ingestion#4149
joanreyero merged 21 commits into
mainfrom
feat/osv-advisories

Conversation

@joanreyero
Copy link
Copy Markdown
Contributor

@joanreyero joanreyero commented May 27, 2026

Summary

Adds the osv-sync Temporal workflow to packages_worker — daily-scheduled ingest of OSV advisories for npm and Maven into packages-db (advisories + advisory_packages + advisory_affected_ranges), followed by derivation of packages.has_critical_vulnerability. Engineer-5 slice of the Tier 2 / Project Osprey sprint.

Verified end-to-end locally (2026-05-28): 226k advisories ingested across two idempotent passes (bit-identical md5), Maven download works, lowercase storage, comparator activates, log4j-core 2.14.1 flag flips TRUE.

Architecture

Per ADR-0001 §Worker architecture (Temporal-based sub-workers on the shared packages-worker task queue, schedules with ScheduleOverlapPolicy.SKIP):

  • src/osv/workflows.tsosvSync workflow. Loops the ecosystems passed via schedule args, calls one activity per ecosystem (per-ecosystem failure does not abort the pass), then osvDeriveCriticalFlag. Two separate proxyActivities configs: sync uses a 5-min heartbeatTimeout because it emits one heartbeat per ~1000 records; derive uses startToCloseTimeout only because it's a single tight loop with no heartbeat.
  • src/osv/activities.tsosvSyncEcosystem (streamed download + parse + batched UPSERT, heartbeats every 1000 records) and osvDeriveCriticalFlag (paged comparator-driven flag flip). FetchError(NOT_FOUND|PARSE) translated to ApplicationFailure.nonRetryable so the retry policy short-circuits. Env vars validated via requirePositiveInt so a typo can't propagate as NaN.
  • src/osv/schedule.ts — idempotent registration of osv-advisories-sync at 30 3 * * * UTC (offset from 15 3 * * * for npm-registry-ingest), 4h workflowExecutionTimeout.

The pure-function pipeline (cvssScoring, extractSeverity, parseOsvRecord, fetchEcosystemZip, upsertAdvisory, deriveCriticalFlag, versionCompare, types) is shared by the activity layer and unit-testable in isolation.

Schema

All OSV-related schema lives in a single migration — V1779710880__initial_schema.sql — following the team's "edit the initial migration in place during pre-production" convention (consistent with PR #4148 absorbing earlier auxiliary migrations and PR #4151 consolidating ADRs). The following changes are folded directly into the initial schema rather than layered as separate migrations:

  • advisories.cvss_source text column with four documented values (osv_cvss_v3, osv_cvss_v4, osv_qualitative_fallback, osv_malicious_package)
  • packages.has_critical_vulnerability bool with partial index on TRUE (uncomments the deferred stub)
  • advisory_affected_ranges unique index widened from (advisory_package_id, COALESCE(introduced_version,'')) to the full tuple (advisory_package_id, COALESCE(introduced_version,''), COALESCE(fixed_version,''), COALESCE(last_affected,''))
  • Comment cleanup on the advisory_affected_ranges table header — the stale "future range-parsing workstream" note is replaced with the OSV-vs-deps.dev-BQ split now reflected in the code

The OSV upsert path only deletes range rows where range_raw IS NULL AND unaffected_raw IS NULL, so any future deps.dev BQ rows (which write only the raw text columns) are not clobbered on each OSV sync pass.

ADRs

Three OSV-related decisions are folded into Joana's living ADR-0001 (docs/adr/0001-oss-packages-design-decisions.md) as sibling sections under §OSV, matching her style:

  • §CVSS scoring strategy — inline v3.1 from the FIRST spec; v4 numeric scoring deferred (V4-only records fall back to the qualitative tag or NULL). Scope (S) metric validated up front.
  • §has_critical_vulnerability semantics — option (b) (TRUE iff latest_version is inside a critical advisory's affected range), plus a MAL-* id-prefix override so malicious-package reports flip the flag even with cvss = NULL. Resolves the prior open question in ADR-0001.
  • §advisory_affected_ranges uniqueness scope — full-tuple uniqueness restores the osv-plan §2 #1 invariant ("one package has many version ranges; no denormalization").

ADR-0004 (the standalone-bin vs Temporal proposal that was on the branch briefly) was removed before merging because ADR-0001 §Worker architecture (decided 2026-05-25, after this branch was cut) standardized sub-workers on Temporal.

Tests

72 vitest tests pass (pnpm test in services/apps/packages_worker):

  • cvssScoring.test.ts (12) — FIRST reference vectors + regression guards for missing-S and S:X.
  • extractSeverity.test.ts (8) — MAL- short-circuit, V3 vector, V4-only fall-through, qualitative tag, malformed input.
  • parseOsvRecord.test.ts (15) — name splits, allowlist, range flattening, multi-affected coalescing, versions[] → discrete ranges conversion (+2 new), redundant versions[] ignored when ranges[] is non-empty.
  • versionCompare.test.ts (33) — npm semver + Maven qualifier ranks + null return for unparseable input + regression guard rejecting titlecase 'Maven'.
  • upsertAdvisory.test.ts (5) — dedupeRanges full-tuple regression guards.
  • deriveCriticalFlag.integration.test.ts (7, skipped without DB env) — lodash, log4j-core, MAL- override, 1.0-final regression guard, catch-up resolver.

Local verification (2026-05-28)

End-to-end Temporal workflow run twice via temporal schedule trigger:

Pass npm read / kept / skipped npm duration Maven derive
1 219,720 / 219,719 / 1 154s 6,607 / 6,607 / 6s flipped=0 cleared=0
2 219,720 / 219,719 / 1 161s 6,607 / 6,607 / 6s flipped=0 cleared=0
3 (with log4j fixture) flipped=1 (log4j-core)

Idempotency confirmed: md5 over (osv_id, cvss, cvss_source) is bit-identical across both clean passes. After seeding a log4j-core@2.14.1 fixture into packages, the third pass correctly flipped has_critical_vulnerability=TRUE, exercising both the lowercase Maven storage path and the comparator activation through deriveCriticalFlag.

Bot review summary

All inline threads resolved across the iteration cycle. Highlights:

  • Two consecutive Cursor/Copilot reviews caught a real high-severity Maven-casing bug after the first lowercasing pass (the versionCompare === 'Maven' line was missed); fixed in de1fce65a along with compareMaven returning null for unparseable input, requirePositiveInt env validation, and the activity-level allowlist lowercasing.
  • Latest round caught a derive-activity heartbeat-timeout risk (CRITICAL), the deps.dev cross-source DELETE risk (HIGH), and versions[] being dropped silently (HIGH); fixed in this round.
  • The earlier upsertOne N+1 perf comment is still deferred. With LOG_LEVEL=info the worker runs ~2.5 min per pass at OSV scale, so it's no longer urgent; the comparator catch-up resolver already makes derivation cheap.

Known gaps

  • CVSS v4 numeric scoring deferred (ADR-0001 §CVSS scoring strategy). ~1.1% of advisories land with cvss = NULL (V4-only with no qualitative tag).
  • The upsertOne N+1 (~5 round-trips per advisory) is deferred to a follow-up. Daily-cron scale; ~2.5 min per pass at LOG_LEVEL=info.

Outstanding before merge

  • No JIRA key in PR title. .github/workflows/pr-title-jira-key-lint.yml will reject. Either add a CM-XXXX ticket or get the lint relaxed for this branch.
  • All commits SSH-signed and DCO sign-off trailered. Branch is rebased on current main.

Test plan

  • CI: lint, prettier, tsc green in services/apps/packages_worker.
  • CI: pnpm test passes the 6 unit test files (integration suite skipped without DB env).
  • Local: ./scripts/cli scaffold up then pnpm run start:packages-worker registers osv-advisories-sync schedule on Temporal.
  • Local: temporal schedule trigger --schedule-id osv-advisories-sync runs the workflow end-to-end, ~2.5 min.
  • Local: spot-check a known critical CVE (log4shell CVE-2021-44228 → CVSS 10.0; lodash CVE-2021-23337 → 7.2).

🤖 Generated with Claude Code


Note

High Risk
Security-critical path: advisory ingestion, CVSS derivation, and version-range matching drive has_critical_vulnerability; schema and uniqueness changes affect downstream security queries.

Overview
Adds a daily OSV bulk sync to packages_worker: a Temporal osvSync workflow (cron 30 3 * * *, overlap SKIP) downloads npm/Maven all.zip feeds, normalizes records (CVSS v3 inline scoring, ecosystem allowlist, range/versions[] handling), and upserts into packages-db via a new DAL module. A follow-on activity derives packages.has_critical_vulnerability using semver/Maven comparators, critical advisories, and MAL-* overrides.

Schema (folded into the initial oss-packages migration): enables has_critical_vulnerability with a partial index; adds advisories.cvss_source; widens advisory_affected_ranges uniqueness to the full introduced/fixed/last_affected tuple; adds a partial index for unresolved advisory_packages.package_id. Config/docs: OSV_* env vars in .env.dist.local; ADR-0001 updated with CVSS, flag semantics, and range-uniqueness decisions. Tooling: vitest + dependencies (semver, unzipper, @temporalio/activity) and broad unit/integration tests for parsing, scoring, dedupe, and derive behavior.

Reviewed by Cursor Bugbot for commit 5904d0e. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings May 27, 2026 14:51
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 27, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown
Contributor

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an osv-sync sub-worker to packages_worker that ingests OSV bulk advisories (npm + Maven), normalizes them into packages-db advisory tables, and derives the denormalized packages.has_critical_vulnerability flag based on ecosystem-specific version comparisons and scored severity.

Changes:

  • Introduces OSV ingestion pipeline (download/parse/score/upsert) and a post-pass derivation step for has_critical_vulnerability.
  • Adds a Maven ComparableVersion-style comparator and npm semver comparator plus unit/integration tests (Vitest).
  • Adds docs (ADRs) and local/docker service wiring for running osv-sync.

Reviewed changes

Copilot reviewed 24 out of 27 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
services/apps/packages_worker/vitest.config.ts Adds Vitest configuration for the packages_worker test suite.
services/apps/packages_worker/src/osv/versionCompare.ts Implements ecosystem-specific version comparison (npm semver + Maven-like comparator).
services/apps/packages_worker/src/osv/upsertAdvisory.ts Writes normalized OSV advisories/packages/ranges to packages-db via transactional upserts.
services/apps/packages_worker/src/osv/types.ts Defines OSV raw/normalized types and a FetchError for ingestion.
services/apps/packages_worker/src/osv/index.ts Orchestrates per-ecosystem sync loop with retries and post-pass critical-flag derivation.
services/apps/packages_worker/src/osv/fetchEcosystemZip.ts Streams OSV zip download to disk and iterates JSON entries for parsing.
services/apps/packages_worker/src/osv/extractSeverity.ts Extracts/seeds severity and CVSS score from OSV records (MAL-/V3/qualitative fallback).
services/apps/packages_worker/src/osv/deriveCriticalFlag.ts Recomputes packages.has_critical_vulnerability by checking latest_version against critical ranges.
services/apps/packages_worker/src/osv/cvssScoring.ts Implements inline CVSS v3.1 base-score calculation.
services/apps/packages_worker/src/osv/tests/versionCompare.test.ts Unit tests for npm and Maven version ordering.
services/apps/packages_worker/src/osv/tests/parseOsvRecord.test.ts Unit tests for OSV record parsing behaviors (name splitting, allowlist, range flattening).
services/apps/packages_worker/src/osv/tests/extractSeverity.test.ts Unit tests for severity extraction paths (MAL-, V3, V4-only qualitative fallback).
services/apps/packages_worker/src/osv/tests/deriveCriticalFlag.integration.test.ts DB-backed integration tests for end-to-end derivation behavior (skipped without env).
services/apps/packages_worker/src/osv/tests/cvssScoring.test.ts Reference-vector tests to pin CVSS scoring implementation.
services/apps/packages_worker/src/config.ts Adds OSV-specific worker config sourced from env vars.
services/apps/packages_worker/src/bin/osv-sync.ts Adds standalone entrypoint binary for the OSV sync worker with shutdown handling.
services/apps/packages_worker/package.json Adds scripts and dependencies/devDependencies for osv-sync + tests.
scripts/services/osv-sync.yaml Adds docker-compose service definition for running osv-sync locally/composed.
pnpm-lock.yaml Locks newly added dependencies (semver/unzipper/vitest and transitive deps).
docs/adr/README.md Registers ADRs 0003–0005 in the ADR index.
docs/adr/0005-cvss-scoring-strategy.md Documents CVSS scoring strategy (inline v3.1, defer v4).
docs/adr/0004-standalone-bin-vs-temporal-for-batch-sub-workers.md Documents rationale for standalone-bin execution shape for batch sub-workers.
docs/adr/0003-has-critical-vulnerability-semantics.md Documents semantics for has_critical_vulnerability and derivation strategy.
backend/src/osspckgs/migrations/V1779871327__add_has_critical_vulnerability_to_packages.sql Adds the has_critical_vulnerability column + partial index to packages-db.
backend/src/osspckgs/migrations/V1779871303__add_cvss_source_to_advisories.sql Adds advisories.cvss_source for score provenance.
backend/.env.dist.local Adds default local env vars for running osv-sync.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/apps/packages_worker/src/osv/extractSeverity.ts Outdated
Comment thread backend/src/osspckgs/migrations/V1779871303__add_cvss_source_to_advisories.sql Outdated
Comment thread services/apps/packages_worker/src/osv/index.ts Outdated
Comment thread services/apps/packages_worker/src/osv/index.ts Outdated
Comment thread services/apps/packages_worker/src/osv/upsertAdvisory.ts
Comment thread services/apps/packages_worker/src/osv/fetchEcosystemZip.ts
Comment thread services/apps/packages_worker/src/osv/cvssScoring.ts
Copilot AI review requested due to automatic review settings May 27, 2026 15:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/osv/extractSeverity.ts
Comment thread services/apps/packages_worker/src/osv/cvssScoring.ts
Comment thread services/apps/packages_worker/src/osv/upsertAdvisory.ts
Comment thread services/apps/packages_worker/src/osv/versionCompare.ts
Signed-off-by: Joan Reyero <joan@reyero.io>
Adds the osv-sync sub-worker inside packages_worker. Pulls OSV's daily per-ecosystem
zip for npm and Maven, normalizes each record, and upserts into advisories,
advisory_packages, and advisory_affected_ranges (transactional UPSERT, idempotent
on osv_id + the range unique index). MAL- malicious-package reports are ingested
with cvss=NULL and cvss_source='osv_malicious_package'.

A deriveCriticalFlag step runs at the end of each pass and flips
packages.has_critical_vulnerability TRUE iff a critical advisory
(cvss>=7.0 OR osv_id LIKE 'MAL-%') has an affected range covering
the package's current latest_version, using ecosystem-specific
comparators (semver for npm, ComparableVersion-style for Maven).
See ADR-0003 for the semantics.

CVSS scoring computes v3.1 base scores inline from the FIRST spec;
v4 numeric scoring is deferred (V4-only records fall back to the
qualitative tag from database_specific.severity).

Verified locally against the full OSV dataset (226,258 advisories;
log4shell CVSS=10.0, lodash CVE-2021-23337 CVSS=7.2; 213,414 MAL-
entries ingested).

Signed-off-by: Joan Reyero <joan@reyero.io>
Adds vitest with five test files covering the OSV pipeline:

- cvssScoring: 10 cases pinning the inline v3.1 implementation against
  FIRST-published scores (log4shell 10.0, shellshock 9.8, heartbleed 7.5,
  and others). Catches future regressions in the formula.
- extractSeverity: MAL- short-circuit, V3 vector path, V4-only fall through,
  qualitative fallback, malformed-vector handling.
- parseOsvRecord: Maven groupId:artifactId split, npm @scope/ split,
  ecosystem allowlist filter, range flattening (introduced -> fixed,
  introduced -> last_affected, MAL- always-vulnerable, GIT skipped),
  multi-affected[] coalescing.
- versionCompare: npm semver ordering + coercion; Maven dotted versions,
  qualifier ranks (alpha < beta < milestone < rc < snapshot < ga/final < sp),
  qualifier aliases, numeric > alpha at same depth, cross-ecosystem null.
- deriveCriticalFlag (integration, real packages-db, skipped without DB env):
  lodash 4.17.20 flips TRUE / 5.0.0 clears, log4j-core 2.14.1 flips TRUE /
  2.17.0 clears, MAL- target flips via the osv_id LIKE prefix override,
  catch-up resolver populates advisory_packages.package_id for late-arriving
  packages, and a regression guard around the Maven 1.0-final edge case.

The versionCompare suite caught a real bug: compareMaven used a num:0 pad
for missing tokens in both kinds of comparison. That made 1.0-final < 1.0
(should be equal: 'final' is an alias for the empty 'ga' qualifier) and
1.0 > 1.0-sp1 (should be less: 'sp' outranks 'ga'). Fixed by picking the
pad type based on the other side's kind (num:0 vs str:''), matching the
Maven ComparableVersion algorithm.

Also verified out-of-band (not in suite):
- Idempotency: rerunning OSV sync leaves advisories, advisory_packages,
  advisory_affected_ranges row counts and the md5 hash of
  (osv_id, cvss, cvss_source) bit-identical.
- SIGINT mid-pass: shutdown handler runs, current batch flushes,
  derive + sleep skip, process exits 0.

68 tests / 5 files pass; lint + prettier + tsc clean.

Signed-off-by: Joan Reyero <joan@reyero.io>
ADR-0004 captures the standalone-bin vs Temporal decision for batch
sub-workers in packages_worker (OSV uses standalone; npm package sync
will use Temporal). ADR-0005 captures the CVSS scoring strategy
(inline v3.1 from the FIRST spec, V4 numeric scoring deferred to a
follow-up, qualitative fallback in the meantime).

Both record the alternatives that were considered and rejected so the
next engineer touching these areas has the rationale in one place.

Signed-off-by: Joan Reyero <joan@reyero.io>
- fetchEcosystemZip: move clearTimeout to cover the pipeline body
  stream, not just the headers fetch. Map pipeline rejection to a
  NETWORK FetchError so withRetry handles stalled mid-transfer
  connections instead of hanging past the daily window.
- index.ts: hoist counters and buffer into the withRetry closure so a
  transient retry restarts from zero. UPSERT is idempotent on osv_id,
  so re-flushed batches are safe.
- index.ts: switch error/warn logs from err.message to { err } so the
  structured logger preserves stack and metadata, matching the rest
  of the service.
- extractSeverity.ts: rewrite the lede comment to match ADR-0005
  (V4 numeric scoring deferred; v1 skips V4 entirely and falls
  through to qualitative for V4-only records).
- V1779871303 migration: list all four cvss_source values so the
  schema doc matches the contract in types.ts and ADR-0005.
- deriveCriticalFlag integration test: extend HAVE_DB to require
  CROWD_PACKAGES_DB_DATABASE and CROWD_PACKAGES_DB_PASSWORD too, so
  half-set envs skip cleanly instead of failing in beforeAll.

Signed-off-by: Joan Reyero <joan@reyero.io>
cvssScoring.ts read `v.S` directly into the `s === 'C' ? ... : ...`
branches but never validated it. A vector missing `S` or carrying an
invalid value like `S:X` would silently take the Scope:Unchanged
branch in every formula and return a wrong numeric score instead of
null. The 10 reference-vector tests didn't catch it because every
test vector had a valid S:U or S:C.

This is the exact failure mode ADR-0005 named as the headline risk
of choosing inline scoring over the cvss npm package — wrong scores
feed advisories.is_critical and packages.has_critical_vulnerability,
i.e. the entire security overlay.

Fix: validate `s` against {U, C} up front and return null otherwise.
Added two regression tests covering the missing-S and invalid-S
paths.

Caught by Cursor's bot review on cbaf41d.

Signed-off-by: Joan Reyero <joan@reyero.io>
The unique index on advisory_affected_ranges shipped in V1779710880
keyed on (advisory_package_id, COALESCE(introduced_version, '')) —
strictly narrower than the natural uniqueness of a range tuple, and
narrower than the principle locked in osv-plan §2 #1 ("one package
has many version ranges; no denormalization").

dedupeRanges in upsertAdvisory.ts was keying on introduced_version
alone to match that index, with the side effect that two ranges
sharing an introduced_version but differing in fixed_version or
last_affected (cross-distro patches, partial fixes) silently
collapsed to the first occurrence. When the surviving range was
the narrower one, isInRange returned FALSE for versions inside the
wider window — a missed critical alert.

Three changes:

- V1779897650__widen_advisory_affected_ranges_unique_index.sql:
  drop the narrow unique index (located via pg_indexes since the
  initial migration didn't name it) and replace with the full-tuple
  unique index over (advisory_package_id,
  COALESCE(introduced_version,''), COALESCE(fixed_version,''),
  COALESCE(last_affected,'')).
- upsertAdvisory.ts dedupeRanges: key on the full tuple so the
  application-side pre-flight matches the database constraint.
  Exported for unit testing.
- upsertAdvisory.test.ts: 5 cases pinning the new semantics
  (same-introduced-different-fixed preserved, same-introduced-
  different-last_affected preserved, identical-tuple collapsed,
  null-introduced disambiguated by other fields, first-wins on
  truly identical tuples).

ADR-0006 captures the decision and the alternatives considered
(coalesce-to-widest at parse time, drop the constraint, dedup at
query time). Cursor's bot review on 1b978ac surfaced the bug.

Signed-off-by: Joan Reyero <joan@reyero.io>
ADR-0001 §Worker architecture (decided 2026-05-25) standardized
packages_worker sub-workers on the Temporal shape: workflows.ts +
activities.ts + schedule.ts per sub-worker, registered against the
shared packages-worker task queue, with ScheduleOverlapPolicy.SKIP
so a slow run does not queue a second concurrent execution.

This branch originally implemented osv-sync as a standalone bin
following the github-repos-enricher precedent and proposed ADR-0004
to codify that pattern. Rebasing onto current main surfaced the
conflict: github-repos-enricher is now explicitly listed as a
legacy exception with migration deferred, and osv/ is named
alongside npm/ and deps-dev/ as Temporal-based from the start.
ADR-0004 is removed (superseded by ADR-0001 before merging).

Changes:

- src/osv/workflows.ts: osvSync workflow definition. Loops the
  ecosystems passed via schedule args, calls osvSyncEcosystem
  activity per ecosystem (per-ecosystem failure logged but does not
  abort the pass), then osvDeriveCriticalFlag. proxyActivities with
  retry policy (3 attempts, exp backoff) and nonRetryableErrorTypes
  for NOT_FOUND / PARSE.
- src/osv/activities.ts: osvSyncEcosystem and osvDeriveCriticalFlag
  activities wrapping the existing pure-function pipeline
  (fetchEcosystemZip + parseOsvRecord + upsertAdvisoryBatch +
  deriveCriticalFlag). Heartbeats every 1000 records keep Temporal
  aware during the ~1-hour npm pass. FetchError(NOT_FOUND|PARSE) is
  translated to ApplicationFailure.nonRetryable so RetryPolicy
  short-circuits.
- src/osv/schedule.ts: scheduleOsvSync registers an idempotent
  daily schedule at 03:30 UTC (offset from npm-registry-ingest at
  03:15). 4-hour workflowExecutionTimeout, SKIP overlap.
- src/workflows/index.ts + src/activities.ts: re-export the new
  workflow + activities for the worker bundle / runtime.
- src/bin/packages-worker.ts: call scheduleOsvSync on startup
  alongside scheduleNpmIngest.

Deleted:
- src/bin/osv-sync.ts (entry point no longer needed)
- src/osv/index.ts (runOsvSync sleep loop replaced by Temporal)
- scripts/services/osv-sync.yaml (separate compose service gone;
  packages-worker container serves osv via the shared task queue)
- docs/adr/0004 (superseded by ADR-0001 §Worker architecture)

Env vars no longer needed (replaced by Temporal scheduling):
- OSV_SYNC_INTERVAL_HOURS (cron expression owns cadence)
- OSV_IDLE_SLEEP_SEC (no internal sleep loop)
- getOsvConfig() removed from src/config.ts

The pure-function pipeline survives untouched — extractSeverity,
cvssScoring, parseOsvRecord, fetchEcosystemZip, upsertAdvisory,
deriveCriticalFlag, versionCompare, types. All 68 unit tests still
pass. The Temporal activity is a thin orchestrator over the same
functions, so the migration is shape-only — no business logic
changes.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings May 28, 2026 06:14
@joanreyero joanreyero force-pushed the feat/osv-advisories branch from d2e92b5 to 0898c23 Compare May 28, 2026 06:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 31 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/osv/activities.ts Outdated
Comment thread services/apps/packages_worker/src/osv/types.ts Outdated
Joana introduced docs/adr/0001-oss-packages-design-decisions.md
(PR #4151, merged 2026-05-27) as the single living record for the
oss-packages domain, with one section per decision and a Changelog
at the bottom. Our slice landed three standalone ADRs (0003 / 0005
/ 0006) before that consolidation merged. This rebases them into
her structure.

Folded into ADR-0001:

- CVSS scoring strategy (was ADR-0005) — inline v3.1, qualitative
  fallback, v4 deferred, Scope-metric validation. Adjacent to
  §OSV.
- has_critical_vulnerability semantics (was ADR-0003) — option (b)
  + MAL- override, comparator-driven derivation, idempotent
  recompute. Resolves the prior open question on this flag.
- advisory_affected_ranges uniqueness scope (was ADR-0006) — full
  tuple unique index restores the "no denormalization" invariant.

Other changes:

- Updated the Scope and current status table at the top of ADR-0001
  to list the three new decisions and remove the "flag deferred"
  qualifier on OSV.
- Removed the open question on has_critical_vulnerability — now
  decided in §has_critical_vulnerability semantics.
- Added a 2026-05-28 Changelog entry documenting the fold and the
  removal of standalone ADR-0004 (Temporal vs standalone-bin —
  superseded by §Worker architecture before merge).
- Rewrote all in-tree ADR cross-references (6 source files, 3
  migration SQL files) to point at the new section anchors instead
  of the removed standalone ADR ids.
- Updated docs/adr/README.md to list only ADR-0001 (matches Joana's
  intent in PR #4151).

All 68 unit tests still pass.

Signed-off-by: Joan Reyero <joan@reyero.io>
Comment thread docs/adr/0001-oss-packages-design-decisions.md
Two fixes derived from the alignment audit against ADR-0001 (Joana's
oss-packages living ADR, merged in PR #4151):

1. Lowercase ecosystem ('Maven' -> 'maven')

ADR-0001 §OSV "Ecosystem normalization" stores ecosystems lowercase
('npm' and 'maven', not 'Maven'). The OSV worker preserved OSV's
titlecase 'Maven' on the way in, leaving 7,957 Maven rows in
advisory_packages that would never join correctly against a
lowercase-storing packages table.

Changes:
- parseOsvRecord lowercases the ecosystem at the OSV boundary before
  allowlist check, splitName, and storage. Allowlist set is lowercase
  in env (OSV_ECOSYSTEMS=npm,maven) and in tests.
- splitName branches on 'maven' instead of 'Maven'.
- deriveCriticalFlag.ts catch-up SQL CASE expression uses 'maven'.
- New migration V1779951727 backfills existing Maven rows in
  advisory_packages (and packages, which is empty today but reads
  cheaply).

2. Populate advisories.source and source_url

The consolidated initial schema has two columns the upsert path didn't
set: source ('OSV' for everything this worker writes; granular GHSA /
NVD / NSWG attribution belongs to the future deps.dev BQ worker) and
source_url (canonical OSV URL: https://osv.dev/vulnerability/<id>).

Changes:
- NormalizedAdvisory grew source: string and sourceUrl: string | null.
- parseOsvRecord sets source: 'OSV' and sourceUrl: osvSourceUrl(id).
- upsertAdvisory INSERT and ON CONFLICT UPDATE include both columns.

All 68 unit tests still pass; migration applied locally and confirmed
the existing Maven rows backfilled to maven without unique-index
collisions.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings May 28, 2026 07:07
@joanreyero joanreyero requested a review from themarolt May 28, 2026 09:01
Cursor pointed out that osvDeriveCriticalFlag was calling
getActivityConfig() and eagerly validating OSV_BULK_BASE_URL,
OSV_TMP_DIR, and OSV_BATCH_SIZE — env vars only the sync activity
uses. Running the derive activity in isolation (e.g. for testing
or debugging) failed with a misleading "Missing required env var"
pointing at vars the derive never needed.

Split into getSyncConfig() and getDeriveConfig(), each activity
reads only its own env. Matches the same shape applied to
proxyActivities in workflows.ts last round — sync and derive are
independent activities and should have independent contracts.

No behavior change in normal scheduled runs (the workflow sets up
the full env anyway); the cleanup is for isolated/debug paths.

72 unit tests still pass.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings May 28, 2026 09:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 27 changed files in this pull request and generated 3 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/osv/schedule.ts Outdated
Comment thread services/apps/packages_worker/src/osv/schedule.ts Outdated
Comment thread services/apps/packages_worker/src/osv/fetchEcosystemZip.ts
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings May 28, 2026 09:16
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 27 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/osv/schedule.ts
Defensive guard against an upstream OSV record (or zip-bomb payload)
that would otherwise OOM the worker on `file.buffer()`. Real OSV
entries are well under 100 KB; the 10 MB cap is ~200x the observed
maximum and catches the unlikely-but-not-impossible case where the
upstream emits a malformed or maliciously-large record. Surfaced as
a PARSE FetchError so withRetry gives up immediately (retrying the
same payload won't help).

Final third of the three Copilot low-severity findings on the
split-config commit. The other two (de-duping OSV_ECOSYSTEMS in
schedule.ts and rewording the "re-registers" doc comment) were
already addressed by Cursor's autofix in 5c9b86f and c91840d.

72 unit tests still pass.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot caught a real silent-failure path: OSV's GCS bucket is
case-sensitive (`Maven/all.zip` exists, `maven/all.zip` 404s), and
the env value flows straight into the download URL. An operator
typo like `OSV_ECOSYSTEMS=npm,maven` would deterministically 404
every day, surface as `NOT_FOUND` (non-retryable), and the workflow
would silently miss Maven advisories from then on with nothing but
a log line as evidence.

Now scheduleOsvSync validates each input against a small
VALID_ECOSYSTEMS list (currently `npm` + `Maven`) and refuses to
register the schedule on a mismatch. The error suggests the
canonical case when the lowercased forms match — common "did you
mean Maven?" recovery — and lists the supported set otherwise.

Adding a new ecosystem in the future means adding it to the list.

72 unit tests still pass.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings May 28, 2026 09:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 27 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment on lines +28 to +49
function isInRange(ecosystem: string, version: string, range: RangeRow): boolean {
const introduced = range.introduced_version
if (introduced && introduced !== '0') {
const c = compareVersion(ecosystem, version, introduced)
if (c === null || c < 0) return false
}

if (range.fixed_version) {
const c = compareVersion(ecosystem, version, range.fixed_version)
if (c === null || c >= 0) return false
}

if (range.last_affected) {
const c = compareVersion(ecosystem, version, range.last_affected)
if (c === null || c > 0) return false
}

// MAL- ranges often have introduced=null/0 with both fixed and last_affected
// null. That collapses to "always vulnerable" — the early returns above never
// fire, so we fall through to true here.
return true
}
Comment thread services/apps/packages_worker/src/osv/versionCompare.ts
Comment thread services/apps/packages_worker/src/osv/deriveCriticalFlag.ts Outdated
Comment thread services/apps/packages_worker/src/osv/deriveCriticalFlag.ts Outdated
Comment thread services/apps/packages_worker/src/osv/upsertAdvisory.ts Outdated
Comment thread services/apps/packages_worker/src/osv/versionCompare.ts Outdated
Comment on lines +89 to +100
const buffer = await file.buffer()
// Real OSV entries are well under 100 KB. A 10 MB cap is ~200x the
// observed max and catches the (admittedly unlikely) case where a bad
// upstream record or a zip-bomb-style payload would otherwise cause the
// worker to OOM on file.buffer(). We surface it as PARSE so withRetry
// gives up immediately — retrying the same payload won't help.
if (buffer.length > MAX_ENTRY_BYTES) {
throw new FetchError(
'PARSE',
`Entry ${ecosystem}/${file.path} exceeds ${MAX_ENTRY_BYTES} bytes (got ${buffer.length})`,
)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A zip bomb will decompress into memory first (file.buffer), and the MAX_ENTRY_BYTES guard fires too late to prevent the OOM. Maybe we can use file.uncompressedSize instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 9db1c44. Now checking file.uncompressedSize from the central directory before file.buffer() is called, so a small-compressed/huge-uncompressed bomb gets rejected without decompression. The post-decompress check is kept as defense in depth in case the central directory lies.

- Add companion partial index on advisory_packages (ecosystem, package_name)
  WHERE package_id IS NULL so the resolveMissingPackageIds catch-up UPDATE
  uses an index scan instead of a seq scan over the full table.
- Narrow upsertAdvisoryBatch tx granularity from per-batch (~2500 stmts) to
  per-record so advisory_packages row locks are held briefly and a Temporal
  cancel mid-batch only loses the in-flight record.
- Drop the lossy semver.coerce fallback in compareNpm; under-flag over
  mis-flag so a malformed introduced/fixed boundary like "1.2-junk-3"
  doesn't mint a false-positive critical match.
- Guard the zip-bomb path by checking file.uncompressedSize from the
  central directory BEFORE calling file.buffer(), since the buffer() call
  decompresses into memory first and would OOM before the post-decompress
  size check could fire.

Signed-off-by: Joan Reyero <joan@reyero.io>
Per CLAUDE.md, all database queries should live in
services/libs/data-access-layer rather than inlined in the worker. This
move puts every advisories / advisory_packages / advisory_affected_ranges
query used by packages_worker/src/osv into a new
data-access-layer/src/packages/osv.ts so they can be reused (e.g. by the
future deps.dev BQ worker that will write the raw range columns) and so
schema-aware refactors stay in one place.

No behavior change: query strings, parameters, and ordering are kept
byte-for-byte; deriveCriticalFlag and upsertAdvisoryBatch just delegate
to the new functions. Existing unit + integration tests pass unchanged.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings May 28, 2026 14:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 30 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment on lines +140 to +144
// Only delete OSV-derived rows: rows with at least one of
// introduced/fixed/last_affected populated AND no deps.dev-source raw text
// columns. The deps.dev BQ worker (future) is expected to populate range_raw
// / unaffected_raw on rows of its own; we must not wipe those on every OSV
// pass.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — comment was wrong, behavior is fine. Fixed in 5904d0e. The SQL only checks the deps.dev raw columns, and that IS the ownership rule (OSV-owned = no deps.dev raw cols). The earlier comment over-narrowed it by also requiring at least one structured column populated, which the SQL doesn't and shouldn't enforce — some MAL- "always vulnerable" ranges have all three structured cols NULL but are still OSV-owned and need to be deleted/re-inserted on resync. Comment now matches the predicate.

Comment on lines +617 to +622
-- Drives the resolveMissingPackageIds catch-up UPDATE in deriveCriticalFlag:
-- the query filters WHERE package_id IS NULL and joins on (ecosystem,
-- package_name), so the planner needs an index whose predicate matches the
-- WHERE clause to avoid a seq scan over the full table. The non-partial
-- (ecosystem, package_name) index above can't be used here because it doesn't
-- prove package_id IS NULL.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair — fixed in 5904d0e. Verified locally: with the partial index dropped, EXPLAIN on the catch-up UPDATE shows Index Scan using advisory_packages_ecosystem_package_name_idx ... Filter: (package_id IS NULL) (so the non-partial index is reachable). Reworded the comment to say what the partial index actually buys: selectivity as the table grows — it stays O(unresolved), the non-partial would still need to scan/filter every (ecosystem, package_name) match.

Comment thread services/apps/packages_worker/src/osv/workflows.ts Outdated
Comment thread services/apps/packages_worker/src/osv/deriveCriticalFlag.ts
…ents

- Bump osvSyncEcosystem heartbeatTimeout from 5m to 15m. The first heartbeat
  only fires after the full ecosystem zip is downloaded, but DOWNLOAD_TIMEOUT_MS
  in fetchEcosystemZip is 10m. On a slow CDN a healthy download could exceed
  5m of silence and Temporal would kill the activity as unresponsive; 15m
  leaves 5m of headroom past the download cap.
- Reword the partial-index comment on advisory_packages. The earlier text said
  the non-partial index "can't be used" for the catch-up UPDATE — Postgres
  can use it with a Filter on package_id IS NULL. The real reason for the
  partial index is selectivity: it stays O(unresolved) as the table grows.
- Reword the deleteOsvOnlyRanges comment in data-access-layer. The earlier
  text claimed the SQL also required at least one of
  introduced/fixed/last_affected to be populated; the SQL only checks the
  deps.dev raw columns. Behavior is correct; the comment now matches.

Heartbeat fix is the only behavior change; the other two are doc-only.

Signed-off-by: Joan Reyero <joan@reyero.io>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5904d0e. Configure here.

Comment thread services/libs/data-access-layer/src/packages/osv.ts
@joanreyero joanreyero merged commit af6a460 into main May 29, 2026
15 checks passed
@joanreyero joanreyero deleted the feat/osv-advisories branch May 29, 2026 08:12
joanreyero added a commit that referenced this pull request Jun 3, 2026
Per themarolt's review on #4149, packages-db queries belong in
services/libs/data-access-layer/src/packages/ alongside osv.ts. The worker now
imports fetchStaleRepoDocker, fetchPendingDockerRepos, upsertRepoDockerRow,
upsertRepoDockerDailySnapshot, touchRepoDocker, markRepoDockerChecked from
@crowd/data-access-layer; dockerhub/upsertRepoDocker.ts is reduced to the tx
orchestrator. Query strings unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants