Skip to content

feat(asset): lifecycle Phase 0 — stale detection + snooze + dry-run#58

Merged
0xmanhnv merged 1 commit into
developfrom
feat/asset-lifecycle-phase0
Apr 23, 2026
Merged

feat(asset): lifecycle Phase 0 — stale detection + snooze + dry-run#58
0xmanhnv merged 1 commit into
developfrom
feat/asset-lifecycle-phase0

Conversation

@0xmanhnv
Copy link
Copy Markdown
Collaborator

RFC-004 Phase 0: assets that no scanner has re-observed transition automatically from active to stale. Operators see the status in the UI (badge work lands in a companion UI PR) and can snooze the worker per-asset to silence false positives during maintenance.

Backward-compat guarantee: a tenant with AssetLifecycleSettings.Enabled = false sees ZERO behavior change. Even the default after upgrade is false — admins must explicitly opt in AND run a successful dry-run first. This pairs with the range/type validation to prevent the pathological "enable on 2-year-old tenant → 1M assets go stale overnight" scenario.

All 5 critical safety rails from the RFC edge-case analysis are in this single PR rather than shipped incrementally:

  • E2.6 manual-reactivation flap: Asset.lifecycle_paused_until + per- tenant ManualReactivationGraceDays. A manual Activate bumps this so the worker does not re-demote next tick. Operators can also set a custom snooze duration via SnoozeLifecycle(d) (7d / 30d / 90d / forever).
  • E3.1 integration-offline storm: worker.hasRecentIngest check skips an entire tenant when no asset has been seen for 48h. Prevents a crashed agent from demoting 10K assets in one pass.
  • E4.4 race between worker and ingest: MarkSeen always resets status to active (unless manual override or archived). Whoever writes last heals the row.
  • E7.1 first-enable risk: AssetLifecycleSettings.Validate rejects Enabled=true when DryRunCompletedAt is nil. The dry-run endpoint stamps the timestamp on success, unlocking the toggle.
  • E9.5 operator intent: manual_status_override column. When true, worker refuses to write status. Operator owns the asset.

What's in:

  • Domain: StatusStale constant, Asset lifecycle fields + methods (MarkStale, SnoozeLifecycle, ManualOverride toggle, MarkSeen auto-reactivate with server-side timestamp).
  • tenant.Settings.AssetLifecycle — enabled flag, thresholds with min/max bounds, excluded source types, pause-on-integration-fail.
  • Migration 000165: adds 'stale' to status CHECK, adds lifecycle_paused_until + manual_status_override columns, partial index for the worker's hot query (CONCURRENTLY so no table lock).
  • internal/app/asset/lifecycle_worker.go — atomic UPDATE with COALESCE-safe null handling, GREATEST across last_seen/updated_at for manually-edited assets, EXISTS filter for asset_sources to protect manual/import-only assets.
  • internal/infra/controller/asset_lifecycle.go — daily cron with per-tenant iteration, isolates failures so one broken tenant does not halt the fleet.
  • TenantService.UpdateAssetLifecycleSettings + full before/after audit diff; StampAssetLifecycleDryRunCompleted unlocks the enable toggle after a successful preview.
  • HTTP: GET/PUT /settings/asset-lifecycle + POST /dry-run, all behind RequireTeamAdmin, DisallowUnknownFields on the body.
  • Audit actions: ActionAssetMarkedStale, ActionAssetReactivated, ActionAssetLifecycleSnoozed, ActionAssetLifecycleUnsnoozed, ActionTenantAssetLifecycleUpdated, ActionAssetLifecycleRun.
  • 23 unit tests covering the transition matrix, bounds validation, excluded source types, first-enable dry-run gate, server-time enforcement, manual-override bypass, archive terminal state, snooze expiry math.

What's NOT in (deferred to later phases):

  • Phase 1: stale → inactive transition after a second threshold.
  • Phase 1.5: SLA pause on deactivated asset's findings (per-finding override + tenant default "continue" / "pause" / "pause_review").
  • Phase 2: archive tier (terminal, manual restore only).
  • UI work (badges, settings page, snooze menu) — separate PR in the ui repo so review can be parallelized.
  • Repository persistence for lifecycle_paused_until + manual_status_ override: the Asset entity exposes getters/setters but the Postgres adapter and Reconstitute signature are untouched here. The worker's UPDATE statement writes the columns directly. A follow-up wires these into the Asset repository for API reads.

Phase 0 ships the whole mechanism safely rather than feature-flipped in stages — every guard on the list is required for correct behavior even at v1 so there is no "partial ship" that would leave a tenant exposed to the failure modes above.

RFC-004 Phase 0: assets that no scanner has re-observed transition
automatically from active to stale. Operators see the status in the
UI (badge work lands in a companion UI PR) and can snooze the
worker per-asset to silence false positives during maintenance.

Backward-compat guarantee: a tenant with AssetLifecycleSettings.Enabled
= false sees ZERO behavior change. Even the default after upgrade is
false — admins must explicitly opt in AND run a successful dry-run
first. This pairs with the range/type validation to prevent the
pathological "enable on 2-year-old tenant → 1M assets go stale
overnight" scenario.

All 5 critical safety rails from the RFC edge-case analysis are in
this single PR rather than shipped incrementally:

- E2.6 manual-reactivation flap: Asset.lifecycle_paused_until + per-
  tenant ManualReactivationGraceDays. A manual Activate bumps this
  so the worker does not re-demote next tick. Operators can also
  set a custom snooze duration via SnoozeLifecycle(d) (7d / 30d /
  90d / forever).
- E3.1 integration-offline storm: worker.hasRecentIngest check skips
  an entire tenant when no asset has been seen for 48h. Prevents a
  crashed agent from demoting 10K assets in one pass.
- E4.4 race between worker and ingest: MarkSeen always resets status
  to active (unless manual override or archived). Whoever writes
  last heals the row.
- E7.1 first-enable risk: AssetLifecycleSettings.Validate rejects
  Enabled=true when DryRunCompletedAt is nil. The dry-run endpoint
  stamps the timestamp on success, unlocking the toggle.
- E9.5 operator intent: manual_status_override column. When true,
  worker refuses to write status. Operator owns the asset.

What's in:
- Domain: StatusStale constant, Asset lifecycle fields + methods
  (MarkStale, SnoozeLifecycle, ManualOverride toggle, MarkSeen
  auto-reactivate with server-side timestamp).
- tenant.Settings.AssetLifecycle — enabled flag, thresholds with
  min/max bounds, excluded source types, pause-on-integration-fail.
- Migration 000165: adds 'stale' to status CHECK, adds
  lifecycle_paused_until + manual_status_override columns, partial
  index for the worker's hot query (CONCURRENTLY so no table lock).
- internal/app/asset/lifecycle_worker.go — atomic UPDATE with
  COALESCE-safe null handling, GREATEST across last_seen/updated_at
  for manually-edited assets, EXISTS filter for asset_sources to
  protect manual/import-only assets.
- internal/infra/controller/asset_lifecycle.go — daily cron with
  per-tenant iteration, isolates failures so one broken tenant
  does not halt the fleet.
- TenantService.UpdateAssetLifecycleSettings + full before/after
  audit diff; StampAssetLifecycleDryRunCompleted unlocks the enable
  toggle after a successful preview.
- HTTP: GET/PUT /settings/asset-lifecycle + POST /dry-run, all
  behind RequireTeamAdmin, DisallowUnknownFields on the body.
- Audit actions: ActionAssetMarkedStale, ActionAssetReactivated,
  ActionAssetLifecycleSnoozed, ActionAssetLifecycleUnsnoozed,
  ActionTenantAssetLifecycleUpdated, ActionAssetLifecycleRun.
- 23 unit tests covering the transition matrix, bounds validation,
  excluded source types, first-enable dry-run gate, server-time
  enforcement, manual-override bypass, archive terminal state,
  snooze expiry math.

What's NOT in (deferred to later phases):
- Phase 1: stale → inactive transition after a second threshold.
- Phase 1.5: SLA pause on deactivated asset's findings (per-finding
  override + tenant default "continue" / "pause" / "pause_review").
- Phase 2: archive tier (terminal, manual restore only).
- UI work (badges, settings page, snooze menu) — separate PR in
  the ui repo so review can be parallelized.
- Repository persistence for lifecycle_paused_until + manual_status_
  override: the Asset entity exposes getters/setters but the
  Postgres adapter and Reconstitute signature are untouched here.
  The worker's UPDATE statement writes the columns directly. A
  follow-up wires these into the Asset repository for API reads.

Phase 0 ships the whole mechanism safely rather than feature-flipped
in stages — every guard on the list is required for correct
behavior even at v1 so there is no "partial ship" that would leave
a tenant exposed to the failure modes above.
@0xmanhnv 0xmanhnv merged commit 4827fa2 into develop Apr 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant