feat(asset): lifecycle Phase 0 — stale detection + snooze + dry-run#58
Merged
Conversation
RFC-004 Phase 0: assets that no scanner has re-observed transition automatically from active to stale. Operators see the status in the UI (badge work lands in a companion UI PR) and can snooze the worker per-asset to silence false positives during maintenance. Backward-compat guarantee: a tenant with AssetLifecycleSettings.Enabled = false sees ZERO behavior change. Even the default after upgrade is false — admins must explicitly opt in AND run a successful dry-run first. This pairs with the range/type validation to prevent the pathological "enable on 2-year-old tenant → 1M assets go stale overnight" scenario. All 5 critical safety rails from the RFC edge-case analysis are in this single PR rather than shipped incrementally: - E2.6 manual-reactivation flap: Asset.lifecycle_paused_until + per- tenant ManualReactivationGraceDays. A manual Activate bumps this so the worker does not re-demote next tick. Operators can also set a custom snooze duration via SnoozeLifecycle(d) (7d / 30d / 90d / forever). - E3.1 integration-offline storm: worker.hasRecentIngest check skips an entire tenant when no asset has been seen for 48h. Prevents a crashed agent from demoting 10K assets in one pass. - E4.4 race between worker and ingest: MarkSeen always resets status to active (unless manual override or archived). Whoever writes last heals the row. - E7.1 first-enable risk: AssetLifecycleSettings.Validate rejects Enabled=true when DryRunCompletedAt is nil. The dry-run endpoint stamps the timestamp on success, unlocking the toggle. - E9.5 operator intent: manual_status_override column. When true, worker refuses to write status. Operator owns the asset. What's in: - Domain: StatusStale constant, Asset lifecycle fields + methods (MarkStale, SnoozeLifecycle, ManualOverride toggle, MarkSeen auto-reactivate with server-side timestamp). - tenant.Settings.AssetLifecycle — enabled flag, thresholds with min/max bounds, excluded source types, pause-on-integration-fail. - Migration 000165: adds 'stale' to status CHECK, adds lifecycle_paused_until + manual_status_override columns, partial index for the worker's hot query (CONCURRENTLY so no table lock). - internal/app/asset/lifecycle_worker.go — atomic UPDATE with COALESCE-safe null handling, GREATEST across last_seen/updated_at for manually-edited assets, EXISTS filter for asset_sources to protect manual/import-only assets. - internal/infra/controller/asset_lifecycle.go — daily cron with per-tenant iteration, isolates failures so one broken tenant does not halt the fleet. - TenantService.UpdateAssetLifecycleSettings + full before/after audit diff; StampAssetLifecycleDryRunCompleted unlocks the enable toggle after a successful preview. - HTTP: GET/PUT /settings/asset-lifecycle + POST /dry-run, all behind RequireTeamAdmin, DisallowUnknownFields on the body. - Audit actions: ActionAssetMarkedStale, ActionAssetReactivated, ActionAssetLifecycleSnoozed, ActionAssetLifecycleUnsnoozed, ActionTenantAssetLifecycleUpdated, ActionAssetLifecycleRun. - 23 unit tests covering the transition matrix, bounds validation, excluded source types, first-enable dry-run gate, server-time enforcement, manual-override bypass, archive terminal state, snooze expiry math. What's NOT in (deferred to later phases): - Phase 1: stale → inactive transition after a second threshold. - Phase 1.5: SLA pause on deactivated asset's findings (per-finding override + tenant default "continue" / "pause" / "pause_review"). - Phase 2: archive tier (terminal, manual restore only). - UI work (badges, settings page, snooze menu) — separate PR in the ui repo so review can be parallelized. - Repository persistence for lifecycle_paused_until + manual_status_ override: the Asset entity exposes getters/setters but the Postgres adapter and Reconstitute signature are untouched here. The worker's UPDATE statement writes the columns directly. A follow-up wires these into the Asset repository for API reads. Phase 0 ships the whole mechanism safely rather than feature-flipped in stages — every guard on the list is required for correct behavior even at v1 so there is no "partial ship" that would leave a tenant exposed to the failure modes above.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RFC-004 Phase 0: assets that no scanner has re-observed transition automatically from active to stale. Operators see the status in the UI (badge work lands in a companion UI PR) and can snooze the worker per-asset to silence false positives during maintenance.
Backward-compat guarantee: a tenant with AssetLifecycleSettings.Enabled = false sees ZERO behavior change. Even the default after upgrade is false — admins must explicitly opt in AND run a successful dry-run first. This pairs with the range/type validation to prevent the pathological "enable on 2-year-old tenant → 1M assets go stale overnight" scenario.
All 5 critical safety rails from the RFC edge-case analysis are in this single PR rather than shipped incrementally:
What's in:
What's NOT in (deferred to later phases):
Phase 0 ships the whole mechanism safely rather than feature-flipped in stages — every guard on the list is required for correct behavior even at v1 so there is no "partial ship" that would leave a tenant exposed to the failure modes above.