Skip to content

docs: design doc for auto-removal of unused launcher image hashes#3488

Merged
barakeinav1 merged 9 commits into
mainfrom
docs/auto-remove-launcher-hashes-design
Jun 15, 2026
Merged

docs: design doc for auto-removal of unused launcher image hashes#3488
barakeinav1 merged 9 commits into
mainfrom
docs/auto-remove-launcher-hashes-design

Conversation

@barakeinav1

@barakeinav1 barakeinav1 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Part of #3381

Design-only PR — proposes usage-based expiry for allowed_launcher_image_hashes:

  • "Use" signal: a launcher hash is "used" whenever a node's hourly attestation resubmission (submit_participant_info) verifies against it — the contract refreshes that entry's last_attested timestamp there. Nodes already send these, so no node-side changes
  • Expiry: max(added, last_attested) + TTL < now, TTL default 14 days (config field, validated >= DEFAULT_EXPIRATION_DURATION_SECONDS)
  • Enforcement: every read of the allowed list filters out expired entries, so an expired hash is rejected immediately at TTL lapse
  • Storage reclaimed by a detached self-call from verify_tee to a #[private] cleanup method (no new public API, cannot fail the host tx)
  • Never-used hashes expire too (e.g. a newly voted launcher image that no node ever migrated to); recovery is a threshold re-vote
  • Unanimous vote_remove_launcher_hash stays as manual early-removal override; measurements and MPC docker-image expiry out of scope

Remaining open question for reviewers: TTL length (14 days vs. 30 — see the Open questions section). Implementation follows in a separate PR.

Proposes a usage-based TTL mechanism (refresh-on-attestation, read-time
filtering, lazy sweep in verify_tee) to evict launcher hashes no
participant has used for 14 days, replacing the unanimous-vote-only
removal path for routine cleanup.

Part of #3381

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design document proposing usage-based expiry for launcher image hashes so the contract can automatically evict stale allowed_launcher_image_hashes entries over time, reducing long-term accumulation without requiring unanimous removal votes.

Changes:

  • Introduces a draft design for adding added / last_attested timestamps to allowed launcher entries and expiring them after max(added, last_attested) + TTL.
  • Proposes “refresh-on-use” via submit_participant_info and read-time filtering to reject expired hashes immediately.
  • Describes an inline, lazy sweep during verify_tee as housekeeping to delete expired entries from storage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +53 to +55
- A hash backing a **valid attestation is never expired**: TTL (14d) >
attestation validity (7d), so any valid attestation refreshed its entry
within 7 days. Enforced as a config validation (`TTL > 7d`).
Comment on lines +111 to +121
## Open questions

1. **TTL = 14 days** implies *vote a launcher in at most 14 days before
migrating to it*. Acceptable, or prefer 30 days?
2. **Inline sweep vs. detached promise** — the issue's acceptance criteria
suggest running cleanup in a detached promise, so that a mass-expiry can
never consume enough gas to fail the transaction it piggybacks on. This doc
instead sweeps inline in `verify_tee`, betting that
`allowed_launcher_images` stays tiny forever (entries only enter via rare
operator votes; sweeping ~4 entries is negligible gas). Does anyone see a
future where this list grows to many entries? If not, inline stands.
Comment on lines +42 to +44
3. **Lazy sweep** — `verify_tee` deletes expired entries from storage
(keeping at least the newest). Housekeeping only; expiry is already
enforced by (2).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m afraid we could potentially run out of gas if we inline it. Instead, we could introduce a public contract method, similar to this one, that cleans up the expired entries from storage:

/// Removes all expired code hashes and returns the number of removed entries.
/// Ensures that at least one (the latest) proposal always remains in the whitelist.
pub fn cleanup_expired_hashes(&mut self, tee_upgrade_deadline_duration: Duration) {
let valid_entries = self.valid_entries(tee_upgrade_deadline_duration);
self.allowed_tee_proposals = valid_entries;
}

Alternatively, we could spawn it as a separate self-call via Promise::new(...).function_call(CLEAN_EXPIRED_LAUNCHER_HASHES, ...).detach(), similar to what we do here:

mpc/crates/contract/src/lib.rs

Lines 1088 to 1136 in cf72a5a

// Spawn a promise to clean up votes from non-participants.
// Note: MpcContract::vote_update uses filtering to ensure correctness even if this cleanup fails.
Promise::new(env::current_account_id())
.function_call(
method_names::REMOVE_NON_PARTICIPANT_UPDATE_VOTES.to_string(),
vec![],
NearToken::from_yoctonear(0),
Gas::from_tgas(self.config.remove_non_participant_update_votes_tera_gas),
)
.detach();
// Spawn a promise to drop votes cast by non-participants.
Promise::new(env::current_account_id())
.function_call(
method_names::CLEAN_TEE_STATUS.to_string(),
vec![],
NearToken::from_yoctonear(0),
Gas::from_tgas(self.config.clean_tee_status_tera_gas),
)
.detach();
// Spawn a bounded sweep over stored attestations to prune invalid / expired entries.
Promise::new(env::current_account_id())
.function_call(
method_names::CLEAN_INVALID_ATTESTATIONS.to_string(),
serde_json::to_vec(&serde_json::json!({
"max_scan": RESHARE_CLEAN_INVALID_ATTESTATIONS_MAX_SCAN
}))
.unwrap(),
NearToken::from_yoctonear(0),
Gas::from_tgas(self.config.clean_invalid_attestations_tera_gas),
)
.detach();
// Spawn a promise to clean up orphaned node migrations for non-participants
Promise::new(env::current_account_id())
.function_call(
method_names::CLEANUP_ORPHANED_NODE_MIGRATIONS.to_string(),
vec![],
NearToken::from_yoctonear(0),
Gas::from_tgas(self.config.cleanup_orphaned_node_migrations_tera_gas),
)
.detach();
// Spawn a promise to clean up foreign chain data for non-participants
Promise::new(env::current_account_id())
.function_call(
method_names::CLEAN_FOREIGN_CHAIN_DATA.to_string(),
vec![],
NearToken::from_yoctonear(0),
Gas::from_tgas(self.config.clean_foreign_chain_data_tera_gas),
)
.detach();

@barakeinav1 barakeinav1 Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to add a new public API. We already have a lot.
I don't think that running out of gas is a real concern, since the launcher whitelist is very small (and we control the size via voting). but I'm ok with moving this to a private method and using a Promise if you think it is better.

Let get another reviewer on this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A private method with a promise seems like a manageable situation. If this starts accumulating and running out of gas it doesn't bring down any other functionality. I'm happy with that solution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc to go with the detached Promise + #[private] method (your option b): verify_tee spawns a detached self-call to a new #[private] clean_expired_launcher_hashes, gas reserved via a clean_expired_launcher_hashes_tera_gas config knob like the existing post-resharing cleanups. Skipped the public-method option to avoid growing the public API. Since read-time filtering already enforces expiry, a failed/out-of-gas sweep is harmless and just retries next verify_tee.

@netrome netrome left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought so far, haven't read the full proposal. Will continue later in the afternoon.

Comment on lines +14 to +17
The insight: **not using a launcher is itself a vote.** Every node already
resubmits an attestation hourly (`submit_participant_info`), proving which
launcher it runs. The contract can observe disuse and evict stale hashes
without any vote.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I wonder if we should do the same for node image hashes. I don't think those should be auto-removed since we may have long periods during upgrades where we want to allow two concurrent node versions.

@barakeinav1 barakeinav1 Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do want add such a thing for nodes, then we will need to add voting mechanism to remove the nodes (same as we have for the launcher). To prevent an outdated node from staying forever.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, which does make sense to me. In practice I'd expect all operators to update in due time and auto-eviction working well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to think if this is in addition to or instead of the auto 7 day removall we have.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't change this for now, but we can think of this for the future.

netrome
netrome previously approved these changes Jun 15, 2026

@netrome netrome left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, as long as we're aligned the implementation of the lazy sweep should be through a promise. Could be worth highlighting in this doc - but we could also view it as an implementation detail. I'm fine with both.

Replaces the inline verify_tee sweep with a detached self-call to a new
#[private] clean_expired_launcher_hashes method, so cleanup can never fail
the host transaction and adds no public API. Resolves the open question on
inline vs. promise cleanup.
@barakeinav1 barakeinav1 marked this pull request as ready for review June 15, 2026 07:12
@barakeinav1 barakeinav1 requested review from netrome and pbeza June 15, 2026 07:12
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Pull request overview

Design-only PR introducing a draft proposal for usage-based expiry of allowed_launcher_image_hashes. The design adds added/last_attested timestamps, refreshes last_attested on each successful submit_participant_info, filters expired entries at read time, and reclaims storage via a detached #[private] sweep spawned from verify_tee. The doc has already absorbed earlier review feedback (detached-promise sweep over inline, no new public API, TTL invariant tied to DEFAULT_EXPIRATION_DURATION_SECONDS).

Changes:

  • New file: docs/design/auto-remove-launcher-hashes-design.md proposing schema, lifecycle, safety invariants, migration, alternatives, and one open question (TTL = 14d vs 30d).

Reviewed changes

Per-file summary
File Description
docs/design/auto-remove-launcher-hashes-design.md New design doc; no code changes.

Findings

I verified the doc's load-bearing claims against the current code:

  • mpc_attestation::attestation::DEFAULT_EXPIRATION_DURATION_SECONDS = 7d, distinct from DEFAULT_TEE_UPGRADE_DEADLINE_DURATION_SECONDS in crates/contract/src/config.rs — the disambiguation in the doc is accurate (crates/mpc-attestation/src/attestation.rs:24, crates/contract/src/config.rs:7).
  • AllowedDockerImageHashes::valid_entries does return the newest entry when all are expired (rposition + get(cutoff_index..)), so the fallback claim in §2 mirrors the existing pattern (crates/contract/src/tee/proposal.rs:170).
  • The detached-promise + tera-gas-knob pattern in vote_reshared is the precedent the doc invokes (crates/contract/src/lib.rs:1147).

Non-blocking (design clarifications worth folding in before implementation):

  • docs/design/auto-remove-launcher-hashes-design.md:63 — "Recovery: vote_add_launcher_hash for an already-present entry resets its added timestamp" implies a behavior change to AllowedLauncherImages::add, which today returns false and is a no-op when the launcher already exists (crates/contract/src/tee/proposal.rs:282). Worth calling this out explicitly so the implementation PR doesn't leave the recovery path silently broken: either change add to upsert/refresh added on a re-vote, or document an alternative recovery (e.g. unanimous remove + add) and update the "Slow rollout" scenario accordingly.
  • docs/design/auto-remove-launcher-hashes-design.md:41 — "Refresh on use" is scoped to submit_participant_info only. The verify_tee / vote_new_parameters re-verify paths (crates/contract/src/lib.rs:896, 1617) successfully re-verify stored attestations against the launcher set but, under this design, would not refresh last_attested. The safety invariant still holds (because the underlying attestation itself expires at 7d, forcing a fresh submit_participant_info), but it's worth stating this asymmetry explicitly in the doc — readers will otherwise wonder why "re-verify" doesn't count as "use."
  • docs/design/auto-remove-launcher-hashes-design.md:41 — Implementation will need a way to identify which AllowedLauncherImage entry owns the matched compose hash. The current attestation.verify(...) -> AcceptedAttestation flow (crates/contract/src/tee/tee_state.rs:159) doesn't surface that linkage to the caller; flagging this so the implementation PR plumbs it through rather than re-deriving by recomputing compose hashes per entry.
  • docs/design/auto-remove-launcher-hashes-design.md:137 — The Open questions section still omits the observability item the PR description asks reviewers to consider (and that Auto-remove unused launcher image hashes from the contract #3381's AC raises). Either add it as Q2 or note in the PR description that it was deliberately resolved (e.g. "logs only, no events").
  • docs/design/auto-remove-launcher-hashes-design.md:118 — Migration sets added = last_attested = migration time, which means every currently stale testnet hash gets a fresh 14-day reprieve before it ages out. That's the conservative choice but worth being explicit that "stale testnet hashes age out" only happens after the post-migration TTL.

✅ Approved — design is internally consistent, safety invariants check out against the actual constants, and the detached-promise sweep matches the established vote_reshared cleanup convention. The notes above are clarifications for the implementation PR, not merge blockers.

@barakeinav1 barakeinav1 enabled auto-merge June 15, 2026 08:55
@barakeinav1 barakeinav1 added this pull request to the merge queue Jun 15, 2026
Merged via the queue into main with commit bc8bd95 Jun 15, 2026
15 checks passed
@barakeinav1 barakeinav1 deleted the docs/auto-remove-launcher-hashes-design branch June 15, 2026 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants