Skip to content

Fix missing load snapshot after manual snapshot update#2955

Merged
mschuwalow merged 2 commits intomainfrom
snapshot-update-fix
Mar 11, 2026
Merged

Fix missing load snapshot after manual snapshot update#2955
mschuwalow merged 2 commits intomainfrom
snapshot-update-fix

Conversation

@mschuwalow
Copy link
Contributor

@mschuwalow mschuwalow commented Mar 10, 2026

This fixes two bugs:

Duplicate enqueued updates

When the agent does not have any pending work in the active queue and checks pending_updates, it was always performing a manual update when it had a pending update. That is incorrect in two ways:

  • if the update is an automatic update a manual update was enqueued immediately afterwards
  • if the update was a manual update, it enqueued a duplicate manual update. If the manual update made it the pending_updates queue (instead of the pending invocation queue), it means the save-snapshot was already performed and the agent is ready for restart / load-snapshot

This is a bit difficult to reproduce in our current tests as it only shows up in a particular interleaving due to a race with the interupt / restart after an update is enqueued. The bug can be reproduced consistently if that restart is commented out.

Manual updates do not call load-snapshot on replay start

Manual updates do not call load-snapshot on replay start, instead just resuming from the update-succeeded oplog entry. This results in agents failing due to them being uninitialized (original initialize is skipped due to the update, but load-snapshot is not called). The fix for this is to do the same as for the automatic snapshots and track the last save-snapshot payload that needs to be loaded.

@mschuwalow mschuwalow self-assigned this Mar 10, 2026
@mschuwalow mschuwalow force-pushed the snapshot-update-fix branch 3 times, most recently from 88e7b4d to 56ad0f0 Compare March 10, 2026 00:49
.last_automatic_snapshot_index
{
// automatic snapshots are only considered until the first failure
// if there are updates, ignore the automatic snapshot temporarily to catch issues earlier
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that is why the pending_update.is_none() case was there

@mschuwalow mschuwalow force-pushed the snapshot-update-fix branch 2 times, most recently from ee5ab96 to 893eb8b Compare March 10, 2026 01:04
@mschuwalow mschuwalow force-pushed the snapshot-update-fix branch from 893eb8b to 4a9a347 Compare March 10, 2026 01:35
@mschuwalow mschuwalow marked this pull request as ready for review March 10, 2026 01:36
@mschuwalow mschuwalow requested a review from vigoo March 10, 2026 09:27
CommandOutcome::Continue => continue,
other => break other,
}
if status.pending_updates.front().is_some() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that there is always an AgentInvocation::ManualUpdate enqueued and processed for saving the snapshot before we reach here? Probably this logic became a bit obscure through all the refactorings.
What if there are multiple manual updates enqueued? When we perform the first, and restart, what enqueues the thing in the command queue?

Copy link
Contributor Author

@mschuwalow mschuwalow Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. pending_updates is only based on the pending_update oplog entry, which for manual snapshot updates is only created during the following path:

agent receives update request via grpc -> agent writes a manual update pending_agent_invocation oplog entry -> agent processes the invocation -> agent writes a pending_update oplog entry

For automatic updates the pending_update oplog entry is written immediately

OplogEntry::Snapshot {
data, mime_type, ..
} => (data, mime_type),
OplogEntry::PendingUpdate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this fixes the main issue, and now that we have the snapshot-based recovery this is probably a nice way to fix it but very confused that there was no machinery for this earlier.

I think it was supposed to be something like this:

  • once a manual update succeeds, it adds a skipped region from the beginning to the update point so that oplog part is ignored
  • so on next recovery, we reach the OplogEntry::PendingUpdate { SnapshotBased } again, and the recovery code calls load-snapshot so our state is restored
  • the rest of the oplog gets replayed

So I don't fully understand why this did not work, and whether it is a problem or not that now we have two ways to recover from a snapshot-based update. (As I think you did not touch the old one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was always broken or at least broken for a very long time.

PendingUpdate is a hint entry and I don't see any logic in replay that would replay load-snapshot. This is surprising to me too, maybe I missed something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of how it work(ed) on main:

I think the part that's hard to see/understand is the skipped region logic that now 100% lives in the worker status calculation. Previously it was explicitly set in various points of the above update logic, making it easier to follow (but the current way of calculating everything directly from the oplog is definitely the correct way)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are saying is exactly correct, but it's not doing this part in any following replays as pending_updates will not contain the update anymore (as it now has a successful update entry)

here we call load-snapshot (the agent sdk is supposed to internally intialize the agent as part of this) (https://github.com/golemcloud/golem/blob/main/golem-worker-executor/src/durable_host/mod.rs#L1048-L1055)

record_resume_worker(start.elapsed());
let replay_result = async {
if let SnapshotRecoveryResult::Failed =
Self::try_load_snapshot(store, instance).await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstand this part - but for automatic updates we should never use snapshots, but replay always from the beginning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only using snapshots created by manual updates, not the automatic snapshots (the logic for that lives in the WorkerConfig creation).

I don't think we can replay the parts of the oplog that were skipped as part of a manual oplog with a new component version in general (as manual snapshot updates can freely break oplog backwards compat). So I think only replaying from the last succesful manual snapshot update is correct here.

pub last_snapshot_index: Option<OplogIndex>,
/// Index of the last manual update snapshot index. Agent will call load_snapshot
/// on this payload before starting replay.
pub last_manual_snapshot_index: Option<OplogIndex>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is a bit confusing to me without the comments, how about

  • last_manual_update_snapshot_index
  • last_snapshot_index (or we can have automatic in it but what I'd like is that one is for updates, and the other has nothing to do with updates - and automatic updates is a thing which has nothing to do with this)

Copy link
Contributor Author

@mschuwalow mschuwalow Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do last_manual_update_snapshot_index + last_automatic_snapshot_index 👍

Will update done

@mschuwalow mschuwalow merged commit 91d5df9 into main Mar 11, 2026
50 of 51 checks passed
@mschuwalow mschuwalow deleted the snapshot-update-fix branch March 11, 2026 14:04
@github-actions github-actions bot locked and limited conversation to collaborators Mar 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants