Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Integration tests reliability fixes #1505

Merged
merged 7 commits into from
Dec 3, 2021
Merged

Conversation

chkeita
Copy link
Contributor

@chkeita chkeita commented Dec 1, 2021

Summary of the Pull Request

Multiple fixes for #1454

  • Check if the node is in the don state before reimaging. This gives the node the opportunity to process the StopCommand before we reimage it
  • Log node done event as warning when the node is deleted from the database
  • Log heartbeat from deleted nodes as warning

Validation Steps Performed

run check-pr

@chkeita chkeita linked an issue Dec 2, 2021 that may be closed by this pull request
@chkeita chkeita marked this pull request as ready for review December 2, 2021 21:18
@chkeita chkeita changed the title only reimage nodes that are in the done state Integration tests reliability fixes Dec 2, 2021
Copy link
Member

@ranweiler ranweiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of way the existing code is written, I find it hard to see how the change to the default behavior of stop_if_complete() interact with various other flows in the system.

Generally: what other system behaviors are determined by that function, and how did we test that certain error cases are still handled correctly?

Specifically: I have a vague concern that there may be error scenarios where we really want to reimage an unresponsive node, but we've now made ourselves rely on its cooperation, and so reimaging will never happen. I don't have an example, but I'd like to think through and test some cases like that.

@chkeita
Copy link
Contributor Author

chkeita commented Dec 3, 2021

@ranweiler

The change i introduced make it so that we don't force the state to done when stopping a task. We rely on the machine to report the state for that.
for the case of unresponsive node, the state is forced to done before reimaging so my change will not prevent the reimage

@chkeita chkeita closed this Dec 3, 2021
@chkeita chkeita reopened this Dec 3, 2021
@chkeita chkeita merged commit 08691c0 into microsoft:main Dec 3, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jan 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spurious error telemetry when stopping Windows generator tasks
4 participants