You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this failed test run, we encountered an edge case where a broken node couldn't be removed from the cluster due to node group minimum size restriction. This causes CA to fall into a loop attempting to remove this node. The impact of this bug is significantly reduced by liveness probe (see below.)
Scenario:
node group with size n and minSize <=n, with one of the nodes in this node group remaining unregistered
after 15 minutes of normal operation, CA will attempt to remove the node
if node group is still at or below its minSize, CA will repeat this attempt in every iteration (and keep failing)
within 10-20 minutes, liveness probe will fail due to repeated failures and restart CA
go to 2.
Impact:
restarts by liveness probe give 15 minutes in which, if a demand for scale up of the affected node group occurs, the problem could be resolved (after the affected node group's size is increased, it should be possible to remove unregistered node)
however, if cluster activity causes only other node groups to be scaled up/down, the result is that CA works approximately 50% of the time (15 minutes on, ~15 minutes off)
in e2e tests, it causes all the remaining scenarios to fail.
Proposed solutions:
implement a backoff when attempting to remove unregistered node
extend cloudprovider interface to allow for overriding min size restriction on node deletion
After discussions and fix attempt it was agreed that exact conditions to trigger this are rare enough to not make it a release blocker. Will be fixed in a patch release.
In this failed test run, we encountered an edge case where a broken node couldn't be removed from the cluster due to node group minimum size restriction. This causes CA to fall into a loop attempting to remove this node. The impact of this bug is significantly reduced by liveness probe (see below.)
Scenario:
Impact:
Proposed solutions:
cc @MaciekPytel
The text was updated successfully, but these errors were encountered: