-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade manager stuck in no InService instances. #324
Comments
Thanks @ameyajoshi99 Also, you mention that all instances in StandBy - are there no new instances inservice? when an instance is set to standby, a new one should automatically launch and should be InService. |
This is happening because of a caching issue. Now, (in latest code on master) We are flushing the ec2 object upon every reconcile operation therefore you shouldn't see this issue anymore.
This message means that amongst the instances in the batch (that are considered for rotation) they all have been set to 'StandBy' and they aren't 'InService' anymore. This was done to avoid some of the corner cases where we will end up setting same instances to 'StandBy' multiple times for which AWS APIs will return error. Bottom line, launch templates roll up could either get stuck or fail or not start at all because upgrade-manager could be operating on stale data. We have fixed that. Next release should address this issue. Please try the latest code from upgrade-manager branch. And let us know if you still have the issue. |
Would be good to add a hotfix release with the latest changes |
@eytan-avisror agreed. I am OOO today. Will consider doing it tomorrow. |
@eytan-avisror / @shreyas-badiger Thanks for the response .. will try out the fix. |
I tried out the the master branch. Seems like that did work for cluster with 15 nodes.
|
Looks like an instance which is not inService (either in pending or terminating state, is being attempted for a standby) I shall look into it. Can you share complete logs in a file? |
upgrade-manager.log |
@ameyajoshi99 I looked into the logs. Let me explain what is happening. I am explaining everything in detail so that you also understand and are encouraged to contribute. NOTE: This comment explains the process, and next comment talks about the error you are facing. Process:Once a CR is admitted, we start processing the nodes for rotation batch-by-batch. The batch size is determined by How do we select a batch? These steps are followed when we are processing a batch: (Eager mode, Lazy mode skips the waiting in step 3,4,5.)
The last step marks completion of processing a batch. We repeat the above steps until we achieve the finite state for the CR. i.e. All the instances in the ASG have the same launch config/ launch template as ASG. |
Now, talking about the error you are facing:
We added this check to make sure we follow the step 2 mentioned above doesn't hit errors while setting instances to I am doing the right thing by setting the instances that are Basically the bug is in this line:
Instead, it should have been:
Can you make this change, test and send out a PR for this? |
@shreyas-badiger Thanks a lot for detailed explanation .. That was really helpful .. I will do the change you suggested and test it out on the cluster. If works well, I'll add PR here ... About batch, we have not specified any strategy type, so if & elseif in upgrade-manager/controllers/upgrade.go Line 419 in 8e0f67d
is not coming into picture .. So the result of CalculateMaxUnavailable is getting used. We've 6 total nodes in 1 asg & maxUnavailable is 20%. intstr.GetValueFromIntOrPercent is producing 2 over 1.2 as its using "ceil". |
@shreyas-badiger , I did testing with suggested change. There are no errors in the background. Rollout was successful |
@ameyajoshi99 I will merge the PR as soon as the CI is fixed. |
PR - #329 |
@shreyas-badiger Done. |
* fix error 'failed to set instances to stand-by' Signed-off-by: Ameya Joshi <v-ameyaj@zillowgroup.com> Co-authored-by: Ameya Joshi <v-ameyaj@zillowgroup.com> Co-authored-by: Shreyas Badiger <7680410+shreyas-badiger@users.noreply.github.com>
Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT
What happened:
We are using release v1.0.4 of upgrade manager. The manager is not able to complete the rollout, especially for cluster with nodes roughly greater than 20.
We've seen couple of errors in the logs
In both cases, these logs start to repeat till manager fails. Failure time is around in 1hr.
When we look at the ASG, the nodes in the logs are in standby state. The logs in upgrade manager for setting node to standby and the time in ASG match so we can say upgrade manager did set the nodes to stand by. However upgrade manager start sending out above logs and is stuck in that error till it fails.
Manager shows the same logs even if we manually drain and delete the node.
What you expected to happen:
Upgrade manager should rollout the nodes.
Environment:
The text was updated successfully, but these errors were encountered: