Upgrade manager stuck in no InService instances. #324

ameyajoshi99 · 2022-04-07T07:06:17Z

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened:
We are using release v1.0.4 of upgrade manager. The manager is not able to complete the rollout, especially for cluster with nodes roughly greater than 20.

We've seen couple of errors in the logs

failed to set instances to stand-by:

{"level":"info","ts":1649087216.0083492,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by","instances":[{"AvailabilityZone":"us-west-2b","HealthStatus":"Healthy","InstanceId":"i-0c485e03bd870299e","InstanceType":"c6i.4xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-02f0c454eec658bb0","LaunchTemplateName":"lt-k8s-1-020220120120625129100000006","Version":"2"},"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2b","HealthStatus":"Healthy","InstanceId":"i-097374badf1782ccb","InstanceType":"c6i.4xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-02f0c454eec658bb0","LaunchTemplateName":"lt-k8s-1-020220120120625129100000006","Version":"2"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-097374badf1782ccb is not in InService.\n\tstatus code: 400, request id: a477d57c-af4e-44a6-8f3b-f89d711e1f35","name":"upgrade-manager/asg-k8s-1-02022012012062550510000000a"}

no InService instances in the batch:

{"level":"info","ts":1649160753.7560294,"logger":"controllers.RollingUpgrade","msg":"selecting batch for rotation","batch size":1,"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}
{"level":"info","ts":1649160753.7560575,"logger":"controllers.RollingUpgrade","msg":"rotating batch","instances":["i-0738c1f7e01cf2ce7"],"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}
{"level":"info","ts":1649160753.7560735,"logger":"controllers.RollingUpgrade","msg":"no InService instances in the batch","batch":["i-0738c1f7e01cf2ce7"],"instances(InService)":[],"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}

In both cases, these logs start to repeat till manager fails. Failure time is around in 1hr.
When we look at the ASG, the nodes in the logs are in standby state. The logs in upgrade manager for setting node to standby and the time in ASG match so we can say upgrade manager did set the nodes to stand by. However upgrade manager start sending out above logs and is stuck in that error till it fails.
Manager shows the same logs even if we manually drain and delete the node.

What you expected to happen:
Upgrade manager should rollout the nodes.

Environment:

rolling-upgrade-controller version 1.0.4
Kubernetes version : 1.21

The text was updated successfully, but these errors were encountered:

eytan-avisror · 2022-04-07T13:15:12Z

Thanks @ameyajoshi99
We've addressed some issues around LaunchTemplate caching in #322 which is not in the latest release.
Could you try out :master tag and see if that works better?
We can create a release with this fix if needed.

Also, you mention that all instances in StandBy - are there no new instances inservice? when an instance is set to standby, a new one should automatically launch and should be InService.
Can you look at the ASG's activity history to see if there was a failure to launch new instances for some reason?

CC @shreyas-badiger

shreyas-badiger · 2022-04-07T16:36:17Z

failed to set instances to standby

This is happening because of a caching issue. Now, (in latest code on master) We are flushing the ec2 object upon every reconcile operation therefore you shouldn't see this issue anymore.

no inService instances in batch

This message means that amongst the instances in the batch (that are considered for rotation) they all have been set to 'StandBy' and they aren't 'InService' anymore. This was done to avoid some of the corner cases where we will end up setting same instances to 'StandBy' multiple times for which AWS APIs will return error.
So, this isn't the final message or a log message of concern. You should look for next few lines. "New nodes yet to join" or "new instances yet to join"

Bottom line, launch templates roll up could either get stuck or fail or not start at all because upgrade-manager could be operating on stale data. We have fixed that. Next release should address this issue.

Please try the latest code from upgrade-manager branch. And let us know if you still have the issue.

eytan-avisror · 2022-04-07T16:37:31Z

Would be good to add a hotfix release with the latest changes

shreyas-badiger · 2022-04-07T16:39:44Z

@eytan-avisror agreed. I am OOO today. Will consider doing it tomorrow.

ameyajoshi99 · 2022-04-11T14:22:35Z

@eytan-avisror / @shreyas-badiger Thanks for the response .. will try out the fix.

ameyajoshi99 · 2022-04-26T14:55:01Z

I tried out the the master branch. Seems like that did work for cluster with 15 nodes.
However I noticed few errors in the log. This did not failed job.

{"level":"info","ts":1650889090.4844146,"logger":"controllers.RollingUpgrade","msg":"***Reconciling***"}
{"level":"info","ts":1650889090.4844556,"logger":"controllers.RollingUpgrade","msg":"operating on existing rolling upgrade","scalingGroup":"asg-3","update strategy":{"type":"randomUpdate","mode":"eager","maxUnavailable":"20%","drainTimeout":300},"name":"upgrade-manager/asg-3-220220330103350910100000024"}
{"level":"info","ts":1650889090.540366,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by","instances":[{"AvailabilityZone":"us-west-2a","HealthStatus":"Healthy","InstanceId":"i-abcd","InstanceType":"m6i.xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-3","LaunchTemplateName":"lt-3-120220330103350515900000021","Version":"1"},"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2a","HealthStatus":"Healthy","InstanceId":"i-pqrs","InstanceType":"m6i.xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-3","LaunchTemplateName":"lt-3-120220330103350515900000021","Version":"1"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-pqrs is not in InService.\n\tstatus code: 400, request id: 99b4774a-cc2d-4a18-9b09-5ae3855cc904","name":"upgrade-manager/asg-3-120220330103350918000000025"}

shreyas-badiger · 2022-04-26T16:27:49Z

Looks like an instance which is not inService (either in pending or terminating state, is being attempted for a standby) I shall look into it. Can you share complete logs in a file?

ameyajoshi99 · 2022-04-27T12:41:22Z

upgrade-manager.log
Hi attached log file.

shreyas-badiger · 2022-04-29T06:39:21Z

@ameyajoshi99 I looked into the logs. Let me explain what is happening. I am explaining everything in detail so that you also understand and are encouraged to contribute.

NOTE: This comment explains the process, and next comment talks about the error you are facing.

Process:

Flow charts

Once a CR is admitted, we start processing the nodes for rotation batch-by-batch. The batch size is determined by maxUnavailable.(In your case, the maxUnavailable is set to 20% and the batch size is 2.)

How do we select a batch?
A batch is selected either Uniformly across all AZs or they are selected Randomly.
Priority is given to the instances that were already InProgress. A tag that gets attached to the instance when we processed it for the first time.
Something like this:
{"level":"info","ts":1651060486.7610822,"logger":"controllers.RollingUpgrade","msg":"found in-progress instances","instances":["i-abcdefg","i-pqrstu"]}

These steps are followed when we are processing a batch: (Eager mode, Lazy mode skips the waiting in step 3,4,5.)

The last step marks completion of processing a batch. We repeat the above steps until we achieve the finite state for the CR. i.e. All the instances in the ASG have the same launch config/ launch template as ASG.

shreyas-badiger · 2022-04-29T06:47:43Z

Now, talking about the error you are facing:

{"level":"info","ts":1651060486.9737825,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by",
"instances":[{"AvailabilityZone":"us-west-2c","HealthStatus":"Healthy","InstanceId":"i-pqrstu","InstanceType":"m6i.2xlarge",
"LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-abcdef","LaunchTemplateName":"lt-3-22022033010335050860000001e","Version":"3"},
"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2c","HealthStatus":"Healthy","InstanceId":"i-abcdefg","InstanceType":"m6i.2xlarge",
"LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-abcdef","LaunchTemplateName":"lt-3-22022033010335050860000001e","Version":"3"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-abcdefg is not in InService.\n\tstatus code: 400, request id: 54bf7c35-c9e3-438f-90ea-ea6b760afd29","name":"upgrade-manager/asg-3-220220330103350910100000024"}

We added this check to make sure we follow the step 2 mentioned above doesn't hit errors while setting instances to StandBy. (AWS API to setInstanceStandBy is allowed only on the instances that are InService.)

I am doing the right thing by setting the instances that are InService in batch to InProgress.
However, I am not following the same when I am setting instances to StandBy

Basically the bug is in this line:

....
if err := r.SetBatchStandBy(batchInstanceIDs); err != nil {
...
...

Instead, it should have been:

....
if err := r.SetBatchStandBy(inServiceInstanceIDs); err != nil {
...
...

Can you make this change, test and send out a PR for this?

ameyajoshi99 · 2022-05-02T14:20:14Z

@shreyas-badiger Thanks a lot for detailed explanation .. That was really helpful .. I will do the change you suggested and test it out on the cluster. If works well, I'll add PR here ...

About batch, we have not specified any strategy type, so if & elseif in

upgrade-manager/controllers/upgrade.go

Line 419 in 8e0f67d

if r.RollingUpgrade.UpdateStrategyType() == v1alpha1.RandomUpdateStrategy {

is not coming into picture ..
So the result of CalculateMaxUnavailable is getting used.

We've 6 total nodes in 1 asg & maxUnavailable is 20%. intstr.GetValueFromIntOrPercent is producing 2 over 1.2 as its using "ceil".

ameyajoshi99 · 2022-05-03T14:23:26Z

@shreyas-badiger , I did testing with suggested change. There are no errors in the background. Rollout was successful
Here is PR: #329
Thanks a lot.

shreyas-badiger · 2022-05-05T08:19:50Z

@ameyajoshi99 I will merge the PR as soon as the CI is fixed.

shreyas-badiger · 2022-05-05T16:54:17Z

PR - #329
@ameyajoshi99 Please fix the DCO.

ameyajoshi99 · 2022-05-06T08:44:07Z

@shreyas-badiger Done.

* fix error 'failed to set instances to stand-by' Signed-off-by: Ameya Joshi <v-ameyaj@zillowgroup.com> Co-authored-by: Ameya Joshi <v-ameyaj@zillowgroup.com> Co-authored-by: Shreyas Badiger <7680410+shreyas-badiger@users.noreply.github.com>

shreyas-badiger mentioned this issue May 5, 2022

fix error 'failed to set instances to stand-by'. Refer https://github.com/keikoproj/upgrade-manager/issues/324 #329

Merged

shreyas-badiger closed this as completed in #329 May 6, 2022

shreyas-badiger mentioned this issue Jul 24, 2023

Question: Option to configure node-reaper differently for specific node group keikoproj/governor#74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade manager stuck in no InService instances. #324

Upgrade manager stuck in no InService instances. #324

ameyajoshi99 commented Apr 7, 2022

eytan-avisror commented Apr 7, 2022

shreyas-badiger commented Apr 7, 2022 •

edited

Loading

eytan-avisror commented Apr 7, 2022

shreyas-badiger commented Apr 7, 2022

ameyajoshi99 commented Apr 11, 2022

ameyajoshi99 commented Apr 26, 2022

shreyas-badiger commented Apr 26, 2022

ameyajoshi99 commented Apr 27, 2022

shreyas-badiger commented Apr 29, 2022 •

edited

Loading

shreyas-badiger commented Apr 29, 2022 •

edited

Loading

ameyajoshi99 commented May 2, 2022 •

edited

Loading

ameyajoshi99 commented May 3, 2022

shreyas-badiger commented May 5, 2022

shreyas-badiger commented May 5, 2022

ameyajoshi99 commented May 6, 2022

Upgrade manager stuck in no InService instances. #324

Upgrade manager stuck in no InService instances. #324

Comments

ameyajoshi99 commented Apr 7, 2022

eytan-avisror commented Apr 7, 2022

shreyas-badiger commented Apr 7, 2022 • edited Loading

eytan-avisror commented Apr 7, 2022

shreyas-badiger commented Apr 7, 2022

ameyajoshi99 commented Apr 11, 2022

ameyajoshi99 commented Apr 26, 2022

shreyas-badiger commented Apr 26, 2022

ameyajoshi99 commented Apr 27, 2022

shreyas-badiger commented Apr 29, 2022 • edited Loading

Process:

shreyas-badiger commented Apr 29, 2022 • edited Loading

ameyajoshi99 commented May 2, 2022 • edited Loading

ameyajoshi99 commented May 3, 2022

shreyas-badiger commented May 5, 2022

shreyas-badiger commented May 5, 2022

ameyajoshi99 commented May 6, 2022

shreyas-badiger commented Apr 7, 2022 •

edited

Loading

shreyas-badiger commented Apr 29, 2022 •

edited

Loading

shreyas-badiger commented Apr 29, 2022 •

edited

Loading

ameyajoshi99 commented May 2, 2022 •

edited

Loading