Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade manager stuck in no InService instances. #324

Closed
ameyajoshi99 opened this issue Apr 7, 2022 · 15 comments · Fixed by #329
Closed

Upgrade manager stuck in no InService instances. #324

ameyajoshi99 opened this issue Apr 7, 2022 · 15 comments · Fixed by #329

Comments

@ameyajoshi99
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened:
We are using release v1.0.4 of upgrade manager. The manager is not able to complete the rollout, especially for cluster with nodes roughly greater than 20.

We've seen couple of errors in the logs

  1. failed to set instances to stand-by:
{"level":"info","ts":1649087216.0083492,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by","instances":[{"AvailabilityZone":"us-west-2b","HealthStatus":"Healthy","InstanceId":"i-0c485e03bd870299e","InstanceType":"c6i.4xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-02f0c454eec658bb0","LaunchTemplateName":"lt-k8s-1-020220120120625129100000006","Version":"2"},"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2b","HealthStatus":"Healthy","InstanceId":"i-097374badf1782ccb","InstanceType":"c6i.4xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-02f0c454eec658bb0","LaunchTemplateName":"lt-k8s-1-020220120120625129100000006","Version":"2"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-097374badf1782ccb is not in InService.\n\tstatus code: 400, request id: a477d57c-af4e-44a6-8f3b-f89d711e1f35","name":"upgrade-manager/asg-k8s-1-02022012012062550510000000a"}
  1. no InService instances in the batch:
{"level":"info","ts":1649160753.7560294,"logger":"controllers.RollingUpgrade","msg":"selecting batch for rotation","batch size":1,"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}
{"level":"info","ts":1649160753.7560575,"logger":"controllers.RollingUpgrade","msg":"rotating batch","instances":["i-0738c1f7e01cf2ce7"],"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}
{"level":"info","ts":1649160753.7560735,"logger":"controllers.RollingUpgrade","msg":"no InService instances in the batch","batch":["i-0738c1f7e01cf2ce7"],"instances(InService)":[],"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}

In both cases, these logs start to repeat till manager fails. Failure time is around in 1hr.
When we look at the ASG, the nodes in the logs are in standby state. The logs in upgrade manager for setting node to standby and the time in ASG match so we can say upgrade manager did set the nodes to stand by. However upgrade manager start sending out above logs and is stuck in that error till it fails.
Manager shows the same logs even if we manually drain and delete the node.

What you expected to happen:
Upgrade manager should rollout the nodes.

Environment:

  • rolling-upgrade-controller version 1.0.4
  • Kubernetes version : 1.21
@eytan-avisror
Copy link
Contributor

Thanks @ameyajoshi99
We've addressed some issues around LaunchTemplate caching in #322 which is not in the latest release.
Could you try out :master tag and see if that works better?
We can create a release with this fix if needed.

Also, you mention that all instances in StandBy - are there no new instances inservice? when an instance is set to standby, a new one should automatically launch and should be InService.
Can you look at the ASG's activity history to see if there was a failure to launch new instances for some reason?

CC @shreyas-badiger

@shreyas-badiger
Copy link
Collaborator

shreyas-badiger commented Apr 7, 2022

failed to set instances to standby

This is happening because of a caching issue. Now, (in latest code on master) We are flushing the ec2 object upon every reconcile operation therefore you shouldn't see this issue anymore.

no inService instances in batch

This message means that amongst the instances in the batch (that are considered for rotation) they all have been set to 'StandBy' and they aren't 'InService' anymore. This was done to avoid some of the corner cases where we will end up setting same instances to 'StandBy' multiple times for which AWS APIs will return error.
So, this isn't the final message or a log message of concern. You should look for next few lines. "New nodes yet to join" or "new instances yet to join"

Bottom line, launch templates roll up could either get stuck or fail or not start at all because upgrade-manager could be operating on stale data. We have fixed that. Next release should address this issue.

Please try the latest code from upgrade-manager branch. And let us know if you still have the issue.

@eytan-avisror
Copy link
Contributor

Would be good to add a hotfix release with the latest changes

@shreyas-badiger
Copy link
Collaborator

@eytan-avisror agreed. I am OOO today. Will consider doing it tomorrow.

@ameyajoshi99
Copy link
Contributor Author

@eytan-avisror / @shreyas-badiger Thanks for the response .. will try out the fix.

@ameyajoshi99
Copy link
Contributor Author

I tried out the the master branch. Seems like that did work for cluster with 15 nodes.
However I noticed few errors in the log. This did not failed job.

{"level":"info","ts":1650889090.4844146,"logger":"controllers.RollingUpgrade","msg":"***Reconciling***"}
{"level":"info","ts":1650889090.4844556,"logger":"controllers.RollingUpgrade","msg":"operating on existing rolling upgrade","scalingGroup":"asg-3","update strategy":{"type":"randomUpdate","mode":"eager","maxUnavailable":"20%","drainTimeout":300},"name":"upgrade-manager/asg-3-220220330103350910100000024"}
{"level":"info","ts":1650889090.540366,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by","instances":[{"AvailabilityZone":"us-west-2a","HealthStatus":"Healthy","InstanceId":"i-abcd","InstanceType":"m6i.xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-3","LaunchTemplateName":"lt-3-120220330103350515900000021","Version":"1"},"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2a","HealthStatus":"Healthy","InstanceId":"i-pqrs","InstanceType":"m6i.xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-3","LaunchTemplateName":"lt-3-120220330103350515900000021","Version":"1"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-pqrs is not in InService.\n\tstatus code: 400, request id: 99b4774a-cc2d-4a18-9b09-5ae3855cc904","name":"upgrade-manager/asg-3-120220330103350918000000025"}

@shreyas-badiger
Copy link
Collaborator

Looks like an instance which is not inService (either in pending or terminating state, is being attempted for a standby) I shall look into it. Can you share complete logs in a file?

@ameyajoshi99
Copy link
Contributor Author

upgrade-manager.log
Hi attached log file.

@shreyas-badiger
Copy link
Collaborator

shreyas-badiger commented Apr 29, 2022

@ameyajoshi99 I looked into the logs. Let me explain what is happening. I am explaining everything in detail so that you also understand and are encouraged to contribute.

NOTE: This comment explains the process, and next comment talks about the error you are facing.

Process:

Flow charts

Once a CR is admitted, we start processing the nodes for rotation batch-by-batch. The batch size is determined by maxUnavailable.(In your case, the maxUnavailable is set to 20% and the batch size is 2.)

How do we select a batch?
A batch is selected either Uniformly across all AZs or they are selected Randomly.
Priority is given to the instances that were already InProgress. A tag that gets attached to the instance when we processed it for the first time.
Something like this:
{"level":"info","ts":1651060486.7610822,"logger":"controllers.RollingUpgrade","msg":"found in-progress instances","instances":["i-abcdefg","i-pqrstu"]}

These steps are followed when we are processing a batch: (Eager mode, Lazy mode skips the waiting in step 3,4,5.)

The last step marks completion of processing a batch. We repeat the above steps until we achieve the finite state for the CR. i.e. All the instances in the ASG have the same launch config/ launch template as ASG.

@shreyas-badiger
Copy link
Collaborator

shreyas-badiger commented Apr 29, 2022

Now, talking about the error you are facing:

{"level":"info","ts":1651060486.9737825,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by",
"instances":[{"AvailabilityZone":"us-west-2c","HealthStatus":"Healthy","InstanceId":"i-pqrstu","InstanceType":"m6i.2xlarge",
"LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-abcdef","LaunchTemplateName":"lt-3-22022033010335050860000001e","Version":"3"},
"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2c","HealthStatus":"Healthy","InstanceId":"i-abcdefg","InstanceType":"m6i.2xlarge",
"LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-abcdef","LaunchTemplateName":"lt-3-22022033010335050860000001e","Version":"3"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-abcdefg is not in InService.\n\tstatus code: 400, request id: 54bf7c35-c9e3-438f-90ea-ea6b760afd29","name":"upgrade-manager/asg-3-220220330103350910100000024"}

We added this check to make sure we follow the step 2 mentioned above doesn't hit errors while setting instances to StandBy. (AWS API to setInstanceStandBy is allowed only on the instances that are InService.)

I am doing the right thing by setting the instances that are InService in batch to InProgress.
However, I am not following the same when I am setting instances to StandBy

Basically the bug is in this line:

....
if err := r.SetBatchStandBy(batchInstanceIDs); err != nil {
...
...

Instead, it should have been:

....
if err := r.SetBatchStandBy(inServiceInstanceIDs); err != nil {
...
...

Can you make this change, test and send out a PR for this?

@ameyajoshi99
Copy link
Contributor Author

ameyajoshi99 commented May 2, 2022

@shreyas-badiger Thanks a lot for detailed explanation .. That was really helpful .. I will do the change you suggested and test it out on the cluster. If works well, I'll add PR here ...

About batch, we have not specified any strategy type, so if & elseif in

if r.RollingUpgrade.UpdateStrategyType() == v1alpha1.RandomUpdateStrategy {

is not coming into picture ..
So the result of CalculateMaxUnavailable is getting used.

We've 6 total nodes in 1 asg & maxUnavailable is 20%. intstr.GetValueFromIntOrPercent is producing 2 over 1.2 as its using "ceil".

@ameyajoshi99
Copy link
Contributor Author

@shreyas-badiger , I did testing with suggested change. There are no errors in the background. Rollout was successful
Here is PR: #329
Thanks a lot.

@shreyas-badiger
Copy link
Collaborator

@ameyajoshi99 I will merge the PR as soon as the CI is fixed.

@shreyas-badiger
Copy link
Collaborator

PR - #329
@ameyajoshi99 Please fix the DCO.

@ameyajoshi99
Copy link
Contributor Author

@shreyas-badiger Done.

shreyas-badiger added a commit that referenced this issue May 6, 2022
* fix error 'failed to set instances to stand-by'

Signed-off-by: Ameya Joshi <v-ameyaj@zillowgroup.com>

Co-authored-by: Ameya Joshi <v-ameyaj@zillowgroup.com>
Co-authored-by: Shreyas Badiger <7680410+shreyas-badiger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants