Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with health #976

Closed
pintonunes opened this issue Oct 18, 2019 · 32 comments · Fixed by #1000
Closed

Issues with health #976

pintonunes opened this issue Oct 18, 2019 · 32 comments · Fixed by #1000

Comments

@pintonunes
Copy link

Hi,

Being facing some issues with kudo health checks.
On my understanding, kudo checks for the resources health right after sending the request to kubernetes. In case of a deployment checks:
if obj.Spec.Replicas != nil && obj.Status.ReadyReplicas == *obj.Spec.Replicas

In my case I'm using a kudo parameter to update an image tag on a deployment.
I did some watches on the kubernetes API and on the first changes, the object has already the new image but the spec.replicas is still equal to status.readyreplicas and Kudo marks the plan as COMPLETED.

This makes any sense to you guys?

I'm guessing that this might happens on any deployment change besides the number of replicas.
I think you can replicate this very easily with a simple update plan that changes the image tag of a deployment.

Environment:

  • Kubernetes version (use kubectl version):1.14.2
  • Kudo version (use kubectl kudo version): 0.7.2
@gerred
Copy link
Member

gerred commented Oct 18, 2019

Hey @pintonunes, good to know you're still seeing this in 0.7.2. We shipped 0.7.5, but haven't added any health check changes in the meantime. I'll investigate.

@gerred
Copy link
Member

gerred commented Oct 18, 2019

@pintonunes and just confirming, the plan doesn't run and your deployment doesn't update, yeah? Or does it just switch the image on the deployment, the deployment starts to do its rollout, and all the while KUDO is reporting COMPLETED instead of IN_PROGRESS?

@pintonunes
Copy link
Author

pintonunes commented Oct 18, 2019 via email

@gerred
Copy link
Member

gerred commented Oct 18, 2019

Seems racy. OK, think I have enough to write a regression test around. Thank you!

@alenkacz
Copy link
Contributor

@pintonunes yeah, this is potentially a duplicate of #664 right?

@alenkacz alenkacz self-assigned this Oct 22, 2019
@alenkacz
Copy link
Contributor

@alenkacz
Copy link
Contributor

I need to be able to depend on 1.16 to do that meaningfully, I'll wait for the version bump we have in progress anyway cc. @gerred

@pintonunes
Copy link
Author

Yes, this its a dup of #664

@pintonunes
Copy link
Author

Awesome guys! Will test as soon as 0.8.0 comes out.

@gerred
Copy link
Member

gerred commented Oct 28, 2019

@pintonunes lucky day - I cut https://github.com/kudobuilder/kudo/releases/tag/v0.8.0-pre.0 a little bit ago. It's still raw and not everything is documented that, but please try this out and give it a shot. I'll keep this issue open until you confirm it works for your needs (and then please close this).

@gerred gerred reopened this Oct 28, 2019
@pintonunes
Copy link
Author

Guys, I'm afraid that this is still not solved. Using kudo 0.8.0 now and:

2019/11/11 21:12:50 InstanceController: Received Reconcile request for instance "p109"
2019/11/11 21:12:50 InstanceController: Nothing to do, no plan in progress for instance personals-poc/p109
2019/11/11 21:17:50 InstanceController: Received Reconcile request for instance "p109"
2019/11/11 21:17:50 Instance: instance personals-poc/p109 has updated parameters from map[started:true version:11.0.606.0] to map[started:true version:11.0.607.0]
2019/11/11 21:17:50 InstanceController: Going to start execution of plan deploy on instance personals-poc/p109
2019/11/11 21:17:50 InstanceController: Going to proceed in execution of active plan deploy on instance personals-poc/p109
2019/11/11 21:17:50 HealthUtil: Unknown type *v1.Pod is marked healthy by default
2019/11/11 21:17:50 HealthUtil: Deployment p109-platform is NOT healthy. Waiting for deployment spec update to be observed...
2019/11/11 21:17:50 TaskExecution: object {
"Namespace": "personals-poc",
"Name": "p109-platform"
} is NOT healthy: Waiting for deployment spec update to be observed...
2019/11/11 21:17:50 PlanExecution: 'deploy-platform' step(s) (instance: personals-poc/p109) of the deploy.deploy-platform are not ready
2019/11/11 21:17:50 InstanceController: Received Reconcile request for instance "p109"
2019/11/11 21:17:50 InstanceController: Going to proceed in execution of active plan deploy on instance personals-poc/p109
2019/11/11 21:17:50 Deployment p109-platform is marked healthy
2019/11/11 21:17:50 HealthUtil: Unknown type *v1.Service is marked healthy by default
2019/11/11 21:17:50 HealthUtil: Unknown type *v1beta1.Ingress is marked healthy by default
2019/11/11 21:17:50 PlanExecution: personals-poc/p109 all phases of the plan deploy are ready
2019/11/11 21:17:50 InstanceController: Received Reconcile request for instance "p109"

If you need for info from my side just let me know

@gerred
Copy link
Member

gerred commented Nov 11, 2019

@pintonunes Do you have a repo that we can use to reproduce this?

@pintonunes
Copy link
Author

Let me do some tests on my side to be able to provide you something to work on.
From my perspective this is a racy thing. Sometimes the plan waits, sometimes it doesn't.
Think this is something related with the readiness probe. I have a readiness probe running every 15s.
Sometimes I can see kubernetes marking the deployment as not ready immediately, sometimes it takes a little more time.
Probably when it takes more time is when kudo sees the deployment as ready... cant tell for now.
This makes any sense?

@gerred
Copy link
Member

gerred commented Nov 11, 2019

I think so, I think we'll need to write whatever that flaky test might end up being or run it through a debug build a few times. I'll go look through our deployment health check to see what's happening there.

@alenkacz
Copy link
Contributor

@pintonunes would be awesome to have some small thing that will help us reproduce. Right now we're using the same code for deployment readiness as kubectl does so I would hope that's already pretty solid but of course everything can be buggy :)

@pintonunes
Copy link
Author

Sure.. Let me find some time for that.

@porridge
Copy link
Member

I noticed we're passing revision=0 to DeploymentStatusViewer.Status, so could it be that we're simply seeing the original deployment object revision, effectively racing with the deployment controller before it has started to act on the change?

@porridge
Copy link
Member

@pintonunes it would be great if you could just dump your operator as is, censoring out pieces that you are not comfortable sharing. This would allow us to attempt reproducing your issue.

@alenkacz
Copy link
Contributor

alenkacz commented Nov 15, 2019

https://github.com/kubernetes/kubernetes/blob/75d51896125b1cafd9e995212b02ccda9b8a0aed/staging/src/k8s.io/kubectl/pkg/polymorphichelpers/rollout_status.go#L67

Oh yeah, @porridge I think you might be right there, after checking the code again and the way kubectl use it https://github.com/kubernetes/kubectl/blob/7b01e2757cc74b1145976726b05cc1108ad2911d/pkg/cmd/rollout/rollout_status.go if we don't put the revision there then we might be asserting health on the previous one. I need to check when is the revision actually created, but I assume it's done asynchronously via some controller so it will most probably be racy

I originally assumed from this https://github.com/kubernetes/kubectl/blob/7b01e2757cc74b1145976726b05cc1108ad2911d/pkg/cmd/rollout/rollout_status.go#L116 that the latest revision is fine, since we're interested in the latest :)

@alenkacz
Copy link
Contributor

okay, this is where the new revision gets created https://github.com/kubernetes/kubernetes/blob/6a19261e96b12c83e6c69d15e2dcca8089432838/pkg/controller/deployment/sync.go#L143

Looks like we might need for kudo to know what revision are we rolling to, so potentially at the time of execution we might store the target revision into some annotation?

@porridge
Copy link
Member

Hm, it's not clear to me how to provide this revision either... in kubectl needs to be provided via a cmdline flag to be non-zero.

Maybe we should actually be looking at deployment.Generation or deployment.Status.ObservedGeneration and comparing it against the generation returned at the time of updating the deployment? Either way, logging that value might help us debug this.

@alenkacz
Copy link
Contributor

@porridge hmm I did a little bit more of digging.

So revision seems to be number very tightly related to actuall rollout - so if your change does not require pod restarts or starting of new pods or killing pods, revision stays the same and so does generation. I tested this with updating Metadata.Label on my deployment.

As for generation, that is actually number set by API server which is good, because that's synchronous. Also the generation changes only if spec changes, but change in spec does not mean that some pods will be rolled out (you can test this by changing revisionHistory field in spec which increments generation, but does not increment revision).

So with all that what I am kind of leaning toward is:

  • apply deployment change
  • read the value of generation observed after persisting
  • then the health should wait until observedGeneration >= that value, when that condition is true, I think it should be guaranteed for deployment controller to already have seen the new deployment. At that point, we can read the actual value of revision and verify the deployment is done as we do right now.

WDYT? :)

@pintonunes
Copy link
Author

pintonunes commented Nov 19, 2019

@alenkacz, sorry but I didn't had time to give you a working example where this happens.
Anyway, my operator is very simple one. The only thing to mention is that, it has an initContainer, its windows containers and I have a readiness probe that runs a command. The only operations we do on the deployment is:

  • Changing the container tag (using a parameter on the operatorVersion). BTW this is where the issue happens more often
  • Scaling the replica count to 0. Sometimes also happens here when we scale back to 1.

At the creation of the deployment I never seen the issue.

Anyway, I'm available to test any build that you have that might fix the issue. Also for a webex, if you want to see it happening live.

@porridge
Copy link
Member

porridge commented Dec 3, 2019

@alenkacz I had misconfigured my email filters and only saw your message today. What you wrote SGTM.

However it seems as if it's not just us who need this logic, so perhaps there is code somewhere already in k8s that implements this algorithm?

@alenkacz
Copy link
Contributor

alenkacz commented Dec 3, 2019

@porridge yeah, I wonder if that exists... I did not find it though, feel free to fish around :)

@alenkacz alenkacz assigned alenkacz and unassigned alenkacz Dec 16, 2019
@alenkacz
Copy link
Contributor

alenkacz commented Dec 16, 2019

@pintonunes can you provide KUDO controller logs from the situation when the health does not work?

I spent some time trying to reproduce and was not able. I tried even scaling down:

  • had one operator with 5 replicas in deployment
  • upgraded to deployment with 1 instance
  • the deployment scaled down and waited for that to happen, see attached logs...

I was testing on a master (which is 0.9.0) but that hopefully should not make a difference as we did not change the behavior as far as I am aware. The generation log lines are extra as I wanted to see if we're seeing the right generation of an object inside the health module

2019/12/16 15:51:44 InstanceController: Going to start execution of plan upgrade on instance default/i1
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment spec update to be observed...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment spec update to be observed...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:44 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 0 out of 1 new replicas have been updated...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 0 out of 1 new replicas have been updated...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:44 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 0 out of 1 new replicas have been updated...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 0 out of 1 new replicas have been updated...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:44 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:44 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:45 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:45 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:45 Generation: 2
2019/12/16 15:51:45 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:45 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:45 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:45 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:45 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:45 Generation: 2
2019/12/16 15:51:45 Deployment upgrade-deployment is marked healthy
2019/12/16 15:51:45 PlanExecution: default/i1 all phases of the plan upgrade are ready
2019/12/16 15:51:45 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:45 InstanceController: Nothing to do, no plan in progress for instance default/i1

@pintonunes
Copy link
Author

Hi @alenkacz, let me try with the 0.9.0 and send you the logs if it fails

@alenkacz
Copy link
Contributor

@pintonunes thanks, that would be helpful, even 0.8.0 logs might be enough though, I just wonder what the health module was saying when marking your deployment as done

@alenkacz
Copy link
Contributor

alenkacz commented Dec 19, 2019

okay I think with #1187 I definitely fixed some issues we had when evaluating health. Do you by any chance have the operator you can share for testing? If not, that PR will probably merge tomorrow so you could test it if you have time :) @pintonunes

@alenkacz
Copy link
Contributor

This is fixed on master and will be released in 0.10.0 please reopen if you can still see the issue. If you have operator to reproduce, I would love to try that :-)

@pintonunes
Copy link
Author

pintonunes commented Dec 20, 2019 via email

@alenkacz
Copy link
Contributor

No pressure, send it over in January, I am happy to reopen if there's still more issues :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants