Issues with health #976

pintonunes · 2019-10-18T18:39:54Z

Hi,

Being facing some issues with kudo health checks.
On my understanding, kudo checks for the resources health right after sending the request to kubernetes. In case of a deployment checks:
if obj.Spec.Replicas != nil && obj.Status.ReadyReplicas == *obj.Spec.Replicas

In my case I'm using a kudo parameter to update an image tag on a deployment.
I did some watches on the kubernetes API and on the first changes, the object has already the new image but the spec.replicas is still equal to status.readyreplicas and Kudo marks the plan as COMPLETED.

This makes any sense to you guys?

I'm guessing that this might happens on any deployment change besides the number of replicas.
I think you can replicate this very easily with a simple update plan that changes the image tag of a deployment.

Environment:

Kubernetes version (use kubectl version):1.14.2
Kudo version (use kubectl kudo version): 0.7.2

The text was updated successfully, but these errors were encountered:

gerred · 2019-10-18T18:44:26Z

Hey @pintonunes, good to know you're still seeing this in 0.7.2. We shipped 0.7.5, but haven't added any health check changes in the meantime. I'll investigate.

gerred · 2019-10-18T18:52:26Z

@pintonunes and just confirming, the plan doesn't run and your deployment doesn't update, yeah? Or does it just switch the image on the deployment, the deployment starts to do its rollout, and all the while KUDO is reporting COMPLETED instead of IN_PROGRESS?

pintonunes · 2019-10-18T22:27:47Z

Plan runs but is marked as completed while the deployment is still updating. I also put a readiness probe on the deployment that fails for some time but kudo shows the plan as completed after a couple of seconds.

…

On Oct 18, 2019, at 7:52 PM, Gerred Dillon ***@***.***> wrote: @pintonunes and just confirming, the plan doesn't run and your deployment doesn't update, yeah? Or does it just switch the image on the deployment, the deployment starts to do its rollout, and all the while KUDO is reporting COMPLETED instead of IN_PROGRESS? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gerred · 2019-10-18T22:46:01Z

Seems racy. OK, think I have enough to write a regression test around. Thank you!

alenkacz · 2019-10-22T09:47:59Z

@pintonunes yeah, this is potentially a duplicate of #664 right?

alenkacz · 2019-10-22T10:31:36Z

I am taking a look what to do with that, more concrete how to leverage e.g. this https://github.com/kubernetes/kubernetes/blob/75d51896125b1cafd9e995212b02ccda9b8a0aed/staging/src/k8s.io/kubectl/pkg/polymorphichelpers/rollout_status.go#L38

alenkacz · 2019-10-22T15:54:23Z

I need to be able to depend on 1.16 to do that meaningfully, I'll wait for the version bump we have in progress anyway cc. @gerred

pintonunes · 2019-10-23T14:12:00Z

Yes, this its a dup of #664

pintonunes · 2019-10-28T17:30:52Z

Awesome guys! Will test as soon as 0.8.0 comes out.

gerred · 2019-10-28T18:13:41Z

@pintonunes lucky day - I cut https://github.com/kudobuilder/kudo/releases/tag/v0.8.0-pre.0 a little bit ago. It's still raw and not everything is documented that, but please try this out and give it a shot. I'll keep this issue open until you confirm it works for your needs (and then please close this).

pintonunes · 2019-11-11T21:35:58Z

Guys, I'm afraid that this is still not solved. Using kudo 0.8.0 now and:

2019/11/11 21:12:50 InstanceController: Received Reconcile request for instance "p109"
2019/11/11 21:12:50 InstanceController: Nothing to do, no plan in progress for instance personals-poc/p109
2019/11/11 21:17:50 InstanceController: Received Reconcile request for instance "p109"
2019/11/11 21:17:50 Instance: instance personals-poc/p109 has updated parameters from map[started:true version:11.0.606.0] to map[started:true version:11.0.607.0]
2019/11/11 21:17:50 InstanceController: Going to start execution of plan deploy on instance personals-poc/p109
2019/11/11 21:17:50 InstanceController: Going to proceed in execution of active plan deploy on instance personals-poc/p109
2019/11/11 21:17:50 HealthUtil: Unknown type *v1.Pod is marked healthy by default
2019/11/11 21:17:50 HealthUtil: Deployment p109-platform is NOT healthy. Waiting for deployment spec update to be observed...
2019/11/11 21:17:50 TaskExecution: object {
"Namespace": "personals-poc",
"Name": "p109-platform"
} is NOT healthy: Waiting for deployment spec update to be observed...
2019/11/11 21:17:50 PlanExecution: 'deploy-platform' step(s) (instance: personals-poc/p109) of the deploy.deploy-platform are not ready
2019/11/11 21:17:50 InstanceController: Received Reconcile request for instance "p109"
2019/11/11 21:17:50 InstanceController: Going to proceed in execution of active plan deploy on instance personals-poc/p109
2019/11/11 21:17:50 Deployment p109-platform is marked healthy
2019/11/11 21:17:50 HealthUtil: Unknown type *v1.Service is marked healthy by default
2019/11/11 21:17:50 HealthUtil: Unknown type *v1beta1.Ingress is marked healthy by default
2019/11/11 21:17:50 PlanExecution: personals-poc/p109 all phases of the plan deploy are ready
2019/11/11 21:17:50 InstanceController: Received Reconcile request for instance "p109"

If you need for info from my side just let me know

gerred · 2019-11-11T22:14:06Z

@pintonunes Do you have a repo that we can use to reproduce this?

pintonunes · 2019-11-11T22:26:50Z

Let me do some tests on my side to be able to provide you something to work on.
From my perspective this is a racy thing. Sometimes the plan waits, sometimes it doesn't.
Think this is something related with the readiness probe. I have a readiness probe running every 15s.
Sometimes I can see kubernetes marking the deployment as not ready immediately, sometimes it takes a little more time.
Probably when it takes more time is when kudo sees the deployment as ready... cant tell for now.
This makes any sense?

gerred · 2019-11-11T22:33:38Z

I think so, I think we'll need to write whatever that flaky test might end up being or run it through a debug build a few times. I'll go look through our deployment health check to see what's happening there.

alenkacz · 2019-11-12T16:00:14Z

@pintonunes would be awesome to have some small thing that will help us reproduce. Right now we're using the same code for deployment readiness as kubectl does so I would hope that's already pretty solid but of course everything can be buggy :)

pintonunes · 2019-11-12T21:57:44Z

Sure.. Let me find some time for that.

porridge · 2019-11-15T00:50:03Z

I noticed we're passing revision=0 to DeploymentStatusViewer.Status, so could it be that we're simply seeing the original deployment object revision, effectively racing with the deployment controller before it has started to act on the change?

porridge · 2019-11-15T00:54:08Z

@pintonunes it would be great if you could just dump your operator as is, censoring out pieces that you are not comfortable sharing. This would allow us to attempt reproducing your issue.

alenkacz · 2019-11-15T09:16:47Z

https://github.com/kubernetes/kubernetes/blob/75d51896125b1cafd9e995212b02ccda9b8a0aed/staging/src/k8s.io/kubectl/pkg/polymorphichelpers/rollout_status.go#L67

Oh yeah, @porridge I think you might be right there, after checking the code again and the way kubectl use it https://github.com/kubernetes/kubectl/blob/7b01e2757cc74b1145976726b05cc1108ad2911d/pkg/cmd/rollout/rollout_status.go if we don't put the revision there then we might be asserting health on the previous one. I need to check when is the revision actually created, but I assume it's done asynchronously via some controller so it will most probably be racy

I originally assumed from this https://github.com/kubernetes/kubectl/blob/7b01e2757cc74b1145976726b05cc1108ad2911d/pkg/cmd/rollout/rollout_status.go#L116 that the latest revision is fine, since we're interested in the latest :)

alenkacz · 2019-11-15T11:10:57Z

okay, this is where the new revision gets created https://github.com/kubernetes/kubernetes/blob/6a19261e96b12c83e6c69d15e2dcca8089432838/pkg/controller/deployment/sync.go#L143

Looks like we might need for kudo to know what revision are we rolling to, so potentially at the time of execution we might store the target revision into some annotation?

porridge · 2019-11-15T11:50:04Z

Hm, it's not clear to me how to provide this revision either... in kubectl needs to be provided via a cmdline flag to be non-zero.

Maybe we should actually be looking at deployment.Generation or deployment.Status.ObservedGeneration and comparing it against the generation returned at the time of updating the deployment? Either way, logging that value might help us debug this.

alenkacz · 2019-11-16T09:40:57Z

@porridge hmm I did a little bit more of digging.

So revision seems to be number very tightly related to actuall rollout - so if your change does not require pod restarts or starting of new pods or killing pods, revision stays the same and so does generation. I tested this with updating Metadata.Label on my deployment.

As for generation, that is actually number set by API server which is good, because that's synchronous. Also the generation changes only if spec changes, but change in spec does not mean that some pods will be rolled out (you can test this by changing revisionHistory field in spec which increments generation, but does not increment revision).

So with all that what I am kind of leaning toward is:

apply deployment change
read the value of generation observed after persisting
then the health should wait until observedGeneration >= that value, when that condition is true, I think it should be guaranteed for deployment controller to already have seen the new deployment. At that point, we can read the actual value of revision and verify the deployment is done as we do right now.

WDYT? :)

pintonunes · 2019-11-19T14:42:27Z

@alenkacz, sorry but I didn't had time to give you a working example where this happens.
Anyway, my operator is very simple one. The only thing to mention is that, it has an initContainer, its windows containers and I have a readiness probe that runs a command. The only operations we do on the deployment is:

Changing the container tag (using a parameter on the operatorVersion). BTW this is where the issue happens more often
Scaling the replica count to 0. Sometimes also happens here when we scale back to 1.

At the creation of the deployment I never seen the issue.

Anyway, I'm available to test any build that you have that might fix the issue. Also for a webex, if you want to see it happening live.

porridge · 2019-12-03T07:51:10Z

@alenkacz I had misconfigured my email filters and only saw your message today. What you wrote SGTM.

However it seems as if it's not just us who need this logic, so perhaps there is code somewhere already in k8s that implements this algorithm?

alenkacz · 2019-12-03T08:22:11Z

@porridge yeah, I wonder if that exists... I did not find it though, feel free to fish around :)

alenkacz · 2019-12-16T14:57:52Z

@pintonunes can you provide KUDO controller logs from the situation when the health does not work?

I spent some time trying to reproduce and was not able. I tried even scaling down:

had one operator with 5 replicas in deployment
upgraded to deployment with 1 instance
the deployment scaled down and waited for that to happen, see attached logs...

I was testing on a master (which is 0.9.0) but that hopefully should not make a difference as we did not change the behavior as far as I am aware. The generation log lines are extra as I wanted to see if we're seeing the right generation of an object inside the health module

2019/12/16 15:51:44 InstanceController: Going to start execution of plan upgrade on instance default/i1
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment spec update to be observed...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment spec update to be observed...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:44 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 0 out of 1 new replicas have been updated...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 0 out of 1 new replicas have been updated...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:44 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 0 out of 1 new replicas have been updated...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 0 out of 1 new replicas have been updated...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:44 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:44 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:44 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:44 Generation: 2
2019/12/16 15:51:44 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:44 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:44 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:45 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:45 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:45 Generation: 2
2019/12/16 15:51:45 HealthUtil: Deployment upgrade-deployment is NOT healthy. Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:45 TaskExecution: object default/upgrade-deployment is NOT healthy: Waiting for deployment "upgrade-deployment" rollout to finish: 1 old replicas are pending termination...
2019/12/16 15:51:45 PlanExecution: 'everything' step(s) (instance: default/i1) of the upgrade.upgrade are not ready
2019/12/16 15:51:45 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:45 InstanceController: Going to proceed in execution of active plan upgrade on instance default/i1
2019/12/16 15:51:45 Generation: 2
2019/12/16 15:51:45 Deployment upgrade-deployment is marked healthy
2019/12/16 15:51:45 PlanExecution: default/i1 all phases of the plan upgrade are ready
2019/12/16 15:51:45 InstanceController: Received Reconcile request for instance "i1"
2019/12/16 15:51:45 InstanceController: Nothing to do, no plan in progress for instance default/i1

pintonunes · 2019-12-16T15:31:22Z

Hi @alenkacz, let me try with the 0.9.0 and send you the logs if it fails

alenkacz · 2019-12-16T19:06:17Z

@pintonunes thanks, that would be helpful, even 0.8.0 logs might be enough though, I just wonder what the health module was saying when marking your deployment as done

alenkacz · 2019-12-19T16:32:15Z

okay I think with #1187 I definitely fixed some issues we had when evaluating health. Do you by any chance have the operator you can share for testing? If not, that PR will probably merge tomorrow so you could test it if you have time :) @pintonunes

alenkacz · 2019-12-20T16:45:24Z

This is fixed on master and will be released in 0.10.0 please reopen if you can still see the issue. If you have operator to reproduce, I would love to try that :-)

pintonunes · 2019-12-20T16:48:57Z

Sorry Alena, I didn’t had the time to test this yet. Holiday season.. you know.. I will reopen with logs and everything if I find some issue *From:* Alena Varkockova <notifications@github.com> *Sent:* Friday, December 20, 2019 4:45 PM *To:* kudobuilder/kudo <kudo@noreply.github.com> *Cc:* Pedro Nunes <pedro.nunes@outsystems.com>; Mention < mention@noreply.github.com> *Subject:* Re: [kudobuilder/kudo] Issues with health (#976) This is fixed on master and will be released in 0.10.0 please reopen if you can still see the issue. If you have operator to reproduce, I would love to try that :-) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#976?email_source=notifications&email_token=AJBA3CEV47JL7GWMLUC5YB3QZTZCLA5CNFSM4JCK7G62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHNNPUA#issuecomment-567990224>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJBA3CCKO2VKKHXMYZQX4CDQZTZCLANCNFSM4JCK7G6Q> .

alenkacz · 2019-12-20T17:13:12Z

No pressure, send it over in January, I am happy to reopen if there's still more issues :)

pintonunes added the kind/bug label Oct 18, 2019

alenkacz added the priority/high label Oct 22, 2019

alenkacz self-assigned this Oct 22, 2019

pintonunes closed this as completed Oct 23, 2019

pintonunes reopened this Oct 23, 2019

alenkacz mentioned this issue Oct 25, 2019

Do not reimplement health checking, use what kubectl uses #1000

Merged

gerred closed this as completed in #1000 Oct 28, 2019

gerred reopened this Oct 28, 2019

alenkacz assigned alenkacz and unassigned alenkacz Dec 16, 2019

alenkacz mentioned this issue Dec 16, 2019

Always pass in obj with latest generation #1187

Merged

alenkacz closed this as completed Dec 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with health #976

Issues with health #976

pintonunes commented Oct 18, 2019

gerred commented Oct 18, 2019

gerred commented Oct 18, 2019

pintonunes commented Oct 18, 2019 via email

gerred commented Oct 18, 2019

alenkacz commented Oct 22, 2019

alenkacz commented Oct 22, 2019

alenkacz commented Oct 22, 2019

pintonunes commented Oct 23, 2019

pintonunes commented Oct 28, 2019

gerred commented Oct 28, 2019

pintonunes commented Nov 11, 2019

gerred commented Nov 11, 2019

pintonunes commented Nov 11, 2019

gerred commented Nov 11, 2019

alenkacz commented Nov 12, 2019

pintonunes commented Nov 12, 2019

porridge commented Nov 15, 2019

porridge commented Nov 15, 2019

alenkacz commented Nov 15, 2019 •

edited

Loading

alenkacz commented Nov 15, 2019

porridge commented Nov 15, 2019

alenkacz commented Nov 16, 2019

pintonunes commented Nov 19, 2019 •

edited

Loading

porridge commented Dec 3, 2019

alenkacz commented Dec 3, 2019

alenkacz commented Dec 16, 2019 •

edited

Loading

pintonunes commented Dec 16, 2019

alenkacz commented Dec 16, 2019

alenkacz commented Dec 19, 2019 •

edited

Loading

alenkacz commented Dec 20, 2019

pintonunes commented Dec 20, 2019 via email

alenkacz commented Dec 20, 2019

Issues with health #976

Issues with health #976

Comments

pintonunes commented Oct 18, 2019

gerred commented Oct 18, 2019

gerred commented Oct 18, 2019

pintonunes commented Oct 18, 2019 via email

gerred commented Oct 18, 2019

alenkacz commented Oct 22, 2019

alenkacz commented Oct 22, 2019

alenkacz commented Oct 22, 2019

pintonunes commented Oct 23, 2019

pintonunes commented Oct 28, 2019

gerred commented Oct 28, 2019

pintonunes commented Nov 11, 2019

gerred commented Nov 11, 2019

pintonunes commented Nov 11, 2019

gerred commented Nov 11, 2019

alenkacz commented Nov 12, 2019

pintonunes commented Nov 12, 2019

porridge commented Nov 15, 2019

porridge commented Nov 15, 2019

alenkacz commented Nov 15, 2019 • edited Loading

alenkacz commented Nov 15, 2019

porridge commented Nov 15, 2019

alenkacz commented Nov 16, 2019

pintonunes commented Nov 19, 2019 • edited Loading

porridge commented Dec 3, 2019

alenkacz commented Dec 3, 2019

alenkacz commented Dec 16, 2019 • edited Loading

pintonunes commented Dec 16, 2019

alenkacz commented Dec 16, 2019

alenkacz commented Dec 19, 2019 • edited Loading

alenkacz commented Dec 20, 2019

pintonunes commented Dec 20, 2019 via email

alenkacz commented Dec 20, 2019

alenkacz commented Nov 15, 2019 •

edited

Loading

pintonunes commented Nov 19, 2019 •

edited

Loading

alenkacz commented Dec 16, 2019 •

edited

Loading

alenkacz commented Dec 19, 2019 •

edited

Loading