Wait for pod to exist up to 60 seconds before erroring #155

chancez · 2017-04-11T20:39:39Z

Most actions that create resources in Kubernetes are asyncronous, because the
controller-manager needs to update resources, and perform extra work
asyncronously after an API call is made to create a given resource, thus it is
possible to do a POST + GET, and not see the resource just POSTed.

The current code incorrectly assumed that if the CreatePod API call succeeded,
that the pod will exist by the time we wait for containers to become ready.
This pull-request updates the logic to query for the pod mulitiple times before
deciding it doesn't exist.

iocanel · 2017-04-12T11:57:02Z

The ContainerExecDecorator is used as part of the container step. This step is meant to called from within a node that corresponds to a kubernetes pod.

So, the only cases I can see that the pod not being there are:

manual deletion
wrong usage of container

In both cases, waiting won't change things.

Do I miss something here?
Does this pull request somehow fixes the issue you have?
Can you please share your pipeline, so that we can better understand your use case?

chancez · 2017-04-12T18:14:27Z

I am using the pipeline steps correctly, I am calling the container step within my node step, which is within my podTemplate step.

This is definitely fix the issues I am experiencing, without this patch my pipeline will not run succeed on some of my clusters. The error I was getting is here: https://gist.github.com/chancez/bd80345f8e84b2bc79b70c824f2e7cea

You are missing something I think, which is that Kubernetes works asynchronously, and thus it's possible that after creating a pod, it may not be immediately visible afterwards. The creation is asynchronous, so if you create it, you may query for it afterwards, only to see it hasn't yet shown up yet.

In the case of the error, I am seeing that the pod is created, both containers start, and are ready, but the plugin has failed because it doesn't see the pod by the time it checked. The same pipeline worked on current clusters running Kubernetes 1.4, but not on my newest cluster running Kubernetes 1.5.5. I can literally launch my job, exec into both pods, watch the logs, but Jenkins will fail, and my pod will get killed by the plugin after the job fails.

With this PR, I see it query once and it doesn't see the pod, and then the 2nd time it queries, the pod does exist. I'm able to see this using the log statements I added in this PR. It generally only ever takes 2 tries before the pod is visible, so it's almost certainly just a race condition between create + get.

This is my pipeline: https://gist.github.com/chancez/b033ea166e5dea7a3862beb1ff86eb7d

syndesisci · 2017-04-12T19:08:28Z

The creation is asynchronous, so if you create it, you may query for it afterwards, only to see it hasn't yet shown up yet.

If the pod is not up and running I can't see how anything within the node block will even run. Unless of course we have a bug or something within the ProvisionCallback, but even then I am not so sure.

@carlossg: Your thoughts?

chancez · 2017-04-12T21:45:39Z

You're right, I'm a bit unclear myself after looking at the actual slave provisioning code + the callback. It looks pretty much correct to me, but I'm definitely seeing that the https://github.com/jenkinsci/kubernetes-plugin/pull/155/files#diff-8a1ab34452b01252e89ffeee0b2eda45R163 returns null the first time, basically always.

One thing that I noticed is that https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud.java#L655-L658 sleeps before it even tries to query the pod, so I bet you would see a null pod there if you moved the sleep to after it, not that changes anything in this scenario, but I bet that sleep is hiding a race in some cases, so I wouldn't be surprised if there were more races.

carlossg · 2017-04-16T19:27:32Z

I think it is reasonable to wait for until pod is not null if k8s returns that

ladventure · 2017-04-24T00:31:07Z

Why I use kubernetes cloud to build which need 8s at least and wait 6s until k8s create pod.

chancez · 2017-04-27T17:08:05Z

I've learned a little more about how the API servers work, and found more evidence of why this might be happening. Basically each API server also does a lot of caching of the results from etcd, and running multiple API servers is what makes this more likely to happen, as the resource may exist after creating, but isn't yet in the cache of all API servers, and causes some delays in seeing resources, despite them existing

carlossg approved these changes Apr 16, 2017

View reviewed changes

Wait for pod to exist up to 60 seconds before erroring

ea1a968

chancez force-pushed the fix_pod_query branch from 685867a to ea1a968 Compare April 24, 2017 21:13

carlossg merged commit 873f343 into jenkinsci:master May 4, 2017

chancez deleted the fix_pod_query branch August 5, 2017 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for pod to exist up to 60 seconds before erroring #155

Wait for pod to exist up to 60 seconds before erroring #155

chancez commented Apr 11, 2017

iocanel commented Apr 12, 2017

chancez commented Apr 12, 2017 •

edited

syndesisci commented Apr 12, 2017

chancez commented Apr 12, 2017

carlossg commented Apr 16, 2017

ladventure commented Apr 24, 2017

chancez commented Apr 27, 2017 •

edited

Wait for pod to exist up to 60 seconds before erroring #155

Wait for pod to exist up to 60 seconds before erroring #155

Conversation

chancez commented Apr 11, 2017

iocanel commented Apr 12, 2017

chancez commented Apr 12, 2017 • edited

syndesisci commented Apr 12, 2017

chancez commented Apr 12, 2017

carlossg commented Apr 16, 2017

ladventure commented Apr 24, 2017

chancez commented Apr 27, 2017 • edited

chancez commented Apr 12, 2017 •

edited

chancez commented Apr 27, 2017 •

edited