Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for pod to exist up to 60 seconds before erroring #155

Merged
merged 1 commit into from
May 4, 2017

Conversation

chancez
Copy link
Contributor

@chancez chancez commented Apr 11, 2017

Most actions that create resources in Kubernetes are asyncronous, because the
controller-manager needs to update resources, and perform extra work
asyncronously after an API call is made to create a given resource, thus it is
possible to do a POST + GET, and not see the resource just POSTed.

The current code incorrectly assumed that if the CreatePod API call succeeded,
that the pod will exist by the time we wait for containers to become ready.
This pull-request updates the logic to query for the pod mulitiple times before
deciding it doesn't exist.

@iocanel
Copy link
Contributor

iocanel commented Apr 12, 2017

The ContainerExecDecorator is used as part of the container step. This step is meant to called from within a node that corresponds to a kubernetes pod.

So, the only cases I can see that the pod not being there are:

  • manual deletion
  • wrong usage of container

In both cases, waiting won't change things.

Do I miss something here?
Does this pull request somehow fixes the issue you have?
Can you please share your pipeline, so that we can better understand your use case?

@chancez
Copy link
Contributor Author

chancez commented Apr 12, 2017

I am using the pipeline steps correctly, I am calling the container step within my node step, which is within my podTemplate step.

This is definitely fix the issues I am experiencing, without this patch my pipeline will not run succeed on some of my clusters. The error I was getting is here: https://gist.github.com/chancez/bd80345f8e84b2bc79b70c824f2e7cea

You are missing something I think, which is that Kubernetes works asynchronously, and thus it's possible that after creating a pod, it may not be immediately visible afterwards. The creation is asynchronous, so if you create it, you may query for it afterwards, only to see it hasn't yet shown up yet.

In the case of the error, I am seeing that the pod is created, both containers start, and are ready, but the plugin has failed because it doesn't see the pod by the time it checked. The same pipeline worked on current clusters running Kubernetes 1.4, but not on my newest cluster running Kubernetes 1.5.5. I can literally launch my job, exec into both pods, watch the logs, but Jenkins will fail, and my pod will get killed by the plugin after the job fails.

With this PR, I see it query once and it doesn't see the pod, and then the 2nd time it queries, the pod does exist. I'm able to see this using the log statements I added in this PR. It generally only ever takes 2 tries before the pod is visible, so it's almost certainly just a race condition between create + get.

This is my pipeline: https://gist.github.com/chancez/b033ea166e5dea7a3862beb1ff86eb7d

@syndesisci
Copy link

The creation is asynchronous, so if you create it, you may query for it afterwards, only to see it hasn't yet shown up yet.

If the pod is not up and running I can't see how anything within the node block will even run. Unless of course we have a bug or something within the ProvisionCallback, but even then I am not so sure.

@carlossg: Your thoughts?

@chancez
Copy link
Contributor Author

chancez commented Apr 12, 2017

You're right, I'm a bit unclear myself after looking at the actual slave provisioning code + the callback. It looks pretty much correct to me, but I'm definitely seeing that the https://github.com/jenkinsci/kubernetes-plugin/pull/155/files#diff-8a1ab34452b01252e89ffeee0b2eda45R163 returns null the first time, basically always.

One thing that I noticed is that https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud.java#L655-L658 sleeps before it even tries to query the pod, so I bet you would see a null pod there if you moved the sleep to after it, not that changes anything in this scenario, but I bet that sleep is hiding a race in some cases, so I wouldn't be surprised if there were more races.

@carlossg
Copy link
Contributor

I think it is reasonable to wait for until pod is not null if k8s returns that

@ladventure
Copy link

Why I use kubernetes cloud to build which need 8s at least and wait 6s until k8s create pod.

@chancez
Copy link
Contributor Author

chancez commented Apr 27, 2017

I've learned a little more about how the API servers work, and found more evidence of why this might be happening. Basically each API server also does a lot of caching of the results from etcd, and running multiple API servers is what makes this more likely to happen, as the resource may exist after creating, but isn't yet in the cache of all API servers, and causes some delays in seeing resources, despite them existing

@carlossg carlossg merged commit 873f343 into jenkinsci:master May 4, 2017
@chancez chancez deleted the fix_pod_query branch August 5, 2017 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants