Better support for sidecar containers in batch jobs #25908
Comments
/sub |
Also using a liveness problem as suggested here http://stackoverflow.com/questions/36208211/sidecar-containers-in-kubernetes-jobs doesn't work since the pod will be considered failed and the overall job will not be considered successful. |
How about we declared a job success probe so that the Job can probe it to detect success instead of waiting for the pod to return 0. |
Can probe run against a container that has already exited, or would there Another option is to designate certain exit codes as having special meaning. Both "Success for the entire pod" or "failure for the entire pod" are both This would need to be on the Pod object, so that is a big API change. On Thu, Sep 22, 2016 at 1:41 PM, Ming Fang notifications@github.com wrote:
|
@erictune Good point; we can't probe an exited container. Can we designate a particular container in the pod as the "completion" container so that when that container exits we can say the job is completed? The sidecar containers tends to be long lived for things like log shipping and monitoring. |
Have you looked into this doc point 3, described in details in here where you basically don't set
Personally, these look to me more like RS, rather than a job, but that's my personal opinion and most importantly I don't know full details of your setup. Generally, there are following discussions #17244 and #30243 that are touching this topic as well. |
@soltysh the link you sent above, point 3 references pod completion and not container completion. |
The two containers can share an emptyDir, and the first container and write an "I'm exiting now" message to a file and the other can exit when it sees that message. |
@erictune I have a use case which I think falls in this bucket and I am hoping you could guide me in the right direction since there doesn't seem to be any official recommended way to solve this problem. I am using the client-go library to code everything below: So, I have a job that basically runs a tool in a one container pod. As soon as the tool finishes running, it is supposed to produce a results file. I can't seem to capture this results file because as soon as the tool finishes running, the pod deletes and I lose the results file. I was able to capture this results file if I used But, I understand that's not recommended and ideal for production containers. So, I used So, should I be solving my problem using the sidecar container pattern as well? Basically, do what you suggested above. Start 2 containers in the pod whenever the job starts. 1 container runs the job and as soon as the job gets done, drops a message that gets picked up by the other container which then grabs the result file and stores it somewhere? I fail to understand why we would need 2 containers in the first place. Why can't the job container do all this by itself? That is, finish the job, save the results file somewhere, access it/read it and store it somewhere. |
@anshumanbh I'd suggest you:
|
@soltysh I don't want the file to be stored permanently. On every run, I just want to compare that result with the last result. So, the way I was thinking of doing this was committing to a github repository on every run and then doing a diff to see what changed. So, in order to do that, I just need to store the result temporarily somewhere so that I can access it to send it to Github. Make sense? |
@anshumanbh perfectly clear, and still that doesn't fall into the category of side-car container. All you want to achieve is currently doable with what jobs provide. |
@soltysh so considering I want to go for option 3 from the list you suggested above, how would I go about implementing it? The problem I am facing is that as soon as the job finishes, the container exits and I lose the file. If I don't have the file, how do I upload it to shared drive like S3/Google Drive/Dropbox? I can't modify the job's code to automatically upload it somewhere before it quits so unfortunately I would have to first run the job and then save the file somewhere.. |
If you can't modify job's code, you need to wrap it in such a way to be able to upload the file. If what you're working with is an image already just extend it with the copying code. |
@soltysh yes, that makes sense. I could do that. However, the next question I have is - suppose I need to run multiple jobs (think about it as running different tools) and none of these tools have the uploading part built in them. So, now, I would have to build that wrapper and extend each one of those tools with the uploading part. Is there a way I can just write the wrapper/extension once and use it for all the tools? Wouldn't the side car pattern fit in that case? |
Yeah, it could. Although I'd try with multiple containers inside the same pod, pattern. Iow. your pod is running the job container and alongside additional one waiting for the output and uploading that. Not sure how feasible is this but you can give it a try already. |
Gentle ping -- sidecar awareness would make management of microservice proxies such as Envoy much more pleasant. Is there any progress to share? The current state of things is that each container needs bundled tooling to coordinate lifetimes, which means we can't directly use upstream container images. It also significantly complicates the templates, as we have to inject extra argv and mount points. An earlier suggestion was to designate some containers as a "completion" container. I would like to propose the opposite -- the ability to designate some containers as "sidecars". When the last non-sidecar container in a Pod terminates, the Pod should send Example config, when container containers:
- name: main
image: gcr.io/some/image:latest
command: ["/my-batch-job/bin/main", "--config=/config/my-job-config.yaml"]
- name: envoy
image: lyft/envoy:latest
sidecar: true
command: ["/usr/local/bin/envoy", "--config-path=/my-batch-job/etc/envoy.json"] |
For reference, here's the bash madness I'm using to simulate desired sidecar behavior: containers:
- name: main
image: gcr.io/some/image:latest
command: ["/bin/bash", "-c"]
args:
- |
trap "touch /tmp/pod/main-terminated" EXIT
/my-batch-job/bin/main --config=/config/my-job-config.yaml
volumeMounts:
- mountPath: /tmp/pod
name: tmp-pod
- name: envoy
image: gcr.io/our-envoy-plus-bash-image:latest
command: ["/bin/bash", "-c"]
args:
- |
/usr/local/bin/envoy --config-path=/my-batch-job/etc/envoy.json &
CHILD_PID=$!
(while true; do if [[ -f "/tmp/pod/main-terminated" ]]; then kill $CHILD_PID; fi; sleep 1; done) &
wait $CHILD_PID
if [[ -f "/tmp/pod/main-terminated" ]]; then exit 0; fi
volumeMounts:
- mountPath: /tmp/pod
name: tmp-pod
readOnly: true
volumes:
- name: tmp-pod
emptyDir: {} |
@jmillikin-stripe I like this idea, although I'm not sure if this follows the principal of treating some containers differently in a Pod or introducing dependencies between them. I'll defer to @erictune for the final call. Although, have you checked #17244, would this type of solution fit your use-case? This is what @erictune mentioned a few comments before:
|
I think Kubernetes may need to be flexible about the principal of not treating containers differently. We (Stripe) don't want to retrofit third-party code such as Envoy to have Lamprey-style lifecycle hooks, and trying to adopt an Envelope-style exec inversion would be much more complex than letting Kubelet terminate specific sidecars.
I'm very strongly opposed to Kubernetes or Kubelet interpreting error codes at a finer granularity than "zero or non-zero". Borglet's use of exit code magic numbers was an unpleasant misfeature, and it would be much worse in Kubernetes where a particular container image could be either a "main" or "sidecar" in different Pods. |
Maybe additional lifecycle hooks would be sufficient to solve this? Could be:
This could also define a means to define custom policies to restart a container - or even start containers that are not started by default to allow some daisy chaining of containers (when container a finishes then start container b) |
Also missing this. We run a job every 30 min that needs a VPN client for connectivity but there seem to be a lot of use cases where this could be very useful (for example stuff that needs kubectl proxy). Currently, I am using EDIT: in my usecase, it would be completely sufficient to have some property in the job spec to mark a container as the terminating one and have the job monitor that one for exit status and kill the remaining ones. |
@Datamance at this point the KEP to address this issue looks like it's indefinitely on hold. I posted a while back this comment, which was my old solution. I'm not trying to push my own stuff here, just that comment has been lost in the "100 more comments..." of github and figured resurfacing it may be useful again. |
@nrmitchi thanks for reposting that. I'm one who had overlooked it in the sea of comments and this looks like a fantastic near-term solution. |
We figure out a different approach, if you add following to your Pod containers:
then you will be able to grep the Pid in other containers, we will run following at the end of our main container: |
@krancour glad it helped. If you look at the network in that repository there are a couple forks which are almost-definitely in a better place than my original one, and might be better to build off of/use. IIRC the lemonade-hq fork had some useful additions. |
@nrmitchi, I've been glancing at the code, but it may be quicker to just ask you... Can you briefly comment on whatever pre-requisites may exist that aren't mentioned in the README? For instance, do the images your sidecars are based on require any special awareness of this workaround? e.g. Do they need to listen on a specific port for a signal from the controller? Or perhaps must they include a certain shell (bash?) |
@krancour I'll preface my response with a note that this solution was written a couple years ago and my memory may be a bit rusty. It was designed at the time such that the containers in question did not need to be aware of the work-around. We were using primarily third-party applications in sidecar (for example, I believe stripe/veneur was one), and did not want to be forking/modifying. The only requirement of the sidecars is that they properly listen for a SIGTERM signal, and then shut down. I recall having some issues with third-party code running in sidecars that were expecting a different signal and had to be worked around, but really the controller should have just allowed the signal sent to be specified (ie, SIGINT instead of SIGTERM). They don't need to listen to any ports for a signal, since the controller uses an |
@ruiyang2015 thanks for this hack. |
That is part of why I asked... I guess, specifically, I am wondering if this doesn't work for containers based on images that are built |
@krancour Fair point, I never went and tested it with containers that were off of |
Sure, but as long as it's exec'ing, it's always going to be dependent on some binary that's present in the container and for a scratch container, there's nothing there except whatever you put there explicitly. Given that limitation, I can't use this for a use case where the containers that are running might be totally arbitrary and specified by a third party. Oh-- and then I have Windows containers in play, too. I'll mention what I am going to settle on instead. It's probably too heavy-handed for most use cases, but I'm mentioning it in case someone else's use case is similar enough to mine to get away with this... I can afford the luxury of simply deleting a pod whose "primary" container has exited, as long as I record the exit status first. So I'm going to end up writing a controller that will monitor some designated (via annotation) container for completion, record its success or failure in a datastore that already tracks "job" status, and then delete the pod entirely. For good measure, I will probably put a little delay on the pod delete to maximize my central log aggregation's chances of getting the last few lines of the primary container's output before it's torpedoed. Heavy-handed, but may work for some. |
@krancour Totally true. As is, the controller will not work for arbitrary use-bases. Honestly I never went back and attempted to abstract some of the implementation to support other cases because I really thought that the previously mentioned KEP would have been merged and made the need for this functionality moot. |
This seems to be a feature which will appear as a kubernetes feature. The issue is discussed at length in [this issue](kubernetes/kubernetes#25908) and there is an [open PR](kubernetes/enhancements#912) which should help close it. Signed-off-by: Ian Allison <iana@pims.math.ca>
Given that this issue is like 4 years old, the KEP hasn’t gone anywhere yet, and the state of the art is a hacky inline shell script replacing every entrypoint, I decided to codify the “standard” hack (tombstones in a shared volume) into a Go binary that can be easily baked into container images using a multi-stage build. https://github.com/karlkfi/kubexit There’s a few ways to use it:
Edit: v0.2.0 now supports “birth dependencies” (delayed start) and “death dependencies” (self-termination). |
Drive-by comment: this looks exactly like kubernetes/enhancements#753 |
@vanzin as has been noted before, that KEP is on indefinite hold. |
My use case for this is that Vault provides credentials for a CronJob to run. Once the task is done the Vault sidecar is still running with the Job in a pending state and that triggers the monitoring system thinking that something is wrong. It's a shame what happened to the KEP. |
I just got hit by this issue when using the Cloud SQL Proxy and I'm in disbelief that there isn't a way to tell a (Cron)Job which containers to take into account to consider it completed. I was expecting something like: template:
spec:
targets:
- container1
- container2 But after reading the documentation, nothing. There's no way. Then I found this issue along with some others about sidecars in Jobs and also nothing except yet another sidecar with kubexit or some manual process monitoring/killing. How is this frozen and not a priority instead? |
Note that AWS ECS has had the https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html |
@mvanholsteijn that's interesting. Something that I think has gone on with this issue is that the perfect has become the enemy of the good. There are a lot of people who don't need anything more than to be able to designate a container (or n containers?) as "the important container" among multiple containers in a pod. If we could do just that, then jobs, for instance, could be considered complete when the designated container(s) exit. But the issue has turned into defining container startup and shutdown order, etc., etc. That sounds like it would be a great feature, but my intuition says that's not what most of us following this issue need. Our needs are basic and a simple boolean field for marking a container "essential" is really all we need-- and would not likely interfere with efforts in the future to express container startup and shutdown order. |
Thank you for your response. How fast can you get the I consider Kubernetes the father of the concept of running multiple containers together as a "pod": The simple fact that this concept breaks as soon as you run a job or cronjob, and is left broken for almost 5 years, is beyond my comprehension. |
please move discussions to the KEP: please see the latest updates to the KEP here: last update was here: /close |
For an alpha feature, shouldn't the first try solve the Job conundrum? Most if not all jobs have one container per pod (before sidecars get inserted). That would be a huge leap into getting an idea of how to solve it for other workloads too, in my modest opinion. |
Consider a Job with two containers in it -- one which does the work and then terminates, and another which isn't designed to ever explicitly exit but provides some sort of supporting functionality like log or metric collection.
What options exist for doing something like this? What options should exist?
Currently the Job will keep running as long as the second container keeps running, which means that the user has to modify the second container in some way to detect when the first one is done so that it can cleanly exit as well.
This question was asked on Stack Overflow a while ago with no better answer than to modify the second container to be more Kubernetes-aware, which isn't ideal. Another customer has recently brought this up to me as a pain point for them.
@kubernetes/goog-control-plane @erictune
The text was updated successfully, but these errors were encountered: