Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for sidecar containers in batch jobs #25908

Closed
a-robinson opened this issue May 19, 2016 · 122 comments
Closed

Better support for sidecar containers in batch jobs #25908

a-robinson opened this issue May 19, 2016 · 122 comments

Comments

@a-robinson
Copy link
Member

@a-robinson a-robinson commented May 19, 2016

Consider a Job with two containers in it -- one which does the work and then terminates, and another which isn't designed to ever explicitly exit but provides some sort of supporting functionality like log or metric collection.

What options exist for doing something like this? What options should exist?

Currently the Job will keep running as long as the second container keeps running, which means that the user has to modify the second container in some way to detect when the first one is done so that it can cleanly exit as well.

This question was asked on Stack Overflow a while ago with no better answer than to modify the second container to be more Kubernetes-aware, which isn't ideal. Another customer has recently brought this up to me as a pain point for them.

@kubernetes/goog-control-plane @erictune

@soltysh
Copy link
Contributor

@soltysh soltysh commented May 23, 2016

/sub

@erictune erictune added the sig/apps label Jul 7, 2016
@mingfang
Copy link

@mingfang mingfang commented Sep 22, 2016

Also using a liveness problem as suggested here http://stackoverflow.com/questions/36208211/sidecar-containers-in-kubernetes-jobs doesn't work since the pod will be considered failed and the overall job will not be considered successful.

@mingfang
Copy link

@mingfang mingfang commented Sep 22, 2016

How about we declared a job success probe so that the Job can probe it to detect success instead of waiting for the pod to return 0.
Once the probe returns success, then the pod can be terminated.

@erictune
Copy link
Member

@erictune erictune commented Sep 22, 2016

Can probe run against a container that has already exited, or would there
be a race where it is being torn down?

Another option is to designate certain exit codes as having special meaning.

Both "Success for the entire pod" or "failure for the entire pod" are both
useful.

This would need to be on the Pod object, so that is a big API change.

On Thu, Sep 22, 2016 at 1:41 PM, Ming Fang notifications@github.com wrote:

How about we declared a job success probe so that the Job can probe it to
detect success instead of waiting for the pod to return 0.

Once the probe returns success, then the pod can be terminated.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#25908 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHuudjrpVtef6U35RWRlZr3mDKcCRo7oks5qsugRgaJpZM4IiqQH
.

@mingfang
Copy link

@mingfang mingfang commented Sep 23, 2016

@erictune Good point; we can't probe an exited container.

Can we designate a particular container in the pod as the "completion" container so that when that container exits we can say the job is completed?

The sidecar containers tends to be long lived for things like log shipping and monitoring.
We can force terminate them once the job is completed.

@soltysh
Copy link
Contributor

@soltysh soltysh commented Sep 26, 2016

Can we designate a particular container in the pod as the "completion" container so that when that container exits we can say the job is completed?

Have you looked into this doc point 3, described in details in here where you basically don't set .spec.completions and as soon as first container finishes with 0 exit code the jobs is done.

The sidecar containers tends to be long lived for things like log shipping and monitoring.
We can force terminate them once the job is completed.

Personally, these look to me more like RS, rather than a job, but that's my personal opinion and most importantly I don't know full details of your setup.

Generally, there are following discussions #17244 and #30243 that are touching this topic as well.

@mingfang
Copy link

@mingfang mingfang commented Sep 26, 2016

@soltysh the link you sent above, point 3 references pod completion and not container completion.

@erictune erictune added the sig/node label Oct 6, 2016
@erictune
Copy link
Member

@erictune erictune commented Oct 6, 2016

The two containers can share an emptyDir, and the first container and write an "I'm exiting now" message to a file and the other can exit when it sees that message.

@anshumanbh
Copy link

@anshumanbh anshumanbh commented Feb 10, 2017

@erictune I have a use case which I think falls in this bucket and I am hoping you could guide me in the right direction since there doesn't seem to be any official recommended way to solve this problem.

I am using the client-go library to code everything below:

So, I have a job that basically runs a tool in a one container pod. As soon as the tool finishes running, it is supposed to produce a results file. I can't seem to capture this results file because as soon as the tool finishes running, the pod deletes and I lose the results file.

I was able to capture this results file if I used HostPath as a VolumeSource and since I am running minikube locally, the results file gets saved onto my workstation.

But, I understand that's not recommended and ideal for production containers. So, I used EmptyDir as suggested above. But, again, if I do that, I can't really capture it because it gets deleted with the pod itself.

So, should I be solving my problem using the sidecar container pattern as well?

Basically, do what you suggested above. Start 2 containers in the pod whenever the job starts. 1 container runs the job and as soon as the job gets done, drops a message that gets picked up by the other container which then grabs the result file and stores it somewhere?

I fail to understand why we would need 2 containers in the first place. Why can't the job container do all this by itself? That is, finish the job, save the results file somewhere, access it/read it and store it somewhere.

@soltysh
Copy link
Contributor

@soltysh soltysh commented Feb 14, 2017

@anshumanbh I'd suggest you:

  1. use a persistent storage, you save the result file
  2. use hostPath mount, which is almost the same as 1, and you've already tried it
  3. upload the result file to a known remote location (s3, google drive, dropbox), generally any kind of shared drive
@anshumanbh
Copy link

@anshumanbh anshumanbh commented Feb 14, 2017

@soltysh I don't want the file to be stored permanently. On every run, I just want to compare that result with the last result. So, the way I was thinking of doing this was committing to a github repository on every run and then doing a diff to see what changed. So, in order to do that, I just need to store the result temporarily somewhere so that I can access it to send it to Github. Make sense?

@soltysh
Copy link
Contributor

@soltysh soltysh commented Feb 20, 2017

@anshumanbh perfectly clear, and still that doesn't fall into the category of side-car container. All you want to achieve is currently doable with what jobs provide.

@anshumanbh
Copy link

@anshumanbh anshumanbh commented Feb 22, 2017

@soltysh so considering I want to go for option 3 from the list you suggested above, how would I go about implementing it?

The problem I am facing is that as soon as the job finishes, the container exits and I lose the file. If I don't have the file, how do I upload it to shared drive like S3/Google Drive/Dropbox? I can't modify the job's code to automatically upload it somewhere before it quits so unfortunately I would have to first run the job and then save the file somewhere..

@soltysh
Copy link
Contributor

@soltysh soltysh commented Feb 23, 2017

If you can't modify job's code, you need to wrap it in such a way to be able to upload the file. If what you're working with is an image already just extend it with the copying code.

@anshumanbh
Copy link

@anshumanbh anshumanbh commented Feb 23, 2017

@soltysh yes, that makes sense. I could do that. However, the next question I have is - suppose I need to run multiple jobs (think about it as running different tools) and none of these tools have the uploading part built in them. So, now, I would have to build that wrapper and extend each one of those tools with the uploading part. Is there a way I can just write the wrapper/extension once and use it for all the tools?

Wouldn't the side car pattern fit in that case?

@soltysh
Copy link
Contributor

@soltysh soltysh commented Feb 23, 2017

Yeah, it could. Although I'd try with multiple containers inside the same pod, pattern. Iow. your pod is running the job container and alongside additional one waiting for the output and uploading that. Not sure how feasible is this but you can give it a try already.

@jmillikin-stripe
Copy link
Contributor

@jmillikin-stripe jmillikin-stripe commented Jun 14, 2017

Gentle ping -- sidecar awareness would make management of microservice proxies such as Envoy much more pleasant. Is there any progress to share?

The current state of things is that each container needs bundled tooling to coordinate lifetimes, which means we can't directly use upstream container images. It also significantly complicates the templates, as we have to inject extra argv and mount points.

An earlier suggestion was to designate some containers as a "completion" container. I would like to propose the opposite -- the ability to designate some containers as "sidecars". When the last non-sidecar container in a Pod terminates, the Pod should send TERM to the sidecars. This would be analogous to the "background thread" concept found in many threading libraries, e.g. Python's Thread.daemon.

Example config, when container main ends the kubelet would kill envoy:

containers:
  - name: main
    image: gcr.io/some/image:latest
    command: ["/my-batch-job/bin/main", "--config=/config/my-job-config.yaml"]
  - name: envoy
    image: lyft/envoy:latest
    sidecar: true
    command: ["/usr/local/bin/envoy", "--config-path=/my-batch-job/etc/envoy.json"]
@jmillikin-stripe
Copy link
Contributor

@jmillikin-stripe jmillikin-stripe commented Jun 14, 2017

For reference, here's the bash madness I'm using to simulate desired sidecar behavior:

containers:
  - name: main
    image: gcr.io/some/image:latest
    command: ["/bin/bash", "-c"]
    args:
      - |
        trap "touch /tmp/pod/main-terminated" EXIT
        /my-batch-job/bin/main --config=/config/my-job-config.yaml
    volumeMounts:
      - mountPath: /tmp/pod
        name: tmp-pod
  - name: envoy
    image: gcr.io/our-envoy-plus-bash-image:latest
    command: ["/bin/bash", "-c"]
    args:
      - |
        /usr/local/bin/envoy --config-path=/my-batch-job/etc/envoy.json &
        CHILD_PID=$!
        (while true; do if [[ -f "/tmp/pod/main-terminated" ]]; then kill $CHILD_PID; fi; sleep 1; done) &
        wait $CHILD_PID
        if [[ -f "/tmp/pod/main-terminated" ]]; then exit 0; fi
    volumeMounts:
      - mountPath: /tmp/pod
        name: tmp-pod
        readOnly: true
volumes:
  - name: tmp-pod
    emptyDir: {}
@soltysh
Copy link
Contributor

@soltysh soltysh commented Aug 2, 2017

I would like to propose the opposite -- the ability to designate some containers as "sidecars". When the last non-sidecar container in a Pod terminates, the Pod should send TERM to the sidecars.

@jmillikin-stripe I like this idea, although I'm not sure if this follows the principal of treating some containers differently in a Pod or introducing dependencies between them. I'll defer to @erictune for the final call.

Although, have you checked #17244, would this type of solution fit your use-case? This is what @erictune mentioned a few comments before:

Another option is to designate certain exit codes as having special meaning.

@jmillikin-stripe
Copy link
Contributor

@jmillikin-stripe jmillikin-stripe commented Aug 2, 2017

@jmillikin-stripe I like this idea, although I'm not sure if this follows the principal of treating some containers differently in a Pod or introducing dependencies between them. I'll defer to @erictune for the final call.

I think Kubernetes may need to be flexible about the principal of not treating containers differently. We (Stripe) don't want to retrofit third-party code such as Envoy to have Lamprey-style lifecycle hooks, and trying to adopt an Envelope-style exec inversion would be much more complex than letting Kubelet terminate specific sidecars.

Although, have you checked #17244, would this type of solution fit your use-case? This is what @erictune mentioned a few comments before:

Another option is to designate certain exit codes as having special meaning.

I'm very strongly opposed to Kubernetes or Kubelet interpreting error codes at a finer granularity than "zero or non-zero". Borglet's use of exit code magic numbers was an unpleasant misfeature, and it would be much worse in Kubernetes where a particular container image could be either a "main" or "sidecar" in different Pods.

@msperl
Copy link

@msperl msperl commented Aug 5, 2017

Maybe additional lifecycle hooks would be sufficient to solve this?

Could be:

  • PostStop: with a means to trigger lifecycle events on other containers in the pod (I.e trigger stop)
  • PeerStopped: signal that a "peer" container in the pod has died - possibly with exit code as an argument

This could also define a means to define custom policies to restart a container - or even start containers that are not started by default to allow some daisy chaining of containers (when container a finishes then start container b)

@oxygen0211
Copy link

@oxygen0211 oxygen0211 commented Sep 6, 2017

Also missing this. We run a job every 30 min that needs a VPN client for connectivity but there seem to be a lot of use cases where this could be very useful (for example stuff that needs kubectl proxy). Currently, I am using jobSpec.concurrencyPolicy: Replace as a workaround but of course this only works if a.) you can live without parallel job runs and b.) Job execution time is shorter than scheduling interval.

EDIT: in my usecase, it would be completely sufficient to have some property in the job spec to mark a container as the terminating one and have the job monitor that one for exit status and kill the remaining ones.

@nrmitchi
Copy link

@nrmitchi nrmitchi commented Mar 12, 2020

@Datamance at this point the KEP to address this issue looks like it's indefinitely on hold.

I posted a while back this comment, which was my old solution. I'm not trying to push my own stuff here, just that comment has been lost in the "100 more comments..." of github and figured resurfacing it may be useful again.

@krancour
Copy link
Member

@krancour krancour commented Mar 12, 2020

@nrmitchi thanks for reposting that. I'm one who had overlooked it in the sea of comments and this looks like a fantastic near-term solution.

@ruiyang2015
Copy link

@ruiyang2015 ruiyang2015 commented Mar 12, 2020

We figure out a different approach, if you add following to your Pod containers:

    securityContext:
            capabilities:
                   add:
                    - SYS_PTRACE

then you will be able to grep the Pid in other containers, we will run following at the end of our main container:
sql_proxy_pid=$(pgrep cloud_sql_proxy) && kill -INT $sql_proxy_pid

@nrmitchi
Copy link

@nrmitchi nrmitchi commented Mar 12, 2020

@krancour glad it helped. If you look at the network in that repository there are a couple forks which are almost-definitely in a better place than my original one, and might be better to build off of/use.

IIRC the lemonade-hq fork had some useful additions.

@krancour
Copy link
Member

@krancour krancour commented Mar 12, 2020

@nrmitchi, I've been glancing at the code, but it may be quicker to just ask you...

Can you briefly comment on whatever pre-requisites may exist that aren't mentioned in the README?

For instance, do the images your sidecars are based on require any special awareness of this workaround? e.g. Do they need to listen on a specific port for a signal from the controller? Or perhaps must they include a certain shell (bash?)

@nrmitchi
Copy link

@nrmitchi nrmitchi commented Mar 12, 2020

@krancour I'll preface my response with a note that this solution was written a couple years ago and my memory may be a bit rusty.

It was designed at the time such that the containers in question did not need to be aware of the work-around. We were using primarily third-party applications in sidecar (for example, I believe stripe/veneur was one), and did not want to be forking/modifying.

The only requirement of the sidecars is that they properly listen for a SIGTERM signal, and then shut down. I recall having some issues with third-party code running in sidecars that were expecting a different signal and had to be worked around, but really the controller should have just allowed the signal sent to be specified (ie, SIGINT instead of SIGTERM).

They don't need to listen to any ports for a signal, since the controller uses an exec to signal the main process of the sidecar directly. IIRC at the time that functionality was copied out of the kubernetes code because it didn't exist in the client. I believe this exists in the official client now and should probably be updated.

@sbocinec
Copy link

@sbocinec sbocinec commented Mar 12, 2020

We figure out a different approach, if you add following to your Pod containers:

    securityContext:
            capabilities:
                   add:
                    - SYS_PTRACE

then you will be able to grep the Pid in other containers, we will run following at the end of our main container:
sql_proxy_pid=$(pgrep cloud_sql_proxy) && kill -INT $sql_proxy_pid

@ruiyang2015 thanks for this hack.
If anyone implementing it though, be sure to understand the implications of sharing a process ns between the containers

@krancour
Copy link
Member

@krancour krancour commented Mar 12, 2020

@nrmitchi

uses an exec to signal the main process of the sidecar directly

That is part of why I asked... I guess, specifically, I am wondering if this doesn't work for containers based on images that are built FROM scratch.

@nrmitchi
Copy link

@nrmitchi nrmitchi commented Mar 12, 2020

@krancour Fair point, I never went and tested it with containers that were off of scratch. Looking at the code (or my originaly version; this could have changed in forked) it looks like it's going to be dependent on bash, but should be able to be modified.

@krancour
Copy link
Member

@krancour krancour commented Mar 12, 2020

it's going to be dependent on bash, but should be able to be modified

Sure, but as long as it's exec'ing, it's always going to be dependent on some binary that's present in the container and for a scratch container, there's nothing there except whatever you put there explicitly. 🤷‍♂

Given that limitation, I can't use this for a use case where the containers that are running might be totally arbitrary and specified by a third party. Oh-- and then I have Windows containers in play, too.

I'll mention what I am going to settle on instead. It's probably too heavy-handed for most use cases, but I'm mentioning it in case someone else's use case is similar enough to mine to get away with this...

I can afford the luxury of simply deleting a pod whose "primary" container has exited, as long as I record the exit status first. So I'm going to end up writing a controller that will monitor some designated (via annotation) container for completion, record its success or failure in a datastore that already tracks "job" status, and then delete the pod entirely.

For good measure, I will probably put a little delay on the pod delete to maximize my central log aggregation's chances of getting the last few lines of the primary container's output before it's torpedoed.

Heavy-handed, but may work for some.

@nrmitchi
Copy link

@nrmitchi nrmitchi commented Mar 13, 2020

@krancour Totally true. As is, the controller will not work for arbitrary use-bases. Honestly I never went back and attempted to abstract some of the implementation to support other cases because I really thought that the previously mentioned KEP would have been merged and made the need for this functionality moot.

ianabc added a commit to pimsmath/hubtraf that referenced this issue Mar 21, 2020
This seems to be a feature which will appear as a kubernetes feature.
The issue is discussed at length in [this
issue](kubernetes/kubernetes#25908) and there
is an [open PR](kubernetes/enhancements#912)
which should help close it.

Signed-off-by: Ian Allison <iana@pims.math.ca>
@karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Apr 26, 2020

Given that this issue is like 4 years old, the KEP hasn’t gone anywhere yet, and the state of the art is a hacky inline shell script replacing every entrypoint, I decided to codify the “standard” hack (tombstones in a shared volume) into a Go binary that can be easily baked into container images using a multi-stage build.

https://github.com/karlkfi/kubexit

There’s a few ways to use it:

  1. Bake it into your images
  2. Side load it using an init container and an ephemeral volume.
  3. Provision it on each node and side load it into containers using a host bind mount

Edit: v0.2.0 now supports “birth dependencies” (delayed start) and “death dependencies” (self-termination).

@vanzin
Copy link

@vanzin vanzin commented Jul 9, 2020

Drive-by comment: this looks exactly like kubernetes/enhancements#753

@muru
Copy link

@muru muru commented Jul 10, 2020

@vanzin as has been noted before, that KEP is on indefinite hold.

@andbuitra
Copy link

@andbuitra andbuitra commented Dec 1, 2020

My use case for this is that Vault provides credentials for a CronJob to run. Once the task is done the Vault sidecar is still running with the Job in a pending state and that triggers the monitoring system thinking that something is wrong. It's a shame what happened to the KEP.

@gonssal
Copy link

@gonssal gonssal commented Jan 30, 2021

I just got hit by this issue when using the Cloud SQL Proxy and I'm in disbelief that there isn't a way to tell a (Cron)Job which containers to take into account to consider it completed. I was expecting something like:

template:
  spec:
    targets:
    - container1
    - container2

But after reading the documentation, nothing. There's no way. Then I found this issue along with some others about sidecars in Jobs and also nothing except yet another sidecar with kubexit or some manual process monitoring/killing.

How is this frozen and not a priority instead?

@mvanholsteijn
Copy link

@mvanholsteijn mvanholsteijn commented Feb 6, 2021

Note that AWS ECS has had the essential property on container since the inception of ECS. If a container is marked essential, and that container fails or stops for any reason, all other containers that are part of the task are stopped too.

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html

@krancour
Copy link
Member

@krancour krancour commented Feb 6, 2021

@mvanholsteijn that's interesting.

Something that I think has gone on with this issue is that the perfect has become the enemy of the good.

There are a lot of people who don't need anything more than to be able to designate a container (or n containers?) as "the important container" among multiple containers in a pod. If we could do just that, then jobs, for instance, could be considered complete when the designated container(s) exit.

But the issue has turned into defining container startup and shutdown order, etc., etc. That sounds like it would be a great feature, but my intuition says that's not what most of us following this issue need. Our needs are basic and a simple boolean field for marking a container "essential" is really all we need-- and would not likely interfere with efforts in the future to express container startup and shutdown order.

@mvanholsteijn
Copy link

@mvanholsteijn mvanholsteijn commented Feb 7, 2021

Thank you for your response. How fast can you get the essential feature implemented and released?

I consider Kubernetes the father of the concept of running multiple containers together as a "pod": The simple fact that this concept breaks as soon as you run a job or cronjob, and is left broken for almost 5 years, is beyond my comprehension.

@dims
Copy link
Member

@dims dims commented Feb 7, 2021

@dims dims closed this Feb 7, 2021
Workloads automation moved this from Backlog to Done Feb 7, 2021
@davi5e
Copy link

@davi5e davi5e commented Feb 7, 2021

For an alpha feature, shouldn't the first try solve the Job conundrum? Most if not all jobs have one container per pod (before sidecars get inserted).

That would be a huge leap into getting an idea of how to solve it for other workloads too, in my modest opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Workloads
  
Done
Linked pull requests

Successfully merging a pull request may close this issue.

None yet