Pod controlled by a Job does not exit after after main container completes #1869

maorfr · 2018-11-22T09:03:24Z

Bug Report

What is the issue?

Pods that are controlled by Jobs are not terminating when the main container exits

How can it be reproduced?

Create a job with linkerd sidecar container

Logs, error output, etc

main container logs are as usual, sidecar container logs are as usual

`linkerd check` output

Status check results are [ok]

Environment

Kubernetes Version: 1.11.4
Cluster Environment: AWS (kops)
Host OS: Container Linux by CoreOS 1911.3.0 (Rhyolite)
Linkerd version: edge-18.11.2 (client and server)

Possible solution

I think that when a container within a pod controlled by a Job completes, the sidecar should exit as well.

Additional context

The sidecar was created using linkerd inject

The text was updated successfully, but these errors were encountered:

grampelberg · 2018-12-03T23:00:48Z

We need something from kubernetes/kubernetes#25908 to move forward with this.

stale · 2019-03-03T23:36:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

christianhuening · 2019-05-16T06:24:29Z

whoop, reopen.
@grampelberg would it make sense to change the MutatingWebhookConfiguration for the time being to only inject Deployment's & Daemonset's pods? I am taking out StatefulSets since there's also this bug: #2266

kivagant-ba · 2019-08-27T11:48:41Z

Should this really be closed?

christianhuening · 2019-08-27T12:02:24Z

I still deem this problematic, too

kivagant-ba · 2019-08-27T12:23:31Z

@christianhuening , by a chance, do you know any workarounds better than this one:
kubernetes/kubernetes#25908 (comment)
?

christianhuening · 2019-08-27T12:28:33Z

We just don't inject Jobs atm. This is not an issue right now, since we only use jobs for initialization when deploying environments.

kivagant-ba · 2019-08-27T12:49:06Z

Do you know if it's possible to blacklist jobs on the namespace level or somewhere in L5d configuration?

christianhuening · 2019-08-27T13:41:36Z

provide the annotation and set it to disabled on the job's pod spec

kivagant-ba · 2019-08-27T13:48:32Z

This requires direct access to each Job specification (or a custom admission controller). Thank you for your feedback.

alexklibisz · 2020-03-05T21:27:46Z

I believe I found a solution to this that doesn't require waiting for a new k8s feature or significantly altering the main job process.

Pods have a shareProcessNamespace setting. This lets containers in a pod see and kill the processes running in other containers.

The solution: Assume you can identify the process id for the main workload in your job/cronjob. Then you can add your own sidecar container that checks to see if your job process is running, sleeps, and repeats until the job process exits. Once it exits, you kill the linkerd2-proxy process, which makes that container exit, and successfully ends the job/cronjob.

Here's an example which assumes your job process is called java. I assume it would work for any other process, you just have to be able to return the process id by running pgrep <name-of-my-process>.

apiVersion: batch/v1
kind: Job
metadata:
  name: my-java-job-that-uses-linkerd2-injection
spec:
  template:
    metadata:
      annotations:
        # Inject linkerd2 proxy sidecar.
        linkerd.io/inject: enabled
    spec:
      containers:
        # This is your main workload. In this case lets assume it's a java process.
        - name: job
          image: com.foo.bar/my-java-job:latest
          resources:
            limits:
              memory: ...
              cpu: ...
            requests:
              memory: ...
              cpu: ...
        # This sidecar monitors the java process that runs the main job and kills the linkerd-proxy once java exits.
        # Note that it's necessary to set `shareProcessNamespace: true` in `spec.template.spec` for this to work.
        - name: linkerd-terminator
          image: ubuntu:19.04
          command:
            - sh
            - "-c"
            - |
              /bin/bash <<'EOSCRIPT'
              set -e
              # Check for the java process and sleep 5 seconds until the java process exits.
              while true; do pgrep java || break; sleep 5; done
              # After the java process exits, 
              kill $(pgrep linkerd2-proxy)
              EOSCRIPT
          resources:
            limits:
              cpu: 10m
              memory: 20M
            requests:
              cpu: 10m
              memory: 20M
      shareProcessNamespace: true # Don't forget this part!

For context, we are running k8s version 1.15.7.

Enrico2 · 2020-03-09T21:03:11Z

@alexklibisz I tried this approach with a deployment for a similar issue (#3751), and the kill command is not permitted:

/bin/bash: line 5: kill: (405) - Operation not permitted

The base image is ubuntu:bionic-20200112

did you encounter any permissions issues you had to work around?

alexklibisz · 2020-03-09T23:52:48Z

@alexklibisz I tried this approach with a deployment for a similar issue (#3751), and the kill command is not permitted:
/bin/bash: line 5: kill: (405) - Operation not permitted
The base image is ubuntu:bionic-20200112

did you encounter any permissions issues you had to work around?

IIRC, I saw some permission errors when I forgot to add the shareProcessNamespace setting.

Enrico2 · 2020-03-09T23:54:27Z

I do have that setting, the termination process can see the pid of the other process. Afaict, the issue is that linkerd-proxy and the termination process are run by different users.

alexklibisz · 2020-03-09T23:57:57Z

I do have that setting, the termination process can see the pid of the other process. Afaict, the issue is that linkerd-proxy and the termination process are run by different users.

In my case, the java process was definitely owned by a non-root user (user id 1001 IIRC) and the kill process, definitely owned by the root user in the termination container, and I believe that linkerd2-proxy was owned by a non-root user as well. But I would have to double check on the last one and don't have my work computer right now.

Enrico2 · 2020-03-10T00:55:45Z

Yeah I didn't set up the termination process to run as root.

…

On Mon, Mar 9, 2020, 4:57 PM Alex Klibisz ***@***.***> wrote: I do have that setting, the termination process can see the pid of the other process. Afaict, the issue is that linkerd-proxy and the termination process are run by different users. In my case, the java process was definitely owned by a non-root user (user id 1001 IIRC) and the kill process, definitely owned by the root user in the termination container, and I believe that linkerd2-proxy was owned by a non-root user as well. But I would have to double check on the last one and don't have my work computer right now. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1869?email_source=notifications&email_token=AABHMIXJLVAVHPVMV4EXOS3RGWGBNA5CNFSM4GF3VQA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOJPSAA#issuecomment-596834560>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABHMITEYAXHKPAR3FNH25LRGWGBNANCNFSM4GF3VQAQ> .

alexklibisz · 2020-03-10T00:59:46Z

Yeah I didn't set up the termination process to run as root.

Got it. So it's working now?

Enrico2 · 2020-03-10T01:03:13Z

I'll update once I try 😉

…

On Mon, Mar 9, 2020, 5:59 PM Alex Klibisz ***@***.***> wrote: Yeah I didn't set up the termination process to run as root. Got it. So it's working now? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1869?email_source=notifications&email_token=AABHMIWH7TXQM6MTUY5RK7TRGWNJHA5CNFSM4GF3VQA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOJTDFQ#issuecomment-596849046>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABHMIXNFXJ3JRU3RHP2VALRGWNJHANCNFSM4GF3VQAQ> .

kumargauravin · 2020-06-11T14:21:48Z

Any updates on this? I am new to using cron jobs and linkerd both. If you could spare a moment to share progress on this issue will be great.

grampelberg · 2020-06-11T15:39:13Z

@kumargauravin the upstream Kubernetes issue is still open and there's not anything we can do on the Linkerd side.

alexklibisz · 2020-06-12T01:46:38Z

Can confirm the original solution I posted is still working fine after about three months.

We also have some some crons running on Argo Workflows with linkerd sidecars. shareProcessNamespace doesn't seem to be an available option in argo workflow specifications. We were able to get Argo to kill the sidecars only after setting the right annotations on the job template:

templates:
  - name: job
    metadata:
      annotations:
        linkerd.io/inject: enabled
        config.linkerd.io/skip-outbound-ports: 443
    container:
      image: ...

The key part is the skip-outbound-ports.. I set this up a while ago so I don't remember the precise reasoning. It was some sort of deadlock where the argo sidecar container couldn't kill the linkerd sidecar container because argo was trying to communicate over 443, which was proxied by linkerd, so linkerd refused to die because it still had open connections over 443, etc. Fun stuff!

Esardes · 2020-08-14T11:54:17Z

@alexklibisz thanks for the workaround!
I got it working on a k8s job, but couldn't figure out how to make it so in an Argo flow (shareProcessNamespace doesn't seem to fit anywhere). Any chance you'd have time to share a gist for it?

alexklibisz · 2020-08-14T15:05:09Z

@alexklibisz thanks for the workaround!
I got it working on a k8s job, but couldn't figure out how to make it so in an Argo flow (shareProcessNamespace doesn't seem to fit anywhere). Any chance you'd have time to share a gist for it?

I updated the comment above.

electrical · 2021-03-29T10:45:00Z

For future reference. A shutdown hook was added in linkerd/linkerd2-proxy#811

laukaichung · 2021-05-11T15:36:29Z

For future reference. A shutdown hook was added in linkerd/linkerd2-proxy#811

Is there an example or documentation for this feature?

wmorgan · 2021-05-13T16:54:27Z

@laukaichung Good catch, I don't think we documented this very well. Would you mind filing an issue to https://github.com/linkerd/website so that we can track this?

rachelvwood · 2021-06-03T17:28:57Z

@wmorgan @laukaichung I didn't find an existing one, so I created an issue about the missing documentation.

grampelberg added bug area/proxy priority/P1 Planned for Release labels Nov 28, 2018

grampelberg added area/controller area/cli and removed area/proxy priority/P1 Planned for Release labels Dec 3, 2018

stale bot added the wontfix label Mar 3, 2019

stale bot closed this as completed Mar 18, 2019

klingerf mentioned this issue May 20, 2019

CronJob does not terminate #2833

Closed

sylvainOL mentioned this issue Nov 7, 2019

Add /quitquitquit endpoint to lifecycle sidecar/envoy so Jobs can self-terminate their sidecars hashicorp/consul-k8s#162

Closed

Enrico2 mentioned this issue Mar 6, 2020

Use shareProcessNamespace to shut down the proxy when the main container shuts down #4146

Open

github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod controlled by a Job does not exit after after main container completes #1869

Pod controlled by a Job does not exit after after main container completes #1869

maorfr commented Nov 22, 2018

grampelberg commented Dec 3, 2018

stale bot commented Mar 3, 2019

christianhuening commented May 16, 2019

kivagant-ba commented Aug 27, 2019

christianhuening commented Aug 27, 2019

kivagant-ba commented Aug 27, 2019

christianhuening commented Aug 27, 2019 •

edited

Loading

kivagant-ba commented Aug 27, 2019

christianhuening commented Aug 27, 2019

kivagant-ba commented Aug 27, 2019

alexklibisz commented Mar 5, 2020 •

edited

Loading

Enrico2 commented Mar 9, 2020

alexklibisz commented Mar 9, 2020

Enrico2 commented Mar 9, 2020

alexklibisz commented Mar 9, 2020

Enrico2 commented Mar 10, 2020 via email

alexklibisz commented Mar 10, 2020

Enrico2 commented Mar 10, 2020 via email

kumargauravin commented Jun 11, 2020

grampelberg commented Jun 11, 2020

alexklibisz commented Jun 12, 2020 •

edited

Loading

Esardes commented Aug 14, 2020

alexklibisz commented Aug 14, 2020

electrical commented Mar 29, 2021

laukaichung commented May 11, 2021

wmorgan commented May 13, 2021 •

edited

Loading

rachelvwood commented Jun 3, 2021

Pod controlled by a Job does not exit after after main container completes #1869

Pod controlled by a Job does not exit after after main container completes #1869

Comments

maorfr commented Nov 22, 2018

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

grampelberg commented Dec 3, 2018

stale bot commented Mar 3, 2019

christianhuening commented May 16, 2019

kivagant-ba commented Aug 27, 2019

christianhuening commented Aug 27, 2019

kivagant-ba commented Aug 27, 2019

christianhuening commented Aug 27, 2019 • edited Loading

kivagant-ba commented Aug 27, 2019

christianhuening commented Aug 27, 2019

kivagant-ba commented Aug 27, 2019

alexklibisz commented Mar 5, 2020 • edited Loading

Enrico2 commented Mar 9, 2020

alexklibisz commented Mar 9, 2020

Enrico2 commented Mar 9, 2020

alexklibisz commented Mar 9, 2020

Enrico2 commented Mar 10, 2020 via email

alexklibisz commented Mar 10, 2020

Enrico2 commented Mar 10, 2020 via email

kumargauravin commented Jun 11, 2020

grampelberg commented Jun 11, 2020

alexklibisz commented Jun 12, 2020 • edited Loading

Esardes commented Aug 14, 2020

alexklibisz commented Aug 14, 2020

electrical commented Mar 29, 2021

laukaichung commented May 11, 2021

wmorgan commented May 13, 2021 • edited Loading

rachelvwood commented Jun 3, 2021

`linkerd check` output

christianhuening commented Aug 27, 2019 •

edited

Loading

alexklibisz commented Mar 5, 2020 •

edited

Loading

alexklibisz commented Jun 12, 2020 •

edited

Loading

wmorgan commented May 13, 2021 •

edited

Loading