Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Istio with CronJobs #11659

Open
Stono opened this issue Feb 11, 2019 · 54 comments
Open

Using Istio with CronJobs #11659

Stono opened this issue Feb 11, 2019 · 54 comments
Labels
area/networking kind/enhancement lifecycle/staleproof

Comments

@Stono
Copy link
Contributor

@Stono Stono commented Feb 11, 2019

Hey all,
I have an issue with Istio when used in conjunction with CronJobs or Jobs, in that when the primary pod completes, the "Job" never completes because istio-proxy is still running:

NAME                                  READY     STATUS    RESTARTS   AGE
backup-at-uk-1549872000-7hrx7         1/2       Running   0          34m

I tried adding the following to the end of the primary pod script as suggested by @costinm in #6324, but that doesn't work (envoy exits, proxy doesn't):

curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit
OK

Which seems to cause envoy to exit correctly, however the istio-proxy process is still running:

istio-proxy@backup-at-uk-1549872000-7hrx7:/$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
istio-p+       1  0.0  0.0  32640 18820 ?        Ssl  08:00   0:00 /usr/local/bin/pilot-agent proxy sidecar --concurrency 1 --configPath /etc/istio/proxy --binaryPath /usr/local/bin/envoy --serviceCluster helm-solr-backup --drainDuration

Despite it no longer listening:

istio-proxy@backup-at-uk-1549872000-7hrx7:/$ netstat -plunt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name

The main pod can't send a SIGTERM to istio-proxy because it doesn't have permission to do so (quite rightly) so I'm a little stuck.

The only hacky thing I can think of doing is adding a readinessProbe to istio-proxy which checks to see if it's listening and if it isn't, sends the SIGTERM.

Thoughts?

@Stono
Copy link
Contributor Author

@Stono Stono commented Feb 11, 2019

@huikang
Copy link

@huikang huikang commented Feb 11, 2019

Same issue for me.

@Stono
Copy link
Contributor Author

@Stono Stono commented Feb 11, 2019

For those who are interested, we worked around this by adding a livenessProbe to the sidecar injector for istio-proxy:

    livenessProbe:
      exec:
        command:
          - /usr/local/bin/liveness.sh
      initialDelaySeconds: 3
      periodSeconds: 10
      failureThreshold: 5

And then the script looks like this:

#!/bin/bash
set -e 
if ! pidof envoy &>/dev/null; then
  if pidof pilot-agent &>/dev/null; then
    echo "Envoy is not running, exiting istio-proxy"
    kill -s TERM $(pidof pilot-agent)
    exit 1 
  fi 
fi

if ! netstat -plunt | grep 15001; then 
  echo "Istio-proxy not listening on 15001"
  exit 1
fi 
exit 0

@huikang
Copy link

@huikang huikang commented Feb 11, 2019

@Stono thanks for sharing your work around. Are the above script and the livenessProbe added to the cronjob yams file? I am asking because I could not understand how to adding a livenessProbe to the sidecar injector for istio-proxy. Thanks.

@Stono
Copy link
Contributor Author

@Stono Stono commented Feb 11, 2019

@huikang
Copy link

@huikang huikang commented Feb 11, 2019

Hi, @Stono, it is still unclear to me how the liveness probe can be added to the istio-proxy pod (my understanding is that the istio-proxy image is not managed by the end user).

Could you point me to some online resources? Thanks.

@Stono
Copy link
Contributor Author

@Stono Stono commented Feb 11, 2019

@liamawhite
Copy link
Member

@liamawhite liamawhite commented Feb 12, 2019

I would imagine native Istio support for this would require a similar approach. Exposing an additional path on the pilot-agent server that tells it to shut down. This would still require the batch job to call that endpoint after it is finished though.

I wonder if there is something else we could hook into in Kube that would automatically call this endpoint once the job is finished?

@Stono
Copy link
Contributor Author

@Stono Stono commented Feb 12, 2019

@liamawhite do you know of any way to make istio-proxy exit with a 0 status code, at the moment if i send a SIGTERM I get 137, which causes a cronjob failure. I need an exit signal which will shutdown with a 0

@Bessonov
Copy link

@Bessonov Bessonov commented Feb 13, 2019

Similar issue: #11045

@liamawhite
Copy link
Member

@liamawhite liamawhite commented Feb 14, 2019

I think the lastest in release-1.1 should return 0. I will try to find some time to verify.

@Stono
Copy link
Contributor Author

@Stono Stono commented Apr 3, 2019

Seems to be fine for me in 1.1.1

@Stono Stono closed this Apr 3, 2019
@Stono Stono reopened this Apr 3, 2019
@Stono
Copy link
Contributor Author

@Stono Stono commented Apr 3, 2019

We have started doing this recently btw folks in our pod spec:

            command: ["/bin/bash", "-c"]
            args:
              - |
                trap "curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit" EXIT
                while ! curl -s -f http://127.0.0.1:15020/healthz/ready; do sleep 1; done
                sleep 2
                {{ $.Values.cron.command }}

This will:

  • Wait for proxy to be ready
  • Tell it to quit when done

@mikesimons
Copy link

@mikesimons mikesimons commented Apr 30, 2019

@Stono Have you observed the sidecar living for a while after receiving quitquitquit? Ours are living for another minute or so before exiting (although 15000 gets closed immediately).

We're also getting log spam on Completed jobs until we delete them:

info	Envoy proxy is NOT ready: failed retrieving Envoy stats: Get http://127.0.0.1:15000/stats?usedonly: dial tcp 127.0.0.1:15000: connect: connection refused

@Stono
Copy link
Contributor Author

@Stono Stono commented Apr 30, 2019

@stale
Copy link

@stale stale bot commented Jul 29, 2019

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 29, 2019
@Bessonov
Copy link

@Bessonov Bessonov commented Jul 31, 2019

Activity!

@howardjohn
Copy link
Member

@howardjohn howardjohn commented Sep 22, 2019

In 1.3 we added a new /quitquitquit endpoint to pilot agent which should resolve this. Its not perfect since it requires some manual action, but I think a longer term solution would depend on kubernetes/kubernetes#65502 or #11366 or similar. If there is anything else we can do to improve these in the short term I would be happy to take a look at

@SuleimanWA
Copy link

@SuleimanWA SuleimanWA commented Nov 9, 2020

@yuenwah That's what I ended up doing. Note that you have to put it on the pod template, not the job template or anywhere else, or the annotation won't do anything. For example:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  ...
spec:
  ...
  jobTemplate:
    spec:
      template:
        metadata:
          annotations:
            # disable istio on the pod due to this issue:
            # https://github.com/istio/istio/issues/11659
            sidecar.istio.io/inject: "false"

This is a good solution if RBAC not enabled, when u enable it, this will fail because it considered out of mesh

@SuleimanWA
Copy link

@SuleimanWA SuleimanWA commented Nov 9, 2020

@yuenwah That's what I ended up doing. Note that you have to put it on the pod template, not the job template or anywhere else, or the annotation won't do anything. For example:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  ...
spec:
  ...
  jobTemplate:
    spec:
      template:
        metadata:
          annotations:
            # disable istio on the pod due to this issue:
            # https://github.com/istio/istio/issues/11659
            sidecar.istio.io/inject: "false"

This is a good solution if RBAC not enabled, when u enable it, this will fail because it considered out of mesh ( ofcourse if the cronjob is accessing mesh-internal )

@hobbytp
Copy link

@hobbytp hobbytp commented Mar 29, 2021

One possible solution for job is call "curl -XPOST http://localhost:15000/quitquitquit" to exist envoyproxy when job task finished.
This is supported from istio 1.3

kubectl -n hobby apply -f -<<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: sleep
spec:
  template:
    spec:
      containers:
      - name: sleep
        image: governmentpaas/curl-ssl
        command: ["/bin/sh","-c"]
        args: [ "/bin/sleep 10; curl -XPOST http://localhost:15000/quitquitquit" ]        
        imagePullPolicy: IfNotPresent
      restartPolicy: Never
  backoffLimit: 4
EOF

BTW, some other guys use "pkill -f /usr/local/bin/pilot-agent" inside of their Dockerfile command and shareProcessNamespace: true to the Pod spec (not the Job spec) to achieve the same target, but I think this one (/quitquitquit) is better.

@howardjohn
Copy link
Member

@howardjohn howardjohn commented Jun 1, 2021

One note for folks that cannot change their docker images or use shell, you can do some tricks with shared volumes to mount new binaries into your container. for example:

apiVersion: v1
kind: Pod
metadata:
  name: shell
spec:
  restartPolicy: Never
  terminationGracePeriodSeconds: 0
  initContainers:
  - name: scuttle-init
    image: howardjohn/scuttle-shell
    volumeMounts:
    - mountPath: /var/lib/scuttle/bin/
      name: scuttle-bootstrap
    command:
    - cp
    - /scuttle
    - /var/lib/scuttle/bin/scuttle
  containers:
  - name: shell
    image: howardjohn/alpine-shell
    imagePullPolicy: IfNotPresent
    command:
      - /var/lib/scuttle/bin/scuttle
      - /bin/sleep
      - "5"
    env:
    - name: ISTIO_QUIT_API
      value: http://localhost:15020
    - name: ENVOY_ADMIN_API
      value: http://localhost:15000
    volumeMounts:
    - mountPath: /var/lib/scuttle/bin/
      name: scuttle-bootstrap
  volumes:
  - name: scuttle-bootstrap
    emptyDir: {}

along with https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/#custom-templates-experimental you can probably make this automagically injected. Main problem is you may not know which command to run (if pod only sets args and not command)

@hzxuzhonghu
Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Dec 25, 2021

/close

Prioritization automation moved this from P1 to Done Dec 25, 2021
@Bessonov
Copy link

@Bessonov Bessonov commented Dec 25, 2021

@hzxuzhonghu is there more information?

@mmerickel
Copy link

@mmerickel mmerickel commented Dec 25, 2021

Was there a fix committed somewhere? What does it mean to mark this done?

@hzxuzhonghu
Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Dec 27, 2021

@mmerickel
Copy link

@mmerickel mmerickel commented Dec 27, 2021

Thanks - I’ve been using the quitquitquit hack in my containers for over a year since starting to use istio and it’s quite tedious to maintain and remind team members of the pitfalls when deploying into istio. I’ve been following this issue since most helm charts etc do not ship with workarounds for istio like this and I am hoping istio would come up with a solution to wrap one-off pods so that they can shutdown correctly without manual modifications. I’d hope this issue would stay open.

@howardjohn
Copy link
Member

@howardjohn howardjohn commented Dec 28, 2021

I think this should be open still, the above is a workaround. It should be automatic

@howardjohn howardjohn reopened this Dec 28, 2021
@tomfankhaenel
Copy link

@tomfankhaenel tomfankhaenel commented Jan 18, 2022

This is indeed an issue worth solving within istio itself. One does not simply think of this kind of behavior when deploying istio. I was expecting this could be solved by adding some kind of magic istio lable or annotation.

@rdavyd
Copy link

@rdavyd rdavyd commented Jan 21, 2022

Ran into the issue using the workaround with calling /quitquitquit endpoint on k8s version 1.22.5.
POD shows as Completed, even when the main job container finishes with error.
image
Didn't test that on earlier k8s versions.

@howardjohn
Copy link
Member

@howardjohn howardjohn commented Jan 21, 2022

POD shows as Completed, even when the main job container finishes with error.

That is how Jobs work in Kubernetes regardless of Istio. I think there is some config somewhere to tweak it

@rdavyd
Copy link

@rdavyd rdavyd commented Jan 21, 2022

@howardjohn Did some digging and testing and did not find a way to do such a tweak. Looks like k8s takes the displayed PODStatus from one of the containers. Tested POD with 2 containers and without istio-proxy and its PODStatus always returned the exit status of the first container in the array.
But when adding istio-proxy (it adds as the second container) this behavior changes and PODStatus displays the exit status of istio-proxy.
Also didn't find envoy API call like /quitquitquit that forces exit with non zero status.

Update: I think I got to the bottom of this. The resulting POD Status display message is calculated when returning to client. And in our case it is unstable and sometimes wrong. It returns the termination reason of the last container in the pod.Status.ContainerStatuses array. I don't think this array sequence is controlled by user.

https://github.com/kubernetes/kubernetes/blob/5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e/pkg/printers/internalversion/printers.go#L812-L813

The link is for version 1.22.5, but in master I see the same.

@sathieu
Copy link

@sathieu sathieu commented Feb 8, 2022

Hello, I've done an "operator" to handle this problem until keystone containers are added to k8s.

See: https://gitlab.com/kubitus-project/kubitus-pod-cleaner-operator/-/blob/main/README.md

@ceastman-r7
Copy link

@ceastman-r7 ceastman-r7 commented Mar 16, 2022

This seems to maintain the error code from the job:

x=$(echo $?); curl -fsI -X POST http://localhost:15020/quitquitquit && exit $x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/enhancement lifecycle/staleproof
Projects
Development

No branches or pull requests