Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Istio with CronJobs #11659

Closed
Stono opened this issue Feb 11, 2019 · 73 comments
Closed

Using Istio with CronJobs #11659

Stono opened this issue Feb 11, 2019 · 73 comments
Labels
area/networking kind/enhancement lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed

Comments

@Stono
Copy link
Contributor

Stono commented Feb 11, 2019

Hey all,
I have an issue with Istio when used in conjunction with CronJobs or Jobs, in that when the primary pod completes, the "Job" never completes because istio-proxy is still running:

NAME                                  READY     STATUS    RESTARTS   AGE
backup-at-uk-1549872000-7hrx7         1/2       Running   0          34m

I tried adding the following to the end of the primary pod script as suggested by @costinm in #6324, but that doesn't work (envoy exits, proxy doesn't):

curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit
OK

Which seems to cause envoy to exit correctly, however the istio-proxy process is still running:

istio-proxy@backup-at-uk-1549872000-7hrx7:/$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
istio-p+       1  0.0  0.0  32640 18820 ?        Ssl  08:00   0:00 /usr/local/bin/pilot-agent proxy sidecar --concurrency 1 --configPath /etc/istio/proxy --binaryPath /usr/local/bin/envoy --serviceCluster helm-solr-backup --drainDuration

Despite it no longer listening:

istio-proxy@backup-at-uk-1549872000-7hrx7:/$ netstat -plunt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name

The main pod can't send a SIGTERM to istio-proxy because it doesn't have permission to do so (quite rightly) so I'm a little stuck.

The only hacky thing I can think of doing is adding a readinessProbe to istio-proxy which checks to see if it's listening and if it isn't, sends the SIGTERM.

Thoughts?

@Stono
Copy link
Contributor Author

Stono commented Feb 11, 2019

Associated: kubernetes/kubernetes#25908

@huikang
Copy link

huikang commented Feb 11, 2019

Same issue for me.

@Stono
Copy link
Contributor Author

Stono commented Feb 11, 2019

For those who are interested, we worked around this by adding a livenessProbe to the sidecar injector for istio-proxy:

    livenessProbe:
      exec:
        command:
          - /usr/local/bin/liveness.sh
      initialDelaySeconds: 3
      periodSeconds: 10
      failureThreshold: 5

And then the script looks like this:

#!/bin/bash
set -e 
if ! pidof envoy &>/dev/null; then
  if pidof pilot-agent &>/dev/null; then
    echo "Envoy is not running, exiting istio-proxy"
    kill -s TERM $(pidof pilot-agent)
    exit 1 
  fi 
fi

if ! netstat -plunt | grep 15001; then 
  echo "Istio-proxy not listening on 15001"
  exit 1
fi 
exit 0

@huikang
Copy link

huikang commented Feb 11, 2019

@Stono thanks for sharing your work around. Are the above script and the livenessProbe added to the cronjob yams file? I am asking because I could not understand how to adding a livenessProbe to the sidecar injector for istio-proxy. Thanks.

@Stono
Copy link
Contributor Author

Stono commented Feb 11, 2019 via email

@huikang
Copy link

huikang commented Feb 11, 2019

Hi, @Stono, it is still unclear to me how the liveness probe can be added to the istio-proxy pod (my understanding is that the istio-proxy image is not managed by the end user).

Could you point me to some online resources? Thanks.

@Stono
Copy link
Contributor Author

Stono commented Feb 11, 2019 via email

@liamawhite
Copy link
Member

liamawhite commented Feb 12, 2019

I would imagine native Istio support for this would require a similar approach. Exposing an additional path on the pilot-agent server that tells it to shut down. This would still require the batch job to call that endpoint after it is finished though.

I wonder if there is something else we could hook into in Kube that would automatically call this endpoint once the job is finished?

@Stono
Copy link
Contributor Author

Stono commented Feb 12, 2019

@liamawhite do you know of any way to make istio-proxy exit with a 0 status code, at the moment if i send a SIGTERM I get 137, which causes a cronjob failure. I need an exit signal which will shutdown with a 0

@Bessonov
Copy link

Similar issue: #11045

@liamawhite
Copy link
Member

I think the lastest in release-1.1 should return 0. I will try to find some time to verify.

@Stono
Copy link
Contributor Author

Stono commented Apr 3, 2019

Seems to be fine for me in 1.1.1

@Stono Stono closed this as completed Apr 3, 2019
@Stono Stono reopened this Apr 3, 2019
@Stono
Copy link
Contributor Author

Stono commented Apr 3, 2019

We have started doing this recently btw folks in our pod spec:

            command: ["/bin/bash", "-c"]
            args:
              - |
                trap "curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit" EXIT
                while ! curl -s -f http://127.0.0.1:15020/healthz/ready; do sleep 1; done
                sleep 2
                {{ $.Values.cron.command }}

This will:

  • Wait for proxy to be ready
  • Tell it to quit when done

@mikesimons
Copy link

@Stono Have you observed the sidecar living for a while after receiving quitquitquit? Ours are living for another minute or so before exiting (although 15000 gets closed immediately).

We're also getting log spam on Completed jobs until we delete them:

info	Envoy proxy is NOT ready: failed retrieving Envoy stats: Get http://127.0.0.1:15000/stats?usedonly: dial tcp 127.0.0.1:15000: connect: connection refused

@Stono
Copy link
Contributor Author

Stono commented Apr 30, 2019 via email

@stale
Copy link

stale bot commented Jul 29, 2019

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 29, 2019
@Bessonov
Copy link

Activity!

@howardjohn
Copy link
Member

In 1.3 we added a new /quitquitquit endpoint to pilot agent which should resolve this. Its not perfect since it requires some manual action, but I think a longer term solution would depend on kubernetes/kubernetes#65502 or #11366 or similar. If there is anything else we can do to improve these in the short term I would be happy to take a look at

@rdavyd
Copy link

rdavyd commented Jan 21, 2022

@howardjohn Did some digging and testing and did not find a way to do such a tweak. Looks like k8s takes the displayed PODStatus from one of the containers. Tested POD with 2 containers and without istio-proxy and its PODStatus always returned the exit status of the first container in the array.
But when adding istio-proxy (it adds as the second container) this behavior changes and PODStatus displays the exit status of istio-proxy.
Also didn't find envoy API call like /quitquitquit that forces exit with non zero status.

Update: I think I got to the bottom of this. The resulting POD Status display message is calculated when returning to client. And in our case it is unstable and sometimes wrong. It returns the termination reason of the last container in the pod.Status.ContainerStatuses array. I don't think this array sequence is controlled by user.

https://github.com/kubernetes/kubernetes/blob/5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e/pkg/printers/internalversion/printers.go#L812-L813

The link is for version 1.22.5, but in master I see the same.

@sathieu
Copy link

sathieu commented Feb 8, 2022

Hello, I've done an "operator" to handle this problem until keystone containers are added to k8s.

See: https://gitlab.com/kubitus-project/kubitus-pod-cleaner-operator/-/blob/main/README.md

@ceastman-r7
Copy link

ceastman-r7 commented Mar 16, 2022

This seems to maintain the error code from the job:

x=$(echo $?); curl -fsI -X POST http://localhost:15020/quitquitquit || true && exit $x

@andrewhibbert
Copy link

Seems it is not always the same port. The following works for me

x=(echo $?); curl -fsI -X POST http://localhost:15000/quitquitquit && exit $x

@jacekszlachtass
Copy link

dollar sign is missing in the above command:

x=$(echo $?); curl -fsI -X POST http://localhost:15000/quitquitquit && exit $x

@kschu91
Copy link

kschu91 commented Jul 15, 2022

Just if anyone is searching for a simple solution: https://github.com/AOEpeople/kubernetes-sidecar-cleaner

We have developed a simple app to clean up the istio-proxies in completed CronJobs.

@ZiaUrRehman-GBI
Copy link

@kschu91 where should be this controller install? Inside istio namespace?

@brunocascio
Copy link

brunocascio commented Feb 8, 2023

Is there any plan to implement something at istio level rather than using hacky solutions or a 3rd party operator?

@mmerickel
Copy link

https://istio.io/latest/blog/2022/introducing-ambient-mesh/ can't come soon enough. :-)

@ceastman-r7
Copy link

There is a kubernetes enhancement that should allow orderly startup and shutdown of multi container pods: kubernetes/enhancements#3759

or ambient :)

@nickzelei
Copy link

nickzelei commented Feb 9, 2023

Just if anyone is searching for a simple solution: https://github.com/AOEpeople/kubernetes-sidecar-cleaner

We have developed a simple app to clean up the istio-proxies in completed CronJobs.

This worked perfectly for me - thank you for your contribution!

@ZiaUrRehman-GBI - you can deploy it anywhere. The helmchart is deployed with a cluster role so it moderates pods in the entire cluster. At least that's how it works with the helmchart that @kschu91 is providing.
You'd have to roll your own deployment if you wanted it to be scoped to a specific namespace.

@brunocascio
Copy link

So there's no official solution.

I prefer to use a script to notify Istio rather than installing an unofficial or hacky operator 😃

Thanks everyone

@afirth
Copy link

afirth commented Aug 30, 2023

This should be addressed by https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/#what-are-sidecar-containers-in-1-28

@wleese wleese mentioned this issue Sep 14, 2023
@bmitchinson
Copy link

Not a solution by any means, but a reminder that you can opt out of istio when appropriate by using the sidecar.istio.io/inject: "false" pod annotation.

@yasensim
Copy link

Sidecar as a first class citizen is in alpha in behind a feature gate in Kubernetes 1.28 and is beta and enabled by default in Kubernetes 1.29. If/when Istio implements restartPolicy the istio-proxy lifecycle will align to the main container. This will allow running Jobs on Istio.
For reference HERE .

cc: @howardjohn

@howardjohn
Copy link
Member

Istio already supports it behind the ENABLE_NATIVE_SIDECARS. Note I highly recommend only using it on K8s 1.29 as some of the lifecycle is broken in 1.28

@kmushegi
Copy link

Just if anyone is searching for a simple solution: https://github.com/AOEpeople/kubernetes-sidecar-cleaner

We have developed a simple app to clean up the istio-proxies in completed CronJobs.

This is exactly what I needed, thank you!!

@kschu91 where should be this controller install? Inside istio namespace?

I installed it into the istio-system namespace since the watcher watches all namespaces: https://github.com/AOEpeople/kubernetes-sidecar-cleaner/blob/main/main.go#L54 and runs with a clusterrole

@Stono
Copy link
Contributor Author

Stono commented Aug 24, 2024

I'm going to close this supperrrr old issue.
Native sidecars is the solution (albeit some gotchas i've talked about here https://karlstoney.com/moving-to-native-sidecars/)

@Stono Stono closed this as completed Aug 24, 2024
@Callek
Copy link

Callek commented Aug 28, 2024

I'm going to close this supperrrr old issue. Native sidecars is the solution (albeit some gotchas i've talked about here https://karlstoney.com/moving-to-native-sidecars/)

@Stono from that linked article they actually recommend not using it with istio:

Bit of an update here; I wouldn't do it with Istio just yet! There's a reasonably nasty bug I've stumbled across which means outbound connections from your app will start seeing Connection: close. This has pretty horrid interactions with some HTTP connection pools.

@Stono
Copy link
Contributor Author

Stono commented Aug 28, 2024

I know, I wrote it :)

My point was more that there won't be a "solution" to this multi year problem without sidecar containers, the actual solution is Sidecar containers. So the choices are either:

Either approach currently has some daemons, but Sidecar containers is less hacky and moving in the correct direction so will only get better with time.

@ruslanguns
Copy link

Hi @Stono, thanks for your article! I’m still a bit confused about the best way to set this up in a cron job. Could you provide some guidance on how to properly configure the startupProbe with Istio Proxy? What would be the best approach for this?

@Stono
Copy link
Contributor Author

Stono commented Aug 28, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/enhancement lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed
Projects
Status: Done
Development

No branches or pull requests