Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Istio with CronJobs #11659

Closed
Stono opened this issue Feb 11, 2019 · 19 comments
Assignees

Comments

@Stono
Copy link
Contributor

@Stono Stono commented Feb 11, 2019

Hey all,
I have an issue with Istio when used in conjunction with CronJobs or Jobs, in that when the primary pod completes, the "Job" never completes because istio-proxy is still running:

NAME                                  READY     STATUS    RESTARTS   AGE
backup-at-uk-1549872000-7hrx7         1/2       Running   0          34m

I tried adding the following to the end of the primary pod script as suggested by @costinm in #6324, but that doesn't work (envoy exits, proxy doesn't):

curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit
OK

Which seems to cause envoy to exit correctly, however the istio-proxy process is still running:

istio-proxy@backup-at-uk-1549872000-7hrx7:/$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
istio-p+       1  0.0  0.0  32640 18820 ?        Ssl  08:00   0:00 /usr/local/bin/pilot-agent proxy sidecar --concurrency 1 --configPath /etc/istio/proxy --binaryPath /usr/local/bin/envoy --serviceCluster helm-solr-backup --drainDuration

Despite it no longer listening:

istio-proxy@backup-at-uk-1549872000-7hrx7:/$ netstat -plunt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name

The main pod can't send a SIGTERM to istio-proxy because it doesn't have permission to do so (quite rightly) so I'm a little stuck.

The only hacky thing I can think of doing is adding a readinessProbe to istio-proxy which checks to see if it's listening and if it isn't, sends the SIGTERM.

Thoughts?

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 11, 2019

@huikang

This comment has been minimized.

Copy link

@huikang huikang commented Feb 11, 2019

Same issue for me.

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 11, 2019

For those who are interested, we worked around this by adding a livenessProbe to the sidecar injector for istio-proxy:

    livenessProbe:
      exec:
        command:
          - /usr/local/bin/liveness.sh
      initialDelaySeconds: 3
      periodSeconds: 10
      failureThreshold: 5

And then the script looks like this:

#!/bin/bash
set -e 
if ! pidof envoy &>/dev/null; then
  if pidof pilot-agent &>/dev/null; then
    echo "Envoy is not running, exiting istio-proxy"
    kill -s TERM $(pidof pilot-agent)
    exit 1 
  fi 
fi

if ! netstat -plunt | grep 15001; then 
  echo "Istio-proxy not listening on 15001"
  exit 1
fi 
exit 0

@huikang

This comment has been minimized.

Copy link

@huikang huikang commented Feb 11, 2019

@Stono thanks for sharing your work around. Are the above script and the livenessProbe added to the cronjob yams file? I am asking because I could not understand how to adding a livenessProbe to the sidecar injector for istio-proxy. Thanks.

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 11, 2019

@huikang

This comment has been minimized.

Copy link

@huikang huikang commented Feb 11, 2019

Hi, @Stono, it is still unclear to me how the liveness probe can be added to the istio-proxy pod (my understanding is that the istio-proxy image is not managed by the end user).

Could you point me to some online resources? Thanks.

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 11, 2019

@liamawhite

This comment has been minimized.

Copy link
Member

@liamawhite liamawhite commented Feb 12, 2019

I would imagine native Istio support for this would require a similar approach. Exposing an additional path on the pilot-agent server that tells it to shut down. This would still require the batch job to call that endpoint after it is finished though.

I wonder if there is something else we could hook into in Kube that would automatically call this endpoint once the job is finished?

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 12, 2019

@liamawhite do you know of any way to make istio-proxy exit with a 0 status code, at the moment if i send a SIGTERM I get 137, which causes a cronjob failure. I need an exit signal which will shutdown with a 0

@Bessonov

This comment has been minimized.

Copy link

@Bessonov Bessonov commented Feb 13, 2019

Similar issue: #11045

@liamawhite

This comment has been minimized.

Copy link
Member

@liamawhite liamawhite commented Feb 14, 2019

I think the lastest in release-1.1 should return 0. I will try to find some time to verify.

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Apr 3, 2019

Seems to be fine for me in 1.1.1

@Stono Stono closed this Apr 3, 2019
@Stono Stono reopened this Apr 3, 2019
@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Apr 3, 2019

We have started doing this recently btw folks in our pod spec:

            command: ["/bin/bash", "-c"]
            args:
              - |
                trap "curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit" EXIT
                while ! curl -s -f http://127.0.0.1:15020/healthz/ready; do sleep 1; done
                sleep 2
                {{ $.Values.cron.command }}

This will:

  • Wait for proxy to be ready
  • Tell it to quit when done
@mikesimons

This comment has been minimized.

Copy link

@mikesimons mikesimons commented Apr 30, 2019

@Stono Have you observed the sidecar living for a while after receiving quitquitquit? Ours are living for another minute or so before exiting (although 15000 gets closed immediately).

We're also getting log spam on Completed jobs until we delete them:

info	Envoy proxy is NOT ready: failed retrieving Envoy stats: Get http://127.0.0.1:15000/stats?usedonly: dial tcp 127.0.0.1:15000: connect: connection refused
@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Apr 30, 2019

@stale

This comment has been minimized.

Copy link

@stale stale bot commented Jul 29, 2019

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 29, 2019
@Bessonov

This comment has been minimized.

Copy link

@Bessonov Bessonov commented Jul 31, 2019

Activity!

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Sep 22, 2019

In 1.3 we added a new /quitquitquit endpoint to pilot agent which should resolve this. Its not perfect since it requires some manual action, but I think a longer term solution would depend on kubernetes/kubernetes#65502 or #11366 or similar. If there is anything else we can do to improve these in the short term I would be happy to take a look at

@drshade

This comment has been minimized.

Copy link

@drshade drshade commented Oct 21, 2019

Took me a while to figure this out, so wanted to help any others that are battling with this. My approach was to copy @Stono above, but change the port number to 15020 (to hit the pilot agent, not the envoy proxy directly). This is only available in Istio 1.3 onwards.

command: ["/bin/bash", "-c"]
args:
 - |
   trap "curl --max-time 2 -s -f -XPOST http://127.0.0.1:15020/quitquitquit" EXIT
   while ! curl -s -f http://127.0.0.1:15020/healthz/ready; do sleep 1; done
   echo "Ready!"
   < your job >

Hope that helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
8 participants
You can’t perform that action at this time.