Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add liveness and readiness to ingress and egress #1055

Closed
wants to merge 9 commits into from

Conversation

@andraxylia
Copy link
Contributor

commented Oct 7, 2017

Reduce likelihood of 404 for Ingress and Egress by ensuring the container does not receive traffic when envoy does not run #1038

Matches istio/old_pilot_repo#1410

The perfect fix is ADS, which will come later.

Add liveness and readiness to ingress and egress
Ingress and Egress will not be ready to receive traffic until envoy starts.
@istio-merge-robot

This comment has been minimized.

Copy link

commented Oct 7, 2017

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
We suggest the following additional approver: geeknoid

Assign the PR to them by writing /assign @geeknoid in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@andraxylia andraxylia changed the title Add liveliness and readiness to ingress and egress Add liveness and readiness to ingress and egress Oct 7, 2017
Copy link
Contributor

left a comment

See below, also please put release notes

readinessProbe:
exec:
command:
- curl http://localhost:15000/clusters && curl http://localhost:15000/listeners && curl http://localhost:15000/routes

This comment has been minimized.

Copy link
@ldemailly

ldemailly Oct 7, 2017

Contributor

1 url should be enough for a liveness check no? (Does envoy have a /ready or one of those is the last one ready)

This comment has been minimized.

Copy link
@andraxylia

andraxylia Oct 7, 2017

Author Contributor

you are right, one is enough, will change.
It does not have /ready, we can suggest an improvement.

Copy link
Member

left a comment

could you explain why this change can help? I thought adding liveness prob to the user's microservices can help...not sure how it can help when adding to ingress and egress pods.

@istio-testing

This comment has been minimized.

Copy link
Collaborator

commented Oct 7, 2017

@andraxylia: Adding do-not-merge/release-note-label-needed because the release note process has not been followed.

One of the following labels is required "release-note", "release-note-action-required", or "release-note-none".
Please see: https://github.com/kubernetes/community/blob/master/contributors/devel/pull-requests.md#write-release-notes-if-needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andraxylia andraxylia removed the do-not-merge label Oct 7, 2017
command:
- /bin/sh
- -c
- curl http://localhost:15000/clusters && curl http://localhost:15000/listeners && curl http://localhost:15000/routes

This comment has been minimized.

Copy link
@ldemailly

ldemailly Oct 7, 2017

Contributor

why not use the standard http probe?

This comment has been minimized.

Copy link
@andraxylia

andraxylia Oct 7, 2017

Author Contributor

Right, this is another option, but we are running out of time for Monday morning.

This comment has been minimized.

Copy link
@andraxylia

andraxylia Oct 7, 2017

Author Contributor

Actually, this requires exposing port 15000. It is not a better option.

This comment has been minimized.

Copy link
@mandarjog

mandarjog Oct 7, 2017

Contributor

I think performing checks without opening up new ports is preferred.

This comment has been minimized.

Copy link
@ldemailly

ldemailly Oct 7, 2017

Contributor

the port is a good argument, I don't know how to configure it so it's only reachable from the probe so for readiness maybe this is a good approach, if it was actually checking something (which it isn't except for something listening on 15000)

@costinm
costinm approved these changes Oct 7, 2017
Copy link
Contributor

left a comment

Looks good - not perfect, but going in the right direction, we'll need those probes for production-ready. In a patch release we'll likely need to add the same thing to kube-inject.

@andraxylia

This comment has been minimized.

Copy link
Contributor Author

commented Oct 7, 2017

This fix has little risk because it is only in the yaml files as opposed to code, something that user can edit easily.

@ldemailly

This comment has been minimized.

Copy link
Contributor

commented Oct 7, 2017

we should not merge this as is, it spawns 4 processes every N seconds for no reason

@andraxylia

This comment has been minimized.

Copy link
Contributor Author

commented Oct 7, 2017

The readiness is only executed at startup.

For liveness, it spawns curl every periodSeconds. I do not see this as a problem, given the benefits. If it is a problem, we could just add just the readiness and find some other solution for liveness. Any proposals?

@ldemailly

This comment has been minimized.

Copy link
Contributor

commented Oct 7, 2017

Again, the proposal is still the same which is to use the built-in kubernetes http probe support instead of a shell and 3 curl processes - why did you pick the custom path, maybe I am missing something?

Readyness should check the output of the url and that it got all the current/latest config from pilot but that seems like a 0.3 feature.(or rather envoy should not bind to the port before being fully ready, either way this should be designed/documented first)

Did you check with the non _debug image ? I don't think it has curl.

command:
- /bin/sh
- -c
- /usr/bin/curl http://localhost:15000/clusters

This comment has been minimized.

Copy link
@mandarjog

mandarjog Oct 7, 2017

Contributor

Can we be consistent with /usr/bin/curl or curl.

This comment has been minimized.

Copy link
@ldemailly

ldemailly Oct 7, 2017

Contributor

doesn't need sh -c and should directly curl or should actually check for the http return code from the url and not just that it can connect (if envoy returns a 503 it will still "pass")

all in all I think liveness/readyness for envoy/pilot needs some thought and work not trying to rush something without a concrete case (envoy crashing is supposed to be detected by pilot-agent; adding ingresses/scaling (where readyness would be necessary)) is a 0.3 feature

command:
- /bin/sh
- -c
- curl http://localhost:15000/clusters && curl http://localhost:15000/listeners && curl http://localhost:15000/routes

This comment has been minimized.

Copy link
@mandarjog

mandarjog Oct 7, 2017

Contributor

I think performing checks without opening up new ports is preferred.

@linsun

This comment has been minimized.

Copy link
Member

commented Oct 8, 2017

Can we not use curl cmd to do health check as that is additional cmd to run at very frequent level? Can we leverage liveness for http request for the port(s) that istio-ingress and istio-egress already expose? https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request

@andraxylia

This comment has been minimized.

Copy link
Contributor Author

commented Oct 9, 2017

Changed to use curl directly for liveness. For readiness, we still need to run it from "sh -c", because the yaml syntax does not allow the 3 commands, nor it accepts the Unix ``.

As explained already, using curl is preferable to opening a new port. Do you have any data suggesting any particular issue with curl? Any computation requires some CPU, since when is this a problem? This is how liveness probe work, they involve running a command, at a certain frequency, and, 20s is not very frequent. Health checks in general are not perfect, since there is still a delay until the next probe, but they are used.

Unfortunately we cannot leverage anther port opened by envoy, since there is none I know of, and it is envoy not running we want to detect. Envoy crashing is supposed to be detected by pilot-agent, but as it seems from the issues reported by user, this does not work properly. Health checks in k8s have been introduce to help with this. Running two long running processes in the same container is an anti-pattern, this design is flawed to start with, please see issue istio/old_pilot_repo#968 currently blocked by lack of Key Discovery Service (aka SDS).

And, yes, it is fine if the liveness probe work when envoy returns 503, it means envoy is there. It also means it will not return 404, which is what we are trying to avoid in this PR.

@ldemailly There is nothing more to design for readiness or liveness, maybe to fine tune, and this solution is a step forward. If you have a better solution, you can replace this solution with the better one when you have it ready.

@andraxylia

This comment has been minimized.

Copy link
Contributor Author

commented Oct 9, 2017

/test istio-presubmit

@costinm

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2017

- /bin/sh
- -c
- curl http://localhost:15000/clusters && curl http://localhost:15000/listeners && curl http://localhost:15000/routes
initialDelaySeconds: 3

This comment has been minimized.

Copy link
@wattli

wattli Oct 9, 2017

Contributor

Thanks for putting this together.

Are these numbers (seconds) default value? Just curious where we get this.

This comment has been minimized.

Copy link
@mandarjog

mandarjog Oct 9, 2017

Contributor

@andraxylia we should use curl -f so that curl actually fails on a non 200.

This comment has been minimized.

Copy link
@mandarjog

mandarjog Oct 9, 2017

Contributor

And also check if those endpoint return empty.

@rshriram

This comment has been minimized.

Copy link
Member

commented Oct 9, 2017

fwiw, I saw mentions of Envoy crashing. The issue reported by the user was of 404s and not a hung connection. 404s come from an Envoy that has no routes (but running nevertheless). Whether or not pilot agent exists is immaterial to the issue. And we have known about this issue for 3 months now (ever since RDS came in the picture).

and this PR, as is, is insufficient to tackle the missing route issue. Merely curl-ing /listeners, /clusters, /routes is not useful, since they return 200 OK, but contain empty listener array, empty route array or empty cluster array. The proper solution is to use a simple script that curls and checks for non empty output. Alternatively, add this functionality into the pilot agent itself.

If all you need is a small delay, just skip all the curl stuff. and add an initialDelaySeconds: 10 and a simple exec command that does echo or ls..

@linsun

This comment has been minimized.

Copy link
Member

commented Oct 9, 2017

Personally I haven't hit this issue myself, I've installed istio at least 5 times in the past week or so. The workaround is simple, wait a few seconds if user hits the ingress before the pod is fully coming up and live. (let me know if I'm missing something here.).

Given @rshriram also believes this can't really fix 404 as you won't get 404 if the envoy crashes, I don't think we should spin a new release build just for this PR, plus we need to validate if curl exists in istio proxy (non-debug) before this can go in, or switch to a simpler check like @rshriram suggested. We also need all tests passing before this PR can go in.

@costinm

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2017

+1 on a simple sleep, no curl.

I was able to repro the 404 on envoy startup, so did Andra. I was also able to reproduce once the 'envoy dead but agent not detecting it' - unfortunately I was not able to debug or get a core dump, and it only happened once. It's ok if we postpone the 'liveness' part, it is not going to result in 404s - but we'll need to add it soon.

@costinm

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2017

I agree with Shriram on adding some more complete /healtz code to pilot-agent, but I think until this
is ready a simple delay to wait for routes to be ready before accepting traffic is needed, if it can't make
it in this build maybe we can have a dot release soon after.

@andraxylia

This comment has been minimized.

Copy link
Contributor Author

commented Oct 9, 2017

I increased the initalDelay to 10s, this is equivalent with sleep 10s, without the code changes.

curl is better than a simple ls command, at least it ensures envoy is there. I do not understand what people have against curl.

@andraxylia

This comment has been minimized.

Copy link
Contributor Author

commented Oct 9, 2017

And curl is present in this version we are shipping. When we will ship something different, we will make sure curl is there.

exec:
command:
- curl
- -f

This comment has been minimized.

Copy link
@ldemailly

ldemailly Oct 9, 2017

Contributor

👍 for no sh and curl -f

command:
- /bin/sh
- -c
- curl -f http://localhost:15000/clusters && curl -f http://localhost:15000/listeners && curl -f http://localhost:15000/routes

This comment has been minimized.

Copy link
@ayj

ayj Oct 9, 2017

Contributor

curl -f http://localhost:15000/clusters should be sufficient here to verify envoy is running, though it isn't sufficient to completely avoid 404 if RDS hasn't occurred yet. You'd need to query the /stats endpoint and grep for number of successful RDS attempts.

This comment has been minimized.

Copy link
@wattli

wattli Oct 9, 2017

Contributor

Why not query all three as it is for now?

This comment has been minimized.

Copy link
@ayj

ayj Oct 9, 2017

Contributor

It gives the false impression that it querying multiple paths is meaningful for the purposes of verifying enovy is running AFAICT.

This comment has been minimized.

Copy link
@rshriram

rshriram Oct 9, 2017

Member

Actually, the stats is a very good idea. Much more easier than checking for non empty jsons.. May be we can do the whole thing in one shot? like

test -z `curl http://localhost:15000/stats| grep update_success | cut -d ':' -f2| tr -d ' '|uniq|grep 0`

which is basically going to get LDS, RDS, CDS stats and check the update_success fields (they are usually like foo.bar.update_success: 1 - hence the cut -d .. stuff). Then if its all >0, uniq will squeeze the list and should typically have "1" or "1" and "2" on separate lines. But if it any of these *DS are 0, then uniq will have

0
1``` or some line with 0. The final grep pulls this 0 out and if its true, the test -z fails.

This can go in as readiness probe, which will make sure that the stuff runs until pod is ready, and not forever.

This comment has been minimized.

Copy link
@rshriram

rshriram Oct 9, 2017

Member
test -z $(curl -s http://localhost:15000/stats|grep update_success|cut -d ':' -f2|tr -d ' '|sort -n|uniq|grep -e "^0$")

tested and works.. if non zero, it has non zero (fail) return code.. and if all update success are >0, test -z returns true

This comment has been minimized.

Copy link
@rshriram

rshriram Oct 9, 2017

Member

update: don't use the expression above, because we have a cluster out.orig-dst-cluster-tcp that does not use SDS.. hence its update_success field is always 0.

This comment has been minimized.

Copy link
@andraxylia

andraxylia Oct 9, 2017

Author Contributor

Using ` with the yaml files does not work.
The only way is run run the command with sh -c "long command"

This comment has been minimized.

Copy link
@ldemailly

ldemailly Oct 9, 2017

Contributor

if possible we should avoid running 8 commands to get a ok/not ok status (for instance 1 awk/python/... can do this; or pilot-agent should do it...). As this is only for startup maybe it's ok though.

@andraxylia

This comment has been minimized.

Copy link
Contributor Author

commented Oct 9, 2017

/tickle bot
why is the bot not updating the branch automatically?

@ldemailly

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2017

it doesn't update the branch, it does a merge automatically if there are no conflicts but you need an lgtm from release folks for that too happen

@linsun
linsun approved these changes Oct 10, 2017
Copy link
Member

left a comment

Thank you for working on this @andraxylia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.