Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/contour: Envoy Shutdown Manager #2227

Merged
merged 4 commits into from
Feb 20, 2020

Conversation

stevesloka
Copy link
Member

Fixes #145 by adding a new set of commands to Contour which will watch the Envoy Prometheus endpoint to block the pod from terminating while there are open connections.

Also adds a sample Grafana panel which shows the open connections by listener:

image

@codecov
Copy link

codecov bot commented Feb 13, 2020

Codecov Report

Merging #2227 into master will decrease coverage by 0.88%.
The diff coverage is 23.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2227      +/-   ##
==========================================
- Coverage   78.24%   77.35%   -0.89%     
==========================================
  Files          57       58       +1     
  Lines        5070     5154      +84     
==========================================
+ Hits         3967     3987      +20     
- Misses       1017     1080      +63     
- Partials       86       87       +1
Impacted Files Coverage Δ
cmd/contour/contour.go 3.94% <0%> (-0.22%) ⬇️
cmd/contour/shutdownmanager.go 25% <25%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c364699...572fdc2. Read the comment docs.

@stevesloka stevesloka marked this pull request as ready for review February 13, 2020 21:31
@stevesloka
Copy link
Member Author

I'm thinking of adding a flow diagram to the docs to explain the sequence of events and the options.

@youngnick
Copy link
Member

I think a flow diagram is an excellent idea.

@stevesloka stevesloka added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 14, 2020
@stevesloka stevesloka force-pushed the shutdownmanager branch 3 times, most recently from c81adb9 to 38f157d Compare February 16, 2020 18:36
@jpeach
Copy link
Contributor

jpeach commented Feb 17, 2020

I'm reviewing now.

Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I particularly liked how you bundled the examples and documentation changes in this PR :)

site/docs/master/shutdown-manager.md Outdated Show resolved Hide resolved
site/docs/master/shutdown-manager.md Outdated Show resolved Hide resolved
site/docs/master/shutdown-manager.md Outdated Show resolved Hide resolved
site/docs/master/shutdown-manager.md Outdated Show resolved Hide resolved
site/docs/master/shutdown-manager.md Outdated Show resolved Hide resolved
cmd/contour/shutdownmanager.go Outdated Show resolved Hide resolved
internal/metrics/parser.go Outdated Show resolved Hide resolved
internal/metrics/parser.go Outdated Show resolved Hide resolved
internal/metrics/parser.go Outdated Show resolved Hide resolved
go.mod Show resolved Hide resolved
Copy link
Member

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall with a couple of questions. I think the docs are great, nice work.

site/docs/master/shutdown-manager.md Outdated Show resolved Hide resolved
cmd/contour/shutdownmanager.go Outdated Show resolved Hide resolved
cmd/contour/shutdownmanager.go Show resolved Hide resolved
@davecheney
Copy link
Contributor

davecheney commented Feb 19, 2020 via email

Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits around log message formatting and so forth. I'd like a bit more clarity around how we should handle errors posting to the health check fail URL though.

cmd/contour/shutdownmanager.go Outdated Show resolved Hide resolved
envoyAdminURL := fmt.Sprintf("http://%s:%d/healthcheck/fail", s.envoyHost, s.envoyPort)

// Send shutdown signal to Envoy to start draining connections
err := shutdownEnvoy(envoyAdminURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevesloka Did we reach a resolution here? What is meant to happen to the shutdown process if we couldn't fail the healthcheck out?

cmd/contour/shutdownmanager.go Outdated Show resolved Hide resolved
cmd/contour/shutdownmanager.go Outdated Show resolved Hide resolved
cmd/contour/shutdownmanager.go Outdated Show resolved Hide resolved
internal/metrics/parser.go Outdated Show resolved Hide resolved
internal/metrics/parser.go Outdated Show resolved Hide resolved
const prometheusStat = "envoy_http_downstream_cx_active"

func prometheusLabels() []string {
return []string{"ingress_http", "ingress_https"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use ENVOY_HTTP_LISTENER and ENVOY_HTTPS_LISTENER here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ENV vars? No, these strings match labels in the prometheus metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant these. But I suppose these strings are coded in so many places that one more doesn't hurt :)

Resolve this if you want to keep this as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see. Hmm, possibly. As it's written now we'd get an import cycle by referencing those since the contour package already references the metrics one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can lift the const somewhere else. It feels bad to add more to the shutdown manager file.

@stevesloka
Copy link
Member Author

Did we reach a resolution here? What is meant to happen to the shutdown process if we couldn't fail the healthcheck out?

@jpeach If we can't tell Envoy to start draining connections then the readiness probe will fail and the pod will wait the total number of terminationGracePeriodSeconds before dropping all the connections that Envoy.

Signed-off-by: Steve Sloka <slokas@vmware.com>
Signed-off-by: Steve Sloka <slokas@vmware.com>
Signed-off-by: Steve Sloka <slokas@vmware.com>
@jpeach
Copy link
Contributor

jpeach commented Feb 20, 2020

we can't tell Envoy to start draining connections then the readiness probe will fail and the pod will wait the total number of terminationGracePeriodSeconds before dropping all the connections that Envoy.

@stevesloka So in that case, are we worse off than before? That is, envoy isn't draining, but the shutdown is going to take the full termination period. If we don't retry here, we are guaranteed to consume the full grace period, right? Whereas if we retry, we might succeed and at least start the draining.

@stevesloka
Copy link
Member Author

So in that case, are we worse off than before?

Worse off than before what? I'm not following. The current preStop hook does the exact same call and has the exact same behavior.

That is, envoy isn't draining, but the shutdown is going to take the full termination period. If we don't retry here, we are guaranteed to consume the full grace period, right?

Yes, we'll use the entire grace period unless somehow the connections all drain by themselves.

Whereas if we retry, we might succeed and at least start the draining.

I'd prefer to make this a new issue to follow up on. I honestly do not see this as a case that would be hit. I do think it is possible, but I'd prefer to not make this PR hold on this. Thoughts?

@jpeach
Copy link
Contributor

jpeach commented Feb 20, 2020

Worse off than before what? I'm not following. The current preStop hook does the exact same call and has the exact same behavior.

The current preStop calls the fail once and then pod shutdown continues. This preStop blocks pod shutdown until the connection count converges. Previously, the preStop would never hold up pod shutdown.

I'd prefer to make this a new issue to follow up on.

Sure, that seems fine.

@stevesloka
Copy link
Member Author

Retry issue: #2262

Signed-off-by: Steve Sloka <slokas@vmware.com>
@stevesloka stevesloka merged commit c438eb8 into projectcontour:master Feb 20, 2020
@stevesloka stevesloka deleted the shutdownmanager branch February 20, 2020 21:37
args:
- envoy
- shutdown-manager
image: docker.io/projectcontour/contour:master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be versioned or :latest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanly draining envoy with DaemonSet/NLB deployment
4 participants