cmd/contour: Envoy Shutdown Manager #2227

stevesloka · 2020-02-13T20:10:51Z

Fixes #145 by adding a new set of commands to Contour which will watch the Envoy Prometheus endpoint to block the pod from terminating while there are open connections.

Also adds a sample Grafana panel which shows the open connections by listener:

codecov · 2020-02-13T21:31:00Z

Codecov Report

Merging #2227 into master will decrease coverage by 0.88%.
The diff coverage is 23.52%.

@@            Coverage Diff             @@
##           master    #2227      +/-   ##
==========================================
- Coverage   78.24%   77.35%   -0.89%     
==========================================
  Files          57       58       +1     
  Lines        5070     5154      +84     
==========================================
+ Hits         3967     3987      +20     
- Misses       1017     1080      +63     
- Partials       86       87       +1

Impacted Files	Coverage Δ
cmd/contour/contour.go	`3.94% <0%> (-0.22%)`	⬇️
cmd/contour/shutdownmanager.go	`25% <25%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c364699...572fdc2. Read the comment docs.

stevesloka · 2020-02-13T23:24:03Z

I'm thinking of adding a flow diagram to the docs to explain the sequence of events and the options.

youngnick · 2020-02-14T00:25:15Z

I think a flow diagram is an excellent idea.

jpeach · 2020-02-17T22:05:04Z

I'm reviewing now.

jpeach

I particularly liked how you bundled the examples and documentation changes in this PR :)

site/docs/master/shutdown-manager.md

cmd/contour/shutdownmanager.go

internal/metrics/parser.go

go.mod

youngnick

LGTM overall with a couple of questions. I think the docs are great, nice work.

site/docs/master/shutdown-manager.md

cmd/contour/shutdownmanager.go

davecheney · 2020-02-19T01:38:19Z

I think that’s technically v0.9.1 upstream, but for whatever reason go modules doesn’t want to use that version number and is reverting to the hash.

…

On 19 Feb 2020, at 11:57 am, James Peach ***@***.***> wrote: @jpeach commented on this pull request. In go.mod: > @@ -19,6 +19,7 @@ require ( github.com/konsorten/go-windows-terminal-sequences v1.0.2 // indirect github.com/prometheus/client_golang v1.1.0 github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4 + github.com/prometheus/common v0.6.0 Yeh, there doesn't seem to be any version numbering consistency across these packages 🤷‍♂ — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jpeach

A few nits around log message formatting and so forth. I'd like a bit more clarity around how we should handle errors posting to the health check fail URL though.

cmd/contour/shutdownmanager.go

jpeach · 2020-02-20T00:54:51Z

cmd/contour/shutdownmanager.go

+	envoyAdminURL := fmt.Sprintf("http://%s:%d/healthcheck/fail", s.envoyHost, s.envoyPort)
+
+	// Send shutdown signal to Envoy to start draining connections
+	err := shutdownEnvoy(envoyAdminURL)


@stevesloka Did we reach a resolution here? What is meant to happen to the shutdown process if we couldn't fail the healthcheck out?

cmd/contour/shutdownmanager.go

internal/metrics/parser.go

jpeach · 2020-02-20T01:15:36Z

internal/metrics/parser.go

+const prometheusStat = "envoy_http_downstream_cx_active"
+
+func prometheusLabels() []string {
+	return []string{"ingress_http", "ingress_https"}


Can you use ENVOY_HTTP_LISTENER and ENVOY_HTTPS_LISTENER here?

The ENV vars? No, these strings match labels in the prometheus metrics.

I meant these. But I suppose these strings are coded in so many places that one more doesn't hurt :)

Resolve this if you want to keep this as is.

oh I see. Hmm, possibly. As it's written now we'd get an import cycle by referencing those since the contour package already references the metrics one.

maybe can lift the const somewhere else. It feels bad to add more to the shutdown manager file.

stevesloka · 2020-02-20T01:30:25Z

Did we reach a resolution here? What is meant to happen to the shutdown process if we couldn't fail the healthcheck out?

@jpeach If we can't tell Envoy to start draining connections then the readiness probe will fail and the pod will wait the total number of terminationGracePeriodSeconds before dropping all the connections that Envoy.

Signed-off-by: Steve Sloka <slokas@vmware.com>

jpeach · 2020-02-20T01:44:31Z

we can't tell Envoy to start draining connections then the readiness probe will fail and the pod will wait the total number of terminationGracePeriodSeconds before dropping all the connections that Envoy.

@stevesloka So in that case, are we worse off than before? That is, envoy isn't draining, but the shutdown is going to take the full termination period. If we don't retry here, we are guaranteed to consume the full grace period, right? Whereas if we retry, we might succeed and at least start the draining.

stevesloka · 2020-02-20T01:52:15Z

So in that case, are we worse off than before?

Worse off than before what? I'm not following. The current preStop hook does the exact same call and has the exact same behavior.

That is, envoy isn't draining, but the shutdown is going to take the full termination period. If we don't retry here, we are guaranteed to consume the full grace period, right?

Yes, we'll use the entire grace period unless somehow the connections all drain by themselves.

Whereas if we retry, we might succeed and at least start the draining.

I'd prefer to make this a new issue to follow up on. I honestly do not see this as a case that would be hit. I do think it is possible, but I'd prefer to not make this PR hold on this. Thoughts?

jpeach · 2020-02-20T01:55:17Z

Worse off than before what? I'm not following. The current preStop hook does the exact same call and has the exact same behavior.

The current preStop calls the fail once and then pod shutdown continues. This preStop blocks pod shutdown until the connection count converges. Previously, the preStop would never hold up pod shutdown.

I'd prefer to make this a new issue to follow up on.

Sure, that seems fine.

stevesloka · 2020-02-20T14:56:34Z

Retry issue: #2262

Signed-off-by: Steve Sloka <slokas@vmware.com>

davecheney · 2020-02-20T22:21:00Z

site/docs/master/redeploy-envoy.md

+   args:
+     - envoy
+     - shutdown-manager
+   image: docker.io/projectcontour/contour:master


should be versioned or :latest

stevesloka added this to the 1.2.0 milestone Feb 13, 2020

stevesloka force-pushed the shutdownmanager branch from 22b091e to fd372ff Compare February 13, 2020 21:30

stevesloka marked this pull request as ready for review February 13, 2020 21:31

stevesloka added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 14, 2020

stevesloka force-pushed the shutdownmanager branch 3 times, most recently from c81adb9 to 38f157d Compare February 16, 2020 18:36

jpeach suggested changes Feb 17, 2020

View reviewed changes

youngnick approved these changes Feb 17, 2020

View reviewed changes

site/docs/master/shutdown-manager.md Outdated Show resolved Hide resolved

cmd/contour/shutdownmanager.go Outdated Show resolved Hide resolved

cmd/contour/shutdownmanager.go Show resolved Hide resolved

stevesloka mentioned this pull request Feb 18, 2020

site: Contour command documentation #2246

Closed

stevesloka force-pushed the shutdownmanager branch from 38f157d to 10f21da Compare February 18, 2020 20:56

stevesloka requested a review from jpeach February 18, 2020 21:01

jpeach reviewed Feb 20, 2020

View reviewed changes

stevesloka added 3 commits February 19, 2020 20:41

Implement envoy shutdown manager

d6b5159

Signed-off-by: Steve Sloka <slokas@vmware.com>

Add Envoy open connections dashboard to Grafana

8106c7c

Signed-off-by: Steve Sloka <slokas@vmware.com>

Update rendered quickstart example utilizing the shutdown-manager

ab8a1a8

Signed-off-by: Steve Sloka <slokas@vmware.com>

stevesloka force-pushed the shutdownmanager branch from 10f21da to 9b98457 Compare February 20, 2020 01:58

stevesloka mentioned this pull request Feb 20, 2020

cmd/contour: Retry Envoy healthcheck fail #2262

Closed

Add docs for Envoy shutdown manager

572fdc2

Signed-off-by: Steve Sloka <slokas@vmware.com>

stevesloka force-pushed the shutdownmanager branch from 9b98457 to 572fdc2 Compare February 20, 2020 15:17

stevesloka mentioned this pull request Feb 20, 2020

DO NOT MERGE: Contour v1.2 Release Notes #2266

Closed

jpeach approved these changes Feb 20, 2020

View reviewed changes

stevesloka merged commit c438eb8 into projectcontour:master Feb 20, 2020

stevesloka deleted the shutdownmanager branch February 20, 2020 21:37

davecheney reviewed Feb 20, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/contour: Envoy Shutdown Manager #2227

cmd/contour: Envoy Shutdown Manager #2227

stevesloka commented Feb 13, 2020

codecov bot commented Feb 13, 2020 •

edited

Loading

stevesloka commented Feb 13, 2020

youngnick commented Feb 14, 2020

jpeach commented Feb 17, 2020

jpeach left a comment

youngnick left a comment

davecheney commented Feb 19, 2020 via email

jpeach left a comment

jpeach Feb 20, 2020

jpeach Feb 20, 2020

stevesloka Feb 20, 2020

jpeach Feb 20, 2020

stevesloka Feb 20, 2020

stevesloka Feb 20, 2020

stevesloka commented Feb 20, 2020

jpeach commented Feb 20, 2020

stevesloka commented Feb 20, 2020

jpeach commented Feb 20, 2020

stevesloka commented Feb 20, 2020

davecheney Feb 20, 2020

cmd/contour: Envoy Shutdown Manager #2227

cmd/contour: Envoy Shutdown Manager #2227

Conversation

stevesloka commented Feb 13, 2020

codecov bot commented Feb 13, 2020 • edited Loading

Codecov Report

stevesloka commented Feb 13, 2020

youngnick commented Feb 14, 2020

jpeach commented Feb 17, 2020

jpeach left a comment

Choose a reason for hiding this comment

youngnick left a comment

Choose a reason for hiding this comment

davecheney commented Feb 19, 2020 via email

jpeach left a comment

Choose a reason for hiding this comment

jpeach Feb 20, 2020

Choose a reason for hiding this comment

jpeach Feb 20, 2020

Choose a reason for hiding this comment

stevesloka Feb 20, 2020

Choose a reason for hiding this comment

jpeach Feb 20, 2020

Choose a reason for hiding this comment

stevesloka Feb 20, 2020

Choose a reason for hiding this comment

stevesloka Feb 20, 2020

Choose a reason for hiding this comment

stevesloka commented Feb 20, 2020

jpeach commented Feb 20, 2020

stevesloka commented Feb 20, 2020

jpeach commented Feb 20, 2020

stevesloka commented Feb 20, 2020

davecheney Feb 20, 2020

Choose a reason for hiding this comment

codecov bot commented Feb 13, 2020 •

edited

Loading