Gracefully drain connections when stopping the gateway #1203

norbjd · 2024-02-02T19:42:05Z

Changes

🐛 fix gateway preStop hook
🎁 gracefully drain connections when stopping the gateway

/kind enhancement

This PR supersedes #1200 (click for more details, but TL;DR: today's preStop hook doesn't work, because curl is not installed in the gateway image).

When the gateway stops, we want to gracefully drain existing connections to prevent in-flight requests getting rejected with 5xx errors. To do so, we just need to drain the listeners using Envoy admin interface and /drain_listeners?graceful endpoint.

The /drain_listeners?graceful endpoint starts the graceful drain process, but returns immediately. The drain process is configured with --drain-time-s $(DRAIN_TIME_SECONDS) --drain-strategy immediate, so the drain will be finished after DRAIN_TIME_SECONDS. All connections that couldn't finish after DRAIN_TIME_SECONDS will be automatically closed, but this can be configured through the environment variable. People might also need to update terminationGracePeriodSeconds depending on their use case (default: 30s), as it only makes sense to have terminationGracePeriodSeconds >= DRAIN_TIME_SECONDS.

This change has been tested manually (in addition to the new e2e test), and the logs look like the following. Here, I've stopped the gateway manually (kubectl delete pod) at around 2024-02-02T19:17:22, while there were some in-flight requests. The drain started just after (around 2024-02-02T19:17:23). Then, we can see that SIGTERM was sent 15 seconds after (time for the preStop hook to finish, as there is a sleep for DRAIN_TIME_SECONDS at the end). During the 15 seconds, my in-flight requests were able to finish properly.

[2024-02-02T19:17:20.238Z] "GET /ready HTTP/1.1" 200 - 0 5 0 0 "-" "kube-probe/1.28" "***" "internalkourier" "/tmp/envoy.admin"
[2024-02-02T19:17:23.857Z] "POST /drain_listeners?graceful HTTP/1.1" 200 - 0 3 0 - "-" "-" "-" "localhost" "-"
[2024-02-02T19:17:25.238Z] "GET /ready HTTP/1.1" 200 - 0 5 0 0 "-" "kube-probe/1.28" "***" "internalkourier" "/tmp/envoy.admin"
[2024-02-02T19:17:30.238Z] "GET /ready HTTP/1.1" 200 - 0 5 1 0 "-" "kube-probe/1.28" "***" "internalkourier" "/tmp/envoy.admin"
[2024-02-02T19:17:35.238Z] "GET /ready HTTP/1.1" 200 - 0 5 0 0 "-" "kube-probe/1.28" "***" "internalkourier" "/tmp/envoy.admin"
[2024-02-02 19:17:38.865][1][warning][main] [source/server/server.cc:862] caught ENVOY_SIGTERM
[2024-02-02 19:17:38.865][1][info][main] [source/server/server.cc:993] shutting down server instance
[2024-02-02 19:17:38.865][1][info][main] [source/server/server.cc:928] main dispatch loop exited
[2024-02-02 19:17:38.866][1][warning][config] [./source/common/config/grpc_stream.h:153] StreamAggregatedResources gRPC config stream to xds_cluster closed: 13,
[2024-02-02 19:17:38.867][1][info][main] [source/server/server.cc:980] exiting

Fixes #1118.

Release Note

Gracefully drain connections when stopping the gateway

…mage

…public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds)

codecov · 2024-02-02T19:45:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.31%. Comparing base (575fdab) to head (86fcaeb).
Report is 12 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1203      +/-   ##
==========================================
+ Coverage   60.63%   62.31%   +1.67%     
==========================================
  Files          24       24              
  Lines        2002     1632     -370     
==========================================
- Hits         1214     1017     -197     
+ Misses        726      553     -173     
  Partials       62       62

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

norbjd · 2024-02-02T20:38:04Z

/retest

config/300-gateway.yaml

skonto · 2024-03-11T13:38:06Z

Hi @norbjd thanks again for your work and patience.

I like the approach taken from the Contour project: projectcontour/contour#145 as it does not depend on configuring timeouts. Istio has implemented a similar approach with istio/istio#35059. I see some steps to get there to some extent.

a) a documented design like https://github.com/projectcontour/contour/blob/main/design/envoy-shutdown.md
b) a sidecar that handles termination based on code similar to https://github.com/projectcontour/contour/blob/main/cmd/contour/shutdownmanager.go

I like the go approach compared to bash logic for waiting for active connections because it seems a bit "dirtier" in bash, an bash approach was proposed also here: projectcontour/contour#2177 (comment).
So definitely (even in bash) should be in another PR.

In the mean time we can make things configurable, see
projectcontour/contour#2177 (comment) as also this PR tries to do.

Thus we could do:

Keep healthcheck fail endpoint as it signals k8s to also remove it from the service and make the endpoint accessible locally only.

Btw I agree that the envoy image only contains perl so no nc, socat, curl etc 🤷. However you can expose the admin
interface locally only and not via a pipe and target 127.0.0.1:9001 directly. Would that work?

clusters:
  - name: service_stats
    connect_timeout: 0.250s
    type: static
    load_assignment:
      cluster_name: service_stats
      endpoints:
        lb_endpoints:
          endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 9001
...

admin:
  access_log:
  - name: envoy.access_loggers.stdout
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 9001

pass the drain parameter drain-time-s with some time that is configurable as you do in the PR.
I am not sure about the drain strategy https://github.com/istio/istio/blob/master/pkg/envoy/proxy.go#L130. Istio sets it to immediate.

--drain-strategy <string>
(optional) Determine behaviour of Envoy during the hot restart drain sequence. During the drain sequence, the drain manager encourages draining through terminating connections on request completion, sending “Connection: CLOSE” on HTTP1, and sending GOAWAY on HTTP2.

gradual: (default) The percentage of requests encouraged to drain increases to 100% as the drain time elapses.
immediate: All requests are encouraged to drain as soon as the drain sequence begins.

I guess we want to notify clients asap (like you have it in the PR).

terminationGracePeriodSeconds should be larger than the drain period for sure.
Add an e2e test that checks the correctness of what we have.

skonto · 2024-03-27T13:48:30Z

@norbjd gentle ping. Do you have cycles to drive the effort here?

skonto · 2024-03-28T12:27:24Z

@norbjd hi, pls let me know if you have the cycles to the PR.

skonto · 2024-04-03T12:58:15Z

@norbjd gentle ping!

norbjd · 2024-04-07T09:32:09Z

Woops, really sorry, I've been busy at work these last weeks. I'll check again right now! Thanks for the detailed answer :)

…-until-all-incoming-requests-are-finished

test/e2e-kind.sh

ReToCode · 2024-04-15T11:31:40Z

I can put a comment next to it to tell it must be greater than DRAIN_TIME_SECONDS, let me know.

Yes please :)

In general I think that approach is good. Could you please also:

Update the PR description to match the current status of the code
Make sure all tests pass successfully

skonto · 2024-04-22T13:16:02Z

@norbjd hi, gentle ping. Pls take a look at @ReToCode's suggestions and let's move on with this one. It has been open for quite some time.

skonto · 2024-04-26T13:29:03Z

@norbjd gentle ping

…-requests-are-finished

Patch env and terminationGracePeriodSeconds at the same time

- avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods

norbjd · 2024-04-27T13:40:15Z

/retest

norbjd · 2024-04-27T14:10:05Z

Thanks for the review, I have fixed the last requested changes ✔️

norbjd/net-kourier@7170cb6...gateway-prestop-hook-wait-until-all-incoming-requests-are-finished

TL;DR:

add a comment near terminationGracePeriodSeconds to ensure users set a value greater than DRAIN_TIME_SECONDS
run the gracefulshutdown e2e test at the end (otherwise I had conflicts between this test and others, this is what caused flaky tests): all e2e tests are now passing
PR description has been updated

ReToCode

/lgtm
/approve

Thanks for doing this @norbjd

knative-prow · 2024-04-29T06:32:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: norbjd, ReToCode

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ReToCode]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@skonto

…sions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint

@skonto

* Gracefully drain connections when stopping the gateway (knative-extensions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint * run hack/update-deps.sh * update openshift files --------- Co-authored-by: norbjd <norbjd@users.noreply.github.com>

@skonto

* Gracefully drain connections when stopping the gateway (knative-extensions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint * run hack/update-deps.sh * update openshift files --------- Co-authored-by: norbjd <norbjd@users.noreply.github.com>

@skonto

* [SRVKS-1171] [RELEASE-1.12] Graceful shutdown (#130) * Gracefully drain connections when stopping the gateway (knative-extensions#1203) * Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image * Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds) * Drain listeners with appropriate endpoint * Simpler drain + sleep * Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds * Use a perl script (no need to open the admin HTTP interface!) * Use bash instead of perl in preStop hook * Review @skonto comments: use socket address for admin cluster * [WIP] add graceful shutdown test and tweak CI to just run that test * [WIP] Fix gracefulshutdown_test.go * [WIP] try to fix race condition and lint * [WIP] use initialTimeout + debug * [WIP] fix gracefulshutdown_test.go logic * [WIP] refacto and add some comments to clarify * [WIP] fix lint * [WIP] reintroduce kind-e2e-upgrade.yaml * [WIP] add test case when request takes a little longer than the drain time * [WIP] fix compilation issue * [WIP] FIx compilation issue (again) * [WIP] hopefully fix data race * [WIP] refacto and hopefully fix race condition (use sync.Map) * [WIP] fix compilation issue * [WIP] Handle EOF * [WIP] check gateway pod has been removed + manual debugging * [WIP] debugging * [WIP] more debugging * [WIP] more debugging * [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF * [WIP] remove debugging related stuff * Revert all unnecessary changes made for testing * Revert unnecessary change (livenessProbe) * Scale to 1 replica * Typo * Run gracefulshutdown test first (speed up feedback loop) * Add a comment for terminationGracePeriodSeconds * Don't update deployment twice Patch env and terminationGracePeriodSeconds at the same time * Fix bad patch * Run gracefulshutdown test at the end - avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods * Fix gracefulshutdown test * Fix gracefulshutdown test * Lint * run hack/update-deps.sh * update openshift files --------- Co-authored-by: norbjd <norbjd@users.noreply.github.com> * update deps --------- Co-authored-by: norbjd <norbjd@users.noreply.github.com>

norbjd added 6 commits February 1, 2024 17:58

Fix gateway's preStop hook: curl does not exist (anymore?) in envoy i…

6f5e4da

…mage

Before stopping the gateway, wait until requests are finished on all …

3e3e933

…public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds)

Drain listeners with appropriate endpoint

659fc4e

Simpler drain + sleep

d10d790

Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds

5be667d

Use a perl script (no need to open the admin HTTP interface!)

ff62994

knative-prow bot added the kind/enhancement label Feb 2, 2024

knative-prow bot requested review from krsna-m and ReToCode February 2, 2024 19:42

knative-prow bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 2, 2024

norbjd mentioned this pull request Feb 2, 2024

Fix gateway preStop hook #1200

Closed

skonto reviewed Feb 7, 2024

View reviewed changes

config/300-gateway.yaml Show resolved Hide resolved

skonto reviewed Feb 7, 2024

View reviewed changes

config/300-gateway.yaml Outdated Show resolved Hide resolved

norbjd requested a review from skonto March 2, 2024 18:50

Use bash instead of perl in preStop hook

423a6da

norbjd added 3 commits April 7, 2024 11:47

Merge branch 'knative-extensions:main' into gateway-prestop-hook-wait…

2b19b5d

…-until-all-incoming-requests-are-finished

Review @skonto comments: use socket address for admin cluster

223b4ac

[WIP] add graceful shutdown test and tweak CI to just run that test

c5a0b98

knative-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 7, 2024

norbjd added 3 commits April 7, 2024 13:58

[WIP] Fix gracefulshutdown_test.go

140d05f

[WIP] try to fix race condition and lint

00b30cd

[WIP] use initialTimeout + debug

8099a30

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2024

ReToCode reviewed Apr 15, 2024

View reviewed changes

test/e2e-kind.sh Outdated Show resolved Hide resolved

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024

Merge branch 'main' into gateway-prestop-hook-wait-until-all-incoming…

184c0e9

…-requests-are-finished

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2024

norbjd added 7 commits April 27, 2024 14:21

Add a comment for terminationGracePeriodSeconds

7170cb6

Don't update deployment twice

63e1131

Patch env and terminationGracePeriodSeconds at the same time

Fix bad patch

44ff38c

Run gracefulshutdown test at the end

ba83fe9

- avoids conflicts with other tests - change gracefulshutdown test to delete all gateway pods

Fix gracefulshutdown test

b908c32

Fix gracefulshutdown test

4ca0a27

Lint

86fcaeb

norbjd requested a review from ReToCode April 27, 2024 14:10

ReToCode approved these changes Apr 29, 2024

View reviewed changes

knative-prow bot assigned ReToCode Apr 29, 2024

knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2024

knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2024

knative-prow bot merged commit cc30bec into knative-extensions:main Apr 29, 2024
42 checks passed

skonto mentioned this pull request Apr 29, 2024

[SRVKS-1171] [RELEASE-1.12] Graceful shutdown openshift-knative/net-kourier#130

Merged

norbjd deleted the gateway-prestop-hook-wait-until-all-incoming-requests-are-finished branch May 1, 2024 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully drain connections when stopping the gateway #1203

Gracefully drain connections when stopping the gateway #1203

norbjd commented Feb 2, 2024 •

edited

Loading

codecov bot commented Feb 2, 2024 •

edited

Loading

norbjd commented Feb 2, 2024

skonto commented Mar 11, 2024 •

edited

Loading

skonto commented Mar 27, 2024 •

edited

Loading

skonto commented Mar 28, 2024 •

edited

Loading

skonto commented Apr 3, 2024

norbjd commented Apr 7, 2024 •

edited

Loading

ReToCode commented Apr 15, 2024

skonto commented Apr 22, 2024 •

edited

Loading

skonto commented Apr 26, 2024

norbjd commented Apr 27, 2024

norbjd commented Apr 27, 2024

ReToCode left a comment

knative-prow bot commented Apr 29, 2024

Gracefully drain connections when stopping the gateway #1203

Gracefully drain connections when stopping the gateway #1203

Conversation

norbjd commented Feb 2, 2024 • edited Loading

Changes

codecov bot commented Feb 2, 2024 • edited Loading

Codecov Report

norbjd commented Feb 2, 2024

skonto commented Mar 11, 2024 • edited Loading

skonto commented Mar 27, 2024 • edited Loading

skonto commented Mar 28, 2024 • edited Loading

skonto commented Apr 3, 2024

norbjd commented Apr 7, 2024 • edited Loading

ReToCode commented Apr 15, 2024

skonto commented Apr 22, 2024 • edited Loading

skonto commented Apr 26, 2024

norbjd commented Apr 27, 2024

norbjd commented Apr 27, 2024

ReToCode left a comment

Choose a reason for hiding this comment

knative-prow bot commented Apr 29, 2024

norbjd commented Feb 2, 2024 •

edited

Loading

codecov bot commented Feb 2, 2024 •

edited

Loading

skonto commented Mar 11, 2024 •

edited

Loading

skonto commented Mar 27, 2024 •

edited

Loading

skonto commented Mar 28, 2024 •

edited

Loading

norbjd commented Apr 7, 2024 •

edited

Loading

skonto commented Apr 22, 2024 •

edited

Loading