Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully drain connections when stopping the gateway #1203

Conversation

norbjd
Copy link
Contributor

@norbjd norbjd commented Feb 2, 2024

Changes

  • 🐛 fix gateway preStop hook
  • 🎁 gracefully drain connections when stopping the gateway

/kind enhancement

This PR supersedes #1200 (click for more details, but TL;DR: today's preStop hook doesn't work, because curl is not installed in the gateway image).

When the gateway stops, we want to gracefully drain existing connections to prevent in-flight requests getting rejected with 5xx errors. To do so, we just need to drain the listeners using Envoy admin interface and /drain_listeners?graceful endpoint.

The /drain_listeners?graceful endpoint starts the graceful drain process, but returns immediately. The drain process is configured with --drain-time-s $(DRAIN_TIME_SECONDS) --drain-strategy immediate, so the drain will be finished after DRAIN_TIME_SECONDS. All connections that couldn't finish after DRAIN_TIME_SECONDS will be automatically closed, but this can be configured through the environment variable. People might also need to update terminationGracePeriodSeconds depending on their use case (default: 30s), as it only makes sense to have terminationGracePeriodSeconds >= DRAIN_TIME_SECONDS.

This change has been tested manually (in addition to the new e2e test), and the logs look like the following. Here, I've stopped the gateway manually (kubectl delete pod) at around 2024-02-02T19:17:22, while there were some in-flight requests. The drain started just after (around 2024-02-02T19:17:23). Then, we can see that SIGTERM was sent 15 seconds after (time for the preStop hook to finish, as there is a sleep for DRAIN_TIME_SECONDS at the end). During the 15 seconds, my in-flight requests were able to finish properly.

[2024-02-02T19:17:20.238Z] "GET /ready HTTP/1.1" 200 - 0 5 0 0 "-" "kube-probe/1.28" "***" "internalkourier" "/tmp/envoy.admin"
[2024-02-02T19:17:23.857Z] "POST /drain_listeners?graceful HTTP/1.1" 200 - 0 3 0 - "-" "-" "-" "localhost" "-"
[2024-02-02T19:17:25.238Z] "GET /ready HTTP/1.1" 200 - 0 5 0 0 "-" "kube-probe/1.28" "***" "internalkourier" "/tmp/envoy.admin"
[2024-02-02T19:17:30.238Z] "GET /ready HTTP/1.1" 200 - 0 5 1 0 "-" "kube-probe/1.28" "***" "internalkourier" "/tmp/envoy.admin"
[2024-02-02T19:17:35.238Z] "GET /ready HTTP/1.1" 200 - 0 5 0 0 "-" "kube-probe/1.28" "***" "internalkourier" "/tmp/envoy.admin"
[2024-02-02 19:17:38.865][1][warning][main] [source/server/server.cc:862] caught ENVOY_SIGTERM
[2024-02-02 19:17:38.865][1][info][main] [source/server/server.cc:993] shutting down server instance
[2024-02-02 19:17:38.865][1][info][main] [source/server/server.cc:928] main dispatch loop exited
[2024-02-02 19:17:38.866][1][warning][config] [./source/common/config/grpc_stream.h:153] StreamAggregatedResources gRPC config stream to xds_cluster closed: 13,
[2024-02-02 19:17:38.867][1][info][main] [source/server/server.cc:980] exiting

Fixes #1118.

Release Note

Gracefully drain connections when stopping the gateway

@knative-prow knative-prow bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 2, 2024
@norbjd norbjd mentioned this pull request Feb 2, 2024
Copy link

codecov bot commented Feb 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.31%. Comparing base (575fdab) to head (86fcaeb).
Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1203      +/-   ##
==========================================
+ Coverage   60.63%   62.31%   +1.67%     
==========================================
  Files          24       24              
  Lines        2002     1632     -370     
==========================================
- Hits         1214     1017     -197     
+ Misses        726      553     -173     
  Partials       62       62              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@norbjd
Copy link
Contributor Author

norbjd commented Feb 2, 2024

/retest

config/300-gateway.yaml Outdated Show resolved Hide resolved
@norbjd norbjd requested a review from skonto March 2, 2024 18:50
@skonto
Copy link
Contributor

skonto commented Mar 11, 2024

Hi @norbjd thanks again for your work and patience.

I like the approach taken from the Contour project: projectcontour/contour#145 as it does not depend on configuring timeouts. Istio has implemented a similar approach with istio/istio#35059. I see some steps to get there to some extent.

a) a documented design like https://github.com/projectcontour/contour/blob/main/design/envoy-shutdown.md
b) a sidecar that handles termination based on code similar to https://github.com/projectcontour/contour/blob/main/cmd/contour/shutdownmanager.go

I like the go approach compared to bash logic for waiting for active connections because it seems a bit "dirtier" in bash, an bash approach was proposed also here: projectcontour/contour#2177 (comment).
So definitely (even in bash) should be in another PR.

In the mean time we can make things configurable, see
projectcontour/contour#2177 (comment) as also this PR tries to do.

Thus we could do:

  1. Keep healthcheck fail endpoint as it signals k8s to also remove it from the service and make the endpoint accessible locally only.

Btw I agree that the envoy image only contains perl so no nc, socat, curl etc 🤷. However you can expose the admin
interface locally only and not via a pipe and target 127.0.0.1:9001 directly. Would that work?

clusters:
  - name: service_stats
    connect_timeout: 0.250s
    type: static
    load_assignment:
      cluster_name: service_stats
      endpoints:
        lb_endpoints:
          endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 9001
...

admin:
  access_log:
  - name: envoy.access_loggers.stdout
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 9001
  1. pass the drain parameter drain-time-s with some time that is configurable as you do in the PR.

  2. I am not sure about the drain strategy https://github.com/istio/istio/blob/master/pkg/envoy/proxy.go#L130. Istio sets it to immediate.

--drain-strategy <string>
(optional) Determine behaviour of Envoy during the hot restart drain sequence. During the drain sequence, the drain manager encourages draining through terminating connections on request completion, sending “Connection: CLOSE” on HTTP1, and sending GOAWAY on HTTP2.

gradual: (default) The percentage of requests encouraged to drain increases to 100% as the drain time elapses.
immediate: All requests are encouraged to drain as soon as the drain sequence begins.

I guess we want to notify clients asap (like you have it in the PR).

  1. terminationGracePeriodSeconds should be larger than the drain period for sure.

  2. Add an e2e test that checks the correctness of what we have.

@skonto
Copy link
Contributor

skonto commented Mar 27, 2024

@norbjd gentle ping. Do you have cycles to drive the effort here?

@skonto
Copy link
Contributor

skonto commented Mar 28, 2024

@norbjd hi, pls let me know if you have the cycles to the PR.

@skonto
Copy link
Contributor

skonto commented Apr 3, 2024

@norbjd gentle ping!

@norbjd
Copy link
Contributor Author

norbjd commented Apr 7, 2024

Woops, really sorry, I've been busy at work these last weeks. I'll check again right now! Thanks for the detailed answer :)

@knative-prow knative-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 7, 2024
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2024
test/e2e-kind.sh Outdated Show resolved Hide resolved
@ReToCode
Copy link
Member

I can put a comment next to it to tell it must be greater than DRAIN_TIME_SECONDS, let me know.

Yes please :)

In general I think that approach is good. Could you please also:

  • Update the PR description to match the current status of the code
  • Make sure all tests pass successfully

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024
@skonto
Copy link
Contributor

skonto commented Apr 22, 2024

@norbjd hi, gentle ping. Pls take a look at @ReToCode's suggestions and let's move on with this one. It has been open for quite some time.

@skonto
Copy link
Contributor

skonto commented Apr 26, 2024

@norbjd gentle ping

@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2024
Patch env and terminationGracePeriodSeconds at the same time
- avoids conflicts with other tests
- change gracefulshutdown test to delete all gateway pods
@norbjd
Copy link
Contributor Author

norbjd commented Apr 27, 2024

/retest

@norbjd
Copy link
Contributor Author

norbjd commented Apr 27, 2024

Thanks for the review, I have fixed the last requested changes ✔️

norbjd/net-kourier@7170cb6...gateway-prestop-hook-wait-until-all-incoming-requests-are-finished

TL;DR:

  • add a comment near terminationGracePeriodSeconds to ensure users set a value greater than DRAIN_TIME_SECONDS
  • run the gracefulshutdown e2e test at the end (otherwise I had conflicts between this test and others, this is what caused flaky tests): all e2e tests are now passing
  • PR description has been updated

@norbjd norbjd requested a review from ReToCode April 27, 2024 14:10
Copy link
Member

@ReToCode ReToCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks for doing this @norbjd

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2024
Copy link

knative-prow bot commented Apr 29, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: norbjd, ReToCode

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2024
@knative-prow knative-prow bot merged commit cc30bec into knative-extensions:main Apr 29, 2024
42 checks passed
skonto pushed a commit to skonto/net-kourier that referenced this pull request Apr 29, 2024
…sions#1203)

* Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image

* Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds)

* Drain listeners with appropriate endpoint

* Simpler drain + sleep

* Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds

* Use a perl script (no need to open the admin HTTP interface!)

* Use bash instead of perl in preStop hook

* Review @skonto comments: use socket address for admin cluster

* [WIP] add graceful shutdown test and tweak CI to just run that test

* [WIP] Fix gracefulshutdown_test.go

* [WIP] try to fix race condition and lint

* [WIP] use initialTimeout + debug

* [WIP] fix gracefulshutdown_test.go logic

* [WIP] refacto and add some comments to clarify

* [WIP] fix lint

* [WIP] reintroduce kind-e2e-upgrade.yaml

* [WIP] add test case when request takes a little longer than the drain time

* [WIP] fix compilation issue

* [WIP] FIx compilation issue (again)

* [WIP] hopefully fix data race

* [WIP] refacto and hopefully fix race condition (use sync.Map)

* [WIP] fix compilation issue

* [WIP] Handle EOF

* [WIP] check gateway pod has been removed + manual debugging

* [WIP] debugging

* [WIP] more debugging

* [WIP] more debugging

* [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF

* [WIP] remove debugging related stuff

* Revert all unnecessary changes made for testing

* Revert unnecessary change (livenessProbe)

* Scale to 1 replica

* Typo

* Run gracefulshutdown test first (speed up feedback loop)

* Add a comment for terminationGracePeriodSeconds

* Don't update deployment twice

Patch env and terminationGracePeriodSeconds at the same time

* Fix bad patch

* Run gracefulshutdown test at the end

- avoids conflicts with other tests
- change gracefulshutdown test to delete all gateway pods

* Fix gracefulshutdown test

* Fix gracefulshutdown test

* Lint
openshift-merge-bot bot pushed a commit to openshift-knative/net-kourier that referenced this pull request Apr 30, 2024
* Gracefully drain connections when stopping the gateway (knative-extensions#1203)

* Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image

* Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds)

* Drain listeners with appropriate endpoint

* Simpler drain + sleep

* Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds

* Use a perl script (no need to open the admin HTTP interface!)

* Use bash instead of perl in preStop hook

* Review @skonto comments: use socket address for admin cluster

* [WIP] add graceful shutdown test and tweak CI to just run that test

* [WIP] Fix gracefulshutdown_test.go

* [WIP] try to fix race condition and lint

* [WIP] use initialTimeout + debug

* [WIP] fix gracefulshutdown_test.go logic

* [WIP] refacto and add some comments to clarify

* [WIP] fix lint

* [WIP] reintroduce kind-e2e-upgrade.yaml

* [WIP] add test case when request takes a little longer than the drain time

* [WIP] fix compilation issue

* [WIP] FIx compilation issue (again)

* [WIP] hopefully fix data race

* [WIP] refacto and hopefully fix race condition (use sync.Map)

* [WIP] fix compilation issue

* [WIP] Handle EOF

* [WIP] check gateway pod has been removed + manual debugging

* [WIP] debugging

* [WIP] more debugging

* [WIP] more debugging

* [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF

* [WIP] remove debugging related stuff

* Revert all unnecessary changes made for testing

* Revert unnecessary change (livenessProbe)

* Scale to 1 replica

* Typo

* Run gracefulshutdown test first (speed up feedback loop)

* Add a comment for terminationGracePeriodSeconds

* Don't update deployment twice

Patch env and terminationGracePeriodSeconds at the same time

* Fix bad patch

* Run gracefulshutdown test at the end

- avoids conflicts with other tests
- change gracefulshutdown test to delete all gateway pods

* Fix gracefulshutdown test

* Fix gracefulshutdown test

* Lint

* run hack/update-deps.sh

* update openshift files

---------

Co-authored-by: norbjd <norbjd@users.noreply.github.com>
@norbjd norbjd deleted the gateway-prestop-hook-wait-until-all-incoming-requests-are-finished branch May 1, 2024 09:26
skonto added a commit to skonto/net-kourier that referenced this pull request May 14, 2024
* Gracefully drain connections when stopping the gateway (knative-extensions#1203)

* Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image

* Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds)

* Drain listeners with appropriate endpoint

* Simpler drain + sleep

* Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds

* Use a perl script (no need to open the admin HTTP interface!)

* Use bash instead of perl in preStop hook

* Review @skonto comments: use socket address for admin cluster

* [WIP] add graceful shutdown test and tweak CI to just run that test

* [WIP] Fix gracefulshutdown_test.go

* [WIP] try to fix race condition and lint

* [WIP] use initialTimeout + debug

* [WIP] fix gracefulshutdown_test.go logic

* [WIP] refacto and add some comments to clarify

* [WIP] fix lint

* [WIP] reintroduce kind-e2e-upgrade.yaml

* [WIP] add test case when request takes a little longer than the drain time

* [WIP] fix compilation issue

* [WIP] FIx compilation issue (again)

* [WIP] hopefully fix data race

* [WIP] refacto and hopefully fix race condition (use sync.Map)

* [WIP] fix compilation issue

* [WIP] Handle EOF

* [WIP] check gateway pod has been removed + manual debugging

* [WIP] debugging

* [WIP] more debugging

* [WIP] more debugging

* [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF

* [WIP] remove debugging related stuff

* Revert all unnecessary changes made for testing

* Revert unnecessary change (livenessProbe)

* Scale to 1 replica

* Typo

* Run gracefulshutdown test first (speed up feedback loop)

* Add a comment for terminationGracePeriodSeconds

* Don't update deployment twice

Patch env and terminationGracePeriodSeconds at the same time

* Fix bad patch

* Run gracefulshutdown test at the end

- avoids conflicts with other tests
- change gracefulshutdown test to delete all gateway pods

* Fix gracefulshutdown test

* Fix gracefulshutdown test

* Lint

* run hack/update-deps.sh

* update openshift files

---------

Co-authored-by: norbjd <norbjd@users.noreply.github.com>
openshift-merge-bot bot pushed a commit to openshift-knative/net-kourier that referenced this pull request May 14, 2024
* [SRVKS-1171] [RELEASE-1.12] Graceful shutdown (#130)

* Gracefully drain connections when stopping the gateway (knative-extensions#1203)

* Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image

* Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds)

* Drain listeners with appropriate endpoint

* Simpler drain + sleep

* Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds

* Use a perl script (no need to open the admin HTTP interface!)

* Use bash instead of perl in preStop hook

* Review @skonto comments: use socket address for admin cluster

* [WIP] add graceful shutdown test and tweak CI to just run that test

* [WIP] Fix gracefulshutdown_test.go

* [WIP] try to fix race condition and lint

* [WIP] use initialTimeout + debug

* [WIP] fix gracefulshutdown_test.go logic

* [WIP] refacto and add some comments to clarify

* [WIP] fix lint

* [WIP] reintroduce kind-e2e-upgrade.yaml

* [WIP] add test case when request takes a little longer than the drain time

* [WIP] fix compilation issue

* [WIP] FIx compilation issue (again)

* [WIP] hopefully fix data race

* [WIP] refacto and hopefully fix race condition (use sync.Map)

* [WIP] fix compilation issue

* [WIP] Handle EOF

* [WIP] check gateway pod has been removed + manual debugging

* [WIP] debugging

* [WIP] more debugging

* [WIP] more debugging

* [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF

* [WIP] remove debugging related stuff

* Revert all unnecessary changes made for testing

* Revert unnecessary change (livenessProbe)

* Scale to 1 replica

* Typo

* Run gracefulshutdown test first (speed up feedback loop)

* Add a comment for terminationGracePeriodSeconds

* Don't update deployment twice

Patch env and terminationGracePeriodSeconds at the same time

* Fix bad patch

* Run gracefulshutdown test at the end

- avoids conflicts with other tests
- change gracefulshutdown test to delete all gateway pods

* Fix gracefulshutdown test

* Fix gracefulshutdown test

* Lint

* run hack/update-deps.sh

* update openshift files

---------

Co-authored-by: norbjd <norbjd@users.noreply.github.com>

* update deps

---------

Co-authored-by: norbjd <norbjd@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/enhancement lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kourier does not gracefully shut down
4 participants