Skip to content

Pod looses network connection (connection refused errors) during graceful shutdown period #105703

@nilesh-telang

Description

@nilesh-telang

What happened?

Hi,

Our backend infrastructure uses Kubernetes pods to perform analysis of data. As the flow of data increases and decreases through the day, the Kubernetes auto scaler scales up and scales down the pods.

The analysis operation on a single workload can take up to six minutes. To enable successful completion of in flight scan operation, the pods

  1. have a terminationGracePeriodSeconds set to 360 seconds
  2. capture the SIGTERM event and prevent the pod from accepting new requests
            SpringApplication springApplication = new SpringApplication(ProductApplication.class);
            springApplication.addListeners(new GracefulShutdownListener());
            
            GracefulShutdownListener -> onApplicationEvent(ContextClosedEvent event) 
            1. Prevent new requests from being sent
            2. Sleep for 6 minutes
    

The issue is that thought the pods are waiting for 6 minutes to complete in flight operation, these operation fail in outward network communication. Thus the pod gets a Connection Refused error when communicating with SNS in the graceful shutdown phase.

Investigation reveal that pods are failing in external network communication with multiple applications during the 'graceful shutdown' phase.

I have gone through multiple tickets /documents in a similar area

  1. Network becomes unavailable on terminating pods #44956
  2. Graceful terminations within Kubernetes ardanlabs/service#189
  3. https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/

One comment on the same area seems to be #86280 (comment)

I am raising this ticket as we require the ability to create new external connections (SNS) during graceful shutdown.

Primarily I see in the document https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/, that a B flow is started on pod deletion, which removed pods from ip tables. It seems to me that this may result in the pod not being able to create new network connections. If this is the case we would need a mechanism to delay the B flow until the grace period is done.

I am not sure if sleeping in the pre-stop hook is the recommended mechanism to prevent pods from loosing network connection during the graceful shutdown phase.

What did you expect to happen?

Any Kubernetes pod during it's graceful shutdown period should have unrestricted access to required resources (network) and the ability to create new connections. This ability need not be by default and could also be by setting a configuration.

How can we reproduce it (as minimally and precisely as possible)?

  1. Start a pod which perpetually creates a new connection with and updates an external entity
  2. Terminate the pod manually

Connection refused should be seen in attempts to create new connection
Connection refused error may be seen for readiness probe.

We have not found this to always be reproducible in our test clusters, however can be seen very frequently in our clusters server continuous data.

Anything else we need to know?

The evidences that we can see indicating that pod is loosing ability to communicate with external components

  1. Connection refused error to multiple components, during graceful shut down phase.
  2. pod events display connectino refused event for readiness proble a minute after shutdown
38m         Normal    Killing             pod/scan-434f242-fr23e   Stopping container xyz
38m         Normal    Killing             pod/scan-434f242-fr23e   Stopping container pqr
38m         Normal    Killing             pod/scan-434f242-fr23e   Stopping container abc
38m         Warning   FailedPreStopHook   pod/s434f242-fr23e   Exec lifecycle hook ([]) for Container "abc" in Pod "scan-434f242-fr23e_dss(19451a4f-2w32-4221-we32-4a3b0b169a7c)" failed - error: command '' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: exec: \"\": executable file not found in $PATH: unknown\r\n"
37m         Warning   Unhealthy           pod/scan-434f242-fr23e   Readiness probe failed: Get http://10.111.2.71:8080/actuator/health/readiness: dial tcp 10.203.9.71:8080: connect: connection refused
33m         Warning   Unhealthy           pod/scan-434f242-fr23e   Readiness probe failed: Get http://10.111.2.71:4191/ready: dial tcp 10.203.9.71:4191: connect: connection refused
38m         Warning   Unhealthy           pod/scan-434f242-fr23e   Liveness probe failed: Get http://10.112.2.71:4191/live: dial tcp 10.111.2.71:4191: connect: connection refused
38m         Warning   Unhealthy           pod/scan-434f242-fr23e   Liveness probe failed: Get http://10.111.2.71:8080/actuator/health/liveness: dial tcp 10.203.9.71:8080: connect: connection refused```

This does not always reproduce.

### Kubernetes version

<details>

```console
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}

Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.17-eks-087e67", GitCommit:"087e67e479962798594218dc6d99923f410c145e", GitTreeState:"clean", BuildDate:"2021-07-31T01:39:55Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Details AWS

OS version

Details
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Install tools

Details

Container runtime (CRI) and and version (if applicable)

Details Docker container

Related plugins (CNI, CSI, ...) and versions (if applicable)

Details

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.kind/supportCategorizes issue or PR as a support question.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/networkCategorizes an issue or PR as relevant to SIG Network.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions