You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our backend infrastructure uses Kubernetes pods to perform analysis of data. As the flow of data increases and decreases through the day, the Kubernetes auto scaler scales up and scales down the pods.
The analysis operation on a single workload can take up to six minutes. To enable successful completion of in flight scan operation, the pods
have a terminationGracePeriodSeconds set to 360 seconds
capture the SIGTERM event and prevent the pod from accepting new requests
SpringApplication springApplication = new SpringApplication(ProductApplication.class);
springApplication.addListeners(new GracefulShutdownListener());
GracefulShutdownListener -> onApplicationEvent(ContextClosedEvent event)
1. Prevent new requests from being sent
2. Sleep for 6 minutes
The issue is that thought the pods are waiting for 6 minutes to complete in flight operation, these operation fail in outward network communication. Thus the pod gets a Connection Refused error when communicating with SNS in the graceful shutdown phase.
Investigation reveal that pods are failing in external network communication with multiple applications during the 'graceful shutdown' phase.
I have gone through multiple tickets /documents in a similar area
I am raising this ticket as we require the ability to create new external connections (SNS) during graceful shutdown.
Primarily I see in the document https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/, that a B flow is started on pod deletion, which removed pods from ip tables. It seems to me that this may result in the pod not being able to create new network connections. If this is the case we would need a mechanism to delay the B flow until the grace period is done.
I am not sure if sleeping in the pre-stop hook is the recommended mechanism to prevent pods from loosing network connection during the graceful shutdown phase.
What did you expect to happen?
Any Kubernetes pod during it's graceful shutdown period should have unrestricted access to required resources (network) and the ability to create new connections. This ability need not be by default and could also be by setting a configuration.
How can we reproduce it (as minimally and precisely as possible)?
Start a pod which perpetually creates a new connection with and updates an external entity
Terminate the pod manually
Connection refused should be seen in attempts to create new connection
Connection refused error may be seen for readiness probe.
We have not found this to always be reproducible in our test clusters, however can be seen very frequently in our clusters server continuous data.
Anything else we need to know?
The evidences that we can see indicating that pod is loosing ability to communicate with external components
Connection refused error to multiple components, during graceful shut down phase.
pod events display connectino refused event for readiness proble a minute after shutdown
38m Normal Killing pod/scan-434f242-fr23e Stopping container xyz
38m Normal Killing pod/scan-434f242-fr23e Stopping container pqr
38m Normal Killing pod/scan-434f242-fr23e Stopping container abc
38m Warning FailedPreStopHook pod/s434f242-fr23e Exec lifecycle hook ([]) for Container "abc" in Pod "scan-434f242-fr23e_dss(19451a4f-2w32-4221-we32-4a3b0b169a7c)" failed - error: command '' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: exec: \"\": executable file not found in $PATH: unknown\r\n"
37m Warning Unhealthy pod/scan-434f242-fr23e Readiness probe failed: Get http://10.111.2.71:8080/actuator/health/readiness: dial tcp 10.203.9.71:8080: connect: connection refused
33m Warning Unhealthy pod/scan-434f242-fr23e Readiness probe failed: Get http://10.111.2.71:4191/ready: dial tcp 10.203.9.71:4191: connect: connection refused
38m Warning Unhealthy pod/scan-434f242-fr23e Liveness probe failed: Get http://10.112.2.71:4191/live: dial tcp 10.111.2.71:4191: connect: connection refused
38m Warning Unhealthy pod/scan-434f242-fr23e Liveness probe failed: Get http://10.111.2.71:8080/actuator/health/liveness: dial tcp 10.203.9.71:8080: connect: connection refused```
This does not always reproduce.
### Kubernetes version
<details>
```console
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.17-eks-087e67", GitCommit:"087e67e479962798594218dc6d99923f410c145e", GitTreeState:"clean", BuildDate:"2021-07-31T01:39:55Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
Details
AWS
OS version
Details
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# cat /etc/os-releaseNAME="Amazon Linux"VERSION="2"ID="amzn"ID_LIKE="centos rhel fedora"VERSION_ID="2"PRETTY_NAME="Amazon Linux 2"ANSI_COLOR="0;33"CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"HOME_URL="https://amazonlinux.com/"
Install tools
Details
Container runtime (CRI) and and version (if applicable)
Details
Docker container
Related plugins (CNI, CSI, ...) and versions (if applicable)
What happened?
Hi,
Our backend infrastructure uses Kubernetes pods to perform analysis of data. As the flow of data increases and decreases through the day, the Kubernetes auto scaler scales up and scales down the pods.
The analysis operation on a single workload can take up to six minutes. To enable successful completion of in flight scan operation, the pods
The issue is that thought the pods are waiting for 6 minutes to complete in flight operation, these operation fail in outward network communication. Thus the pod gets a
Connection Refusederror when communicating with SNS in the graceful shutdown phase.Investigation reveal that pods are failing in external network communication with multiple applications during the 'graceful shutdown' phase.
I have gone through multiple tickets /documents in a similar area
One comment on the same area seems to be #86280 (comment)
I am raising this ticket as we require the ability to create new external connections (SNS) during graceful shutdown.
Primarily I see in the document https://freecontent.manning.com/handling-client-requests-properly-with-kubernetes/, that a B flow is started on pod deletion, which removed pods from ip tables. It seems to me that this may result in the pod not being able to create new network connections. If this is the case we would need a mechanism to delay the B flow until the grace period is done.
I am not sure if sleeping in the pre-stop hook is the recommended mechanism to prevent pods from loosing network connection during the graceful shutdown phase.
What did you expect to happen?
Any Kubernetes pod during it's graceful shutdown period should have unrestricted access to required resources (network) and the ability to create new connections. This ability need not be by default and could also be by setting a configuration.
How can we reproduce it (as minimally and precisely as possible)?
Connection refused should be seen in attempts to create new connection
Connection refused error may be seen for readiness probe.
We have not found this to always be reproducible in our test clusters, however can be seen very frequently in our clusters server continuous data.
Anything else we need to know?
The evidences that we can see indicating that pod is loosing ability to communicate with external components
Cloud provider
Details
AWSOS version
Details
Install tools
Details
Container runtime (CRI) and and version (if applicable)
Details
Docker containerRelated plugins (CNI, CSI, ...) and versions (if applicable)
Details