-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using proxy-next-upstream when proxy-connect-timeout happens? #4944
Comments
You can see exactly what nginx is doing reading the logs. When a retry is required the $upstream_addr field becomes an array where you can see the IPs of the pods and also the status code that triggered that retry |
@aledbf I changed my config to ensure that proxy_next_upstream_timeout is more than proxy_connect_timeout and I think it is working now. Thanks for the tip about the upstream addr being an array. With traces and logs I mentioned, the upstream addr field only has one entry. |
@aledbf the proxy next upstream works after testing but the behaviour is wrong. With proxy_next_upstream_tries set to 3, and a proxy_connect_timeout of 3. I get a consistent 9s latency with my 504 errors. It was still failing and upon checking the $upstream_addr, the 3 pod IPs used are all identical. But it also works! This is my config. proxy-connect-timeout: "3"
proxy-next-upstream: "error timeout http_502 http_503 http_504 non_idempotent"
proxy-next-upstream-timeout: "10"
# Fix to prevent status code 499
http-snippet: |
proxy_ignore_client_abort on; |
@bzon I am sorry but what are you asking exactly? |
This means you only have one pod running |
This means you have more than one. If you increase the tries you will see the same IP:port more than once |
But there were 2 available pods running during that trace time. How can this happen? Maybe DNS? 🤔 Anyway, I can now close this issue since it already answered my question in the title. |
There is no DNS involved in ingress-nginx. We use the client-go library to reach the api server and get the endpoints (pods) directly. Please check https://kubernetes.github.io/ingress-nginx/how-it-works/ |
But I'm pretty sure that there were at least 2 available pods running because I was doing all these tests manually. Deleting pods manually and the 2 other pods were receiving requests as well... |
Same issue here. In our case, a "chaos controller" terminated one node, and we could see ingress 504 errors. There were 2 pods, one which died, and one that was on a healthy node. Total of 3 ingresses, and same situation in all of them. In a 30s period, there were total of 79 requests to the dying pod, out of which:
Any idea why same ingress controller has different retry setup for same service? These three cases on a same ingress were mixed in time (was not that all cases 1 came first, then case 2 and case 3 in the end, they were evenly distributed in time). |
@mikksoone I have the same issue with you. Did you somehow manage to control this problem? Do you make use of next upstream retries? |
@mikksoone @bzon same issue here |
Same issue here, multiple pods running but our logs indicate that in som cases (i guess with 3 pods of which 1 failed with maximum of 3 retries, the odds are 1 in 9 (3x3)) the request fails after consistently calling the failed node. Maybe the lua script only updates the list of nodes every so many times, or with a multi-node cluster it takes a while for all nodes to discover the pod's unavailability? |
same issue here, some of the timeout requests are retried, some not. so solution so far. |
It's 2022 and i have the same issue. Nginx ingress controller behind an ALB, targeting a microservice running 3 instances. When doing a node upgrade we get the odd (not many but still a handfull of errors, like 10 ) 504 errors. When looking in detail we end up with the same conclusion : it is retrying on the same adress even though we have at least 2 available instances at all times (PDB with maxUnavailable is set to 1).
|
Is there an existing bug for only one pod running? I didn't change any default config
|
Found the solution my self: #5524 (comment) |
@yifeng-cerebras "proxy_next_upstream_timeout 0;" means 0 timeout. You are trying 3 tries, each of infinitely timespan which is consistent with your result... |
@bellasys thanks for reply |
From the nginx docs:
To me this reads that his service will not apply any timeouts to the process of trying next upstreams. So, it should only be limited by the 5 second connect timeout which is applied to each connection attempt (3 in total, or 15 seconds). |
When an upstream pod changes its IP for some reason like pod or node restarts, we get a small fraction of 504 errors with a consistent 5 seconds timeout. In the error logs and trace, it is very clear that the upstream pod IP address that is used by Nginx does not exist anymore. With pod disruption budgets, we almost have at least 3 available replicas of these upstream pod.
We have
proxy_connect_timeout=5s
. These are the other the settings I've extracted from the Nginx ingress conf file.Can we use
proxy_next_upstream
and hope that it goes to the next available pod?The text was updated successfully, but these errors were encountered: