Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using proxy-next-upstream when proxy-connect-timeout happens? #4944

Closed
bzon opened this issue Jan 17, 2020 · 20 comments
Closed

Using proxy-next-upstream when proxy-connect-timeout happens? #4944

bzon opened this issue Jan 17, 2020 · 20 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@bzon
Copy link

bzon commented Jan 17, 2020

When an upstream pod changes its IP for some reason like pod or node restarts, we get a small fraction of 504 errors with a consistent 5 seconds timeout. In the error logs and trace, it is very clear that the upstream pod IP address that is used by Nginx does not exist anymore. With pod disruption budgets, we almost have at least 3 available replicas of these upstream pod.

We have proxy_connect_timeout=5s. These are the other the settings I've extracted from the Nginx ingress conf file.

proxy_connect_timeout                   5s;
proxy_send_timeout                      60s;
proxy_read_timeout                      60s;
proxy_next_upstream                     error timeout http_502 http_503 http_504 non_idempotent;
proxy_next_upstream_timeout             1;
proxy_next_upstream_tries               3;

Can we use proxy_next_upstream and hope that it goes to the next available pod?

@bzon bzon added the kind/support Categorizes issue or PR as a support question. label Jan 17, 2020
@aledbf
Copy link
Member

aledbf commented Jan 17, 2020

Can we use proxy_next_upstream and hope that it goes to the next available pod?

You can see exactly what nginx is doing reading the logs. When a retry is required the $upstream_addr field becomes an array where you can see the IPs of the pods and also the status code that triggered that retry
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/log-format/

@bzon
Copy link
Author

bzon commented Jan 17, 2020

@aledbf I changed my config to ensure that proxy_next_upstream_timeout is more than proxy_connect_timeout and I think it is working now.

Thanks for the tip about the upstream addr being an array. With traces and logs I mentioned, the upstream addr field only has one entry.

@bzon
Copy link
Author

bzon commented Jan 17, 2020

@aledbf the proxy next upstream works after testing but the behaviour is wrong.

With proxy_next_upstream_tries set to 3, and a proxy_connect_timeout of 3. I get a consistent 9s latency with my 504 errors.

Screen Shot 2020-01-17 at 7 59 52 PM

It was still failing and upon checking the $upstream_addr, the 3 pod IPs used are all identical.

Screen Shot 2020-01-17 at 7 59 02 PM

But it also works!

Screen Shot 2020-01-17 at 8 08 15 PM

This is my config.

proxy-connect-timeout: "3"
proxy-next-upstream: "error timeout http_502 http_503 http_504 non_idempotent"
proxy-next-upstream-timeout: "10"

# Fix to prevent status code 499
http-snippet: |
  proxy_ignore_client_abort on;

@aledbf
Copy link
Member

aledbf commented Jan 17, 2020

@bzon I am sorry but what are you asking exactly?

@aledbf
Copy link
Member

aledbf commented Jan 17, 2020

It was still failing and upon checking the $upstream_addr, the 3 pod IPs used are all identical.

This means you only have one pod running

@aledbf
Copy link
Member

aledbf commented Jan 17, 2020

But it also works!

This means you have more than one. If you increase the tries you will see the same IP:port more than once

@bzon
Copy link
Author

bzon commented Jan 17, 2020

This means you only have one pod running

But there were 2 available pods running during that trace time. How can this happen? Maybe DNS? 🤔

Anyway, I can now close this issue since it already answered my question in the title.

@bzon bzon closed this as completed Jan 17, 2020
@aledbf
Copy link
Member

aledbf commented Jan 17, 2020

Maybe DNS?

There is no DNS involved in ingress-nginx. We use the client-go library to reach the api server and get the endpoints (pods) directly. Please check https://kubernetes.github.io/ingress-nginx/how-it-works/

@bzon
Copy link
Author

bzon commented Jan 17, 2020

There is no DNS involved in ingress-nginx. We use the client-go library to reach the api server and get the endpoints (pods) directly. Please check https://kubernetes.github.io/ingress-nginx/how-it-works/

But I'm pretty sure that there were at least 2 available pods running because I was doing all these tests manually. Deleting pods manually and the 2 other pods were receiving requests as well...

@mikksoone
Copy link

Same issue here. In our case, a "chaos controller" terminated one node, and we could see ingress 504 errors. There were 2 pods, one which died, and one that was on a healthy node. Total of 3 ingresses, and same situation in all of them.

In a 30s period, there were total of 79 requests to the dying pod, out of which:

  1. 51 requests got 200 OK from second pod after 5s 100.115.211.230:14444,100.104.239.13:14444 0, 0 5.000, 0.008 504, 200
  2. 12 requests got 200 OK after 10s (two retries from a failed pod, third retry from healthy pod) 100.115.211.230:14444, 100.115.211.230:14444, 100.104.239.13:14444 0, 0, 1480 5.000, 5.000, 0.004 504, 504, 200
  3. 13 requests got 504 after 15s (three retires, all to the dying pod) 100.115.211.230:14444, 100.115.211.230:14444, 100.115.211.230:14444 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504

Any idea why same ingress controller has different retry setup for same service? These three cases on a same ingress were mixed in time (was not that all cases 1 came first, then case 2 and case 3 in the end, they were evenly distributed in time).

@bzon
Copy link
Author

bzon commented Dec 22, 2020

@mikksoone I have the same issue with you. Did you somehow manage to control this problem? Do you make use of next upstream retries?

@lonre
Copy link

lonre commented Jan 4, 2021

@mikksoone @bzon same issue here

@erdebee
Copy link

erdebee commented Aug 9, 2021

Same issue here, multiple pods running but our logs indicate that in som cases (i guess with 3 pods of which 1 failed with maximum of 3 retries, the odds are 1 in 9 (3x3)) the request fails after consistently calling the failed node. Maybe the lua script only updates the list of nodes every so many times, or with a multi-node cluster it takes a while for all nodes to discover the pod's unavailability?

@tiejunhu
Copy link

same issue here, some of the timeout requests are retried, some not. so solution so far.

@tontondematt
Copy link

tontondematt commented Sep 30, 2022

It's 2022 and i have the same issue. Nginx ingress controller behind an ALB, targeting a microservice running 3 instances. When doing a node upgrade we get the odd (not many but still a handfull of errors, like 10 ) 504 errors. When looking in detail we end up with the same conclusion : it is retrying on the same adress even though we have at least 2 available instances at all times (PDB with maxUnavailable is set to 1).

upstream_addr 100.65.187.226:8080, 100.65.187.226:8080, 100.65.187.226:8080
upstream_response_length 0, 0, 0
upstream_response_time 5.000, 5.000, 5.000
upstream_status 504, 504, 504

@yifeng-cerebras
Copy link

Is there an existing bug for only one pod running?
I have this infinite retry on the same IP over and over again.
Is it expected to turn off for only one pod?

I didn't change any default config

		proxy_connect_timeout                   5s;
		proxy_send_timeout                      60s;
		proxy_read_timeout                      60s;

		proxy_buffering                         off;
		proxy_buffer_size                       4k;
		proxy_buffers                           4 4k;

		proxy_max_temp_file_size                1024m;

		proxy_request_buffering                 on;
		proxy_http_version                      1.1;

		proxy_cookie_domain                     off;
		proxy_cookie_path                       off;

		# In case of errors try the next upstream server before returning an error
		proxy_next_upstream                     error timeout;
		proxy_next_upstream_timeout             0;
		proxy_next_upstream_tries               3;

		grpc_pass grpc://upstream_balancer;

		proxy_redirect                          off;

192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000 , 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:900 0, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:90 00, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 60.048, 60.039, 60.032, 60.000, 60.027, 60.037, 60.025, 60.060, 60.060, 60.061, 60.053, 60.060, 60.061, 60.059, 60.038, 60.061, 60.060, 60.060, 60.060, 60.060, 60.061, 60.060, 60.060, 60.060, 60.060, 60.061, 60.060, 60.060, 60.032, 60.060, 60.048, 60.055, 60.059, 60.059, 60.058, 60.058, 60.058, 60.060, 60.061, 60.060, 60.060, 60.005, 60.060, 60.061, 60.059, 60.060, 60.031, 60.060, 60.038, 60.061, 60.060, 60.060, 60.060, 60.020, 60.060, 60.059, 60.060, 60.061, 60.060, 60.060, 60.060, 60.060, 35.283 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504 , 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, - 37071e75abfc39a2ca0045eb9626d4e4

@yifeng-cerebras
Copy link

Found the solution my self: #5524 (comment)

@bellasys
Copy link

bellasys commented May 17, 2023

@yifeng-cerebras "proxy_next_upstream_timeout 0;" means 0 timeout. You are trying 3 tries, each of infinitely timespan which is consistent with your result...

@yifeng-cerebras
Copy link

yifeng-cerebras commented May 17, 2023

@bellasys thanks for reply
the workaround is to turn off grpc_next_upstream instead of "proxy_next_upstream" for my use case

@sstaley-hioscar
Copy link

rebras "proxy_next_upstream_timeout 0;" means 0 timeout. You are trying 3 tries, each of infinitely timespan which is consistent with your result...

From the nginx docs:

Limits the time during which a request can be passed to the next server. The 0 value turns off this limitation.

To me this reads that his service will not apply any timeouts to the process of trying next upstreams. So, it should only be limited by the 5 second connect timeout which is applied to each connection attempt (3 in total, or 15 seconds).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

10 participants