Using proxy-next-upstream when proxy-connect-timeout happens? #4944

bzon · 2020-01-17T17:34:18Z

When an upstream pod changes its IP for some reason like pod or node restarts, we get a small fraction of 504 errors with a consistent 5 seconds timeout. In the error logs and trace, it is very clear that the upstream pod IP address that is used by Nginx does not exist anymore. With pod disruption budgets, we almost have at least 3 available replicas of these upstream pod.

We have proxy_connect_timeout=5s. These are the other the settings I've extracted from the Nginx ingress conf file.

proxy_connect_timeout                   5s;
proxy_send_timeout                      60s;
proxy_read_timeout                      60s;
proxy_next_upstream                     error timeout http_502 http_503 http_504 non_idempotent;
proxy_next_upstream_timeout             1;
proxy_next_upstream_tries               3;

Can we use proxy_next_upstream and hope that it goes to the next available pod?

The text was updated successfully, but these errors were encountered:

aledbf · 2020-01-17T18:08:11Z

Can we use proxy_next_upstream and hope that it goes to the next available pod?

You can see exactly what nginx is doing reading the logs. When a retry is required the $upstream_addr field becomes an array where you can see the IPs of the pods and also the status code that triggered that retry
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/log-format/

bzon · 2020-01-17T18:11:24Z

@aledbf I changed my config to ensure that proxy_next_upstream_timeout is more than proxy_connect_timeout and I think it is working now.

Thanks for the tip about the upstream addr being an array. With traces and logs I mentioned, the upstream addr field only has one entry.

bzon · 2020-01-17T19:02:41Z

@aledbf the proxy next upstream works after testing but the behaviour is wrong.

With proxy_next_upstream_tries set to 3, and a proxy_connect_timeout of 3. I get a consistent 9s latency with my 504 errors.

It was still failing and upon checking the $upstream_addr, the 3 pod IPs used are all identical.

But it also works!

This is my config.

proxy-connect-timeout: "3"
proxy-next-upstream: "error timeout http_502 http_503 http_504 non_idempotent"
proxy-next-upstream-timeout: "10"

# Fix to prevent status code 499
http-snippet: |
  proxy_ignore_client_abort on;

aledbf · 2020-01-17T19:19:32Z

@bzon I am sorry but what are you asking exactly?

aledbf · 2020-01-17T19:19:46Z

It was still failing and upon checking the $upstream_addr, the 3 pod IPs used are all identical.

This means you only have one pod running

aledbf · 2020-01-17T19:20:29Z

But it also works!

This means you have more than one. If you increase the tries you will see the same IP:port more than once

bzon · 2020-01-17T19:26:04Z

This means you only have one pod running

But there were 2 available pods running during that trace time. How can this happen? Maybe DNS? 🤔

Anyway, I can now close this issue since it already answered my question in the title.

aledbf · 2020-01-17T19:30:01Z

Maybe DNS?

There is no DNS involved in ingress-nginx. We use the client-go library to reach the api server and get the endpoints (pods) directly. Please check https://kubernetes.github.io/ingress-nginx/how-it-works/

bzon · 2020-01-17T19:32:51Z

There is no DNS involved in ingress-nginx. We use the client-go library to reach the api server and get the endpoints (pods) directly. Please check https://kubernetes.github.io/ingress-nginx/how-it-works/

But I'm pretty sure that there were at least 2 available pods running because I was doing all these tests manually. Deleting pods manually and the 2 other pods were receiving requests as well...

mikksoone · 2020-01-29T11:49:37Z

Same issue here. In our case, a "chaos controller" terminated one node, and we could see ingress 504 errors. There were 2 pods, one which died, and one that was on a healthy node. Total of 3 ingresses, and same situation in all of them.

In a 30s period, there were total of 79 requests to the dying pod, out of which:

51 requests got 200 OK from second pod after 5s 100.115.211.230:14444,100.104.239.13:14444 0, 0 5.000, 0.008 504, 200
12 requests got 200 OK after 10s (two retries from a failed pod, third retry from healthy pod) 100.115.211.230:14444, 100.115.211.230:14444, 100.104.239.13:14444 0, 0, 1480 5.000, 5.000, 0.004 504, 504, 200
13 requests got 504 after 15s (three retires, all to the dying pod) 100.115.211.230:14444, 100.115.211.230:14444, 100.115.211.230:14444 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504

Any idea why same ingress controller has different retry setup for same service? These three cases on a same ingress were mixed in time (was not that all cases 1 came first, then case 2 and case 3 in the end, they were evenly distributed in time).

bzon · 2020-12-22T22:47:16Z

@mikksoone I have the same issue with you. Did you somehow manage to control this problem? Do you make use of next upstream retries?

lonre · 2021-01-04T17:22:20Z

@mikksoone @bzon same issue here

erdebee · 2021-08-09T09:45:17Z

Same issue here, multiple pods running but our logs indicate that in som cases (i guess with 3 pods of which 1 failed with maximum of 3 retries, the odds are 1 in 9 (3x3)) the request fails after consistently calling the failed node. Maybe the lua script only updates the list of nodes every so many times, or with a multi-node cluster it takes a while for all nodes to discover the pod's unavailability?

tiejunhu · 2021-08-11T09:37:23Z

same issue here, some of the timeout requests are retried, some not. so solution so far.

tontondematt · 2022-09-30T17:04:43Z

It's 2022 and i have the same issue. Nginx ingress controller behind an ALB, targeting a microservice running 3 instances. When doing a node upgrade we get the odd (not many but still a handfull of errors, like 10 ) 504 errors. When looking in detail we end up with the same conclusion : it is retrying on the same adress even though we have at least 2 available instances at all times (PDB with maxUnavailable is set to 1).

upstream_addr	100.65.187.226:8080, 100.65.187.226:8080, 100.65.187.226:8080
upstream_response_length	0, 0, 0
upstream_response_time	5.000, 5.000, 5.000
upstream_status	504, 504, 504

yifeng-cerebras · 2022-10-26T22:49:27Z

Is there an existing bug for only one pod running?
I have this infinite retry on the same IP over and over again.
Is it expected to turn off for only one pod?

I didn't change any default config

		proxy_connect_timeout                   5s;
		proxy_send_timeout                      60s;
		proxy_read_timeout                      60s;

		proxy_buffering                         off;
		proxy_buffer_size                       4k;
		proxy_buffers                           4 4k;

		proxy_max_temp_file_size                1024m;

		proxy_request_buffering                 on;
		proxy_http_version                      1.1;

		proxy_cookie_domain                     off;
		proxy_cookie_path                       off;

		# In case of errors try the next upstream server before returning an error
		proxy_next_upstream                     error timeout;
		proxy_next_upstream_timeout             0;
		proxy_next_upstream_tries               3;

		grpc_pass grpc://upstream_balancer;

		proxy_redirect                          off;

192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000 , 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:900 0, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:90 00, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000, 192.168.65.97:9000 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 60.048, 60.039, 60.032, 60.000, 60.027, 60.037, 60.025, 60.060, 60.060, 60.061, 60.053, 60.060, 60.061, 60.059, 60.038, 60.061, 60.060, 60.060, 60.060, 60.060, 60.061, 60.060, 60.060, 60.060, 60.060, 60.061, 60.060, 60.060, 60.032, 60.060, 60.048, 60.055, 60.059, 60.059, 60.058, 60.058, 60.058, 60.060, 60.061, 60.060, 60.060, 60.005, 60.060, 60.061, 60.059, 60.060, 60.031, 60.060, 60.038, 60.061, 60.060, 60.060, 60.060, 60.020, 60.060, 60.059, 60.060, 60.061, 60.060, 60.060, 60.060, 60.060, 35.283 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504 , 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, 504, - 37071e75abfc39a2ca0045eb9626d4e4

yifeng-cerebras · 2022-10-31T22:14:45Z

Found the solution my self: #5524 (comment)

bellasys · 2023-05-17T03:40:56Z

@yifeng-cerebras "proxy_next_upstream_timeout 0;" means 0 timeout. You are trying 3 tries, each of infinitely timespan which is consistent with your result...

yifeng-cerebras · 2023-05-17T20:17:50Z

@bellasys thanks for reply
the workaround is to turn off grpc_next_upstream instead of "proxy_next_upstream" for my use case

sstaley-hioscar · 2024-03-18T21:20:12Z

rebras "proxy_next_upstream_timeout 0;" means 0 timeout. You are trying 3 tries, each of infinitely timespan which is consistent with your result...

From the nginx docs:

Limits the time during which a request can be passed to the next server. The 0 value turns off this limitation.

To me this reads that his service will not apply any timeouts to the process of trying next upstreams. So, it should only be limited by the 5 second connect timeout which is applied to each connection attempt (3 in total, or 15 seconds).

bzon added the kind/support Categorizes issue or PR as a support question. label Jan 17, 2020

bzon closed this as completed Jan 17, 2020

AlexandrePicosson mentioned this issue Jul 20, 2020

502 when under load Substra/substra-backend#288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using proxy-next-upstream when proxy-connect-timeout happens? #4944

Using proxy-next-upstream when proxy-connect-timeout happens? #4944

bzon commented Jan 17, 2020

aledbf commented Jan 17, 2020

bzon commented Jan 17, 2020

bzon commented Jan 17, 2020 •

edited

Loading

aledbf commented Jan 17, 2020

aledbf commented Jan 17, 2020

aledbf commented Jan 17, 2020

bzon commented Jan 17, 2020

aledbf commented Jan 17, 2020 •

edited

Loading

bzon commented Jan 17, 2020

mikksoone commented Jan 29, 2020

bzon commented Dec 22, 2020

lonre commented Jan 4, 2021

erdebee commented Aug 9, 2021 •

edited

Loading

tiejunhu commented Aug 11, 2021

tontondematt commented Sep 30, 2022 •

edited

Loading

yifeng-cerebras commented Oct 26, 2022

yifeng-cerebras commented Oct 31, 2022

bellasys commented May 17, 2023 •

edited

Loading

yifeng-cerebras commented May 17, 2023 •

edited

Loading

sstaley-hioscar commented Mar 18, 2024

Using proxy-next-upstream when proxy-connect-timeout happens? #4944

Using proxy-next-upstream when proxy-connect-timeout happens? #4944

Comments

bzon commented Jan 17, 2020

aledbf commented Jan 17, 2020

bzon commented Jan 17, 2020

bzon commented Jan 17, 2020 • edited Loading

aledbf commented Jan 17, 2020

aledbf commented Jan 17, 2020

aledbf commented Jan 17, 2020

bzon commented Jan 17, 2020

aledbf commented Jan 17, 2020 • edited Loading

bzon commented Jan 17, 2020

mikksoone commented Jan 29, 2020

bzon commented Dec 22, 2020

lonre commented Jan 4, 2021

erdebee commented Aug 9, 2021 • edited Loading

tiejunhu commented Aug 11, 2021

tontondematt commented Sep 30, 2022 • edited Loading

yifeng-cerebras commented Oct 26, 2022

yifeng-cerebras commented Oct 31, 2022

bellasys commented May 17, 2023 • edited Loading

yifeng-cerebras commented May 17, 2023 • edited Loading

sstaley-hioscar commented Mar 18, 2024

bzon commented Jan 17, 2020 •

edited

Loading

aledbf commented Jan 17, 2020 •

edited

Loading

erdebee commented Aug 9, 2021 •

edited

Loading

tontondematt commented Sep 30, 2022 •

edited

Loading

bellasys commented May 17, 2023 •

edited

Loading

yifeng-cerebras commented May 17, 2023 •

edited

Loading