Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAProxy 2.4.17 reload issue #1854

Closed
sameergn opened this issue Sep 6, 2022 · 19 comments
Closed

HAProxy 2.4.17 reload issue #1854

sameergn opened this issue Sep 6, 2022 · 19 comments
Labels
status: feedback required The developers are waiting for a reply from the reporter. type: bug This issue describes a bug.

Comments

@sameergn
Copy link

sameergn commented Sep 6, 2022

Detailed Description of the Problem

We are seeing issues with haproxy reload operation in 2.4.17 via systemctl reload command.
No issues were reported with 2.0.17 version.
Some requests are resulting in no HTTP response or HAProxy is timing out with “sD” termination state and retrying, but retry counters are zero.
From timings, it seems that HAproxy got response from backed server quickly but somehow waited for 10 minutes before it could send response. (Our timeout is 10 mins)

Is this a known issue and has it been fixed in later releases?
Is it related to http://git.haproxy.org/?p=haproxy-2.4.git;a=commitdiff;h=1f8342f

Expected Behavior

Traffic should not fail during HAproxy 2.4.17 reload operation

Steps to Reproduce the Behavior

Send continuous traffic to haproxy 2.4.17
Reload haproxy few times

Do you have any idea what may have caused this?

No response

Do you have an idea how to solve the issue?

No response

What is your configuration?

global
   log         127.0.0.1 local2 debug
   lua-load    /etc/haproxy/cors.lua
   chroot      /var/lib/haproxy
   pidfile     /var/run/haproxy.pid
   maxconn     1000
   nbproc  4
   cpu-map  1 0
   cpu-map  2 1
   cpu-map  3 2
   cpu-map  4 3

defaults
   log    global
   mode   http
   option httplog
   option dontlognull
   # It seems with current configuration we don't need it https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#4-option%20persist
   # option persist
   option redispatch
   option forwardfor

   option http-keep-alive
   #pretend keep-alive closes connection on backend, but we want to keep it open for keep-alive interval
   #option http-pretend-keepalive
   timeout http-keep-alive 30s

   # Connection timeout is to accomodate losses in TCP packets, it should be a few ms in general in LAN
   timeout connect 5s
   timeout check   5s
   timeout client  30s
   timeout server  600s

   # HAProxy will retry only on conn-failure or empty-response conditions
   retry-on conn-failure empty-response
   retries 2

frontend localnodes
   maxconn     1000

   # HAProxy will retry only on GET requests
   http-request disable-l7-retry unless METH_GET

Output of haproxy -vv

HAProxy version 2.4.17-9f97155 2022/05/13 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2026.
Known bugs: http://www.haproxy.org/bugs/bugs-2.4.17.html
Running on: Linux 3.10.0-1160.66.1.el7.x86_64 #1 SMP Wed May 18 16:02:34 UTC 2022 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
  OPTIONS = USE_PCRE=1 USE_LIBCRYPT=1 USE_CRYPT_H=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_SYSTEMD=1 USE_PROMEX=1
  DEBUG   =

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE -PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 -CLOSEFROM +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL -PROCCTL +THREAD_DUMP -EVPORTS -OT -QUIC +PROMEX -MEMORY_PROFILING

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=4).
Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.4.2
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 4.8.5 20150623 (Red Hat 4.8.5-44)

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2       flags=HTX|CLEAN_ABRT|HOL_RISK|NO_UPG
            fcgi : mode=HTTP       side=BE        mux=FCGI     flags=HTX|HOL_RISK|NO_UPG
       <default> : mode=HTTP       side=FE|BE     mux=H1       flags=HTX
              h1 : mode=HTTP       side=FE|BE     mux=H1       flags=HTX|NO_UPG
       <default> : mode=TCP        side=FE|BE     mux=PASS     flags=
            none : mode=TCP        side=FE|BE     mux=PASS     flags=NO_UPG

Available services : prometheus-exporter
Available filters :
	[SPOE] spoe
	[CACHE] cache
	[FCGI] fcgi-app
	[COMP] compression
	[TRACE] trace

Last Outputs and Backtraces

No response

Additional Information

No response

@sameergn sameergn added status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug. labels Sep 6, 2022
@sameergn
Copy link
Author

sameergn commented Sep 7, 2022

HAProxy Log Showing 10min processing time, server timeout sD but no retries
Sep 5 07:49:40 localhost haproxy[5876]: [05/Sep/2022:07:39:40.709] 0/0/0/94/600095 200 2747 - - sD-- 2/2/0/0/0 0/0 "POST PATH_REMOVED HTTP/1.1" th:0/37/37

Log Format
%t %TR/%Tw/%Tc/%Tr/%Ta\ %ST\ %B\ %CC\ %CS\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ %{+Q}r th:%Th/%Ti/%Tq

@capflam
Copy link
Member

capflam commented Sep 7, 2022

The mention commit is unrelated. It fixes a bug on peers not affecting normal traffic. Do you know if it only happens on reload or not ? Because it may be related to 372b38f. You should try the last 2.4 release (2.4.18) to be sure. If it also happens outside a reload, you can try to get output of show sess all CLI command to identify sessions waiting for several minutes.

@capflam capflam added status: feedback required The developers are waiting for a reply from the reporter. and removed status: needs-triage This issue needs to be triaged. labels Sep 7, 2022
@sameergn
Copy link
Author

sameergn commented Sep 7, 2022

We are focusing on hitless-reload scenario
Will update once we test with version 2.4.18

@sameergn
Copy link
Author

sameergn commented Sep 8, 2022

We have taken output of show sess all every 2 seconds while haproxy 2.4.17 was stuck for 10 mins.
We grouped the output for the combination of source=IP1:PORT1 and server=BACKEND_SERVER (Values are masked)
long_running_session_masked.txt

Please let us know if you can find any insights from the output.
Thanks!

@sameergn
Copy link
Author

sameergn commented Sep 8, 2022

Same issue was observed on HAProxy 2.4.18 as well.

@capflam
Copy link
Member

capflam commented Sep 9, 2022

Thanks for the feedback. Your show sess all output is unfortunately useless because it is performed on the new process. To get it from the old process, you may open a connection on the CLI before the reload and keep it open via prompt command. This way you will be able to perform commands on the old process. Note the default client timeout for the CLI is 10s.

@Darlelet
Copy link
Contributor

Darlelet commented Sep 9, 2022

To complete capflam's answer: if 10s CLI timeout bothers you during your tests, you can adjust this setting with the following directive:
stats timeout *DESIRED_TIMEOUT_IN_SECONDS*s
under the global section of your config file

This way you won't unintentionally loose the old process' CLI.

@sameergn
Copy link
Author

Generated session files for old processes by running haproxy with master CLI mode using -S flag.
During reload, 1 child process exited quickly but 3 other hung around for about 10 minutes.
Attached are the session info of those 3 child processes every 2 seconds using a script like this

child_pid=$1

while true
do
        echo @!$child_pid show sess all
        sleep 2
done | sudo nc -vU /var/run/haproxy_master.sock > ${child_pid}_sessions.txt &

8189_sessions.txt_masked.txt
8188_sessions.txt_masked.txt
8187_sessions.txt_masked.txt

@capflam
Copy link
Member

capflam commented Sep 12, 2022

Thanks, I'll take a look.

@capflam
Copy link
Member

capflam commented Sep 12, 2022

In your traces, there is a stream blocked on the response analysis. The response was fully received but there is something blocking the data forwarding. Could you share your http-response rules please, from the frontend and the backend ?

@sameergn
Copy link
Author

sameergn commented Sep 19, 2022

@capflam Thanks for the pointer. We had a SPOE module enabled and vendor is looking into the issue.
However, without that module enabled, we are still getting some errors when new processes start after a reload operation. Each new process starts with 50+ total errors for all backends.
Server servers_XXX/YYYY is going DOWN for maintenance (No IP for server )
We are using following config

resolvers dnsresolvers
  parse-resolv-conf
  resolve_retries       3
  timeout resolve       1s
  timeout retry         1s
  hold other           30s
  hold refused         30s
  hold nx              30s
  hold timeout         30s
  hold valid           10s
  hold obsolete        30s

We use server-template for all backends with resolvers dnsresolvers
This is happening in 2.4.18. It works fine in 2.0.17

@capflam
Copy link
Member

capflam commented Sep 19, 2022

So, without the SPOE enabled, there is no session blocked on reload, right ? In this case, you should investigate on this point. If you need some help, you should share your configuration. In addition, if it is not already enabled, you may try to enable logs on the SPOE. However, I'm surprised it is only related to the reload.

About your resolver issue, you should share your backend configuration. But it may be because several servers get the same IP address and you don't add resolve-opts allow-dup-ip on you server lines.

@sameergn
Copy link
Author

sameergn commented Sep 19, 2022

Here is backend configuration

backend servers_SERVER_1
    # Number of consecutive invalid health checks before considering the server as DOWN.
    default-server fall 5

    http-check expect status 200
    option httpchk GET SOME_PATH_1 HTTP/1.1\r\nHost:\ unknownhost

    server-template SERVER_1 1-5 DNS_NAME:PORT check ssl verify required ca-file CA_FILE crt CRT_BUNDLE_FILE resolvers dnsresolvers

backend servers_SERVER_2
    # Number of consecutive invalid health checks before considering the server as DOWN.
    default-server fall 5

    http-check expect status 200
    option httpchk GET SOME_PATH_2 HTTP/1.1\r\nHost:\ unknownhost

    server-template SERVER_2 1-5 DNS_NAME:PORT check ssl verify required ca-file CA_FILE crt CRT_BUNDLE_FILE resolvers dnsresolvers

Please note that DNS_NAME:PORT is same for all backends and it resolves to 3 IP addresses; one IP for each availability zone in AWS.
In that case, should we use resolve-opts allow-dup-ip for each server-template line?

@capflam
Copy link
Member

capflam commented Sep 19, 2022

Well probably not in this case because the same port is used. At first glance, there is no reason to have several servers in the same backend with the same IP/PORT. Duplicated IPs are useful when several servers share the same ip address with different ports.

So, in this case, you provide 5 servers per backends, but all slots are not filled because your DNS server is returning less than 5 ip address for the corresponding DNS_NAME. It is probably expected. It is the server-template purpose: be able to dynamically add or remove server. In this case, it is performed by adding or removing a server address in the DNS response. The log messages you get on reload are expected. If not, you must review your configuration.

@sameergn
Copy link
Author

sameergn commented Sep 19, 2022

The reason we are concerned is that it is resulting in errors during performance tests. JMeter is getting several "No HTTPResponse: Failed to Response" errors. Those HTTP requests are not seen anywhere in backend serverlogs.
There is a AWS Classic Load balancer in TCP mode between JMeter and HAProxy instances. These errors are consistently seen during haproxy 2.4.18 reload operation but were not seen with 2.0.17
Is there any difference between 2.0.17 and 2.4.x configs that we need to look for?

@capflam
Copy link
Member

capflam commented Sep 19, 2022

Sorry, I'm puzzled because your are mixing several unrelated issues. to each replies you come back with a new issue. I'll try to sum up.

  • Your blocked sessions on reload are related to the SPOE, maybe because of your agent. you must investigate on this point. But for now, there is no way to be sure the issue is on HAProxy side. You must provide more info on this point if you really suspect a bug.

  • You have several log messages on reload related to DNS resolvers because some slots in your server-templates are not filled. Here, there is no issue. It is expected.

  • Now, you have an issue during reload with JMeter. There is a load-balancer between your JMeter and HAProxy. You experienced a different behavior between 2.4.18 and 2.0.17. It is probably because client idle connections are closed on soft-stop since the 2.4. You can enable idle-close-on-response option to keep them open. This option was added because some users experienced issues with ALB in front of HAProxy. But there is no bug here. At least at first glance.

@sameergn
Copy link
Author

Your summary is accurate. Last two bullets are related to Jmeter performance test. We will try idle-close-on-response and share the results.

@sameergn
Copy link
Author

with idle-close-on-response all "No HTTP Response: Failed to respond" errors are gone, but some requests still failed with "No HTTP Response: Socket closed"
We did observe this time that few old processed stayed alive for few seconds before existing.
Interestingly most (54/65) of the "Socket Closed" requests logged in JMeter logs are marked as successful in HAProxy access logs, remaining 15 are not found in HAProxy access logs.

@capflam
Copy link
Member

capflam commented Sep 21, 2022

Indeed, with idle-close-on-response option, idle HTTP connections on client side are not closed on soft-stop. The client may send a last request. These connections are closed with the response. However the keep-alive timeout is still there. So it explains why the old worker stays alive longer. I guess some "Socket Closed" errors may be reported if an idle connection is closed by HAProxy (because of K/A timeout) when JMeter or ALB try to use it. In this case, there is no response. It is expected. This way, clients know they can safely retry the request. For errors reported on JMeter side but not on HAProxy side, I suspect there is something wrong with ALB. To be sure, you should do your performance test on HAProxy directly.

At this stage, it seems there is no bug in HAProxy. So I suggest you to get some help on HAproxy forum or on the mailing list. Some users are using HAProxy behind an ALB. They may help you. Github issues must only be used as a bug tracker. I'm closing now. Of course, I may have wrong. In this case, feel free to reopen the issue to provide more info.

The forum is at: https://discourse.haproxy.org/

The mailing list (no need to subscribe) is: haproxy@formilux.org
Subscribe to the list: haproxy+subscribe@formilux.org
Unsubscribe from the list: haproxy+unsubscribe@formilux.org

@capflam capflam closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: feedback required The developers are waiting for a reply from the reporter. type: bug This issue describes a bug.
Projects
None yet
Development

No branches or pull requests

3 participants