HAproxy in front of postfix causing warning in postfix log after upgrade from 1.9.8 to 2.0.3 #187

amit777 · 2019-07-24T18:51:47Z

I'm not sure if this is a bug. There seems to be a change in behavior that I'm having trouble tracking down. I get the following warning in my postfix maillog:
"warning: haproxy read: lost connection"

Is there a change in behavior between 1.9.8 and 2.0.3 regarding how TCP checks, and specifically smtpchk works?

haproxy -vvv

HA-Proxy version 2.0.3 2019/07/23 - https://haproxy.org/
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_LINUX_SPLICE=1 USE_OPENSSL=1 USE_ZLIB=1 USE_TFO=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED -REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL -LUA +FUTEX +ACCEPT4 -MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE2 version : 10.23 2017-02-14
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTX        side=FE|BE     mux=H2
              h2 : mode=HTTP       side=FE        mux=H2
       <default> : mode=HTX        side=FE|BE     mux=H1
       <default> : mode=TCP|HTTP   side=FE|BE     mux=PASS

Available services : none

Available filters :
	[SPOE] spoe
	[COMP] compression
	[CACHE] cache
	[TRACE] trace

haproxy.cfg snippet

frontend ft_smtp
  bind 0.0.0.0:25
  mode tcp
  no option http-server-close
  timeout client 1m
  option tcplog
#  option smtpchk #  moved to backend for 2.0.3
  option dontlognull
  log global
  log 127.0.0.1 local0
  log 127.0.0.1 local1

  default_backend bk_postfix

backend bk_postfix
  mode tcp
  option smtpchk # moved from front end to here
  log global
  timeout server 1m
  timeout connect 10s
  server can1 can1:10024 send-proxy check

The text was updated successfully, but these errors were encountered:

wtarreau · 2019-07-25T16:31:49Z

Hmmm that's an interesting case. It's a side effect of this fix: fe4abe6 ("BUG/MEDIUM: connections: Don't call shutdown() if we want to disable linger.") which was merged between 2.0.2 and 2.0.3. It was also backported into 1.9.9.

We don't have enough state to represent all desired conditions it seems. Or probably we'd need to be able to explicitly mention that certain checks must explicitly shutdown before closing, even if they did not send a message.

The attached patch addresses this for me (based on the network capture), could you please confirm that it's OK with your postfix server as well ?

0001-WIP-BUG-MEDIUM-checks-make-sure-to-close-nicely-in-S.patch.txt

amit777 · 2019-07-25T17:00:52Z

hi, I applied your patch to a 2.0.3 source directory, rebuilt, installed, restarted HAProxy. Still get the warning messages in postfix. Happy to try out other things if you like.

wtarreau · 2019-08-06T14:01:48Z

Hmm indeed, I can reproduce this here as well:

Before the patch:

15:56:02.010034 select(16, [0 6], NULL, NULL, NULL) = 1 (in [6])
15:56:02.010196 read(6, "HELO localhost\r\n", 8192) = 16
15:56:02.010223 write(1, "HELO localhost\r\n", 16HELO localhost
) = 16
15:56:02.010245 select(16, [0 6], NULL, NULL, NULL) = 1 (in [6])
15:56:04.011458 read(6, 0x125f5d0, 8192) = -1 ECONNRESET (Connection reset by peer)
15:56:04.011501 close(6)                = 0

After the patch:

15:56:52.612894 select(16, [0 6], NULL, NULL, NULL) = 1 (in [6])
15:56:52.612913 read(6, "HELO localhost\r\n", 8192) = 16
15:56:52.612928 write(1, "HELO localhost\r\n", 16HELO localhost
) = 16
15:56:52.612941 select(16, [0 6], NULL, NULL, NULL) = 1 (in [6])
15:56:54.614362 read(6, 0x20ab5d0, 8192) = -1 ECONNRESET (Connection reset by peer)
15:56:54.614537 close(6)                = 0

And indeed, we don't consume the pending data so we don't even see if the server responds. I'm digging.

wtarreau · 2019-08-06T14:25:20Z

So it's in the case where the server doesn't respond fast enough that it fails, if it presents the banner immediately it's OK with the patch:

16:20:37.452494 read(6, "HELO localhost\r\n", 8192) = 16
16:20:37.452529 read(0, "250 HELO\n", 8192) = 9
16:20:37.452548 write(1, "HELO localhost\r\n", 16HELO localhost
) = 16
16:20:37.452567 write(6, "250 HELO\n", 9) = 9
16:20:37.452596 select(16, [0 6], NULL, NULL, NULL) = 1 (in [0])
16:20:37.452622 read(0, "", 8192)       = 0
16:20:37.452642 close(0)                = 0
16:20:37.452663 select(16, [6], NULL, NULL, NULL) = 1 (in [6])
16:20:37.452785 read(6, "", 8192)       = 0
16:20:37.452804 close(6)                = 0

Are you sure your postfix doesn't have a delayed banner ? Otherwise it would be useful to take a capture between haproxy and the postfix server.

I'm taking the patch above for 2.0.4 as it's needed anyway. This will eliminate some moving parts when you want to continue the troubleshooting.

…speak In SMTP, MySQL and PgSQL checks, we're supposed to finish with a message to politely quit the server, otherwise some of them will log some errors. This is the case with Postfix as reported in GH issue #187. Since commit fe4abe6 ("BUG/MEDIUM: connections: Don't call shutdown() if we want to disable linger.") we are a bit more aggressive on outgoing connection closure and checks were not prepared for this. This patch makes the 3 checks above disable the linger_risk for these checks so that we close cleanly, with the side effect that it will leave some TIME_WAIT connections behind (hence why it should not be generalized to all checks). It's worth noting that in issue #187 it's mentioned that this patch doesn't seem to be sufficient for Postfix, however based only on local network activity this looks OK, so maybe this will need to be improved later. Given that the patch above was backported to 2.0 and 1.9, this one should as well.

wtarreau · 2019-08-07T06:29:10Z

I'm marking this as fixed as it's in 2.0.4 now. I'd appreciate it if you could try again. I'm fine with tagging it bogus again if it's not enough in your tests.

amit777 · 2019-08-08T05:54:52Z

Hi, sorry I am traveling and will have to look at it this weekend. We use postscreen in front of the main postfix smtp server process. Postscreen can indeed introduce a delay as described in this document: http://www.postfix.org/POSTSCREEN_README.html

amit777 · 2019-08-08T15:04:36Z

Is there a configuration option to be able to support the delayed response on the check? Postscreen must go through it’s checks which can take some time.

wtarreau · 2019-08-08T17:00:35Z

You can use two methods for this, either increase the "inter" argument of the server line, which defines both the check interval and the default check timeout, or use "timeout check" to forcefully increase the check timeout without changing the interval. But given that in your case the delay will be long for each check you'd rather just increase "inter".

Please note that it's very possible that postscreen will reject haproxy's check as it sends immediately without waiting for the banner. If this causes trouble, then I suggest that you switch tcp-checks which allow to define sequences of connect/send/expect. In this case you'd basically just expect a 2xx then send a QUIT and close. But if smtpchk works fine, no need to complicate your setup.

amit777 · 2019-08-08T18:17:28Z

Thanks for that note! I’ll keep an eye out for that postscreen nuance you mentioned, though I would imagine it would have affected us by now.

amit777 · 2019-08-10T04:44:51Z

I installed 2.0.4, updated config to have inter 5000, but the warnings still appear in postfix log, albeit less frequently.. But it seems like every check. I will run the packet trace and look deeper into it soon.

wtarreau · 2019-08-10T04:49:54Z

On Fri, Aug 09, 2019 at 09:44:53PM -0700, amit777 wrote: I installed 2.0.4, updated config to have inter 5000, but the warnings still appear in postfix log, albeit less frequently.. But it seems like every check. I will run the packet trace and look deeper into it soon.

Thank you. Please also take a capture of the version that does not cause these warnings so that we can compare. Willy

…speak In SMTP, MySQL and PgSQL checks, we're supposed to finish with a message to politely quit the server, otherwise some of them will log some errors. This is the case with Postfix as reported in GH issue haproxy#187. Since commit fe4abe6 ("BUG/MEDIUM: connections: Don't call shutdown() if we want to disable linger.") we are a bit more aggressive on outgoing connection closure and checks were not prepared for this. This patch makes the 3 checks above disable the linger_risk for these checks so that we close cleanly, with the side effect that it will leave some TIME_WAIT connections behind (hence why it should not be generalized to all checks). It's worth noting that in issue haproxy#187 it's mentioned that this patch doesn't seem to be sufficient for Postfix, however based only on local network activity this looks OK, so maybe this will need to be improved later. Given that the patch above was backported to 2.0 and 1.9, this one should as well. (cherry picked from commit 5488a62) Signed-off-by: Willy Tarreau <w@1wt.eu>

sukantahazra · 2019-09-29T02:27:44Z

Hi!

I am Amit's (@amit777 ) colleague. I have attached two files.

elk1-haproxy.dat.gz is the tcpdump capture for haproxy version 1.9.10 (working fine)
can1-haproxy.dat.gz is the tcpdump capture for haproxy version 2.0.7 (has postscreen warning messages)

elk1-haproxy.dat.gz
can1-haproxy.dat.gz

I had a look at the packets I noticed that version 2.0.7 sends RST,ACK packets frequently and they correspond with the postscreen warning messages.

The origin of those aborted connects is an AWS Cloudwatch health check which connects to TCP port 25 and then closes the connection without sending any data. AWS Cloudwatch -> port:25 --> haproxy --> port:10024 (postscreen)

Hope this helps.

…speak In SMTP, MySQL and PgSQL checks, we're supposed to finish with a message to politely quit the server, otherwise some of them will log some errors. This is the case with Postfix as reported in GH issue haproxy#187. Since commit fe4abe6 ("BUG/MEDIUM: connections: Don't call shutdown() if we want to disable linger.") we are a bit more aggressive on outgoing connection closure and checks were not prepared for this. This patch makes the 3 checks above disable the linger_risk for these checks so that we close cleanly, with the side effect that it will leave some TIME_WAIT connections behind (hence why it should not be generalized to all checks). It's worth noting that in issue haproxy#187 it's mentioned that this patch doesn't seem to be sufficient for Postfix, however based only on local network activity this looks OK, so maybe this will need to be improved later. Given that the patch above was backported to 2.0 and 1.9, this one should as well. (cherry picked from commit 5488a62) Signed-off-by: Willy Tarreau <w@1wt.eu> (cherry picked from commit 5570678) Signed-off-by: Willy Tarreau <w@1wt.eu>

capflam · 2021-12-03T13:32:08Z

Any news on this issue ?

sukantahazra · 2021-12-03T14:06:49Z

We are currently using haproxy-2.3.6-7851701. I don't see the issue any longer.

capflam · 2021-12-03T14:07:46Z

Thanks. So I'm closing the issue.

wtarreau added status: reviewed This issue was reviewed. A fix is required. and removed status: feedback required The developers are waiting for a reply from the reporter. labels Aug 6, 2019

wtarreau added 2.0 This issue affects the HAProxy 2.0 stable branch. status: feedback required The developers are waiting for a reply from the reporter. and removed status: fixed This issue is a now-fixed bug. labels Aug 10, 2019

wtarreau mentioned this issue Sep 4, 2019

Checker dosen't close the conn and retained many time-wait connection #254

Closed

capflam closed this as completed Dec 3, 2021

capflam removed the status: feedback required The developers are waiting for a reply from the reporter. label Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAproxy in front of postfix causing warning in postfix log after upgrade from 1.9.8 to 2.0.3 #187

HAproxy in front of postfix causing warning in postfix log after upgrade from 1.9.8 to 2.0.3 #187

amit777 commented Jul 24, 2019

wtarreau commented Jul 25, 2019

amit777 commented Jul 25, 2019

wtarreau commented Aug 6, 2019

wtarreau commented Aug 6, 2019

wtarreau commented Aug 7, 2019

amit777 commented Aug 8, 2019

amit777 commented Aug 8, 2019

wtarreau commented Aug 8, 2019

amit777 commented Aug 8, 2019

amit777 commented Aug 10, 2019

wtarreau commented Aug 10, 2019 via email

sukantahazra commented Sep 29, 2019

capflam commented Dec 3, 2021

sukantahazra commented Dec 3, 2021

capflam commented Dec 3, 2021

HAproxy in front of postfix causing warning in postfix log after upgrade from 1.9.8 to 2.0.3 #187

HAproxy in front of postfix causing warning in postfix log after upgrade from 1.9.8 to 2.0.3 #187

Comments

amit777 commented Jul 24, 2019

wtarreau commented Jul 25, 2019

amit777 commented Jul 25, 2019

wtarreau commented Aug 6, 2019

wtarreau commented Aug 6, 2019

wtarreau commented Aug 7, 2019

amit777 commented Aug 8, 2019

amit777 commented Aug 8, 2019

wtarreau commented Aug 8, 2019

amit777 commented Aug 8, 2019

amit777 commented Aug 10, 2019

wtarreau commented Aug 10, 2019 via email

sukantahazra commented Sep 29, 2019

capflam commented Dec 3, 2021

sukantahazra commented Dec 3, 2021

capflam commented Dec 3, 2021