Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAproxy in front of postfix causing warning in postfix log after upgrade from 1.9.8 to 2.0.3 #187

Closed
amit777 opened this issue Jul 24, 2019 · 15 comments
Labels
2.0 This issue affects the HAProxy 2.0 stable branch. severity: medium This issue is of MEDIUM severity.

Comments

@amit777
Copy link

amit777 commented Jul 24, 2019

I'm not sure if this is a bug. There seems to be a change in behavior that I'm having trouble tracking down. I get the following warning in my postfix maillog:
"warning: haproxy read: lost connection"

Is there a change in behavior between 1.9.8 and 2.0.3 regarding how TCP checks, and specifically smtpchk works?

haproxy -vvv

HA-Proxy version 2.0.3 2019/07/23 - https://haproxy.org/
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_LINUX_SPLICE=1 USE_OPENSSL=1 USE_ZLIB=1 USE_TFO=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED -REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL -LUA +FUTEX +ACCEPT4 -MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k-fips  26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE2 version : 10.23 2017-02-14
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTX        side=FE|BE     mux=H2
              h2 : mode=HTTP       side=FE        mux=H2
       <default> : mode=HTX        side=FE|BE     mux=H1
       <default> : mode=TCP|HTTP   side=FE|BE     mux=PASS

Available services : none

Available filters :
	[SPOE] spoe
	[COMP] compression
	[CACHE] cache
	[TRACE] trace

haproxy.cfg snippet

frontend ft_smtp
  bind 0.0.0.0:25
  mode tcp
  no option http-server-close
  timeout client 1m
  option tcplog
#  option smtpchk #  moved to backend for 2.0.3
  option dontlognull
  log global
  log 127.0.0.1 local0
  log 127.0.0.1 local1

  default_backend bk_postfix
backend bk_postfix
  mode tcp
  option smtpchk # moved from front end to here
  log global
  timeout server 1m
  timeout connect 10s
  server can1 can1:10024 send-proxy check
@wtarreau
Copy link
Member

Hmmm that's an interesting case. It's a side effect of this fix: fe4abe6 ("BUG/MEDIUM: connections: Don't call shutdown() if we want to disable linger.") which was merged between 2.0.2 and 2.0.3. It was also backported into 1.9.9.

We don't have enough state to represent all desired conditions it seems. Or probably we'd need to be able to explicitly mention that certain checks must explicitly shutdown before closing, even if they did not send a message.

The attached patch addresses this for me (based on the network capture), could you please confirm that it's OK with your postfix server as well ?

0001-WIP-BUG-MEDIUM-checks-make-sure-to-close-nicely-in-S.patch.txt

@wtarreau wtarreau added 1.9 This issue affects the HAProxy 1.9 stable branch. 2.0 This issue affects the HAProxy 2.0 stable branch. dev This issue affects the HAProxy development branch. severity: medium This issue is of MEDIUM severity. status: feedback required The developers are waiting for a reply from the reporter. labels Jul 25, 2019
@amit777
Copy link
Author

amit777 commented Jul 25, 2019

hi, I applied your patch to a 2.0.3 source directory, rebuilt, installed, restarted HAProxy. Still get the warning messages in postfix. Happy to try out other things if you like.

@wtarreau
Copy link
Member

wtarreau commented Aug 6, 2019

Hmm indeed, I can reproduce this here as well:

Before the patch:

15:56:02.010034 select(16, [0 6], NULL, NULL, NULL) = 1 (in [6])
15:56:02.010196 read(6, "HELO localhost\r\n", 8192) = 16
15:56:02.010223 write(1, "HELO localhost\r\n", 16HELO localhost
) = 16
15:56:02.010245 select(16, [0 6], NULL, NULL, NULL) = 1 (in [6])
15:56:04.011458 read(6, 0x125f5d0, 8192) = -1 ECONNRESET (Connection reset by peer)
15:56:04.011501 close(6)                = 0

After the patch:

15:56:52.612894 select(16, [0 6], NULL, NULL, NULL) = 1 (in [6])
15:56:52.612913 read(6, "HELO localhost\r\n", 8192) = 16
15:56:52.612928 write(1, "HELO localhost\r\n", 16HELO localhost
) = 16
15:56:52.612941 select(16, [0 6], NULL, NULL, NULL) = 1 (in [6])
15:56:54.614362 read(6, 0x20ab5d0, 8192) = -1 ECONNRESET (Connection reset by peer)
15:56:54.614537 close(6)                = 0

And indeed, we don't consume the pending data so we don't even see if the server responds. I'm digging.

@wtarreau wtarreau added status: reviewed This issue was reviewed. A fix is required. and removed status: feedback required The developers are waiting for a reply from the reporter. labels Aug 6, 2019
@wtarreau
Copy link
Member

wtarreau commented Aug 6, 2019

So it's in the case where the server doesn't respond fast enough that it fails, if it presents the banner immediately it's OK with the patch:

16:20:37.452494 read(6, "HELO localhost\r\n", 8192) = 16
16:20:37.452529 read(0, "250 HELO\n", 8192) = 9
16:20:37.452548 write(1, "HELO localhost\r\n", 16HELO localhost
) = 16
16:20:37.452567 write(6, "250 HELO\n", 9) = 9
16:20:37.452596 select(16, [0 6], NULL, NULL, NULL) = 1 (in [0])
16:20:37.452622 read(0, "", 8192)       = 0
16:20:37.452642 close(0)                = 0
16:20:37.452663 select(16, [6], NULL, NULL, NULL) = 1 (in [6])
16:20:37.452785 read(6, "", 8192)       = 0
16:20:37.452804 close(6)                = 0

Are you sure your postfix doesn't have a delayed banner ? Otherwise it would be useful to take a capture between haproxy and the postfix server.

I'm taking the patch above for 2.0.4 as it's needed anyway. This will eliminate some moving parts when you want to continue the troubleshooting.

haproxy-mirror pushed a commit that referenced this issue Aug 6, 2019
…speak

In SMTP, MySQL and PgSQL checks, we're supposed to finish with a message
to politely quit the server, otherwise some of them will log some errors.
This is the case with Postfix as reported in GH issue #187. Since commit
fe4abe6 ("BUG/MEDIUM: connections: Don't call shutdown() if we want to
disable linger.") we are a bit more aggressive on outgoing connection
closure and checks were not prepared for this.

This patch makes the 3 checks above disable the linger_risk for these
checks so that we close cleanly, with the side effect that it will leave
some TIME_WAIT connections behind (hence why it should not be generalized
to all checks). It's worth noting that in issue #187 it's mentioned that
this patch doesn't seem to be sufficient for Postfix, however based only
on local network activity this looks OK, so maybe this will need to be
improved later.

Given that the patch above was backported to 2.0 and 1.9, this one should
as well.
@wtarreau
Copy link
Member

wtarreau commented Aug 7, 2019

I'm marking this as fixed as it's in 2.0.4 now. I'd appreciate it if you could try again. I'm fine with tagging it bogus again if it's not enough in your tests.

@wtarreau wtarreau added status: fixed This issue is a now-fixed bug. and removed 1.9 This issue affects the HAProxy 1.9 stable branch. 2.0 This issue affects the HAProxy 2.0 stable branch. dev This issue affects the HAProxy development branch. status: reviewed This issue was reviewed. A fix is required. labels Aug 7, 2019
@amit777
Copy link
Author

amit777 commented Aug 8, 2019

Hi, sorry I am traveling and will have to look at it this weekend. We use postscreen in front of the main postfix smtp server process. Postscreen can indeed introduce a delay as described in this document: http://www.postfix.org/POSTSCREEN_README.html

@amit777
Copy link
Author

amit777 commented Aug 8, 2019

Is there a configuration option to be able to support the delayed response on the check? Postscreen must go through it’s checks which can take some time.

@wtarreau
Copy link
Member

wtarreau commented Aug 8, 2019

You can use two methods for this, either increase the "inter" argument of the server line, which defines both the check interval and the default check timeout, or use "timeout check" to forcefully increase the check timeout without changing the interval. But given that in your case the delay will be long for each check you'd rather just increase "inter".

Please note that it's very possible that postscreen will reject haproxy's check as it sends immediately without waiting for the banner. If this causes trouble, then I suggest that you switch tcp-checks which allow to define sequences of connect/send/expect. In this case you'd basically just expect a 2xx then send a QUIT and close. But if smtpchk works fine, no need to complicate your setup.

@amit777
Copy link
Author

amit777 commented Aug 8, 2019

Thanks for that note! I’ll keep an eye out for that postscreen nuance you mentioned, though I would imagine it would have affected us by now.

@amit777
Copy link
Author

amit777 commented Aug 10, 2019

I installed 2.0.4, updated config to have inter 5000, but the warnings still appear in postfix log, albeit less frequently.. But it seems like every check. I will run the packet trace and look deeper into it soon.

@wtarreau
Copy link
Member

wtarreau commented Aug 10, 2019 via email

@wtarreau wtarreau added 2.0 This issue affects the HAProxy 2.0 stable branch. status: feedback required The developers are waiting for a reply from the reporter. and removed status: fixed This issue is a now-fixed bug. labels Aug 10, 2019
jkoelker pushed a commit to jkoelker/haproxy that referenced this issue Sep 27, 2019
…speak

In SMTP, MySQL and PgSQL checks, we're supposed to finish with a message
to politely quit the server, otherwise some of them will log some errors.
This is the case with Postfix as reported in GH issue haproxy#187. Since commit
fe4abe6 ("BUG/MEDIUM: connections: Don't call shutdown() if we want to
disable linger.") we are a bit more aggressive on outgoing connection
closure and checks were not prepared for this.

This patch makes the 3 checks above disable the linger_risk for these
checks so that we close cleanly, with the side effect that it will leave
some TIME_WAIT connections behind (hence why it should not be generalized
to all checks). It's worth noting that in issue haproxy#187 it's mentioned that
this patch doesn't seem to be sufficient for Postfix, however based only
on local network activity this looks OK, so maybe this will need to be
improved later.

Given that the patch above was backported to 2.0 and 1.9, this one should
as well.

(cherry picked from commit 5488a62)
Signed-off-by: Willy Tarreau <w@1wt.eu>
@sukantahazra
Copy link

Hi!

I am Amit's (@amit777 ) colleague. I have attached two files.

elk1-haproxy.dat.gz is the tcpdump capture for haproxy version 1.9.10 (working fine)
can1-haproxy.dat.gz is the tcpdump capture for haproxy version 2.0.7 (has postscreen warning messages)

elk1-haproxy.dat.gz
can1-haproxy.dat.gz

I had a look at the packets I noticed that version 2.0.7 sends RST,ACK packets frequently and they correspond with the postscreen warning messages.

image

The origin of those aborted connects is an AWS Cloudwatch health check which connects to TCP port 25 and then closes the connection without sending any data. AWS Cloudwatch -> port:25 --> haproxy --> port:10024 (postscreen)

Hope this helps.

FireBurn pushed a commit to FireBurn/haproxy that referenced this issue Jan 29, 2020
…speak

In SMTP, MySQL and PgSQL checks, we're supposed to finish with a message
to politely quit the server, otherwise some of them will log some errors.
This is the case with Postfix as reported in GH issue haproxy#187. Since commit
fe4abe6 ("BUG/MEDIUM: connections: Don't call shutdown() if we want to
disable linger.") we are a bit more aggressive on outgoing connection
closure and checks were not prepared for this.

This patch makes the 3 checks above disable the linger_risk for these
checks so that we close cleanly, with the side effect that it will leave
some TIME_WAIT connections behind (hence why it should not be generalized
to all checks). It's worth noting that in issue haproxy#187 it's mentioned that
this patch doesn't seem to be sufficient for Postfix, however based only
on local network activity this looks OK, so maybe this will need to be
improved later.

Given that the patch above was backported to 2.0 and 1.9, this one should
as well.

(cherry picked from commit 5488a62)
Signed-off-by: Willy Tarreau <w@1wt.eu>
(cherry picked from commit 5570678)
Signed-off-by: Willy Tarreau <w@1wt.eu>
@capflam
Copy link
Member

capflam commented Dec 3, 2021

Any news on this issue ?

@sukantahazra
Copy link

We are currently using haproxy-2.3.6-7851701. I don't see the issue any longer.

@capflam
Copy link
Member

capflam commented Dec 3, 2021

Thanks. So I'm closing the issue.

@capflam capflam closed this as completed Dec 3, 2021
@capflam capflam removed the status: feedback required The developers are waiting for a reply from the reporter. label Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0 This issue affects the HAProxy 2.0 stable branch. severity: medium This issue is of MEDIUM severity.
Projects
None yet
Development

No branches or pull requests

4 participants