Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout / Full Load problems with HAproxy 1.8 #588

Closed
addy90 opened this issue Mar 8, 2018 · 24 comments
Closed

Timeout / Full Load problems with HAproxy 1.8 #588

addy90 opened this issue Mar 8, 2018 · 24 comments
Assignees
Labels
bug Production bug upstream Third party issue

Comments

@addy90
Copy link

addy90 commented Mar 8, 2018

I am having strange timeout problems with HAproxy since 1.8. (So since OPNsense 18.1.3, with 18.1.2 it worked!)

The network is using jumbo frames and validated TLS 1.2 sessions between HAproxy and Backend Server, except in one case.
With the previous HAproxy version 1.7, no problems were happening!

Now, when I call a website via Frontend, HAproxy sometimes hangs up with 100% CPU load for 30 seconds until the timeout breaks the connection and the client reconnects. This seems to "always" happen when I hit Ctrl+F5 for "full reload" of the website (because of the amount of data as it seems). Different backends (apache, nginx) = same problem. So does not depend on the backend server.

Sometimes I receive TLS Alerts within package capture, but I am not sure if these are the reason or someting else, maybe the large MTU or again something else. The thing is, when I call the backend website directly (with jumbo frames), everything works. It also works when I call via VPN, so MTU conversion is not the problem.

I also have Health check timeouts, sometimes Level4, Level6 or Level7 at random times...
Moreover, I sometimes get "Timeout during SSL handshake" errors from HAproxy when the 30 seconds timeouts are over.

There are no known package drops within the system, pings with don't fragment and the corresponding jumbo mtu work from both directions.

No idea what happened, but this is a big problem... but somehow, HAproxy hangs itself up with the parameters of my setup and I have not found any way to stop this behavior.

By the way: When I try to change the Timeout parameters globally (nothing manually set in servers), no changes in /usr/local/etc/haproxy.conf are made.

EDIT: It looks like HTTP backend without TLS have the same problem!
So I have a 30s timeout and 100% CPU load even with blank unencrypted backends / servers!

I once got a "TCPWindowFull" while HAproxy is at 100% some times. It really looks like a bug with jumbo frames or some other performance bottleneck in HAproxy with processing frames. Some times HAproxy also just RST the connection with a high window.
So while HAproxy is fully 100% loaded, it is not able to process the incoming packages which seems to result in broken connections or so... The tcp windows runs up until nearly 90.000 some times and HAproxy blocks then. Maybe some deeper problem?

@addy90
Copy link
Author

addy90 commented Mar 8, 2018

I made a revert via opnsense-revert -r 18.1.2 os-haproxy and everything is working again as expected!

@fichtner
Copy link
Member

fichtner commented Mar 9, 2018

Are you sure? "os-haproxy" is the plugin itself, not HAProxy 1.8 / 1.7 so if its in the plugin @fraenki would be able to figure it out. :)

@fichtner fichtner added the bug Production bug label Mar 9, 2018
@addy90
Copy link
Author

addy90 commented Mar 9, 2018

yes, os-haproxy 2.5 installs haproxy-devel, which seems to have this bug (it's haproxy 1.8.4 as far as I could see). When I try to revert haproxy-devel, this is not possible as there was no package haproxy-devel in 18.1.2, but there was haproxy-1.7.
So I need to revert the plugin itself to 2.4 from 18.1.2 for reverting haproxy-devel to haproxy-1.7.

I just validated again: I made an upgrade to os-haproxy 2.5 via GUI, then I had to "Apply" settings again in HAproxy, then the problems are back, also with the high load and no response up to 30 seconds when having enough complex website reloads (as with Ctrl+F5). Also had some health timeouts Layer4 again.
Then I reverted to os-haproxy 2.4 via Console, reapplied Settings, and everything works as expected again, no high load (as there are no more timeouts).

@fichtner
Copy link
Member

fichtner commented Mar 9, 2018

Oh you are right... FreeBSD still has haproxy and haproxy-devel split. :)

@addy90
Copy link
Author

addy90 commented Mar 22, 2018

Just for information: 18.1.5 does not solve the problem, I had to rewind again to 18.1.2 for os-haproxy package. Is there a way to freeze this package or do I have to rewind every time I update OPNsense?

@fichtner
Copy link
Member

go to system: firmware: packages: find "os-haproxy" and click "lock"

@addy90
Copy link
Author

addy90 commented Mar 22, 2018

omg, so easy... :X I just did not look there... thank you!

@fichtner
Copy link
Member

sure thing :) does upstream / FreeBSD know this is happening for 1.8 ? I don't believe this is OPNsense-specific... further amplified because FreeBSD keeps this in "devel" mode instead of shipping the latest release...

@fichtner
Copy link
Member

1.8.5 is out now, not sure if it addresses your issue: https://www.mail-archive.com/haproxy@formilux.org/msg29401.html

@addy90
Copy link
Author

addy90 commented Mar 24, 2018

I am not sure if it is an OPNsense specific issue or not, I have to set up an additional HAProxy for testing this, did not have a chance to try this out, yet. Maybe 1.8.5 helps? I am not sure, there are some bugs stated with 100% CPU usage, but mostly with multithreading (multiprocess?) and as far as I could see, even 1.8.4 on OPNsense is using only 1 process, but I am not sure if it uses multiple threads?
So we definitely should try 1.8.5, and if it still has problems, I will set up an additional HAProxy on a clean FreeBSD 11.1 for finding out more.

@lukrop
Copy link

lukrop commented Apr 3, 2018

We are experiencing the same symptoms. Reverting to os-haproxy 2.4 as @addy90 suggested fixed the issue. Make sure to restart haproxy/press "Apply" in the GUI.
I'll be locking os-haproxy to 2.4 until the issue is resolved.

@fraenki fraenki added the upstream Third party issue label Apr 5, 2018
@fraenki
Copy link
Member

fraenki commented Apr 5, 2018

Thanks all for the reports and sorry for the long period of silence!

This is an upstream bug in HAProxy 1.8. There is no fix available yet. It does not occur with all configurations, so the only workaround is to stay on HAProxy 1.7 until a fix is available.

That being said, I encourage everyone to help get this bug fixed by contributing debug information and example configurations (to help reproduce this bug). I'm aware of the following upstream threads regarding this (or similar) issues, feel free to contribute:

https://discourse.haproxy.org/t/haproxy-1-8-4-at-100-cpu-right-after-startup/2218
https://www.mail-archive.com/haproxy@formilux.org/msg29416.html
https://www.mail-archive.com/haproxy@formilux.org/msg29289.html

Please read these threads thoroughly, they contain further details how to debug this issue.

@ripkens
Copy link

ripkens commented Apr 8, 2018

I can confirm that "opnsense-revert -r 18.1.2 os-haproxy" ran in Shell over SSH makes haproxy working as expected.

@NunoHiggs
Copy link

NunoHiggs commented Apr 11, 2018

Just upgraded to 18.1.6 and the issue is even stranger.
I had to do full reinstall on the firewall and when i tried to downgrade haproxy it was impossible using the above commands:

_root@opncluster0101:~ # /usr/local/sbin/haproxy  -v
**HA-Proxy version 1.8.5 2018/03/23**
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

root@opncluster0101:~ # opnsense-revert -r 18.1.2 os-haproxy
Fetching os-haproxy.txz: ... done
Verifying signature with trusted certificate pkg.opnsense.org.20171219... done
os-haproxy-2.6: already unlocked
Updating OPNsense repository catalogue...
OPNsense repository is up to date.
All repositories are up to date.
pkg-static: os-haproxy has a missing dependency: haproxy
Checking integrity... done (0 conflicting)
The following 1 package(s) will be affected (of 0 checked):

New packages to be INSTALLED:
        os-haproxy: 2.4

Number of packages to be installed: 1
[1/1] Installing os-haproxy-2.4...
Extracting os-haproxy-2.4: 100%
Stopping configd...done
Starting configd.
Keep version OPNsense\HAProxy\HAProxy (2.2.0)
Configuring system logging...done.
Reloading template OPNsense/HAProxy: OK
root@opncluster0101:~ # /usr/local/sbin/haproxy -v
**HA-Proxy version 1.8.5 2018/03/23** /same version
Copyright 2000-2018 Willy Tarreau <willy@haproxy.org>

Then i also downgraded the haproxy-devel package

# opnsense-revert -r 18.1.2 haproxy-devel
Fetching haproxy-devel.txz: ... done
Verifying signature with trusted certificate pkg.opnsense.org.20171219... done
haproxy-devel-1.8.5: already unlocked
Updating OPNsense repository catalogue...
OPNsense repository is up to date.
All repositories are up to date.
Checking integrity... done (0 conflicting)
The following 1 package(s) will be affected (of 0 checked):

New packages to be INSTALLED:
        haproxy-devel: 1.8.3

Number of packages to be installed: 1

The process will require 3 MiB more space.
[1/1] Installing haproxy-devel-1.8.3...
Extracting haproxy-devel-1.8.3: 100%
root@pfsense01:~ # /usr/local/sbin/haproxy -v
**HA-Proxy version 1.8.3-205f675 2017/12/30**
Copyright 2000-2017 Willy Tarreau <willy@haproxy.org>_

In this version of haproxy, the high cpu load is now showing even under heavy network load.
Don't forget to lock both packages after that.

@fraenki
Copy link
Member

fraenki commented Apr 16, 2018

HAProxy 1.8.7 does not have a fix for this issue, please do not upgrade OPNsense if you're affected by this bug until a fix becomes available.

@Kali-
Copy link

Kali- commented Apr 16, 2018

i'm also affected by this bug, as a work around i have manually replaced the binary from 1.7.10, and removed tune.lua.maxmem from haproxy.conf template (/usr/local/opnsense/service/templates/OPNsense/HAProxy/haproxy.conf)

@Kali-
Copy link

Kali- commented Jul 2, 2018

anyone have tested HAProxy > 1.8.7 ?

@fichtner
Copy link
Member

fichtner commented Jul 2, 2018

HAProxy 1.8.12 is included in OPNsense 18.1.11.

@fraenki
Copy link
Member

fraenki commented Jul 3, 2018

HAProxy 1.8.12 looks indeed very promising. The announcement (for 1.8.10) specifically mentions that a 100% CPU issue is fixed. I have yet to test this new release myself.

@fichtner
Copy link
Member

fichtner commented Jul 3, 2018

Updates have been smooth so far as I heard no complaints due to recent HAProxy version bumps. Very good engineering on their part if true. Like it. :)

@addy90
Copy link
Author

addy90 commented Jul 4, 2018

I will test the new version during the next days and if it does what it claims, I will roll it out on our productive environment next week.

Sorry for offtopic, but: I love you guys for helping and keeping us updated! :) Running five instances of OPNsense already in different environments, it more and more looks like this was the right decision! Thank you!

@fichtner
Copy link
Member

fichtner commented Jul 4, 2018

Be sure to let us know how the testing goes.

And no need to be sorry, thank you. ❤️

@fraenki
Copy link
Member

fraenki commented Jul 6, 2018

Just upgraded one box to OPNsense 18.1.11 and HAProxy 1.8.12 and the 100% CPU issue is gone.

@addy90
Copy link
Author

addy90 commented Jul 6, 2018

Me too, upgraded two instances to HAProxy 1.8.12 now and it seems to be working great so far :)

@fraenki fraenki closed this as completed Jul 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Production bug upstream Third party issue
Development

No branches or pull requests

7 participants