-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stale HAProxy configurations remain listening #71
Comments
Sounds a lot like issues we've had in the past. HAProxy, when doing a soft-reload, was spawning a new process with the new config and kept the old process around to finish processing requests currently in progress. But it was also accepting new requests, so the old and new config were both staying alive. Updating HAProxy to the most recent version fixed that issue for us. Along with that, at least on systemd, haproxy seems to have more issues reloading properly: |
We are using the marathon-lb docker container which is running HAProxy 1.5.8. |
we are running HAProxy 1.5.14 right now |
When this occurs, can you check how many instances of HAProxy are running? As in, do a |
I will do that @brndnmtthws:
|
Interesting. You definitely have some extra haproxy instances there. I wonder if the use of Any chance you can test with the current master code? I made a few changes to how the reloads are handled, and I haven't seen the same behaviour in my recent testing. |
I think it will help, and its on my road map but currently, we don't have a nice way of getting the information required for the $PORTS variable in an automated fashion so it will take a bit of time. I'll try and get it wrapped up next week and report back. Thanks for looking into it. |
It would be sufficient to specify some subset of ports (or even just one port, for that matter) to test. At the very least, it wouldn't be any worse than what you're using now. |
Ok, I was under the impression (only because I didn't investigate too deeply) that it wouldn't work if I didn't have all the $PORTS configured. That makes it a lot easier to test. Thanks |
The only limitation is that reloads will not be completely 'zero-downtime' unless you supply the ports ahead of time. |
I have upgraded docker marathon-lb to the 1.1.1 tag in our development environment. I haven't noticed the problem so far but as it was fairly unpredictable to begin with, it may not come up for some time. I'll roll with this for a while and hope it keeps working. Here is the same output as before, for comparision:
|
Cool, that looks better. I'm going to close the issue for now, but please reopen it if it comes up again. |
We had an issue it this in production last night. I wasn't the one troubleshooting it so my information is limited, but there were many haproxy instances running On the host:
|
Never mind, I hadn't upgraded our production environment yet. My mistake. |
Well, we are still having the issue even after upgrading in production.
|
Strange. Are you sure there are no long-lived TCP connections? Are you using anything like websockets? |
I know that there are long running connections and yes, that is probably the real source of our problem. I'll be doing work to eliminate those long running connections from being proxied through marathon-lb today. In general, do you have any suggestions for proxying long running connections with mesos/marathon? |
I am seeing similar issues at QubitProducts/bamboo#200 Have you found a solution to stale processes? There should not be any long running connections in our setup. |
We upgraded marathon and marathon-lb and removed long running connections from marathon and marathon-lb. Since then the number of problems related to stale connections has droped to almost zero. Once in a while, we will get some 502 errors from haproxy. Usually this coincided with a flapping service in marathon. I think if you can slow down the rate of reconfigurations you are likely to avoid issues. Good luck! |
We are seeing an interesting issue with v1.0.1 of marthon-lb running in SSE.
On a reconfiguration, occasionally the older process remains listening causing stale configuration and the newer process to fail.
I have not be able to reliable reproduce it but it has cause several issues for us recently and the only way. It seems to occur when a service is flapping, or when many deployments are occurring at once. I have a theory that there is a race condition between when the new process has successfully started and the pidfile is read by the next starting process. (https://github.com/mesosphere/marathon-lb/blob/master/service/haproxy/run#L30)
I'm interested in any ideas anyone might have on how to reliably reproduce this or how to guaranty it doesn't bite us anymore. We are not ready to upgrade to v1.1.0 because we don't currently have an up front list of ports.
The text was updated successfully, but these errors were encountered: