Long timeout metrics capture process when process terminates #75

pecameron · 2017-03-23T19:24:59Z

Description of problem:
After upgrading from Openshift 3.3 to 3.4, the router metrics exporting pod is no longer working, seemingly due to a port conflict (where nothing is listening on the port)

I don't know if the container is setting SO_REUSEPORT on the container (I don't see it in haproxy_exporter.go and I don't think the golang http.ListenAndServe() does it by default).

See https://bugzilla.redhat.com/show_bug.cgi?id=1426446 for context.

grobie · 2017-03-23T21:37:35Z

This sounds like an issue with your environment and not the haproxy_exporter. Usage questions are best asked on the promethreus-users mailing list. SO_REUSEPORT is not set. On Thu, Mar 23, 2017, 16:25 Phil Cameron <notifications@github.com> wrote: Description of problem: After upgrading from Openshift 3.3 to 3.4, the router metrics exporting pod is no longer working, seemingly due to a port conflict (where nothing is listening on the port) I don't know if the container is setting SO_REUSEPORT on the container (I don't see it in haproxy_exporter.go and I don't think the golang http.ListenAndServe() does it by default). See https://bugzilla.redhat.com/show_bug.cgi?id=1426446 for context. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#75>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAANaB_JqD8gVTkOkn7PcD2kJb2Zc_Woks5roscLgaJpZM4MnPtO> .

pecameron · 2017-03-24T12:51:39Z

On 03/23/2017 05:37 PM, Tobias Schmidt wrote: This sounds like an issue with your environment and not the haproxy_exporter.

Care to elaborate how the environment contributes to this? I am curious.

…

Usage questions are best asked on the promethreus-users mailing list. SO_REUSEPORT is not set. On Thu, Mar 23, 2017, 16:25 Phil Cameron ***@***.***> wrote: Description of problem: After upgrading from Openshift 3.3 to 3.4, the router metrics exporting pod is no longer working, seemingly due to a port conflict (where nothing is listening on the port) I don't know if the container is setting SO_REUSEPORT on the container (I don't see it in haproxy_exporter.go and I don't think the golang http.ListenAndServe() does it by default). See https://bugzilla.redhat.com/show_bug.cgi?id=1426446 for context. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#75>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAANaB_JqD8gVTkOkn7PcD2kJb2Zc_Woks5roscLgaJpZM4MnPtO> . — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#75 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANUgenuDi3EclyDko3J3-4VFaI8VUm2Eks5rouYggaJpZM4MnPtO>.

knobunc · 2017-03-24T13:51:22Z

@grobie unless the program binding to the port sets SO_REUSEPORT, if it terminates abnormally and is restarted, it will not be able to re-bind to the port until TIME_WAIT (several minutes) passes. https://hea-www.harvard.edu/~fine/Tech/addrinuse.html

I think you need to pass SO_REUSEPORT or SO_REUSEADDR.

@pecameron it is possible that SO_REUSEADDR is set by default: https://golang.org/src/net/sockopt_linux.go

grobie · 2017-04-03T03:10:16Z

@pecameron I can't comment on your environment, I don't have any experience with openshift. That the lsof output shows a running haproxy_exporter process with a PID and a socket in LISTEN state indicates there was indeed already a process instance running.

@knobunc SO_REUSEPORT was added to allow multiple processes simultaneously binding to the same port. With only a single process being used by haproxy_exporter, it doesn't make sense setting it here. SO_REUSEADDR has always been set by Go on listening sockets, it would indeed be a huge issue otherwise.

Side note, if your server really still uses a TIME_WAIT timeout of "minutes" as you say (2m is the default on Linux IIRC), I highly recommend adjusting tcp_fin_timeout. Fortunately we moved past the times were TCP packets routing takes several tens of seconds. Especially on load balancers where HAProxy acts as TCP client I've seen many systems running into port exhaustion systems otherwise.

I'm not really sure what to do here, I can't reproduce any issues with trailing connections in TIME_WAIT status and restarting haproxy_exporter. The lsof output at the beginning of your bug report looks a lot more suspicious tbh. If you have any actual indications or steps to show there is indeed an issue with haproxy_exporter, I'm happy to investigate.

grobie closed this as completed Jul 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long timeout metrics capture process when process terminates #75

Long timeout metrics capture process when process terminates #75

pecameron commented Mar 23, 2017

grobie commented Mar 23, 2017 via email

pecameron commented Mar 24, 2017 via email

knobunc commented Mar 24, 2017

grobie commented Apr 3, 2017

Long timeout metrics capture process when process terminates #75

Long timeout metrics capture process when process terminates #75

Comments

pecameron commented Mar 23, 2017

grobie commented Mar 23, 2017 via email

pecameron commented Mar 24, 2017 via email

knobunc commented Mar 24, 2017

grobie commented Apr 3, 2017