Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

Long timeout metrics capture process when process terminates #75

Closed
pecameron opened this issue Mar 23, 2017 · 4 comments
Closed

Long timeout metrics capture process when process terminates #75

pecameron opened this issue Mar 23, 2017 · 4 comments

Comments

@pecameron
Copy link

Description of problem:
After upgrading from Openshift 3.3 to 3.4, the router metrics exporting pod is no longer working, seemingly due to a port conflict (where nothing is listening on the port)

I don't know if the container is setting SO_REUSEPORT on the container (I don't see it in haproxy_exporter.go and I don't think the golang http.ListenAndServe() does it by default).

See https://bugzilla.redhat.com/show_bug.cgi?id=1426446 for context.

@grobie
Copy link
Member

grobie commented Mar 23, 2017 via email

@pecameron
Copy link
Author

pecameron commented Mar 24, 2017 via email

@knobunc
Copy link

knobunc commented Mar 24, 2017

@grobie unless the program binding to the port sets SO_REUSEPORT, if it terminates abnormally and is restarted, it will not be able to re-bind to the port until TIME_WAIT (several minutes) passes. https://hea-www.harvard.edu/~fine/Tech/addrinuse.html

I think you need to pass SO_REUSEPORT or SO_REUSEADDR.

@pecameron it is possible that SO_REUSEADDR is set by default: https://golang.org/src/net/sockopt_linux.go

@grobie
Copy link
Member

grobie commented Apr 3, 2017

@pecameron I can't comment on your environment, I don't have any experience with openshift. That the lsof output shows a running haproxy_exporter process with a PID and a socket in LISTEN state indicates there was indeed already a process instance running.

@knobunc SO_REUSEPORT was added to allow multiple processes simultaneously binding to the same port. With only a single process being used by haproxy_exporter, it doesn't make sense setting it here. SO_REUSEADDR has always been set by Go on listening sockets, it would indeed be a huge issue otherwise.

Side note, if your server really still uses a TIME_WAIT timeout of "minutes" as you say (2m is the default on Linux IIRC), I highly recommend adjusting tcp_fin_timeout. Fortunately we moved past the times were TCP packets routing takes several tens of seconds. Especially on load balancers where HAProxy acts as TCP client I've seen many systems running into port exhaustion systems otherwise.

I'm not really sure what to do here, I can't reproduce any issues with trailing connections in TIME_WAIT status and restarting haproxy_exporter. The lsof output at the beginning of your bug report looks a lot more suspicious tbh. If you have any actual indications or steps to show there is indeed an issue with haproxy_exporter, I'm happy to investigate.

@grobie grobie closed this as completed Jul 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants