Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"too many open files" possible when using spooling? #480

Closed
Dieterbe opened this issue Nov 17, 2021 · 2 comments
Closed

"too many open files" possible when using spooling? #480

Dieterbe opened this issue Nov 17, 2021 · 2 comments

Comments

@Dieterbe
Copy link
Contributor

A customer ran into "too many open files" when using basic grafanaNet route and a tcp route with spooling.
TODO:

  1. check out code to look for any obvious file descriptor leaks
  2. add instrumentation - if we don't have any already - to track number of open files (opened by CRNG)
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Nov 18, 2021

Hypothesis

usage of spooling in TCP routes (or reconnecting connections) leaks (opens many and doesn't close) many file descriptors.

Investigation approach

  • dig in at the destination level, as it owns the TCP connections and spooling
  • don't investigate route, table and other structures. As they are removed from TCP connections and spooling.
  • focus on "normal" operation with a TCP destination running, connections dropping and reconnecting, spool usage etc. Not shutdown of the relay.

Learnings

Diagram of where FD's are created and owned

Destination{
    connUpdate: chan *Conn {
	      conn :*net.TCPConn  // these conns get created by NewConn() which is
                                  // called in dest.updateConn() and are owned by dest.relay()
    }
    spool: Spool {
        queue: nsqd.Diskqueue
    }
}

Connections

  • does Conn close conn: yes upon network error or calling Close(). (*)
    Conn does not touch its writer (which embeds the conn), when closing the conn but that should be ok.

  • updateConn does not close the prior connection, because it does not have access to it.
    updateConn is called from...

  1. destination.relay(), once at startup (when no pre-existing conn), and once if the conn has gone down, at which point it should have been closed (see isAlive check) (*)
  2. when updating the destinations address, when a pre-existing, active conn may be live -> this was a bug that is now fixed: the previous conn was not being closed. close conn fix + troubleshooting too many open files #481

(*) these mechanism rely on channels messaging between different routines. If goroutines get stuck, processing may not properly advance. This is best diagnosed with goroutine dumps.

Spool/diskqueue

  • dest.Run creates NewSpool() (and only there do we call NewSpool)
  • dest.Run runs only when route.Run() and AddDest() are called, which are considered "safe" paths.

So then, looking at NewSpool()...
spool stays tied to the destination until dest is shutdown, which does not happen during "normal operation". Likewise, 1 diskqueue is tied to 1 spool for its entire lifetime.
So it comes down to the diskqueue. After some digging, it seems that it correctly opens/closes its file descriptors.

How to avoid in the future?

see diagonstics instructions added in #481
TODO: instrumentation for open FD's?

@Dieterbe
Copy link
Contributor Author

The two ways I see to instrument open fd's are:

  1. polling /proc/pid/fd and counting the number of files in there.
  2. accounting ourselves, any time we open/close a conn or file.

1 is expensive, 2 is verbose and possibly inaccurate.

I think i will scrap that idea for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant