New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
caml_ml_open_descriptor_out memory leak and file corrupt at_exit #12300
Comments
You raise interesting points. However, there's one thing I don't understand in your |
Well that is what we've been thinking for the past 3 months, until we realized that the 'oc' points to a socket, and the other end of that socket has closed the connection, so all these flushes were raising exceptions and never finishing, so the finalizer always sees unflushed data :) |
Meanwhile I thought of another way of working this around from OCaml: 'Unix.dup fd' and another wrap that calls 'close_out_noerr' (cannot use close, because that would always call flush which fails), this way the 'channel' would "own" the (now duplicated file descriptor) and we can safely close the channel. And that avoids the leak. Would still be good to have a solution for this in the runtime though. |
Right, I see better now, thanks for all the explanations. I was about to suggest dup + close_out, but your idea of dup + close_out_noerr is even better to handle (as gracefully as possible) the case where the other side closes the network connection early. Concerning the memory leak and FD capture you report, I agree we need to reconsider some of the current design choices. |
See also #9786 for an interesting discussion of a somewhat related issue. |
My current thoughts on this issue is that, currently, channels returned by For this reason, I think Some ideas for future improvements:
|
Is this problem rooted in the broken duality between FDs and channels? FDs are R/W while channels are explicitly read or write. This forces (for sockets or anything R/W) to use two channels that now have a complicated relationship with the underlying FD and each other. Could a future improvement be a new channel type (module) that supports R/W and has a 1:1 relationship with the underlying FD? That could mean fading out the current channels over the long run. |
I think this is entirely backwards compatible, and should fix most of the memory leaks that I'm seeing. About in/out channels I've attempted to write wrapper like this in the past (but there are too many ways to "escape" this unless I wrap the entire Unix module with a safer interface), but it will cause the original application to use more file descriptors which can be a problem if you're already close to the 1024 limit impose by
If you're interested here is my attempt to write a library with a kind of borrow semantics from a while ago, but it is not quite as easy to use as I'd like (and because there is an escape hatch for getting the raw Unix fd it is also easy to misuse), and trying to adopt it in a large codebase would result in a lot of perhaps unnecessary churn and incompatibility with external libraries: |
#9786 is related to this "broken duality", but not the present PR, it's just about output channels.
The problem with RW channels is the transitions between R and W. Reading after writing is OK, just have to flush the output buffer so that it can act as input buffer. But writing after reading means discarding the data that was read ahead. That's wrong if you're working with a socket. I'd much prefer to keep in_channel and out_channel distinct types and suggest using |
Self-contained testcase reproduced on OCaml 4.14.1 (but should also reproduce on 4.08.1):
There are 2 bugs here:
(Bear with me until the 'Analysis' section below on why I'm reporting 2 bugs in the same issue: they are caused by the same piece of code)
(You can also run the program without ulimit, increase the 'for' loop limit and watch the process eat gigabytes of memory in 'top')
Analysis
Each 'struct channel' allocated when opening a channel (from a fd) contains a 64k buffer:
I think the leak is caused by this code in the finalizer of the channels:
This is a 64KiB/channel leak, which can happen e.g. if the underlying file descriptor returns an error on flush,
such that flush never succeeds (e.g. it is a socket and the other end has closed the connection).
This can be particularly problematic for a long running daemon using sockets (e.g. a webserver).
The comment says that the reason the channel is kept alive is such that the data may be flushed at exit.
However by that time the underlying file descriptor may have been closed if the channel was constructed by the Unix module for example, and the 'fd' is not owned by the channel itself.
In fact it could be worse than closed: some unrelated file descriptor may have reused its number by the time the process exits, at which point it will flush one channel's data into some other file descriptor's file.... (potentially corrupting whatever data was meant to be in that other file).
Thus this is not merely an OOM bug, it is also a correctness bug.
Suggested fix
Remove this 'flush on exit' behaviour for channels which got created by the Unix module: the channel does not have full control of the file descriptor and it may have been closed. Could potentially cause unexpected behaviour for applications that rely on this (unsafe) behaviour
Remove this behaviour for channels that are based on Unix sockets (whether the channel fully manages the file descriptor or not): retaining it to be flushed on exit may not work
Change 'caml_ml_flush' to set a flag such that the finalizer may discard the channel if flushing was attempted but failed.
The latter would fix the immediate bug reported here, although it might still leave the possibility open that if someone forgets to call 'flush' then 'at_exit' the runtime may write data into the wrong file...
I'm happy to help with creating a patch for these, but would be happy to hear thoughts on how else this could be fixed.
Background
We've been trying to track a leak in a long running OCaml daemon for the past 3 months that only happened in a production environment, and caused that particular OCaml daemon to end up using gigabytes of memory after a while.
Nothing suspicious as far as OCaml heap usage goes, but using 'jemalloc' we got some statistics from the C side that showed that
caml_ml_open_descriptor_out -> caml_open_descriptor_in -> caml_stat_alloc
was leaking about 800 MiB+ of memory.The caller was this OCaml code:
We looked at this code and concluded that because we always call flush then the finalizer should always free the memory and the above codepath should not be entered. The breakthrough came this morning when we realized that these are sockets, and that flushing may indeed fail if the other end has already closed the connection, and with that information I've been able to write the above testcase that immediately reproduced the problem.
There really isn't much this code could do better: it cannot close the channel because that would close the underlying Unix file descriptor which this function does not own (it got passed as a parameter), and there isn't anything other than 'flush' it could call to 'clear' the pending data buffer AFAIK either.
We could refactor the code to use channels instead of Unix file descriptors, such that this function gets a channel passed in (that it can 'close_noerr' on error), but the OCaml runtime leak should still be fixed.
Looking closer at the reason for the OCaml runtime leak (attempting to flush the channel 'at_exit') that is unsafe in general: the underlying Unix file descriptor could've been closed and reused by something else at which point 'at_exit' will "corrupt" this other file by writing some other channel's data into it.
The text was updated successfully, but these errors were encountered: