Fixed resource leaks in runtime events system#13419
Conversation
gasche
left a comment
There was a problem hiding this comment.
Approved. Could you maybe move the Changes entry from the "Working version" section to the "5.3.0" section, so that I don't need to tweak it when cherry-picking?
There is a failure of tests/lib-runtime-events/test_create_cursor_failure in the i386 CI, which needs to be investigated before this can be merged.
... testing 'test_create_cursor_failures.ml' with line 6 (bytecode) => failed
@@ -1,6 +1,6 @@
Runtime_events: no ring for current process. Was runtime_events started?
OK
OK
-Runtime_events: could not create cursor for specified path.
-Runtime_events: could not create cursor for specified path.
-Runtime_events: could not create cursor for specified path.
+OK
+OK
+OK
Done.
Curious - cursor creation isn't failing as it should. I'll look into it. |
I set up an i386 environment but couldn't reproduce this. |
| Runtime_events.pause () | ||
|
|
||
| (* workaround for finding the events file even on Windows, where | ||
| [Unix.getpid] doesn't match the one used to open the file *) |
There was a problem hiding this comment.
This is, ahem, horrible. It sounds like something that could be an issue for our users as well. Could we fix this by ensuring that there is a programmatic way, from OCaml, to guess the correct event file for the current program? We could change the runtime-events code to use the same pid as Unix.getpid, or expose a way (maybe a new function in the runtime_events module?) to get whatever other notion of pid was used.
There was a problem hiding this comment.
Note: another option would be to not work on this problem at all on this PR, and restrict the test to not-windows and write the simple solution to get the pid file. (I'm curious to know whether the testsuite failure would go away in this case.)
| let startup_events_files = list_events_files () in | ||
| fun () -> | ||
| let is_new_event_file f = not (List.mem f startup_events_files) in | ||
| List.find is_new_event_file (list_events_files ()) |
There was a problem hiding this comment.
The code here looks fairly optimistic in terms of the expected semantics of the filesystem. It assumes that all .events file from past test runs are visible in the filesystem when this program starts (but, as far as I know, there is no fsync call anywhere in the runtime that would give this guarantee), and then at call-time it assumes that the .events file created by the current process at runtime-events startup time is already visible in the filesystem. (The latter is not the cause of the weird test failure in this PR I guess, otherwise we would see a Not_found exception.)
There was a problem hiding this comment.
Isn't there a guarantee that files created by a process are subsequently visible to that process? Huh.
There was a problem hiding this comment.
Hm, I must misremember those things and your interpretation is plausible. But that would only cover the second issue I mentioned, not the accuracy of the first call to list_event_files that happens at program startup and lists "startup files" from other processes, right?
There was a problem hiding this comment.
I quite agree: Runtime_events should have a function to return the path of the current process's events file (possibly a function on a cursor, returning the file name, as one can readily make a cursor for the current process).
There was a problem hiding this comment.
On reflection, there's no reason for it to be a function on a cursor.
There was a problem hiding this comment.
Incidentally, while I'm looking at resource safety in the runtime events system, I think there is a lack of data hygiene in caml_runtime_events_current_location, which hands out the value of the module-local string pointer current_ring_loc - so one domain might use it after another domain frees it. I'm changing it to always hand out a copy.
There was a problem hiding this comment.
OK, I've tidied that up and added Runtime_events.path to address your points about finding the events file.
|
|
||
| val path : unit -> string | ||
| (** [path ()] returns the path to the runtime events file, if runtime | ||
| events are being collected. *) |
There was a problem hiding this comment.
And otherwise, what? (Answer: it fails with a Failure _ exception, which could be documented.) I would prefer to have the function return a string option.
There was a problem hiding this comment.
I considered that, but thought this was "good enough" for an interface in which other exceptional cases are not documented (e.g. Runtime_events.create_cursor) and other functions (e.g. pause) silently do nothing if circumstances aren't right (e.g. events aren't enabled). The whole API is perhaps a bit shonky. But that's no excuse to add more shonkiness; I'll tidy this up.
There was a problem hiding this comment.
I see; with github reviews we don't really see the diff in context, so I tend to be maximally nitpicky by default, apologies.
|
Thanks for the simplification to the failing test. Unfortunately it looks like this still fails for |
8f3e69d to
c6f83d9
Compare
Can't reproduce. |
|
Ah, there is also a MSVC 64bits build failure in the testsuite: (The wchar_t thing might suggest that |
|
The fact that you can't reproduce does not immediately tell us how to deal with the issue, unfortunately. We cannot just merge the PR as-is and break the i386 CI, so we have to decide between:
Which route is best depends on whether it is a real issue that does not occur under your particular setup (but does on the CI), or whether this is not a real issue and just an artifact of some weird setup on the CI machine. It's tempting to bet on the second option, but I really have no idea myself. This is also not the focus of this PR, so I'm happy to cut our losses and silence this one CI configuration for that one test. |
runtime/runtime_events.c
Outdated
| caml_fatal_error("Couldn't open ring buffer loc: %s", | ||
| ring_loc_u8); | ||
| caml_stat_free(ring_loc_u8); | ||
| char* ring_loc_u8 = caml_stat_strdup_noexc_os(current_ring_loc); |
There was a problem hiding this comment.
The function caml_stat_strdup_noexc_os does a "os" -> "os" copy. Here you would need a "os" -> "utf-8" copy instead (like the original caml_stat_strdup_of_os).
There was a problem hiding this comment.
My mistake. I was tripped up by the similarity between the function names caml_stat_strdup_of_os (which returns a char * and can't raise an exception) and caml_stat_strdup_os (which returns a char_os * and can). I guess from the previous state of the code that I want MBCS for the caml_fatal_error, so I should use caml_stat_strdup_of_os. But my change was intended to avoid the (in fact impossible) exception, surviving a failure of the strdup. So looking at the implementation of caml_stat_strdup_of_os, it does a straight malloc under the hood, then calls WideCharToMultiByte with the un-validated return value. So that still seems unsafe: are we going to get NULL, or a SEGV?
This is getting quite bikesheddy: producing a useful caml_fatal_error message despite the failure of this particular malloc might not be a very high priority. But I can tweak caml_stat_strdup_of_os.
There was a problem hiding this comment.
So looking at the implementation of
caml_stat_strdup_of_os, it does a straightmallocunder the hood, then callsWideCharToMultiBytewith the un-validated return value.
Sorry, but caml_stat_strdup_of_utf16 uses caml_stat_alloc, which (if memory serves) can raise "out of memory" exception; am I missing something?
Lines 967 to 977 in fc98dd1
There was a problem hiding this comment.
Argh, I'm touching the implementation in yacc/wstr.c, rather than the one in runtime/win32.c. PEBKAC. I'm going home.
There was a problem hiding this comment.
OK, I've cleaned this up with a new caml_stat_strdup_noexc_of_os.
|
Regarding disabling the test on
|
I am suspicious of the i386 CI setup, but have wildly exceeded my effort budget for this PR. I'm going to ask @dra27 to poke at it with me when he's back from ICFP. |
…umer.c` Co-authored-by: Nick Barnes <nick@tarides.com>
Test cases for some recent bugs in lib-runtime-events Co-authored-by: Nick Barnes <nick@tarides.com>
|
@dra27 spots that the i386 CI at present runs everything as root, so the events file is readable to the test despite the permission bits all being cleared by |
|
CI fixed in dra27@1434c45 - I don't seem to be able to push to the PR directly. |
|
Oof! Congratulations on the CI fixes. Merging. |
|
@gasche, it looks like you never cherry-picked this PR to 5.3 . Any specific reasons? If there are no specific reasons, I am inclined to do so to give more freeway to the runtime events library. |
Fixed resource leaks in runtime events system (cherry picked from commit 05f7e7a)
|
Cherry-picked to 5.3 as planned initially in 64f6c2e . |
As @eutro is currently unavailable to shepherd #13091 to merge, here I have rebased it on to trunk, with some additional changes to reflect my review comments. The discussion on #13091 mostly applies here equally.
In email @eutro wrote: "In the interest of getting it merged in a timely manner, it's probably best if you do open a new PR, since I'm not sure when I would have gotten around to addressing the review myself. The code looks good to me."