Skip to content

Fixed resource leaks in runtime events system#13419

Merged
gasche merged 3 commits intoocaml:trunkfrom
NickBarnes:pr-13091
Sep 10, 2024
Merged

Fixed resource leaks in runtime events system#13419
gasche merged 3 commits intoocaml:trunkfrom
NickBarnes:pr-13091

Conversation

@NickBarnes
Copy link
Contributor

As @eutro is currently unavailable to shepherd #13091 to merge, here I have rebased it on to trunk, with some additional changes to reflect my review comments. The discussion on #13091 mostly applies here equally.

In email @eutro wrote: "In the interest of getting it merged in a timely manner, it's probably best if you do open a new PR, since I'm not sure when I would have gotten around to addressing the review myself. The code looks good to me."

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Could you maybe move the Changes entry from the "Working version" section to the "5.3.0" section, so that I don't need to tweak it when cherry-picking?

There is a failure of tests/lib-runtime-events/test_create_cursor_failure in the i386 CI, which needs to be investigated before this can be merged.

 ... testing 'test_create_cursor_failures.ml' with line 6 (bytecode) => failed

@@ -1,6 +1,6 @@
 Runtime_events: no ring for current process.          Was runtime_events started?
 OK
 OK
-Runtime_events: could not create cursor for specified path.
-Runtime_events: could not create cursor for specified path.
-Runtime_events: could not create cursor for specified path.
+OK
+OK
+OK

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

@NickBarnes
Copy link
Contributor Author

Approved. Could you maybe move the Changes entry from the "Working version" section to the "5.3.0" section, so that I don't need to tweak it when cherry-picking?

Done.

There is a failure of tests/lib-runtime-events/test_create_cursor_failure in the i386 CI, which needs to be investigated before this can be merged.

Curious - cursor creation isn't failing as it should. I'll look into it.

@NickBarnes
Copy link
Contributor Author

There is a failure of tests/lib-runtime-events/test_create_cursor_failure in the i386 CI, which needs to be investigated before this can be merged.

Curious - cursor creation isn't failing as it should. I'll look into it.

I set up an i386 environment but couldn't reproduce this.

Runtime_events.pause ()

(* workaround for finding the events file even on Windows, where
[Unix.getpid] doesn't match the one used to open the file *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is, ahem, horrible. It sounds like something that could be an issue for our users as well. Could we fix this by ensuring that there is a programmatic way, from OCaml, to guess the correct event file for the current program? We could change the runtime-events code to use the same pid as Unix.getpid, or expose a way (maybe a new function in the runtime_events module?) to get whatever other notion of pid was used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: another option would be to not work on this problem at all on this PR, and restrict the test to not-windows and write the simple solution to get the pid file. (I'm curious to know whether the testsuite failure would go away in this case.)

let startup_events_files = list_events_files () in
fun () ->
let is_new_event_file f = not (List.mem f startup_events_files) in
List.find is_new_event_file (list_events_files ())
Copy link
Member

@gasche gasche Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks fairly optimistic in terms of the expected semantics of the filesystem. It assumes that all .events file from past test runs are visible in the filesystem when this program starts (but, as far as I know, there is no fsync call anywhere in the runtime that would give this guarantee), and then at call-time it assumes that the .events file created by the current process at runtime-events startup time is already visible in the filesystem. (The latter is not the cause of the weird test failure in this PR I guess, otherwise we would see a Not_found exception.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a guarantee that files created by a process are subsequently visible to that process? Huh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I must misremember those things and your interpretation is plausible. But that would only cover the second issue I mentioned, not the accuracy of the first call to list_event_files that happens at program startup and lists "startup files" from other processes, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite agree: Runtime_events should have a function to return the path of the current process's events file (possibly a function on a cursor, returning the file name, as one can readily make a cursor for the current process).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, there's no reason for it to be a function on a cursor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally, while I'm looking at resource safety in the runtime events system, I think there is a lack of data hygiene in caml_runtime_events_current_location, which hands out the value of the module-local string pointer current_ring_loc - so one domain might use it after another domain frees it. I'm changing it to always hand out a copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've tidied that up and added Runtime_events.path to address your points about finding the events file.


val path : unit -> string
(** [path ()] returns the path to the runtime events file, if runtime
events are being collected. *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And otherwise, what? (Answer: it fails with a Failure _ exception, which could be documented.) I would prefer to have the function return a string option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, but thought this was "good enough" for an interface in which other exceptional cases are not documented (e.g. Runtime_events.create_cursor) and other functions (e.g. pause) silently do nothing if circumstances aren't right (e.g. events aren't enabled). The whole API is perhaps a bit shonky. But that's no excuse to add more shonkiness; I'll tidy this up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; with github reviews we don't really see the diff in context, so I tend to be maximally nitpicky by default, apologies.

@gasche
Copy link
Member

gasche commented Sep 5, 2024

Thanks for the simplification to the failing test. Unfortunately it looks like this still fails for i386, so we probably need to think about this further.

@NickBarnes NickBarnes force-pushed the pr-13091 branch 2 times, most recently from 8f3e69d to c6f83d9 Compare September 6, 2024 13:43
@NickBarnes
Copy link
Contributor Author

Thanks for the simplification to the failing test. Unfortunately it looks like this still fails for i386, so we probably need to think about this further.

Can't reproduce.

$ docker run -it --platform linux/i386 ocaml-i386
root@b29ea0b89219:/# su - dixon
dixon@b29ea0b89219:~$ cd ocaml
dixon@b29ea0b89219:~/ocaml$ git checkout pr-13091
[...]
dixon@b29ea0b89219:~/ocaml$ make distclean
[...]
dixon@b29ea0b89219:~/ocaml$ ./configure --prefix=`opam var prefix`
[...]
dixon@b29ea0b89219:~/ocaml$ make -j 4
[...]
dixon@b29ea0b89219:~/ocaml$ make install
[...]
dixon@b29ea0b89219:~/ocaml$ make -C testsuite one TEST=tests/lib-runtime-events/test_create_cursor_failures.ml
make: Entering directory '/home/dixon/ocaml/testsuite'
 ... testing 'test_create_cursor_failures.ml' => passed
make[1]: Entering directory '/home/dixon/ocaml/testsuite'
make[1]: Leaving directory '/home/dixon/ocaml/testsuite'
make[1]: Entering directory '/home/dixon/ocaml/testsuite'


Summary:
     1 tests passed
     0 tests skipped
     0 tests failed
     0 tests not started (parent test skipped or failed)
     0 unexpected errors
     1 tests considered
make[1]: Leaving directory '/home/dixon/ocaml/testsuite'
make: Leaving directory '/home/dixon/ocaml/testsuite'
dixon@b29ea0b89219:~/ocaml$ lscpu
Architecture:        i686
CPU op-mode(s):      32-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           0x00
Model:               0
Stepping:            0x0
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint
dixon@b29ea0b89219:~/ocaml$

@gasche
Copy link
Member

gasche commented Sep 6, 2024

Ah, there is also a MSVC 64bits build failure in the testsuite:

runtime/runtime_events.c(289): error C2143: syntax error: missing ';' before '}'
runtime/runtime_events.c(282): warning C4133: 'initializing': incompatible types - from 'wchar_t *' to 'char *'

(The wchar_t thing might suggest that char_os should be used somewhere instead of char, I don't know.)

@gasche
Copy link
Member

gasche commented Sep 6, 2024

The fact that you can't reproduce does not immediately tell us how to deal with the issue, unfortunately. We cannot just merge the PR as-is and break the i386 CI, so we have to decide between:

  • heroic attempts to fix the issue without being able to reproduce it easily, or to do CI-golfing to figure out what is going on
  • disabling the test on i386 to make the issue go away, with a comment explaining why we did that

Which route is best depends on whether it is a real issue that does not occur under your particular setup (but does on the CI), or whether this is not a real issue and just an artifact of some weird setup on the CI machine. It's tempting to bet on the second option, but I really have no idea myself. This is also not the focus of this PR, so I'm happy to cut our losses and silence this one CI configuration for that one test.

caml_fatal_error("Couldn't open ring buffer loc: %s",
ring_loc_u8);
caml_stat_free(ring_loc_u8);
char* ring_loc_u8 = caml_stat_strdup_noexc_os(current_ring_loc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function caml_stat_strdup_noexc_os does a "os" -> "os" copy. Here you would need a "os" -> "utf-8" copy instead (like the original caml_stat_strdup_of_os).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. I was tripped up by the similarity between the function names caml_stat_strdup_of_os (which returns a char * and can't raise an exception) and caml_stat_strdup_os (which returns a char_os * and can). I guess from the previous state of the code that I want MBCS for the caml_fatal_error, so I should use caml_stat_strdup_of_os. But my change was intended to avoid the (in fact impossible) exception, surviving a failure of the strdup. So looking at the implementation of caml_stat_strdup_of_os, it does a straight malloc under the hood, then calls WideCharToMultiByte with the un-validated return value. So that still seems unsafe: are we going to get NULL, or a SEGV?
This is getting quite bikesheddy: producing a useful caml_fatal_error message despite the failure of this particular malloc might not be a very high priority. But I can tweak caml_stat_strdup_of_os.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at the implementation of caml_stat_strdup_of_os, it does a straight malloc under the hood, then calls WideCharToMultiByte with the un-validated return value.

Sorry, but caml_stat_strdup_of_utf16 uses caml_stat_alloc, which (if memory serves) can raise "out of memory" exception; am I missing something?

ocaml/runtime/win32.c

Lines 967 to 977 in fc98dd1

CAMLexport caml_stat_string caml_stat_strdup_of_utf16(const wchar_t *s)
{
caml_stat_string out;
int retcode;
retcode = caml_win32_wide_char_to_multi_byte(s, -1, NULL, 0);
out = caml_stat_alloc(retcode);
caml_win32_wide_char_to_multi_byte(s, -1, out, retcode);
return out;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, I'm touching the implementation in yacc/wstr.c, rather than the one in runtime/win32.c. PEBKAC. I'm going home.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've cleaned this up with a new caml_stat_strdup_noexc_of_os.

@gasche
Copy link
Member

gasche commented Sep 6, 2024

Regarding disabling the test on i386, I have bad news and good news:

  1. ocamltest has a arch_i386 option, but no arch_not_i386 option, so right now you cannot do this
  2. someone implemented negation in ocamltest, and it's juuust waiting for a review from @shindere and @OlivierNicole

@NickBarnes
Copy link
Contributor Author

The fact that you can't reproduce does not immediately tell us how to deal with the issue, unfortunately. We cannot just merge the PR as-is and break the i386 CI, so we have to decide between:

* heroic attempts to fix the issue without being able to reproduce it easily, or to do CI-golfing to figure out what is going on

* disabling the test on i386 to make the issue go away, with a comment explaining why we did that

I am suspicious of the i386 CI setup, but have wildly exceeded my effort budget for this PR. I'm going to ask @dra27 to poke at it with me when he's back from ICFP.

eutro and others added 2 commits September 9, 2024 14:32
…umer.c`

Co-authored-by: Nick Barnes <nick@tarides.com>
Test cases for some recent bugs in lib-runtime-events

Co-authored-by: Nick Barnes <nick@tarides.com>
@NickBarnes
Copy link
Contributor Author

@dra27 spots that the i386 CI at present runs everything as root, so the events file is readable to the test despite the permission bits all being cleared by chmod. So this test, as it stands, will fail when running as root.

@dra27
Copy link
Member

dra27 commented Sep 9, 2024

CI fixed in dra27@1434c45 - I don't seem to be able to push to the PR directly.

@gasche
Copy link
Member

gasche commented Sep 10, 2024

Oof! Congratulations on the CI fixes. Merging.

@gasche gasche merged commit 05f7e7a into ocaml:trunk Sep 10, 2024
@Octachron
Copy link
Member

@gasche, it looks like you never cherry-picked this PR to 5.3 . Any specific reasons? If there are no specific reasons, I am inclined to do so to give more freeway to the runtime events library.

Octachron pushed a commit that referenced this pull request Oct 31, 2024
Fixed resource leaks in runtime events system

(cherry picked from commit 05f7e7a)
@Octachron
Copy link
Member

Cherry-picked to 5.3 as planned initially in 64f6c2e .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants