Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
crash in free() logic due to what appears like pointer error in 4.03.0 & 4.04.0 #7457
Original bug ID: 7457
(1) I have a rather multithreaded, networked system with many C libs glued in. It's been running correctly on ocaml 4.02.3 for many months.
(2) The failure is of a unit-test runner, so some interprocess comms, lots of crypto, but no actual off-machine networking.
(3) This failure is very, very reproducible. It happens with 4.03.0, 4.04.0 (both byte & native). IT DOES NOT happen with 4.02.3.
If you have any suggestions for how I might help narrow down (e.g., if you have a pointer to where I can get access to the source-code repository, so I could do bisection-search in the code-repo, I'm happy to do that). I have the reproduction process almost-automated, so while bisection will be time-consuming, it will be labor-free.
Steps to reproduce
Build and run my unit-test. I realize this is a bit vague, and I'm going to work to narrow this down, but I figured I should report it, since it might be known, and also, since it only shows up with ocaml > 4.02.3, that might point to where the problem arises.
Also, I ran both with and without valgrind, on both 4.02.3 and 4.04.0. On 4.02.3, I get no significant valgrind errors (so it's not as if there's a latent error, but not crashing). With 4.04.0, of course I get a failure, and then I run with valgrind to get better crash info.
Here's some information I got from valgrind at the point where the failure occurs. There are many more messages, but there's no point in my sending these in, I suspect, until I can simplify the test. This output came from a run with a native executable.
.==11710== Invalid read of size 1
Comment author: @xavierleroy
Thanks for the very precise valgrind trace, it was really informative.
Based on this trace, I attached a plausible fix for this issue, see file 001-harden-io-mutex-free.patch. I hope it applies cleanly to 4.03 and 4.04. Feel free to test again.
Plausible explanation: since 4.03, buffered IO channels can remain in the list of opened channels even after they have been "half-closed" by finalization. Those channels will be considered by caml_thread_reinitialize when a Unix.fork() is performed in a multithreaded program. That's where pthread_mutex_destroy() is called on a mutex that has already been destroyed as part of the half-closing.
Also: there may be a logic error in your program whereas it forgets to close (or at least flush) a out_channel before the channel becomes unreachable. This you can check for by running your program with run-time warnings enabled.
Comment author: chetmurthy
Xavier, I tried your patch just now with ocaml version 4.03.0, and my test passes without problem. Obviously with this sort of thing, one must be careful b/c so much depends on determinism, so I'd like to leave this issue open, but I'll revisit in a week, and if it hasn't recurred, I suspect we can close this issue.
Thank you so much for so quickly coming up with a diagnosis and fix! I didn't even get a chance to bisect the ocaml repository!
BTW, if you let me know when this patch gets checked-into Git, I'd be happy to test my stuff against the git repo.