Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash in free() logic due to what appears like pointer error in 4.03.0 & 4.04.0 #7457

vicuna opened this issue Jan 11, 2017 · 7 comments


Copy link

vicuna commented Jan 11, 2017

Original bug ID: 7457
Reporter: chetmurthy
Assigned to: @xavierleroy
Status: resolved (set by @xavierleroy on 2017-01-27T13:14:54Z)
Resolution: fixed
Priority: normal
Severity: crash
Platform: Intel x86-64 (amd64)
OS: Ubuntu 16.04.1 LTS
OS Version: 16.04.1
Version: 4.04.0
Target version: 4.04.1+dev
Fixed in version: 4.04.1+dev
Category: runtime system and C interface
Related to: #7486
Monitored by: @yallop

Bug description

(1) I have a rather multithreaded, networked system with many C libs glued in. It's been running correctly on ocaml 4.02.3 for many months.

(2) The failure is of a unit-test runner, so some interprocess comms, lots of crypto, but no actual off-machine networking.

(3) This failure is very, very reproducible. It happens with 4.03.0, 4.04.0 (both byte & native). IT DOES NOT happen with 4.02.3.

If you have any suggestions for how I might help narrow down (e.g., if you have a pointer to where I can get access to the source-code repository, so I could do bisection-search in the code-repo, I'm happy to do that). I have the reproduction process almost-automated, so while bisection will be time-consuming, it will be labor-free.

Steps to reproduce

Build and run my unit-test. I realize this is a bit vague, and I'm going to work to narrow this down, but I figured I should report it, since it might be known, and also, since it only shows up with ocaml > 4.02.3, that might point to where the problem arises.

Also, I ran both with and without valgrind, on both 4.02.3 and 4.04.0. On 4.02.3, I get no significant valgrind errors (so it's not as if there's a latent error, but not crashing). With 4.04.0, of course I get a failure, and then I run with valgrind to get better crash info.

Additional information

Here's some information I got from valgrind at the point where the failure occurs. There are many more messages, but there's no point in my sending these in, I suspect, until I can simplify the test. This output came from a run with a native executable.

.==11710== Invalid read of size 1
==11710== at 0x527A6B1: pthread_mutex_destroy (pthread_mutex_destroy.c:30)
==11710== by 0xFD42E0: caml_thread_reinitialize (in /home/chet/Hack/IL/infraledger/src/il-replica/_build/peer0_unit_tests.native)
==11710== by 0x709C945: fork (fork.c:211)
==11710== by 0xFD5C18: unix_fork (in /home/chet/Hack/IL/infraledger/src/il-replica/_build/peer0_unit_tests.native)
==11710== by 0xF2315A: camlUnix__system_1846 (
==11710== by 0x7E2890: camlTest_util__reset_unit_2674 (
==11710== by 0x7E2921: camlTest_util__setup_unit_dir_2888 (
==11710== by 0x80118B: camlPeer0_test_utils__master_test_inner_69265 (
==11710== by 0x8260CB: camlOUnitRunner__fun_2459 (
==11710== by 0x81B97F: camlOUnitTest__section_ctxt_1435 (
==11710== by 0x825F70: camlOUnitRunner__run_one_test_1382 (
==11710== by 0x826281: camlOUnitRunner__iter_1582 (
==11710== Address 0x7de6110 is 16 bytes inside a block of size 40 free'd
==11710== at 0x4C2EDEB: free (in /usr/lib/valgrind/
==11710== by 0xFE6358: caml_finalize_channel (io.c:398)
==11710== by 0xFE0645: caml_empty_minor_heap (minor_gc.c:381)
==11710== by 0xFE0AB5: caml_gc_dispatch (minor_gc.c:438)
==11710== by 0xFE1C40: caml_alloc_string (alloc.c:100)
==11710== by 0xF3B025: camlBuffer__resize_1232 (
==11710== by 0xF3B15B: camlBuffer__add_substring_1242 (
==11710== by 0xF5A314: camlFormat__format_pp_token_1355 (
==11710== by 0xF5A905: camlFormat__advance_loop_1394 (
==11710== by 0xF5A969: camlFormat__advance_left_1400 (
==11710== by 0x82524D: camlOUnitDiff__fun_1812 (
==11710== by 0xF35C2C: camlSet__iter_1430 (
==11710== Block was alloc'd at
==11710== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/
==11710== by 0xFD3AC2: st_mutex_create (in /home/chet/Hack/IL/infraledger/src/il-replica/_build/peer0_unit_tests.native)
==11710== by 0xFD3DE7: caml_io_mutex_lock (in /home/chet/Hack/IL/infraledger/src/il-replica/_build/peer0_unit_tests.native)
==11710== by 0xFE748D: caml_ml_output_bytes (io.c:660)
==11710== by 0xF276FE: camlPervasives__output_1206 (
==11710== by 0x962B51: camlTFramedTransport__fun_2149 (
==11710== by 0x911DCD: camlInvoker__fun_3330 (
==11710== by 0x7BFAB5: camlReplicaDB__thunk_10433 (
==11710== by 0xD67DEC: camlCore_kernel__Exn__protectx_3197 (
==11710== by 0x7C02EB: camlReplicaDB__fun_62944 (
==11710== by 0xD67DEC: camlCore_kernel__Exn__protectx_3197 (
==11710== by 0xD67DEC: camlCore_kernel__Exn__protectx_3197 (

File attachments

Copy link

vicuna commented Jan 11, 2017

Comment author: chetmurthy

I just uploaded a file containing the backtraces and such from a run without valgrind.

Copy link

vicuna commented Jan 11, 2017

Comment author: @yallop

The OCaml sources are available on GitHub:

Copy link

vicuna commented Jan 11, 2017

Comment author: @dra27

Te merge-base of 4.03 and trunk 2449d2f and b12f7430423d0fb115a0ee551c2009e6094f (which should be the commit on trunk where 4.02.x is fully merged into trunk) might be good one-off tests to hone on where bisection is needed.

Copy link

vicuna commented Jan 11, 2017

Comment author: @xavierleroy

Hello Chet,

Thanks for the very precise valgrind trace, it was really informative.

Based on this trace, I attached a plausible fix for this issue, see file 001-harden-io-mutex-free.patch. I hope it applies cleanly to 4.03 and 4.04. Feel free to test again.

Plausible explanation: since 4.03, buffered IO channels can remain in the list of opened channels even after they have been "half-closed" by finalization. Those channels will be considered by caml_thread_reinitialize when a Unix.fork() is performed in a multithreaded program. That's where pthread_mutex_destroy() is called on a mutex that has already been destroyed as part of the half-closing.

Also: there may be a logic error in your program whereas it forgets to close (or at least flush) a out_channel before the channel becomes unreachable. This you can check for by running your program with run-time warnings enabled.

Copy link

vicuna commented Jan 11, 2017

Comment author: chetmurthy

Xavier, I tried your patch just now with ocaml version 4.03.0, and my test passes without problem. Obviously with this sort of thing, one must be careful b/c so much depends on determinism, so I'd like to leave this issue open, but I'll revisit in a week, and if it hasn't recurred, I suspect we can close this issue.

Thank you so much for so quickly coming up with a diagnosis and fix! I didn't even get a chance to bisect the ocaml repository!

BTW, if you let me know when this patch gets checked-into Git, I'd be happy to test my stuff against the git repo.

Copy link

vicuna commented Jan 12, 2017

Comment author: @xavierleroy

I pushed the tentative fix to trunk, commit [trunk 2011591]. Will probably push it to 4.04 bugfix branch later.

Copy link

vicuna commented Jan 27, 2017

Comment author: @xavierleroy

Fix also pushed to 4.04 branch, commit [4.04 b3c107e].

In the absence of new information I'm assuming the fix is good. Reopen if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

2 participants