Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on building ECL on macOS 10.5 i386 with modern Clang #569

Open
catap opened this issue Aug 29, 2023 · 32 comments
Open

Crash on building ECL on macOS 10.5 i386 with modern Clang #569

catap opened this issue Aug 29, 2023 · 32 comments

Comments

@catap
Copy link
Contributor

catap commented Aug 29, 2023

I not sure that it is BoehmGC issues, but ECL's maintainer guess that it is. See: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/705

Long story short: an attempt to build ECL on macOS 10.5 i386 fails as:

;;; ECL C Backtrace
;;; 0   ecl_min                             0x00038c9b _ecl_dump_c_backtrace + 47
;;; 1   ecl_min                             0x0002cfc8 ecl_internal_error + 110
;;; 2   ecl_min                             0x0004fa0f deferred_signal_handler + 0
;;; 3   ecl_min                             0x0004f79d sigsegv_handler + 262
;;; 4   libSystem.B.dylib                   0x0025d05b _sigtramp + 43
;;; 5   ???                                 0xffffffff 0x0 + 4294967295
;;; 6   ecl_min                             0x00051db7 _ecl_sethash_pack + 66
;;; 7   ecl_min                             0x00007a97 cl_export2 + 579
;;; 8   ecl_min                             0x000042ef init_all_symbols + 431
;;; 9   ecl_min                             0x00004ba0 cl_boot + 1065
;;; 10  ecl_min                             0x00003a0e main + 33
;;; 11  ecl_min                             0x00003575 start + 53
/bin/sh: line 1: 76282 Abort trap              ECLDIR=`pwd`/ ./ecl_min compile

If I change compiler from Clang-7 (default at MacPorts) to GCC-7 => it works.

I've run some tests and discovers that Clang-5+ leads to this result. But clang-3.7 works fine.

@ivmai
Copy link
Owner

ivmai commented Aug 29, 2023

I've checked the link above: I don't have an immediate advice how to find the root case, it is strange to see a stack trace with tons of GC_inner_start_routine and GC_wait_marker, like it's an infinite recursion.

@catap
Copy link
Contributor Author

catap commented Aug 29, 2023

@ivmai I can't reproduce Sergey's behaviour but one which I shown before is easy to reproduce by switching compilers.

And it seems like a strong lead to the GC because I can find compiler specified code only at GC.

@ivmai
Copy link
Owner

ivmai commented Aug 29, 2023

I can find compiler specified code only at GC.

I don't understand what does it mean

;;; ECL C Backtrace

I don't anything related to bdwgc in the stack trace

@catap
Copy link
Contributor Author

catap commented Aug 30, 2023

Ok, I do have some solid evidence that GC might be the cause. With help of git bisect I was able to find the exact commit which introduced this crash: https://gitlab.com/embeddable-common-lisp/ecl/-/commit/6e5016dcb869bb02ab6ee089d4dc9f266b042bf1

We're interesting in chunk:

@@ -398,6 +399,8 @@ struct ecl_hashtable {          /*  hash table header  */
         _ECL_HDR2(test,weak);
         struct ecl_hashtable_entry *data; /*  pointer to the hash table  */
         cl_object sync_lock;              /*  synchronization lock  */
+        cl_object generic_test;           /*  generic test function */
+        cl_object generic_hash;           /*  generic hashing function */
         cl_index entries;       /*  number of entries  */
         cl_index size;          /*  hash table size  */
         cl_index limit;         /*  hash table threshold (integer value)  */

which patching a structure:

struct ecl_hashtable {          /*  hash table header  */
        _ECL_HDR2(test,weak);
        struct ecl_hashtable_entry *data; /*  pointer to the hash table  */
        cl_object sync_lock;              /*  synchronization lock  */
        cl_index entries;       /*  number of entries  */
        cl_index size;          /*  hash table size  */
        cl_index limit;         /*  hash table threshold (integer value)  */
        cl_object rehash_size;  /*  rehash size  */
        cl_object threshold;    /*  rehash threshold  */
        double factor;          /*  cached value of threshold  */
        cl_object (*get)(cl_object, cl_object, cl_object);
        cl_object (*set)(cl_object, cl_object, cl_object);
        bool (*rem)(cl_object, cl_object);
        /* Unsafe variants are used to store the real accessors when
           the synchronized variant is bound to get/set/rem. */
        cl_object (*get_unsafe)(cl_object, cl_object, cl_object);
        cl_object (*set_unsafe)(cl_object, cl_object, cl_object);
        bool (*rem_unsafe)(cl_object, cl_object);
};

Simple adds two more fields.

This structure is allocated as

    cl_object obj;
    ecl_disable_interrupts_env(the_env);
    obj = (cl_object)GC_MALLOC(type_info[t].size);
    ecl_enable_interrupts_env(the_env);
    obj->d.t = t;
    return obj;

And size is defined that that structure as sizeof(struct ecl_hashtable).

If I hacked code and put 2 * sizeof(struct ecl_hashtable) => it works.

So, let me summarizy. If I use https://gitlab.com/embeddable-common-lisp/ecl/-/commit/aa985f566fdedd45e2c74774d6e81f2442dd3802 as local root it works.

When I apply patch:

--- a/src/h/object.h
+++ b/src/h/object.h
@@ -398,6 +398,8 @@ struct ecl_hashtable {          /*  hash table header  */
         _ECL_HDR2(test,weak);
         struct ecl_hashtable_entry *data; /*  pointer to the hash table  */
         cl_object sync_lock;              /*  synchronization lock  */
+        cl_object generic_test;           /*  generic test function */
+        cl_object generic_hash;           /*  generic hashing function */
         cl_index entries;       /*  number of entries  */
         cl_index size;          /*  hash table size  */
         cl_index limit;         /*  hash table threshold (integer value)  */

=> it crashes, but when I reserve twi time more space via hack:

--- a/src/c/alloc_2.d
+++ b/src/c/alloc_2.d
@@ -860,9 +860,9 @@ init_alloc(void)
   init_tm(t_symbol, "SYMBOL", sizeof(struct ecl_symbol), 5);
   init_tm(t_package, "PACKAGE", sizeof(struct ecl_package), -1); /* 36 */
 #ifdef ECL_THREADS
-  init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 3);
+  init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 3);
 #else
-  init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 4);
+  init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 4);
 #endif
   init_tm(t_array, "ARRAY", sizeof(struct ecl_array), 3);
   init_tm(t_vector, "VECTOR", sizeof(struct ecl_vector), 2);

well.. it works again.

Have I missed something? Thus, when I change compiler, it masks the issue.

@catap
Copy link
Contributor Author

catap commented Sep 3, 2023

The issue is reproduced when I build GC from 8.2.4 release, 75e522c (release-8.2 branch) and 07a6d0e (master branch).

@catap
Copy link
Contributor Author

catap commented Sep 3, 2023

And libatomic-7.6.6 with boehmgc-7.6.8 reproduces the issue.

@catap
Copy link
Contributor Author

catap commented Sep 3, 2023

And libatomic-7.4.4 with boehmgc-7.6.0 reproduces the issue.

@catap
Copy link
Contributor Author

catap commented Sep 3, 2023

From another hand this issue can't be reproduced by GCC-7.5.0 that means that it might be clang-only things.

@catap
Copy link
Contributor Author

catap commented Sep 3, 2023

From another hand sizeof(struct ecl_hashtable) is 72 on both compilers: GCC-7.5.0 and Clang-7.1.0.

@ivmai do you agree that it seems like GC issue?

Do you have any suggestion to dig?

@catap
Copy link
Contributor Author

catap commented Sep 23, 2023

Probably another related issue on macOS: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/718

@catap
Copy link
Contributor Author

catap commented Sep 23, 2023

I was able to reproduce it with some probability on my laptop with macOS 12.

I've applied a patch on 8.2.4:

--- darwin_stop_world.c
+++ darwin_stop_world.c
@@ -639,6 +639,11 @@ GC_INNER void GC_stop_world(void)
             kern_result = thread_suspend(p -> mach_thread);
           } while (kern_result == KERN_ABORTED);
           GC_release_dirty_lock();
+          if ((((p) -> flags & FINISHED) != FINISHED)
+              && kern_result == KERN_TERMINATED)
+            continue;
+          if (kern_result == KERN_TERMINATED)
+            ABORT("thread_suspend failed: it was already termindated");
           if (kern_result != KERN_SUCCESS)
             ABORT("thread_suspend failed");
           if (GC_on_thread_event)

and run tests of cl-njson multiple times via ECL with that boehmgc. Let say 1-out-of-15 attempts fails as:

thread_suspend failed: it was already termindated

@catap
Copy link
Contributor Author

catap commented Sep 23, 2023

Well.. seems that it broke stack or some memory somehow.

A function thread_suspend is looks like:

kern_return_t
thread_suspend(thread_t thread)
{
	kern_return_t result = KERN_SUCCESS;

	if (thread == THREAD_NULL || get_threadtask(thread) == kernel_task) {
		return KERN_INVALID_ARGUMENT;
	}

	thread_mtx_lock(thread);

	if (thread->active) {
		if (thread->user_stop_count++ == 0) {
			thread_hold(thread);
		}
	} else {
		result = KERN_TERMINATED;
	}

	thread_mtx_unlock(thread);

	if (thread != current_thread() && result == KERN_SUCCESS) {
		thread_wait(thread, FALSE);
	}

	return result;
}

from here: https://github.com/apple-open-source/macos/blob/master/xnu/osfmk/kern/thread_act.c#L382-L408

and if I modify mine patch to:

@@ -639,8 +639,13 @@ GC_INNER void GC_stop_world(void)
             kern_result = thread_suspend(p -> mach_thread);
           } while (kern_result == KERN_ABORTED);
           GC_release_dirty_lock();
-          if (kern_result != KERN_SUCCESS)
+          if (kern_result != KERN_SUCCESS) {
+            fprintf(stderr, "KERN_SUCCESS: %d\n", KERN_SUCCESS);
+            fprintf(stderr, "KERN_TERMINATED: %d\n", KERN_TERMINATED);
+            fprintf(stderr, "KERN_INVALID_ARGUMENT: %d\n", KERN_INVALID_ARGUMENT);
+            fprintf(stderr, "kern_result: %d\n", kern_result);
             ABORT("thread_suspend failed");
+          }
           if (GC_on_thread_event)
             GC_on_thread_event(GC_EVENT_THREAD_SUSPENDED,
                                (void *)(word)(p -> mach_thread));

and run it a lot of times (more than hundred) in the loop, it might fails as:

:info:build KERN_SUCCESS: 0
:info:build KERN_TERMINATED: 37
:info:build KERN_INVALID_ARGUMENT: 4
:info:build kern_result: 268435459
:info:build thread_suspend failed

@ivmai
Copy link
Owner

ivmai commented Sep 23, 2023

It looks like the thread is terminated in parallel to GC_stop_world(). This is strange becase the GC lock is held during GC_stop_world and the terminating threads should call GC_thread_exit_proc which also should acquire the lock.

Please try to figure out the death scenario of the the thread for which thread_suspend returns anything other than KERN_SUCCESS.

@catap
Copy link
Contributor Author

catap commented Sep 23, 2023

@ivmai i have no idea how.

I run tests via ECL and it fails. Rarely.

@catap
Copy link
Contributor Author

catap commented Sep 23, 2023

@ivmai I've added one clean debug log line:

          if (kern_result != KERN_SUCCESS) {
# ifdef DEBUG_THREADS
            GC_log_printf("thread_suspend(%d) returns %d\n", p->stop_info.mach_thread, kern_result);
# endif
            ABORT("thread_suspend failed");
          }

and rebuild GC with -DDEBUG_THREADS which leads to log:

:info:build Stopping the world from thread 0x303
:info:build thread_suspend(259) returns 4
:info:build thread_suspend failed
:info:build An error occurred during initialization:
:info:build Error code 6 when executing
:info:build (EXT:RUN-PROGRAM "/usr/bin/clang" Stopping the world from thread 0x103
:info:build World stopped from 0x103
:info:build Darwin: Stack for thread 0x103 is [0x7ff7b5a81e80,0x7ff7b5a8c000)
:info:build thread_get_state returns 0
:info:build Darwin: Stack for thread 0x2803 is [0x700003188d08,0x700003188f98)

259 is 0x103 and 4 is KERN_INVALID_ARGUMENT which may occures only on condition:

	if (thread == THREAD_NULL || get_threadtask(thread) == kernel_task) {
		return KERN_INVALID_ARGUMENT;
	}

It defently not a kernel task, and can't be THREAD_NULL because it is used to stop the world after abort.

Something very wrong here.

@catap
Copy link
Contributor Author

catap commented Sep 24, 2023

It looks like the thread is terminated in parallel to GC_stop_world(). This is strange becase the GC lock is held during GC_stop_world and the terminating threads should call GC_thread_exit_proc which also should acquire the lock.

Please try to figure out the death scenario of the the thread for which thread_suspend returns anything other than KERN_SUCCESS.

And I've figured out it.

The root cause of issue that mach_thread_self() may return different value after fork(). Different means different that is stored inside GC_threads.

GC_remove_all_threads_but_me() updated that value but seems that it's doing so too late.

As result thread_suspend returns KERN_INVALID_ARGUMENT or 268435459 which is (ipc/send) invalid destination port by mach_error_string.

As a naive approach I've tried to move GC_remove_all_threads_but_me() inside fork_child_proc() to the first line, but it doesn't help.

@catap
Copy link
Contributor Author

catap commented Sep 24, 2023

The proof. I've used one commit ahead of 8.2.4: catap@a7e17f9 which adds getpid() to the log.

If may fail on test case with output like:

[14298] Stopping the world from thread 0x103
[14298] thread_suspend(0xb03) returns (os/kern) successful
World stopped from 0x103
thread_get_state returns 0
Darwin: Stack for thread 0xb03 is [0x70000aebb068,0x70000aebdf78)
Darwin: Stack for thread 0x103 is [0x7ff7b6401280,0x7ff7b640c000)
Starting mark helper for mark number 566
Starting mark helper for mark number 566
World starting
Resuming thread 0xb03 with state 3
World started
Total stacks size: 56464
;;; Note:
;;;   Invoking external command:
;;;   clang -I. -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/build/ -DECL_API -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/build/c -DDEBUG_THREADS -fPIC -fno-common -D_THREAD_SAFE -Ddarwin -I/Users/runner/work/cl-duckdb/cl-duckdb/ecl/src/c -c eclinit3Cg6rN.c -o eclinit3Cg6rN.o 
;;; Note:
;;;   Invoking external command:
[14430] Stopping the world from thread 0x303
[14430] thread_suspend(0xb03) returns (ipc/send) invalid destination port
thread_suspend failed
[14298] Stopping the world from thread 0x103
[14298] thread_suspend(0xb03) returns (os/kern) successful
World stopped from 0x103
thread_get_state returns 0
Darwin: Stack for thread 0xb03 is [0x70000aebb068,0x70000aebdf78)
Darwin: Stack for thread 0x103 is [0x7ff7b63f0a00,0x7ff7b640c000)
Starting mark helper for mark number 567
Starting mark helper for mark number 567
World starting
Resuming thread 0xb03 with state 3
World started

and 0x303 it is the first and the only appears of 0x303; thus no update thread id which means GC_remove_all_threads_but_me() hasn't been called or hasn't updated the reference (yet?)

@ivmai
Copy link
Owner

ivmai commented Sep 24, 2023

Was GC_remove_all_threads_but_me called?

@catap
Copy link
Contributor Author

catap commented Sep 24, 2023

@ivmai I've added a log line into GC_remove_all_threads_but_me and can't find it in output => seems that it hadn't been called.

Which is very strange.

@catap
Copy link
Contributor Author

catap commented Sep 24, 2023

Yes, it is happened after fork. I was able to recover a stack trace:

0   libsystem_kernel.dylib        	    0x7ff81e0a7ffe __pthread_kill + 10
1   libsystem_pthread.dylib       	    0x7ff81e0de1ff pthread_kill + 263
2   libsystem_c.dylib             	    0x7ff81e029d24 abort + 123
3   libgc.1.dylib                 	       0x10a5c4bc3 GC_stop_world.cold.1 + 43
4   libgc.1.dylib                 	       0x10a5c340f GC_stop_world + 276
5   libgc.1.dylib                 	       0x10a5b3010 GC_stopped_mark + 70
6   libgc.1.dylib                 	       0x10a5b2ecd GC_try_to_collect_inner + 311
7   libgc.1.dylib                 	       0x10a5b40fe GC_collect_or_expand + 189
8   libgc.1.dylib                 	       0x10a5b4380 GC_allocobj + 265
9   libgc.1.dylib                 	       0x10a5b8baa GC_generic_malloc_inner + 253
10  libgc.1.dylib                 	       0x10a5b8e3a GC_generic_malloc + 83
11  libecl.23.9.9.dylib           	       0x10a9eb821 ecl_alloc_atomic + 32
12  libecl.23.9.9.dylib           	       0x10a99db75 si_make_sequence_output_stream + 333
13  libecl.23.9.9.dylib           	       0x10a9aeb23 si_string_to_octets + 499
14  libecl.23.9.9.dylib           	       0x10a9e35dd si_spawn_subprocess + 577
15  libecl.23.9.9.dylib           	       0x10a96928e si_run_program + 1744

inside ECL the code which leads to an issue looks like:

    child_pid = fork();
    if (child_pid == 0) {
      /* Child */
      int j;
      cl_object p;
      char **argv_ptr = ecl_alloc((ecl_length(argv) + 1) * sizeof(char*));
      for (p = argv, j = 0; p != ECL_NIL; p = ECL_CONS_CDR(p)) {
        cl_object arg = si_string_to_octets(3, ECL_CONS_CAR(p), @':null-terminate', ECL_T);
        argv_ptr[j++] = (char*)arg->base_string.self;
      }
      argv_ptr[j] = NULL;

@catap
Copy link
Contributor Author

catap commented Sep 25, 2023

Status update:

@ivmai
Copy link
Owner

ivmai commented Sep 25, 2023

Okay, more investigation is needed for the original issue.

@catap
Copy link
Contributor Author

catap commented Sep 25, 2023

@ivmai yes, but I'm out of idea how to proceed

@catap
Copy link
Contributor Author

catap commented Sep 25, 2023

@ivmai what do you think to enable "handle fork" by default for macOS? Base on behaviour of mach_thread_self() I feel that it is common and very difficult to track issue.

For example scala-native uses that code and instead of replicate ECL's changes in scala-native and many other software I feel that it simple should be enabled by default.

@ivmai
Copy link
Owner

ivmai commented Sep 26, 2023

what do you think to enable "handle fork" by default for macOS? Base on behavior of mach_thread_self() I feel that it is common and very difficult to track issue.

Yes, agree, but after implementing #103

As of now: # The incremental mode conflicts with fork handling on Darwin.

@ivmai
Copy link
Owner

ivmai commented Nov 30, 2023

what do you think to enable "handle fork" by default for macOS? Base on behaviour of mach_thread_self() I feel that it is common and very difficult to track issue.

Implemented by commit ba2861e
@catap, please test it

@ivmai
Copy link
Owner

ivmai commented Nov 30, 2023

As I understand the original issue is not resolve, so I'm leaving it open.

@cooljeanius
Copy link

As I understand the original issue is not resolve, so I'm leaving it open.

What remains to be done to fix it?

@catap
Copy link
Contributor Author

catap commented Mar 18, 2024

As I understand the original issue is not resolve, so I'm leaving it open.

What remains to be done to fix it?

It still fails as it was described here: #569 (comment)

@ivmai
Copy link
Owner

ivmai commented Mar 18, 2024

Is there any fix proposed?

@ivmai
Copy link
Owner

ivmai commented Mar 18, 2024

The root cause of issue that mach_thread_self() may return different value after fork(). Different means different that is stored inside GC_threads.
GC_remove_all_threads_but_me() updated that value but seems that it's doing so too late.

This is not fixed, right?

@catap
Copy link
Contributor Author

catap commented Mar 18, 2024

@ivmai let me summarize everything.

Here I've mixed up two issues.

  1. thread related which was fixed as far as I recall.
  2. Something brokes stack -- not fixed.

(2) can be reproduced only by clang on i386 on macOS. I can't reproduce it by gcc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants