-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash on building ECL on macOS 10.5 i386 with modern Clang #569
Comments
I've checked the link above: I don't have an immediate advice how to find the root case, it is strange to see a stack trace with tons of GC_inner_start_routine and GC_wait_marker, like it's an infinite recursion. |
@ivmai I can't reproduce Sergey's behaviour but one which I shown before is easy to reproduce by switching compilers. And it seems like a strong lead to the GC because I can find compiler specified code only at GC. |
I don't understand what does it mean
I don't anything related to bdwgc in the stack trace |
Ok, I do have some solid evidence that GC might be the cause. With help of We're interesting in chunk: @@ -398,6 +399,8 @@ struct ecl_hashtable { /* hash table header */
_ECL_HDR2(test,weak);
struct ecl_hashtable_entry *data; /* pointer to the hash table */
cl_object sync_lock; /* synchronization lock */
+ cl_object generic_test; /* generic test function */
+ cl_object generic_hash; /* generic hashing function */
cl_index entries; /* number of entries */
cl_index size; /* hash table size */
cl_index limit; /* hash table threshold (integer value) */ which patching a structure: struct ecl_hashtable { /* hash table header */
_ECL_HDR2(test,weak);
struct ecl_hashtable_entry *data; /* pointer to the hash table */
cl_object sync_lock; /* synchronization lock */
cl_index entries; /* number of entries */
cl_index size; /* hash table size */
cl_index limit; /* hash table threshold (integer value) */
cl_object rehash_size; /* rehash size */
cl_object threshold; /* rehash threshold */
double factor; /* cached value of threshold */
cl_object (*get)(cl_object, cl_object, cl_object);
cl_object (*set)(cl_object, cl_object, cl_object);
bool (*rem)(cl_object, cl_object);
/* Unsafe variants are used to store the real accessors when
the synchronized variant is bound to get/set/rem. */
cl_object (*get_unsafe)(cl_object, cl_object, cl_object);
cl_object (*set_unsafe)(cl_object, cl_object, cl_object);
bool (*rem_unsafe)(cl_object, cl_object);
}; Simple adds two more fields. This structure is allocated as cl_object obj;
ecl_disable_interrupts_env(the_env);
obj = (cl_object)GC_MALLOC(type_info[t].size);
ecl_enable_interrupts_env(the_env);
obj->d.t = t;
return obj; And If I hacked code and put So, let me summarizy. If I use https://gitlab.com/embeddable-common-lisp/ecl/-/commit/aa985f566fdedd45e2c74774d6e81f2442dd3802 as local root it works. When I apply patch: --- a/src/h/object.h
+++ b/src/h/object.h
@@ -398,6 +398,8 @@ struct ecl_hashtable { /* hash table header */
_ECL_HDR2(test,weak);
struct ecl_hashtable_entry *data; /* pointer to the hash table */
cl_object sync_lock; /* synchronization lock */
+ cl_object generic_test; /* generic test function */
+ cl_object generic_hash; /* generic hashing function */
cl_index entries; /* number of entries */
cl_index size; /* hash table size */
cl_index limit; /* hash table threshold (integer value) */ => it crashes, but when I reserve twi time more space via hack: --- a/src/c/alloc_2.d
+++ b/src/c/alloc_2.d
@@ -860,9 +860,9 @@ init_alloc(void)
init_tm(t_symbol, "SYMBOL", sizeof(struct ecl_symbol), 5);
init_tm(t_package, "PACKAGE", sizeof(struct ecl_package), -1); /* 36 */
#ifdef ECL_THREADS
- init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 3);
+ init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 3);
#else
- init_tm(t_hashtable, "HASH-TABLE", sizeof(struct ecl_hashtable), 4);
+ init_tm(t_hashtable, "HASH-TABLE", 2 * sizeof(struct ecl_hashtable), 4);
#endif
init_tm(t_array, "ARRAY", sizeof(struct ecl_array), 3);
init_tm(t_vector, "VECTOR", sizeof(struct ecl_vector), 2); well.. it works again. Have I missed something? Thus, when I change compiler, it masks the issue. |
And libatomic-7.6.6 with boehmgc-7.6.8 reproduces the issue. |
And libatomic-7.4.4 with boehmgc-7.6.0 reproduces the issue. |
From another hand this issue can't be reproduced by GCC-7.5.0 that means that it might be clang-only things. |
From another hand @ivmai do you agree that it seems like GC issue? Do you have any suggestion to dig? |
Probably another related issue on macOS: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/718 |
I was able to reproduce it with some probability on my laptop with macOS 12. I've applied a patch on --- darwin_stop_world.c
+++ darwin_stop_world.c
@@ -639,6 +639,11 @@ GC_INNER void GC_stop_world(void)
kern_result = thread_suspend(p -> mach_thread);
} while (kern_result == KERN_ABORTED);
GC_release_dirty_lock();
+ if ((((p) -> flags & FINISHED) != FINISHED)
+ && kern_result == KERN_TERMINATED)
+ continue;
+ if (kern_result == KERN_TERMINATED)
+ ABORT("thread_suspend failed: it was already termindated");
if (kern_result != KERN_SUCCESS)
ABORT("thread_suspend failed");
if (GC_on_thread_event) and run tests of
|
Well.. seems that it broke stack or some memory somehow. A function kern_return_t
thread_suspend(thread_t thread)
{
kern_return_t result = KERN_SUCCESS;
if (thread == THREAD_NULL || get_threadtask(thread) == kernel_task) {
return KERN_INVALID_ARGUMENT;
}
thread_mtx_lock(thread);
if (thread->active) {
if (thread->user_stop_count++ == 0) {
thread_hold(thread);
}
} else {
result = KERN_TERMINATED;
}
thread_mtx_unlock(thread);
if (thread != current_thread() && result == KERN_SUCCESS) {
thread_wait(thread, FALSE);
}
return result;
} from here: https://github.com/apple-open-source/macos/blob/master/xnu/osfmk/kern/thread_act.c#L382-L408 and if I modify mine patch to: @@ -639,8 +639,13 @@ GC_INNER void GC_stop_world(void)
kern_result = thread_suspend(p -> mach_thread);
} while (kern_result == KERN_ABORTED);
GC_release_dirty_lock();
- if (kern_result != KERN_SUCCESS)
+ if (kern_result != KERN_SUCCESS) {
+ fprintf(stderr, "KERN_SUCCESS: %d\n", KERN_SUCCESS);
+ fprintf(stderr, "KERN_TERMINATED: %d\n", KERN_TERMINATED);
+ fprintf(stderr, "KERN_INVALID_ARGUMENT: %d\n", KERN_INVALID_ARGUMENT);
+ fprintf(stderr, "kern_result: %d\n", kern_result);
ABORT("thread_suspend failed");
+ }
if (GC_on_thread_event)
GC_on_thread_event(GC_EVENT_THREAD_SUSPENDED,
(void *)(word)(p -> mach_thread)); and run it a lot of times (more than hundred) in the loop, it might fails as:
|
It looks like the thread is terminated in parallel to GC_stop_world(). This is strange becase the GC lock is held during GC_stop_world and the terminating threads should call GC_thread_exit_proc which also should acquire the lock. Please try to figure out the death scenario of the the thread for which thread_suspend returns anything other than KERN_SUCCESS. |
@ivmai i have no idea how. I run tests via ECL and it fails. Rarely. |
@ivmai I've added one clean debug log line: if (kern_result != KERN_SUCCESS) {
# ifdef DEBUG_THREADS
GC_log_printf("thread_suspend(%d) returns %d\n", p->stop_info.mach_thread, kern_result);
# endif
ABORT("thread_suspend failed");
} and rebuild GC with
It defently not a kernel task, and can't be Something very wrong here. |
And I've figured out it. The root cause of issue that
As result As a naive approach I've tried to move |
The proof. I've used one commit ahead of If may fail on test case with output like:
and |
Was GC_remove_all_threads_but_me called? |
@ivmai I've added a log line into Which is very strange. |
Yes, it is happened after fork. I was able to recover a stack trace:
inside ECL the code which leads to an issue looks like:
|
Status update:
|
Okay, more investigation is needed for the original issue. |
@ivmai yes, but I'm out of idea how to proceed |
@ivmai what do you think to enable "handle fork" by default for macOS? Base on behaviour of For example |
Yes, agree, but after implementing #103 As of now: # The incremental mode conflicts with fork handling on Darwin. |
As I understand the original issue is not resolve, so I'm leaving it open. |
What remains to be done to fix it? |
It still fails as it was described here: #569 (comment) |
Is there any fix proposed? |
This is not fixed, right? |
@ivmai let me summarize everything. Here I've mixed up two issues.
(2) can be reproduced only by clang on i386 on macOS. I can't reproduce it by gcc. |
I not sure that it is BoehmGC issues, but ECL's maintainer guess that it is. See: https://gitlab.com/embeddable-common-lisp/ecl/-/issues/705
Long story short: an attempt to build ECL on macOS 10.5 i386 fails as:
If I change compiler from Clang-7 (default at MacPorts) to GCC-7 => it works.
I've run some tests and discovers that Clang-5+ leads to this result. But clang-3.7 works fine.
The text was updated successfully, but these errors were encountered: