forked from illumos/illumos-gate
-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qemu coroutines get confused about their thread identity #1260
Comments
Opened upstream as https://www.illumos.org/issues/15206 |
citrus-it
added a commit
to omniosorg/qemu
that referenced
this issue
Dec 1, 2022
The coroutine ucontext backend uses a combination of sigsetjmp/siglongjmp and set/getcontext(). It stores the current coroutine data in thread local storage. When a context is restored into a different thread than the one in which it last ran, the %fsbase is also restored, leading to a thread that thinks it is a different one, and critically for this code, a thread that accesses the TLS of the other. This patch copies %fsbase from the existing thread context to the new one before switching so that it resumes with the right identity. If relies on this being an amd64 binary, and insider knowledge of the libc implementations. It is not yet understood why this code works on other platforms without this change, or if there is a better fix. It would be slightly cleaner to use set/getcontext() throughout, but that adds additional system call overhead. For more analysis, see omniosorg/illumos-omnios#1260
citrus-it
added a commit
to omniosorg/qemu
that referenced
this issue
Dec 28, 2022
The coroutine ucontext backend uses a combination of sigsetjmp/siglongjmp and set/getcontext(). It stores the current coroutine data in thread local storage. When a context is restored into a different thread than the one in which it last ran, the %fsbase is also restored, leading to a thread that thinks it is a different one, and critically for this code, a thread that accesses the TLS of the other. This patch copies %fsbase from the existing thread context to the new one before switching so that it resumes with the right identity. If relies on this being an amd64 binary, and insider knowledge of the libc implementations. It is not yet understood why this code works on other platforms without this change, or if there is a better fix. It would be slightly cleaner to use set/getcontext() throughout, but that adds additional system call overhead. For more analysis, see omniosorg/illumos-omnios#1260
citrus-it
added a commit
to omniosorg/qemu
that referenced
this issue
Dec 29, 2022
The coroutine ucontext backend uses a combination of sigsetjmp/siglongjmp and set/getcontext(). It stores the current coroutine data in thread local storage. When a context is restored into a different thread than the one in which it last ran, the %fsbase is also restored, leading to a thread that thinks it is a different one, and critically for this code, a thread that accesses the TLS of the other. This patch copies %fsbase from the existing thread context to the new one before switching so that it resumes with the right identity. If relies on this being an amd64 binary, and insider knowledge of the libc implementations. It is not yet understood why this code works on other platforms without this change, or if there is a better fix. It would be slightly cleaner to use set/getcontext() throughout, but that adds additional system call overhead. For more analysis, see omniosorg/illumos-omnios#1260
citrus-it
added a commit
to omniosorg/qemu
that referenced
this issue
Jan 5, 2023
The coroutine ucontext backend uses a combination of sigsetjmp/siglongjmp and set/getcontext(). It stores the current coroutine data in thread local storage. When a context is restored into a different thread than the one in which it last ran, the %fsbase is also restored, leading to a thread that thinks it is a different one, and critically for this code, a thread that accesses the TLS of the other. This patch copies %fsbase from the existing thread context to the new one before switching so that it resumes with the right identity. If relies on this being an amd64 binary, and insider knowledge of the libc implementations. It is not yet understood why this code works on other platforms without this change, or if there is a better fix. It would be slightly cleaner to use set/getcontext() throughout, but that adds additional system call overhead. For more analysis, see omniosorg/illumos-omnios#1260
citrus-it
added a commit
to omniosorg/qemu
that referenced
this issue
Apr 22, 2023
The coroutine ucontext backend uses a combination of sigsetjmp/siglongjmp and set/getcontext(). It stores the current coroutine data in thread local storage. When a context is restored into a different thread than the one in which it last ran, the %fsbase is also restored, leading to a thread that thinks it is a different one, and critically for this code, a thread that accesses the TLS of the other. This patch copies %fsbase from the existing thread context to the new one before switching so that it resumes with the right identity. If relies on this being an amd64 binary, and insider knowledge of the libc implementations. It is not yet understood why this code works on other platforms without this change, or if there is a better fix. It would be slightly cleaner to use set/getcontext() throughout, but that adds additional system call overhead. For more analysis, see omniosorg/illumos-omnios#1260
hadfl
pushed a commit
to omniosorg/qemu
that referenced
this issue
Aug 25, 2023
The coroutine ucontext backend uses a combination of sigsetjmp/siglongjmp and set/getcontext(). It stores the current coroutine data in thread local storage. When a context is restored into a different thread than the one in which it last ran, the %fsbase is also restored, leading to a thread that thinks it is a different one, and critically for this code, a thread that accesses the TLS of the other. This patch copies %fsbase from the existing thread context to the new one before switching so that it resumes with the right identity. If relies on this being an amd64 binary, and insider knowledge of the libc implementations. It is not yet understood why this code works on other platforms without this change, or if there is a better fix. It would be slightly cleaner to use set/getcontext() throughout, but that adds additional system call overhead. For more analysis, see omniosorg/illumos-omnios#1260
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm filing this here while I gather information. It may be an upstream illumos bug in which case I'll file it there.
TL,TR; It appears that a thread restarted via
siglongjmp()
is confused about its identity.This started while trying to get qemu 7 working on OmniOS. When starting an aarch64 guest, the process aborts:
Looking into the
qemu_in_coroutine()
function, it retrieves a pointer to aCoroutine
struct from thread local storage (TLS) and checks that it indicates that a coroutine is active.I used dtrace to watch writes and reads to that pointer in TLS, and print out the thread ID alongside the value being set or retrieved. The functions involved are
set_current()
andget_current()
. The thread ID (dtrace variabletid
) is shown in the square brackets. This showed:That's weird. Thread 3 reads the value and then sets a new one, and reads it back. So far so good. But then thread 1 writes the value to its own TLS and reads it back, but the read returns a different value - definitely not the one which was just set! It's actually the value that was previously set on thread 3.
At this point I broke out
fprintf(stderr
and added a few lines to the code. This tells a slightly different story to dtrace. The following is from the same run as the dtrace output above, so the addresses match. Again, the thread ID (this time retrieved viapthread_self()
) is shown in the square brackets.This says that it was thread 3 that did the last read, and not thread 1 as shown by dtrace. This is the root cause of the assertion failure - it's apparently getting data from the wrong thread. After blinking a couple of times, I went and explicitly printed the value of %fs:0 in the routine that is calling get_current, specifically:
and saw (different run, different address, sorry):
With the process stopped by
dtrace
, I can take a look viamdb
:The thread address printed by the assembly matches thread 3.
The variable in TLS looks right for thread 1 (indicating that a co-routine is in progress) and wrong for thread 3:
I'm looking at thread 1 in the debugger and the
qemu_coroutine_self()
function is the top of the stack:Thread 3 doesn't seem to be doing much:
Well, look at this..
%fsbase
is the same for threads 1 and 3 and, based on the output of::walk ulwp
above, thread 1 is wrong (should be 0xfffffc7fed680940)The text was updated successfully, but these errors were encountered: