Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug / Concurrency] CPU sampler during multi threaded operation yields error: Context cannot be entered on system threads #3013

Closed
itarato opened this issue Apr 20, 2023 · 18 comments · Fixed by #3385

Comments

@itarato
Copy link
Collaborator

itarato commented Apr 20, 2023

cpusampler is yielding threading errors a various runs. Example when running the thread spec tests:

jt test spec/ruby/optional/capi/thread_spec.rb -- --cpusampler

The output contains a few tens of:

[ruby] SEVERE: could not unblock thread inside blocking call in C extension because the context does not allow multithreading (Context cannot be entered on system threads.)

Another example is when running rails tests (not necessarily related to multithreading).

@itarato itarato changed the title [Concurrency] CPU sampler during multi threaded operation yields error: Context cannot be entered on system threads [Bug / Concurrency] CPU sampler during multi threaded operation yields error: Context cannot be entered on system threads Apr 20, 2023
@eregon
Copy link
Member

eregon commented Apr 21, 2023

Thanks for the report.

So this message comes from

public void interrupt(Thread thread) {
final TruffleContext truffleContext = context.getEnv().getContext();
final boolean alreadyEntered = truffleContext.isEntered();
Object prev = null;
if (!alreadyEntered) {
// We need to enter the context to execute this unblocking action, as it runs on Sulong.
try {
if (context.getOptions().SINGLE_THREADED) {
throw new IllegalStateException("--single-threaded was passed");
}
prev = truffleContext.enter(null);
} catch (IllegalStateException e) { // Multi threaded access denied from Truffle
// Not in a context, so we cannot use TruffleLogger
context.getLogger().severe(
"could not unblock thread inside blocking call in C extension because " +
"the context does not allow multithreading (" + e.getMessage() + ")");
return;
}
}
try {
InteropLibrary.getUncached().execute(function, argument);
} catch (InteropException e) {
throw CompilerDirectives.shouldNotReachHere(e);
} finally {
if (!alreadyEntered) {
truffleContext.leave(null, prev);
}
}
}

We need to execute the unblock function given to rb_thread_call_without_gvl(), so the thread inside the blocking function of rb_thread_call_without_gvl() can get out of it (e.g. out of a blocking syscall) and reach a Truffle safepoint, and then start its operation again.

What changed here is the CPUSampler now uses a system thread vs a embedder thread before, and system threads don't allow entering the context.

I don't know how to fix this, we need to execute the blocking function on some thread and that requires entering the context:

  • We can't execute it on the current system thread due to system threads don't allow entering the context (this issue)
  • We could use a Truffle safepoint/ThreadLocalAction, but to which thread? We can't use the Ruby main thread, maybe that's the one in rb_thread_call_without_gvl(). And we could have recursion problems if we use a ThreadLocalAction inside an Interrupter#interrupt.
  • Obviously we can't use the thread in rb_thread_call_without_gvl(), that's the one stuck until we run the unblock function.
  • I suppose we could create a new thread with Env#createThread, but that seems prohibitively expensive for every Interrupter#interrupt, i.e., every ThreadLocalAction affecting that thread while it's in rb_thread_call_without_gvl().
  • I suppose we could have an extra thread that we keep around for running that, but that is complicated and not working if there is any single-threaded language like JS/Python/etc. OTOH unblocking rb_thread_call_without_gvl() does require at least 2 parallel threads anyway (the thread blocked in it, the thread to run the unblock function), so maybe that thread could be created lazily the first time it is needed and not actually break polyglot. Also that thread will appear e.g. in Ruby's Thread.list which might be unexpected for user code.

We already have the Timer nativeInterruptTimer to send SIGVTALRM to interrupt NFI blocking syscalls. That's an embedder thread because there is no ThreadFactory for java.util.Timer. We don't control what it does so we can't reuse it for this. We also can't reuse the ReferenceProcessingThread because that's optional. My concern is we start to need a lot of threads to workaround Truffle limitations.

@chumer @jchalou What's is the envisioned solution here? Maybe tools should use embedder threads and not system threads to avoid this problem? It is a breaking change in CPUSampler from Ruby's POV.

@tzezula
Copy link
Member

tzezula commented Apr 21, 2023

@eregon Tools must use system threads to correctly work with Truffle isolates. The ordinary Java thread is not able to do a downcall to the host VM.

@eregon
Copy link
Member

eregon commented Apr 21, 2023

@tzezula Why not? By embedder thread I mean new java.lang.Thread().

@tzezula
Copy link
Member

tzezula commented Apr 21, 2023

The problem with java.lang.Thread in Truffle isolate is that it's not attached to host (HotSpot) JavaVM and has no host JNIEnv. When such a thread performs an operation that needs call to host, for example logging, it will fail.
The system threads created in the Truffle isolate are always attached to host in the enterpise implementation of AbstractPolyglotImpl#createThreadScope().

@chumer
Copy link
Member

chumer commented Apr 21, 2023

@tzezula, we handle embedder threads in polyglot isolates. The problem is with threads created by the language that are not system or polyglot threads. They are unsupported now but will only fail if you use polyglot isolates.

No embedder threads are not the solution. Embedder threads are only intended for embedders, as the name suggests.

What changed here is the CPUSampler now uses a system thread vs a embedder thread before, and system threads don't allow entering the context.

Why would you do anything on a CPU sampler thread? Sorry, I cannot yet see why system threads are problematic here. Can you elaborate?

@eregon
Copy link
Member

eregon commented Apr 21, 2023

Why would you do anything on a CPU sampler thread? Sorry, I cannot yet see why system threads are problematic here. > Can you elaborate?

So CPUSampler creates a samplerExecutionService system thread. And it submitThreadLocal. That then calls the various Interrupters on the current thread, including the Interrupter we care about it here to call the unblocking function of a thread in rb_thread_call_without_gvl().

@tzezula
Copy link
Member

tzezula commented Apr 21, 2023

@chumer Right, we handle embedder threads in Truffle isolate. Embedder threads originated in the host VM and are always attached to it. But @eregon uses the term "embedder thread" for ordinary Java thread (neither system nor polyglot) created in the isolate.

@eregon
Copy link
Member

eregon commented Jan 8, 2024

This is still an issue, @chumer @tzezula could you recommend a solution?

@eregon
Copy link
Member

eregon commented Jan 8, 2024

One idea is if unblock functions do not need Sulong and do not need a Truffle context, we could execute the native function (via JNI, not NFI which needs the context) directly on the "system thread of CPUSampler", without needing to enter the Truffle context.
But

/**
* Allows the passed function to run in parallel with other Ruby threads.
*
* What this function does:
*
* 1. Checks (and handles) pending interrupts.
* 2. Releases the GVL. (Others can run here in parallel...)
* 3. Calls the passed function.
* 4. Blocks until it re-acquires the GVL.
* 5. Checks interrupts that happened between 2 to 4.
*
* In case other threads interfaced with this thread using rb_thread_kill()
* etc., the passed UBF is additionally called. See ::rb_unblock_function_t
* for details.
*
* Unlike rb_thread_call_without_gvl2() this function also reacts to signals
* etc.
*
* @param[in] func A function to call without GVL.
* @param[in,out] data1 Passed as-is to `func`.
* @param[in] ubf An UBF to cancel `func`.
* @param[in,out] data2 Passed as-is to `ubf`.
* @return What `func` returned, or 0 in case `ubf` cancelled `func`.
* @warning You cannot use most of Ruby C APIs like calling methods or
* raising exceptions from any of the functions passed to it.
* If that is dead necessary use rb_thread_call_with_gvl() to
* re-acquire the GVL.
* @warning In short, this API is difficult. @ko1 recommends you to use
* other ways if any. We lack experiences to use this API. If
* you find any corner cases etc., please report it to the
* devs.
* @warning Releasing and re-acquiring the GVL are expensive operations.
* For a short-running `func`, it might be faster to just call
* `func` with blocking everything else. Be sure to benchmark
* your code to see if it is actually worth releasing the GVL.
*/
void *rb_thread_call_without_gvl(void *(*func)(void *), void *data1,
rb_unblock_function_t *ubf, void *data2);
and https://github.com/oracle/truffleruby/blob/master/lib/cext/include/ruby/internal/intern/thread.h#L330-L343
do not specify whether the unblock functions of rb_thread_call_without_gvl* can call rb_* or so, which would be a problem as those need to be entered in the context as almost all of them end up running some Sulong or Ruby code.
It might hold though.

self-note: some cleanup at bd/system-thread-cext-interrupter. I can try if that'd work after #3356 is merged

@eregon
Copy link
Member

eregon commented Jan 9, 2024

I chatted with @chumer.
There is no perfect solution right now, so we have to use an extra (polyglot) thread or use JNI and then expect the unblock function will not call back to the Ruby C API (if so we need to test how it fails when that does not hold, it would need to be a clear error and e.g. not a segfault).
Longer-term there might be some kind of async-profiler for Truffle or the functionality to read the guest stacktrace of another thread without needing to run code on that thread, i.e., also working when that thread is in a blocking native call.

@eregon eregon added this to the 24.0.0 Release (March 19, 2024) milestone Jan 9, 2024
@eregon eregon self-assigned this Jan 9, 2024
@eregon
Copy link
Member

eregon commented Jan 9, 2024

if so we need to test how it fails when that does not hold, it would need to be a clear error and e.g. not a segfault.

With asserts on it's rather clear:

$ jt test -e rb_thread_call_without_gvl spec/ruby/optional/capi/thread_spec.rb -- --cpusampler
Exception in thread "SIGHUP handler" java.lang.AssertionError: No polyglot context is entered. A language or context reference must not be used if there is no polyglot context entered.
	at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotFastThreadLocals.assertValidGet(PolyglotFastThreadLocals.java:481)
	at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotFastThreadLocals$ContextReferenceImpl.get(PolyglotFastThreadLocals.java:513)
	at com.oracle.truffle.truffle_nfi_libffi/com.oracle.truffle.nfi.backend.libffi.LibFFIContext.get(LibFFIContext.java:359)

but without it's not:

Exception in thread "SIGHUP handler" java.lang.NullPointerException: Cannot invoke "com.oracle.truffle.nfi.backend.libffi.LibFFIContext.executeNative(long, long, byte[], int, int[], Object[], byte[])" because "ctx" is null
	at com.oracle.truffle.truffle_nfi_libffi/com.oracle.truffle.nfi.backend.libffi.LibFFISignature$CachedSignatureInfo.execute(LibFFISignature.java:322)
	at com.oracle.truffle.truffle_nfi_libffi/com.oracle.truffle.nfi.backend.libffi.FunctionExecuteNode$SignatureExecuteNode.execute(FunctionExecuteNode.java:151)
	at org.graalvm.truffle/com.oracle.truffle.api.impl.DefaultCallTarget.callDirectOrIndirect(DefaultCallTarget.java:85)
	at org.graalvm.truffle/com.oracle.truffle.api.impl.DefaultDirectCallNode.call(DefaultDirectCallNode.java:59)

@eregon
Copy link
Member

eregon commented Jan 9, 2024

I tried creating a new thread in CExtInterrupter#interrupt but that does not work, because the context lock is held while the sampler ThreadLocalAction is running:

"Sampling thread" #34 [90231] daemon prio=5 os_prio=0 cpu=144.69ms elapsed=7.08s tid=0x00005627f2a4ff00 nid=90231 in Object.wait()  [0x00007f8304c2d000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait0(java.base@21.0.1/Native Method)
	- waiting on <0x00000007ffc1b470> (a com.oracle.truffle.polyglot.PolyglotThread)
	at java.lang.Object.wait(java.base@21.0.1/Object.java:366)
	at java.lang.Thread.join(java.base@21.0.1/Thread.java:2078)
	- locked <0x00000007ffc1b470> (a com.oracle.truffle.polyglot.PolyglotThread)
	at java.lang.Thread.join(java.base@21.0.1/Thread.java:2154)
	at org.truffleruby.core.thread.ThreadNodes$CallWithUnblockingFunctionNode$CExtInterrupter.interrupt(org.graalvm.ruby/ThreadNodes.java:722)
	at com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.setFastPendingAndInterrupt(org.graalvm.truffle/ThreadLocalHandshake.java:487)
	at com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.addHandshakeImpl(org.graalvm.truffle/ThreadLocalHandshake.java:474)
	at com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.addHandshake(org.graalvm.truffle/ThreadLocalHandshake.java:465)
	at com.oracle.truffle.api.impl.ThreadLocalHandshake.addHandshakes(org.graalvm.truffle/ThreadLocalHandshake.java:136)
	at com.oracle.truffle.api.impl.ThreadLocalHandshake.runThreadLocal(org.graalvm.truffle/ThreadLocalHandshake.java:125)
	at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:309)
	- locked <0x0000000418811aa0> (a com.oracle.truffle.polyglot.PolyglotContextImpl)
	at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:236)
	at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:232)
	at com.oracle.truffle.polyglot.EngineAccessor$EngineImpl.submitThreadLocal(org.graalvm.truffle/EngineAccessor.java:1812)
	at com.oracle.truffle.api.instrumentation.TruffleInstrument$Env.submitThreadLocal(org.graalvm.truffle/TruffleInstrument.java:1336)
	at com.oracle.truffle.tools.profiler.SafepointStackSampler.sample(com.oracle.truffle.tools.profiler/SafepointStackSampler.java:101)
	at com.oracle.truffle.tools.profiler.CPUSampler$SamplingTask.run(com.oracle.truffle.tools.profiler/CPUSampler.java:753)
	at java.util.concurrent.Executors$RunnableAdapter.call(java.base@21.0.1/Executors.java:572)
	at java.util.concurrent.FutureTask.runAndReset(java.base@21.0.1/FutureTask.java:358)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@21.0.1/ScheduledThreadPoolExecutor.java:305)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.1/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.runWith(java.base@21.0.1/Thread.java:1596)
	at java.lang.Thread.run(java.base@21.0.1/Thread.java:1583)
	at com.oracle.truffle.polyglot.SystemThread.run(org.graalvm.truffle/SystemThread.java:68)

And the main thread ends up stuck, waiting for that lock:

"main" #1 [90109] prio=5 os_prio=0 cpu=6872.82ms elapsed=7.58s tid=0x00005627f242c340 nid=90109 waiting for monitor entry  [0x00007ffd2eb8e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:248)
	- waiting to lock <0x0000000418811aa0> (a com.oracle.truffle.polyglot.PolyglotContextImpl)
	at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:236)
	at com.oracle.truffle.polyglot.PolyglotThreadLocalActions.submit(org.graalvm.truffle/PolyglotThreadLocalActions.java:232)
	at com.oracle.truffle.polyglot.EngineAccessor$EngineImpl.submitThreadLocal(org.graalvm.truffle/EngineAccessor.java:1812)
	at com.oracle.truffle.api.TruffleLanguage$Env.submitThreadLocalInternal(org.graalvm.truffle/TruffleLanguage.java:3729)
	at com.oracle.truffle.api.TruffleLanguage$Env.submitThreadLocal(org.graalvm.truffle/TruffleLanguage.java:3686)
	at org.truffleruby.language.SafepointManager.pauseAllThreadsAndExecute(org.graalvm.ruby/SafepointManager.java:52)
	at org.truffleruby.language.SafepointManager.pauseRubyThreadAndExecute(org.graalvm.ruby/SafepointManager.java:44)
	at org.truffleruby.core.thread.ThreadNodes$WakeupNode.wakeup(org.graalvm.ruby/ThreadNodes.java:640)

EDIT: it fixes the deadlock to not join() the thread there, so that's fine.
Then it's the same failure as with JNI (wakeup raising ThreadError: killed thread)

@eregon
Copy link
Member

eregon commented Jan 9, 2024

I also tried just calling the unblock function via JNI (which would avoid the overhead of an extra thread and communicating with it for each sampling ThreadLocalAction).
That seems to work for spec/ruby/optional/capi/thread_spec.rb, but the spec also fails as a side-effect with wakeup raising ThreadError: killed thread since the unblock already ran and so the thread already terminated.
Which means CPUSampler has that side effect, unless the blocking function automatically retries after checking interrupts.
Obviously, not great but not sure there is a solution.
Also not clear to me if it's safe to call the unblock function multiple times, but it probably needs to be e.g. if the blocking function retries after checking interrupts (= rb_thread_check_ints()).

It'd be good to look at real-world usages of rb_thread_call_without_gvl() and see how it behaves after being interrupted.
Ideally they would just retry after checking interrupts, but not sure that's what they do.

@eregon
Copy link
Member

eregon commented Jan 10, 2024

Some more thoughts here:

  • Maybe it would make sense to leaveAndEnter() for calling the blocking function. But since that may use rb_thread_call_with_gvl() and leaveAndEnter() does not support nested enter (IIRC) it's not an option. It would also of course not account the blocking functions at all which is not great.
  • We could disable safepoints while in the blocking function. This would again not account it, but then also there would be no need to interrupt it. But this wouldn't work for e.g. Ctrl+C/signal traps which need to run on that thread, so that's not an option either.
  • Maybe we could check if the CPUSampler is active and if so eagerly capture the guest stacktrace for that thread before executing the blocking function. It would have more overhead if there is 0 samples during that blocking function though (i.e., if it runs for less than 10ms).
  • The old CPUSampler with ShadowStack would work well for that case and effectively provide the ability to query another thread stack at any time without any synchronization with it or running any code on it. OTOH it added an overhead for every CallTarget call (in call mode). IMO it would be valuable to add it again as an option, we still have the ShadowStack impl anyway for other tools.
  • Obviously if there was a JVMCI API to get the guest stacktrace of another thread without needing to run code on it that'd be ideal, but it might take a while.

@nirvdrum
Copy link
Collaborator

Ideally, I don't think we'd have to deal with the unblock function at all. I suppose that's what you mean with an analog to async_profiler. I don't want collecting a profile to affect the behavior of running code beyond performance overhead. Interrupting a blocking call that would naturally unblock is not ideal.

With that said, I'd settle for whatever we can do. I'm finding the sampling profiler doesn't work with a real world Rails application. There's a lot of thread calls going on without the GVL in the database driver, OpenSSL, ZLib, and others.

@eregon
Copy link
Member

eregon commented Jan 11, 2024

Ideally, I don't think we'd have to deal with the unblock function at all. I suppose that's what you mean with an analog to async_profiler. I don't want collecting a profile to affect the behavior of running code beyond performance overhead. Interrupting a blocking call that would naturally unblock is not ideal.

Yes, exactly, ideally we wouldn't need to, but currently the only way to get a guest stacktrace of another thread is to run TruffleRuntime#iterateFrames on that thread, which means using TruffleSafepoint and reaching a guest safepoint reasonably quickly.

With that said, I'd settle for whatever we can do. I'm finding the sampling profiler doesn't work with a real world Rails application. There's a lot of thread calls going on without the GVL in the database driver, OpenSSL, ZLib, and others.

Note that many calls to rb_thread_call_without_gvl() use RUBY_UBF_IO as the unblock "function", which is sending SIGVTALRM to interrupt it and we already do this with JNI so that should work fine, except of course it also has the side effect of interrupting that blocking call (for every sample while in it) while it would not be interrupted in CRuby.

Ideally when such a blocking call is interrupted it would check interrupts (rb_thread_check_ints()) and then retry (e.g. until it doesn't get EINTR when using RUBY_UBF_IO).
So for example it should not abort and not retry if the unblock function is called.
But I'm not sure all calls to rb_thread_call_without_gvl() do this correctly.
They should though because CRuby uses the unblock function e.g. to run a Ruby-defined signal handler on the main Ruby thread and that can happen multiple times and should have side effects on the program besides what the signal handler does.

graalvmbot pushed a commit that referenced this issue Jan 12, 2024
* We cannot enter TruffleContext on a system thread, so that leaves us with two options:
  * Calling the unblock function with JNI, and expect the unblock function does not call back to Ruby (not the case for all unblock functions I saw).
  * Calling the unblock function on a new Ruby Thread per context and communicate via a queue, but this seems very high overhead for every guest safepoint.
* Fixes #3013
@eregon
Copy link
Member

eregon commented Jan 12, 2024

Fix in #3385.
I verified that with the fix truffleruby succeeds in capturing samples while a thread is inside rb_thread_call_without_gvl() with a custom unblock function.
The profile for that thread is mostly attributed to rb_thread_check_ints()/rb_thread_call_with_gvl() because that's where the guest safepoint poll happens, which makes sense and is expected although the actual time is mostly spent in the blocking function but that itself can't be polled so that's expected safepoint bias. We still see those are above the rb_thread_call_without_gvl() in the flamegraph so it's rather clear.

@eregon
Copy link
Member

eregon commented Jan 12, 2024

Here is the flamegraph from the new spec for rb_thread_call_without_gvl() with a custom unblock function that I'm talking about above:
flamegraph

And before the fix (on truffleruby-dev native) it looked like this:
flamegraph
i.e. it could not get any sample while a thread was inside rb_thread_call_without_gvl()` with a custom unblock function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants