Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy compilation is not thread-safe #5556

Open
llvmbot opened this issue Oct 14, 2009 · 11 comments
Open

Lazy compilation is not thread-safe #5556

llvmbot opened this issue Oct 14, 2009 · 11 comments

Comments

@llvmbot
Copy link
Collaborator

@llvmbot llvmbot commented Oct 14, 2009

Bugzilla Link 5184
Version trunk
OS All
Reporter LLVM Bugzilla Contributor
CC @danchr

Extended Description

Lazy compilation is implemented roughly as:

call @​compilationCallback

define @​compilationCallback() {
...
store %new_function_address (%return_address-4)
}

which turns the "call @​compilationCallback" into a "call @​real_function". Unfortunately, while that store into the return address is running, another thread could be executing the call. At least on the x86, there are no guarantees about what can happen in this case. (Data reads and writes are defined, but not instruction execution.) It's entirely possible that the store won't be seen as atomic and will result in a wild jump.

I see three ways to fix this:

  1. Instead of codegening a direct call instruction, generate something like:
    %target = atomic load @​callsite_target_address
    call %target
    and then at the end of compilation atomically store the real function back to @​callsite_target_address. This causes every callsite to be an indirect call, which is likely to slow things down a lot, so there should be some way to mark the functions where it's used. On platforms that don't even have atomic pointer writes, this could be implemented by acquiring a lock around every lazily-compilable call.
  2. Assert that lazy compilation is only used in a single-threaded program. This allows us to delete the JIT lock.
  3. Delete lazy compilation. (which would simplify the JIT a fair amount)
@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 23, 2009

Besides the "how to update a call instruction" problem, lazy JITting can race with concurrent modifications of any IR in the same LLVMContext unless you're careful to hold the JIT lock around such modifications. This should be documented.

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 23, 2009

I see three ways to fix this:

  1. Instead of codegening a direct call instruction, generate something like:
    %target = atomic load @​callsite_target_address
    call %target
    and then at the end of compilation atomically store the real function back to
    @​callsite_target_address. This causes every callsite to be an indirect call,
    which is likely to slow things down a lot, so there should be some way to mark
    the functions where it's used. On platforms that don't even have atomic
    pointer writes, this could be implemented by acquiring a lock around every
    lazily-compilable call.

This looks a lot like the way DSOs are handled when compiling with -fPIC. You can probably expect the same kind of overhead. It is interesting to note that when using a DSO that was not compiled without -fPIC (on x86 at least) the resolution is done at load time.

  1. Assert that lazy compilation is only used in a single-threaded program. This
    allows us to delete the JIT lock.
  2. Delete lazy compilation. (which would simplify the JIT a fair amount)

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 24, 2009

Lazy compilation is implemented roughly as:

call @​compilationCallback

define @​compilationCallback() {
...
store %new_function_address (%return_address-4)
}

which turns the "call @​compilationCallback" into a "call @​real_function".
Unfortunately, while that store into the return address is running, another
thread could be executing the call. At least on the x86, there are no
guarantees about what can happen in this case. (Data reads and writes are
defined, but not instruction execution.) It's entirely possible that the store
won't be seen as atomic and will result in a wild jump.

I'm missing some knowledge: the write is a 4-byte write, right? Why is that not atomic on x86?

If the store of a call instruction is not atomic, at least we should be able to store an instruction that does an infinite loop until the update is complete.

The paper from IBM JVM developers called "Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler" [CGO2006] discuss about the issue of lazy compilation and multithreading. You can find the presentation of the paper here:
http://www.cgo.org/cgo2006/html/progslides/session2_talk3_maier.pdf

I see three ways to fix this:

  1. Instead of codegening a direct call instruction, generate something like:
    %target = atomic load @​callsite_target_address
    call %target
    and then at the end of compilation atomically store the real function back to
    @​callsite_target_address. This causes every callsite to be an indirect call,
    which is likely to slow things down a lot, so there should be some way to mark
    the functions where it's used. On platforms that don't even have atomic
    pointer writes, this could be implemented by acquiring a lock around every
    lazily-compilable call.

The infinite loop instruction is a possible way of avoiding this.

  1. Assert that lazy compilation is only used in a single-threaded program. This
    allows us to delete the JIT lock.

I really don't want this :)

  1. Delete lazy compilation. (which would simplify the JIT a fair amount)

Indeed, it would simplify things a lot, but, again, I really don't want to not support lazy compilation. How can Unladden-Swallow afford disabling lazy compilation? Don't you have functions that are not compiled yet and only want to compile when they are actually used?

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 24, 2009

  1. Delete lazy compilation. (which would simplify the JIT a fair amount)

Indeed, it would simplify things a lot, but, again, I really don't want to not
support lazy compilation. How can Unladden-Swallow afford disabling lazy
compilation? Don't you have functions that are not compiled yet and only want
to compile when they are actually used?

If I understand it correctly, only hot parts are translated from python bytecode to llvm.

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 25, 2009

Unfortunately, while that store into the return address is running, another
thread could be executing the call. At least on the x86, there are no
guarantees about what can happen in this case. (Data reads and writes are
defined, but not instruction execution.) It's entirely possible that the store
won't be seen as atomic and will result in a wild jump.

I'm missing some knowledge: the write is a 4-byte write, right? Why is that not
atomic on x86?

With threading, the question you have to ask is, "why is that atomic". Absent guarantees from either the intel&amd architecture manuals, or exhaustive testing with lots of different platforms, assuming something's atomic sets you up to get paged at 3 in the morning.

For example, the "Intel® 64 and IA-32 Architectures Software Developer's Manual" (http://www.intel.com/Assets/PDF/manual/253668.pdf) section [8.2.3.1 Assumptions, Terminology, and Notation] says that aligned 1, 2, 4, and 8-byte reads and writes, and arbitrary locked instructions appear to execute as a single memory access, but that "Other instructions may be implemented with multiple memory accesses." It doesn't mention instruction fetches, but that makes me think that they can appear to be implemented with multiple memory accesses too.

The JVMs have done exhaustive testing, so I personally trust copying them, but that does set us up for a maintenance burden as new chips come out.

If the store of a call instruction is not atomic, at least we should be able to
store an instruction that does an infinite loop until the update is complete.

This doesn't necessarily help. Say the instruction looks like

[call] [target1] [target2] [target3] [target4]

So we call:
store [br] [self] to {[call] [target1]}
mfence
store [self] [newtarget2] [new target3] [new target4] to {[self] [target2] [target3] [target4]}
mfence
store [call] [newtarget1] to {[br] [self]}

Another thread is concurrently executing this code. 3 problems:

  1. If storing the call instruction wasn't atomic in the first place, why do we think storing the [br self] over it would be atomic? Specifically, the executing thread could have fetched the [call] part of the original code before the mfence, stalled for a bit while the updating thread wrote [br self], then fetched [self] [target2] [target3] [target4]: crash
  2. Alternately, maybe writing [br self] is atomic, but the executing thread fetched [call] [target1] before that write, then stalled for a bit, then fetched [newtarget2] [new target3] [new target4] after they're written: crash.
  3. Alternately, since the mfence constrains only the writer and not the reader, maybe the instruction fetch happens out of order and reads [call] [newtarget1] [target2] [target3] [target4]: crash.

The paper from IBM JVM developers called "Experiences with Multi-threading and
Dynamic Class Loading in a Java Just-In-Time Compiler" [CGO2006] discuss about
the issue of lazy compilation and multithreading. You can find the presentation
of the paper here:
http://www.cgo.org/cgo2006/html/progslides/session2_talk3_maier.pdf

Thanks, that's helpful. It says that fetches are atomic within "patching boundaries": "A patching boundary is an address that is aligned on the length of a data cache line on P6 microarchitecture (i.e., 32 bytes on Pentium 3 and 64 bytes on Pentium 4) [9] [10] or 8 bytes on K7 and K8 microarchitecture (determined through empirical experimentation) [1]." Of course, we'd need more testing to check that 8 bytes is still sufficient on K10, Core 2, and Nehalem. The right thing to do is probably to cross-reference our patching code to the right functions in openjdk so it's easy to just imitate them for new processors.

  1. Delete lazy compilation. (which would simplify the JIT a fair amount)

Indeed, it would simplify things a lot, but, again, I really don't want to not
support lazy compilation. How can Unladden-Swallow afford disabling lazy
compilation? Don't you have functions that are not compiled yet and only want
to compile when they are actually used?

Like Rafael said, Unladen Swallow and Rubinius only JIT-compile hot functions, and we don't define "hot" as "called once". When a not-yet-compiled function is called from JITted code, it calls back into the interpreter. If A calls B, and A becomes hot before B, we currently have to recompile A to get a direct call to B; otherwise the call goes through the interpreter. Rubinius tries to find blocks of functions to compile all at once, so it's less vulnerable to a caller becoming hot only slightly before a callee.

Lazy compilation in the current JIT would also be a latency problem: codegen is slow, and we don't want to block the call to the lazily-compiled function while it codegens (the call can run through the interpreter instead). Instead, we want to compile in a background thread and only switch the call from the interpreter to compiled code when compilation is finished.

I think we would use a more generic thread-safe code patching facility if the JIT offered it.

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 26, 2009

OK, so if you're not using lazy compilation, then why bother? ;-) No seriously, I'd rather let that bug open, wait until an expert implement it, and still enable lazy compilation in llvm. Does that sound reasonable?

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 27, 2009

I think we should turn lazy compilation off by default as long as it's not thread-safe. Besides being potentially crashy, it has confused people trying to helgrind their apps.

I don't think the "lazily compile every function" approach is good for performance for anyone, but if vmkit or other clients are using it, I can't unilaterally delete it.

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 27, 2009

I think we should turn lazy compilation off by default as long as it's not
thread-safe.

Fine by me!

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 28, 2009

I got some feedback (thanks!), at least for AMD processors.
According to the AMD programmer manual (http://support.amd.com/us/Processor_TechDocs/24593.pdf), it is supposed to be safe to patch the naturally aligned address after the call with a store (Section 7.6.1). So, one could thus align the call before the call address accordingly with nops. Using a lock cmpxchg is not necessarily safe.

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Oct 28, 2009

Thanks Torvald! Specifically, I think you're referring to "Synchronization for cross-
modifying code is not required for code that resides within the naturally aligned quadword." which is a really nice guarantee.

I can't find the wording that implies that a lock cmpxchg is not necessarily safe; where do you see that? I don't think we need it, but I'd like to get all the restrictions recorded here if possible.

If anyone else knows the guarantees for powerpc, arm, etc. please link to them here too.

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Nov 9, 2009

Jeffrey, I guess that the natural alignment is the necessary condition, not the cmpxchg. But I don't have a citation for this, sorry.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant