Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Tracing JITs and modern CPUs part 3: A bad case #8
Let's look at a bad case for tracing JITs. The simple function we looked at in #6 worked really well but turns out to be quite fragile. Let us look at how to break it and see if we can learn something in the process.
I write this from my perspective as a humble application programmer who only needs to understand the fundamentals well enough to write efficient programs. I will not be offering any grand insights into compilers on this blog (though perhaps commenters will).
The code was a simple loop to bump a counter one billion times. The code is high-level Lua that even includes a hashtable lookup in the inner loop (looking up the
local counter = require("core.counter") local n = 1e9 local c = counter.open("test") for i = 1,n do counter.add(c, 1) end
LuaJIT compiled this down to a trace: a linear block of machine code that contains at most one loop and otherwise no internal branches. The loop compiled down to five instructions:
and I was pleasantly surprised when
That paints a very pretty picture. But it is easy to shake this up.
Just a little innocent change
Here is an innocent looking new version of the program that contains an
local counter = require("core.counter") local n = 1e9 local c = counter.open("test") for i = 1,n do if i % 2 == 0 then counter.add(c, 1) else counter.add(c, 10) end end
How does this version run?
Oops! Now each iteration takes 15 cycles and executes 36 instructions. That is a 15x slow down.
The high-level explanation is actually straightforward. The first version runs fast because the loop executes very few instructions: most of the work like table lookups has been "hoisted" to run before entering the loop. The second version runs slowly because it is frequently repeating this setup work.
To understand what happens we can read the JIT dumps for the first version and second version and draw pictures of the flow of control. In these diagrams each box is a trace i.e. a series of machine code instructions that will execute from top to bottom. Branches are drawn as arrows and there are two kinds: the loop back into an earlier part of the trace (at most one is allowed) or an exit to a different trace. The "hot" code that consumes the CPU is highlighted.
Here is a picture of the first version that runs fast:
and below is the full machine code. I don't really bother to read every instruction here. My observation is that the proportions match the diagram: quite a lot of instructions upon entry and then a small number of instructions in the loop.
So what changes in the second version that causes the inner loop to expand from 5 instructions up to 36? Here is the picture:
Now we have two traces: the original root trace and a new side trace. This is necessary because there is a branch (
The picture also illustrates the two reasons why we execute so many instructions now. First, the side trace is bigger than the loop in the root trace (i.e. it contains more instructions). Second, when the side trace branches back to the root trace it re-enters at the top instead of taking a short-cut into the inner loop. This means that overall we execute more instructions.
Let us zoom in to a bit more detail: first to look at the inner loop in the root trace, then to look at the side trace, and finally to look at the complete root trace that is running every time the side trace branches back.
Here is the new loop in the root trace (with added comments):
The difference from the original trace is the two new instructions at the start. These test a guard for the trace (that the loop index must be even) and branch to an exit if this condition does not hold. So when the loop index happens to be even the execution will be very similar to the original version, but when the loop index is odd we will exit to the side trace.
Here is the code for the side trace.
I have not read this code in detail but here are a couple of observations:
Here finally is the entire root trace, this time including the initial code before the loop that executes when the side trace branches back:
The trace executes an additional 38 instructions before entering the loop. This path will be taken on every second loop iteration, when the exit to the side trace is taken and it branches back to the top. That would seem to account for the rest of the instructions reported by
If I were a compiler expert then this is where I would explain why the code compiles in this way and provide interesting links to all the relevant research. But I am not. So all I can really state are my own personal observations.
I find this all very interesting. The issues are subtle and complex, but most languages and platforms are subtle and complex to optimize when you first start taking them seriously. I am happy to keep geeking out on these ideas.
Thanks to Alex Gall and Andy Wingo for mentioning the issue of side-traces re-entering their parent traces at the root, so that I could recognise it when I saw it too.
The jump from root trace to side trace should be at here?
It should not be at the LOOP part.
The root trace should be patched (change "jnz 0x0bca0014" to "jnz 0x0bcaf922", the entry address of the side trace) after the side trace created.