Here is a simple exercise to connect the theory and practice of tracing JITs and modern Intel microarchitectures. I write a small example program, see how LuaJIT compiles it to a trace, and then see how a Haswell CPU executes it. This follows on from #5 and #3 respectively.
Tracing JIT
The program is trivially simple: it uses the Snabb Switch counter module to create a counter object and then increment that one billion times. Snabb Switch counters are represented as binary files on disk that each contain one 64-bit number (each file is 8 bytes). The reason we allocate counters on the file system is to make them directly available to diagnostic programs that are tracking network packets processed, packets dropped, and so on. The way we actually access them in Lua code is by mapping them into memory with mmap() and then accessing them directly as FFIuint64_t * values. (See the shm module for our cute little API to allocate arbitrary C data types as named shared memory objects.)
Here is the code:
local counter =require("core.counter")
local n =1e9local c = counter.open("test")
for i =1,n do
counter.add(c, 1)
end
I run this using snsh (snabb shell, a LuaJIT frontend) with JIT trace dumping enabled:
# ./snabb snsh -jdump script.lua
which outputs a full dump (bytecode, intermediate representation, and x86 machine code) from which we can look at the machine code for the loop that will execute one billion times:
There we see that LuaJIT has compiled the loop body down to five instructions:
Bump counter value in register.
Store counter value to memory.
Bump loop iteration counter.
Check for loop termination.
Branch back to start of loop.
This seems pretty nice actually: according to the semantics of Lua the call to counter.add() is actually a hashtable lookup and a function call but LuaJIT has been able to optimize this away and inline the call into two instructions. (Hat tip to Mike Pall and his very impressive brain.)
So that is what the tracing JIT does!
Haswell CPU
Now what does the Haswell CPU do with this?
First the theory: we can refer to the excellent AnandTech article to see how each Haswell CPU core works:
The CPU takes in a large number of x86 instructions, JITs them all into internal Haswell micro-instructions, figures out their interdependencies, and schedules them for parallel execution across eight independent execution units. (This is a sophisticated piece of technology.)
To connect this with practice we will use the ocperf.py program from pmu-tools to access some CPU performance counters. Performance counters give us visbility into the internal workings of the CPU: a modern Xeon exports a lot of diagnostic information and is very far from a black box.
The loop executed around 5 billion instructions. This makes sense because we counted five instructions in the loop body and we chose an iteration count of one billion.
The loop executed in 1 billion cycles. Holy shit! The CPU is actually executing the entire loop - all five instructions - in only one cycle. I am impressed.
There were a billion branches but the CPU predicted them all correctly.
There were a billion memory stores but the CPU made them all hit the L1 cache.
The Haswell execution units 4 and 6 were used continuously and the CPU scheduled the rest of the load across execution units 0, 1, 2, 3, and 7. I can see what port 4 would need to be used continuously because it is the only execution core capable of Store Data but that is the limit of my interpretation.
Cool stuff!
The end
This is the level of visibility that I want to have into the programs I am working on. I am quite satisfied with this example. Now what I want to do is make it easy for Snabb Switch hackers to get this level of visibility into the practical code that they are working on.
The text was updated successfully, but these errors were encountered:
Holy shit! The CPU is actually executing the entire loop - all five instructions - in only one cycle. I am impressed.
Fun fact: the Mill processor is capable of executing over 30 instructions per cycle, each cycle, in general purpose workloads. Sure, it's not a shipping product just yet, but an interesting architecture for the future indeed.
Hi there!
Here is a simple exercise to connect the theory and practice of tracing JITs and modern Intel microarchitectures. I write a small example program, see how LuaJIT compiles it to a trace, and then see how a Haswell CPU executes it. This follows on from #5 and #3 respectively.
Tracing JIT
The program is trivially simple: it uses the Snabb Switch
countermodule to create a counter object and then increment that one billion times. Snabb Switch counters are represented as binary files on disk that each contain one 64-bit number (each file is 8 bytes). The reason we allocate counters on the file system is to make them directly available to diagnostic programs that are tracking network packets processed, packets dropped, and so on. The way we actually access them in Lua code is by mapping them into memory withmmap()and then accessing them directly as FFIuint64_t *values. (See theshmmodule for our cute little API to allocate arbitrary C data types as named shared memory objects.)Here is the code:
I run this using snsh (snabb shell, a LuaJIT frontend) with JIT trace dumping enabled:
which outputs a full dump (bytecode, intermediate representation, and x86 machine code) from which we can look at the machine code for the loop that will execute one billion times:
There we see that LuaJIT has compiled the loop body down to five instructions:
This seems pretty nice actually: according to the semantics of Lua the call to
counter.add()is actually a hashtable lookup and a function call but LuaJIT has been able to optimize this away and inline the call into two instructions. (Hat tip to Mike Pall and his very impressive brain.)So that is what the tracing JIT does!
Haswell CPU
Now what does the Haswell CPU do with this?
First the theory: we can refer to the excellent AnandTech article to see how each Haswell CPU core works:
The CPU takes in a large number of x86 instructions, JITs them all into internal Haswell micro-instructions, figures out their interdependencies, and schedules them for parallel execution across eight independent execution units. (This is a sophisticated piece of technology.)
To connect this with practice we will use the
ocperf.pyprogram from pmu-tools to access some CPU performance counters. Performance counters give us visbility into the internal workings of the CPU: a modern Xeon exports a lot of diagnostic information and is very far from a black box.I test with a Xeon E5-2620 v3 and this command:
So what does this mean?
instructions. This makes sense because we counted five instructions in the loop body and we chose an iteration count of one billion.cycles. Holy shit! The CPU is actually executing the entire loop - all five instructions - in only one cycle. I am impressed.Cool stuff!
The end
This is the level of visibility that I want to have into the programs I am working on. I am quite satisfied with this example. Now what I want to do is make it easy for Snabb Switch hackers to get this level of visibility into the practical code that they are working on.
The text was updated successfully, but these errors were encountered: