-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracing JITs 4: Zooming in on a simple trace #9
Comments
Have you tried combining hacks 1 and 2? The movsxd instructions in your last trace are kinda useless and different number modes might help. |
@andywingo This is interesting... NUMMODE did not make a difference but I was able to eliminate one of those instructions by declaring the array as
The interesting thing is that despite the extra instructions in the original these two versions seem to have exactly the same performance on a Haswell: they execute in the same number of cycles, they keep the execution units occupied in the same proportions, and they execute the same number of micro-ops. Original version:
New version:
Could the instruction decoder in the Haswell actually be eliminating the unnecessary work while translating into micro-ops, so that the execution units in the backend don't actually do any work for the extra instruction? (Is that what they call macro-fusion in the CPU front-end?) Or could be that I am rounding the numbers from |
@andywingo Tricky... I had the impression from My confusion comes from the Generally I would really like to have more control over performance counter monitoring. For example to separately count events for different loops in the code (Snabb apps). If a loop is executing for around a thousand instructions then it seems like it should be meaningful to track its performance counters separately without too much risk of measurement error. Could look into PAPI or Linux/perf primitives for self-profiling programs. That could really make life easier. Unlike these toy example programs most of the programs I want to analyze don't spend 99% of their execution time in one loop and that makes them less compatible with a simple |
I asked JSC people about PAPI a long time ago and they said "why bother, we have a JIT compiler, we can just emit the instructions we need to get the information we care about". Sounds like this is a thing that could be added to LuaJIT. |
I'm looking at jevents from pmu-tools now... Could clone in LuaJIT? Looks tiny |
I actually don't get why luajit is doing all the movsxd's. |
Hmmm.. Is the kernel perf events business really useful or do we just need an RDPMC instruction? (I don't think we care much about context switches, interrupts, etc so a very raw interface might be fine) |
Also should check out Agner on this: http://agner.org/optimize/#testp |
@andywingo btw it's now really easy to mix asm with Lua since snabbco/snabb#565. |
Wrong issue :-) I mean snabbco/snabb#575 Githubbing via phone.. |
@andywingo I have been digging a little bit to find a suitable software interface to access performance monitoring counters. Currently it looks like Linux I had hoped that we could simply use the I don't want to depend on a kernel module and I also want to be able to track lots of counters at the same time. Tracking performance issues with these counters seems really like a "search for needle in haystack" problem some times. Low performance could be caused by dcache miss, icache miss, dTLB miss, branch miss, cache coherency protocol conflicts, QPI latency, or a zillion other things. It would be nice to be able to check all of these in parallel in order to generate a hypothesis for what the problem could be. (perf can already do this today but only for the whole process and not per-app). The kernel interface seems to improve on this quite nicely:
The main element lacking (?) from the kernel interface seems to be broad support for all the zillions of counters that exist in a given CPU e.g. Xeon Haswell E5-26xx. This is added by the The main alternative that may be a shortcut would be to integrate better with standard commands like That is how it looks based on today's googling anyway. |
@lukego, you are a very good candidate to pick LuaJit development ! |
@bratao I will do my bit :). |
[I was just walking past]. Two quick observations:
|
Today I want to do a simple experiment to improve my mental model of how Snabb Switch code is just-in-time compiled into traces by LuaJIT. I am continuing with simple artificial examples like the ones in #6 and #8.
This is the code I want to look at today:
which is functionally equivalent to the code in #8: it loops one billion times and increments a counter alternatively by 1 and 10. The difference is that the decision of how much to increment the counter now depends on data (table lookup) rather than control (
if
statement). This is an optimization that I have made to help the tracing JIT: I moved the variability from control (branching) into data (table lookup).Here is how it runs:
Each iteration takes 4 cycles (and executes 15 instructions). This is indeed in between the version from #6 that runs in one cycle (5 instructions) and the version in #8 that runs in 15 cycles (36 instructions).
Here is what I am going to do now:
IR and mcode
Let me present three exhibits: the Intermediate Representation for the loop, the corresponding machine code, and a hand-interleaved combination of the two with some comments.
Here is the Intermediate Representation of the loop:
Here is the machine code that was generated for that IR code:
Here is an interleaved combination of the two where I have added comments explaining my interpretation of what is going on:
I may well have made a mistake in this interpretation. I am also not certain whether the machine code does strictly match the IR code or to what extent it can be merged and shuffled around. I would like to understand this better because I have a fantasy that LuaJIT could automatically generate the interleaved view and that this might make traces easier for me to read.
So what jumps out from this?
Tweak 1: LUAJIT_NUMMODE
First I take the opportunity to try a little bit of voodoo. LuaJIT supports several number modes that can be chosen at compile time. What is a number mode? I don't really know. Mike Pall has commented that on x86_64 there are several options and some may be faster than others depending on the mix of integer and floating point operations.
Just for fun I tried them all. Turned out that compiling LuaJIT with
-DLUAJIT_NUMMODE=2
improved this example significantly:Now we are down to 3 cycles per iteration (for 12 instructions).
Here is the IR:
Here is the mcode:
Interesting. I am tempted to submit a Pull Request to Snabb Switch that enables
-DLUAJIT_NUMMODE=2
and see what impact that has on the performance tests that our CI runs. However, I am generally reluctant to apply optimizations that I don't understand reasonably well.Tweak 2: FFI table
This time I will try a more straightforward change.
The problem I see is that we are doing a bunch of work to check array bounds and convert the table values from floats to ints. Let us try to avoid this by replacing the high-level Lua table with a low-level FFI array of integers.
This actually works pretty well:
Now we are down to two cycles per iteration (for 9 instructions).
Here is the IR:
Here is the mcode:
This experiment feels more satisfying. I was able to identify redundant code, eliminate it in a sensible way, and verify that performance improved.
The end
Morals of this story:
The text was updated successfully, but these errors were encountered: