Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Tracing JITs and modern CPUs: double trouble, or, a problem shared is a problem halved? #5
Lately while hacking Snabb Switch I am spending a lot of time getting familiar with two mysterious technologies: trace-based just-in-time compilers and the latest Intel CPU microarchitectures.
Each one is complex enough to make your head hurt. Is it madness to have to contend with both at the same time? Maybe. However, I am starting to see symmetry and to enjoy thinking about them both in combination rather than separately in isolation.
Tracing just-in-time compilers work by creating chunks of code ("traces") with peculiar characteristics (slightly simplified):
CPUs can execute code blindingly fast while it is "on trace": that is, when you can keep the CPU running on one such block of code for a significant amount of time e.g. 100 nanoseconds. The trace compiler can make a whole new class of optimizations because it knows exactly which instructions will execute and exactly how control will flow.
Code runs slower when it does not stay on-trace. This extremely specialized code generation is less effective when several traces have to be patched together. So there is a major benefit to be had from keeping the trace compiler happy -- and a penalty to be paid when you do something to piss it off.
I want to have a really strong mental model of how code is compiled to traces. I am slowly getting there: I have even caught myself writing C code as if it were going to be trace compiled (which frankly would be very handy). However, this is a long journey, and in the meantime some of the optimization techniques are really surprising.
Consider these optimization tips:
Extreme, right? I mean, what is the point of having an
On the face of it you might think that Tracing JITs are an anomoly that will soon disappear, like programming in a straight jacket. Then you would go back to your favourite static compiler or method-based JIT and use all the loops and branches that you damned well please.
Here is the rub: Modern CPUs also have a long do-and-don't list for maximizing performance at the machine code level. This sounds bad because if you are already stretching your brain to make the JIT happy then the last thing you want is another set of complex rules to follow. However, in practice the demands of the JIT and the CPU seem to be surprisingly well aligned, and thinking about satisfying one actually helps you to to satisfy the other.
Here are a few rules from the Intel Optimization Reference Manual for Haswell that seem to be on point:
There is even a hardware trace cache in the CPU that attempts to do some of the same optimizations as a software tracing JIT to improve performance.
So what does it all mean? I don't know for sure yet but I am really enjoying thinking it through.
I like to think that effort spent on making the JIT happy is also making the CPU happy. Then with a happy CPU we can better reap the benefits of mechanical sympathy and achieve seemingly impossible performance for more applications. Sure, a trace compiler takes some effort to please, but it is a lot more helpful and transparent than dealing with the CPU directly.
In any case tracing JITs and modern CPU microarchitectures are both extremely interesting technologies and the study of one does stimulate a lot of interesting ideas about the other.
Just for your intellectual curiosity, I'd suggest looking at a different approach: a new CPU architecture. In particular the Mill CPU architecture. I can strongly suggest the talks about the belt, memory, prediction, metadata and pipelining.
They use a hybrid of a statically scheduled CPU with a 'sufficiently smart compiler' to try and get 'on trace' performance on general purpose code.
P.S.: I'm not associated with the Mill CPU in any way, just interested.
One of the worse things that can happen in a cpu, performance wise, is when branch prediction fails. If that happens, the pipeline has to be thrown away and refilled from scratch.
As @chuesler said, frequent branch mispredictions are very expensive. That is a very good observation that is valid well outside the field of tracing JITs. Avoiding (hard to predict) branches in inner loops and either making them unnecessary or replacing them with predicated instructions can yield enormous performance improvements.
You can easily see this with quicksort: during partitioning, in the theoretically ideal case (picking the median as pivot), each item has a 50-50 chance of going "left" or "right", which is the branch predictor's worst nightmare. Thus, you can make quicksort faster by intentionally choosing a bad pivot. I experimented with this (sorting a random distribution of 0..n-1 so that picking the perfect pivot is free) at https://github.com/lorenzhs/quicksort-pivot-imbalance . However, choosing a bad pivot intentionally is not the conclusion, instead for certain kinds of data, you can do much better even in comparison-based sorting using (for example) SuperScalarSampleSort.
Indeed tracing JITs are quite intriguing than methods JITs (like eBPF which I study these days). For tracing JITs I found myself starting off with LuaJIT and going through DynASM implementation. A relevant link is this small discussion : http://www.freelists.org/post/luajit/How-does-LuaJITs-trace-compiler-work,1 which also discusses loops. The goals of JIT are the same as those of the compilers - whether tracing or method, hence they follow the suggested optimizations for underlying architectures.
@orlp I am happy to focus on Intel CPUs now. It is a really exciting time. They are increasing the number of cores, increasing the power of each core with more execution units and exponentially more SIMD bandwidth, and adding exotic features like hardware transactional memory. Their documentation is excellent too. I can't wait for Skylake CPUs with 512-bit SIMD registers to hit the market later this year.