New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing JITs and modern CPUs: double trouble, or, a problem shared is a problem halved? #5

Open
lukego opened this Issue Jul 24, 2015 · 6 comments

Comments

Projects
None yet
5 participants
@lukego
Owner

lukego commented Jul 24, 2015

Lately while hacking Snabb Switch I am spending a lot of time getting familiar with two mysterious technologies: trace-based just-in-time compilers and the latest Intel CPU microarchitectures.

Each one is complex enough to make your head hurt. Is it madness to have to contend with both at the same time? Maybe. However, I am starting to see symmetry and to enjoy thinking about them both in combination rather than separately in isolation.

Tracing JITs

Tracing just-in-time compilers work by creating chunks of code ("traces") with peculiar characteristics (slightly simplified):

  1. Flat: every single function call inlined.
  2. No branches.
  3. One loop.

CPUs can execute code blindingly fast while it is "on trace": that is, when you can keep the CPU running on one such block of code for a significant amount of time e.g. 100 nanoseconds. The trace compiler can make a whole new class of optimizations because it knows exactly which instructions will execute and exactly how control will flow.

Code runs slower when it does not stay on-trace. This extremely specialized code generation is less effective when several traces have to be patched together. So there is a major benefit to be had from keeping the trace compiler happy -- and a penalty to be paid when you do something to piss it off.

I want to have a really strong mental model of how code is compiled to traces. I am slowly getting there: I have even caught myself writing C code as if it were going to be trace compiled (which frankly would be very handy). However, this is a long journey, and in the meantime some of the optimization techniques are really surprising.

Consider these optimization tips:

  1. Avoid nested loops.
  2. Avoid lots of tiny loops.
  3. Avoid loops with unpredictable iteration counts.
  4. Avoid unpredictable branches.

Extreme, right? I mean, what is the point of having an if statement at all if the code is only allowed to take one of the alternatives? And when did loops, one of the most basic concepts in the history of computing, suddenly become taboo?

On the face of it you might think that Tracing JITs are an anomoly that will soon disappear, like programming in a straight jacket. Then you would go back to your favourite static compiler or method-based JIT and use all the loops and branches that you damned well please.

Intel microarchitecture

Here is the rub: Modern CPUs also have a long do-and-don't list for maximizing performance at the machine code level. This sounds bad because if you are already stretching your brain to make the JIT happy then the last thing you want is another set of complex rules to follow. However, in practice the demands of the JIT and the CPU seem to be surprisingly well aligned, and thinking about satisfying one actually helps you to to satisfy the other.

Here are a few rules from the Intel Optimization Reference Manual for Haswell that seem to be on point:

  1. Arrange code to make basic blocks contiguous and eliminate unnecessary branches.
  2. Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches.
  3. Favor inlining small functions that contain branches with poor prediction rates. If a branch misprediction results in a RETURN being prematurely predicted as taken, a performance penalty may be incurred.

There is even a hardware trace cache in the CPU that attempts to do some of the same optimizations as a software tracing JIT to improve performance.

So what does it all mean? I don't know for sure yet but I am really enjoying thinking it through.

I like to think that effort spent on making the JIT happy is also making the CPU happy. Then with a happy CPU we can better reap the benefits of mechanical sympathy and achieve seemingly impossible performance for more applications. Sure, a trace compiler takes some effort to please, but it is a lot more helpful and transparent than dealing with the CPU directly.

In any case tracing JITs and modern CPU microarchitectures are both extremely interesting technologies and the study of one does stimulate a lot of interesting ideas about the other.

@lukego lukego added the tech label Jul 24, 2015

@orlp

This comment has been minimized.

Show comment
Hide comment
@orlp

orlp Jul 24, 2015

Just for your intellectual curiosity, I'd suggest looking at a different approach: a new CPU architecture. In particular the Mill CPU architecture. I can strongly suggest the talks about the belt, memory, prediction, metadata and pipelining.

They use a hybrid of a statically scheduled CPU with a 'sufficiently smart compiler' to try and get 'on trace' performance on general purpose code.

P.S.: I'm not associated with the Mill CPU in any way, just interested.

orlp commented Jul 24, 2015

Just for your intellectual curiosity, I'd suggest looking at a different approach: a new CPU architecture. In particular the Mill CPU architecture. I can strongly suggest the talks about the belt, memory, prediction, metadata and pipelining.

They use a hybrid of a statically scheduled CPU with a 'sufficiently smart compiler' to try and get 'on trace' performance on general purpose code.

P.S.: I'm not associated with the Mill CPU in any way, just interested.

@chuesler

This comment has been minimized.

Show comment
Hide comment
@chuesler

chuesler Jul 24, 2015

One of the worse things that can happen in a cpu, performance wise, is when branch prediction fails. If that happens, the pipeline has to be thrown away and refilled from scratch.
The requirements for traces basically eliminate problems for branch prediction, so your pipeline stays full for the maximum amount of time. The manual has the same goal, so it's not surprising that the instructions are similar.

chuesler commented Jul 24, 2015

One of the worse things that can happen in a cpu, performance wise, is when branch prediction fails. If that happens, the pipeline has to be thrown away and refilled from scratch.
The requirements for traces basically eliminate problems for branch prediction, so your pipeline stays full for the maximum amount of time. The manual has the same goal, so it's not surprising that the instructions are similar.

@lorenzhs

This comment has been minimized.

Show comment
Hide comment
@lorenzhs

lorenzhs Jul 24, 2015

As @chuesler said, frequent branch mispredictions are very expensive. That is a very good observation that is valid well outside the field of tracing JITs. Avoiding (hard to predict) branches in inner loops and either making them unnecessary or replacing them with predicated instructions can yield enormous performance improvements.

You can easily see this with quicksort: during partitioning, in the theoretically ideal case (picking the median as pivot), each item has a 50-50 chance of going "left" or "right", which is the branch predictor's worst nightmare. Thus, you can make quicksort faster by intentionally choosing a bad pivot. I experimented with this (sorting a random distribution of 0..n-1 so that picking the perfect pivot is free) at https://github.com/lorenzhs/quicksort-pivot-imbalance . However, choosing a bad pivot intentionally is not the conclusion, instead for certain kinds of data, you can do much better even in comparison-based sorting using (for example) SuperScalarSampleSort.

lorenzhs commented Jul 24, 2015

As @chuesler said, frequent branch mispredictions are very expensive. That is a very good observation that is valid well outside the field of tracing JITs. Avoiding (hard to predict) branches in inner loops and either making them unnecessary or replacing them with predicated instructions can yield enormous performance improvements.

You can easily see this with quicksort: during partitioning, in the theoretically ideal case (picking the median as pivot), each item has a 50-50 chance of going "left" or "right", which is the branch predictor's worst nightmare. Thus, you can make quicksort faster by intentionally choosing a bad pivot. I experimented with this (sorting a random distribution of 0..n-1 so that picking the perfect pivot is free) at https://github.com/lorenzhs/quicksort-pivot-imbalance . However, choosing a bad pivot intentionally is not the conclusion, instead for certain kinds of data, you can do much better even in comparison-based sorting using (for example) SuperScalarSampleSort.

@tuxology

This comment has been minimized.

Show comment
Hide comment
@tuxology

tuxology Jul 24, 2015

Indeed tracing JITs are quite intriguing than methods JITs (like eBPF which I study these days). For tracing JITs I found myself starting off with LuaJIT and going through DynASM implementation. A relevant link is this small discussion : http://www.freelists.org/post/luajit/How-does-LuaJITs-trace-compiler-work,1 which also discusses loops. The goals of JIT are the same as those of the compilers - whether tracing or method, hence they follow the suggested optimizations for underlying architectures.

tuxology commented Jul 24, 2015

Indeed tracing JITs are quite intriguing than methods JITs (like eBPF which I study these days). For tracing JITs I found myself starting off with LuaJIT and going through DynASM implementation. A relevant link is this small discussion : http://www.freelists.org/post/luajit/How-does-LuaJITs-trace-compiler-work,1 which also discusses loops. The goals of JIT are the same as those of the compilers - whether tracing or method, hence they follow the suggested optimizations for underlying architectures.

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jul 26, 2015

Owner

@orlp I am happy to focus on Intel CPUs now. It is a really exciting time. They are increasing the number of cores, increasing the power of each core with more execution units and exponentially more SIMD bandwidth, and adding exotic features like hardware transactional memory. Their documentation is excellent too. I can't wait for Skylake CPUs with 512-bit SIMD registers to hit the market later this year.

Owner

lukego commented Jul 26, 2015

@orlp I am happy to focus on Intel CPUs now. It is a really exciting time. They are increasing the number of cores, increasing the power of each core with more execution units and exponentially more SIMD bandwidth, and adding exotic features like hardware transactional memory. Their documentation is excellent too. I can't wait for Skylake CPUs with 512-bit SIMD registers to hit the market later this year.

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jul 26, 2015

Owner

@lorenzhs That's a really cool example :).

Owner

lukego commented Jul 26, 2015

@lorenzhs That's a really cool example :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment