Add first proper benchmarks #17

bbrehm · 2023-03-20T16:23:08Z

Some changes in graph accessors and schemagen. Add JMH benchmarks. Unify benchmarks into a single subproject.

The opsPerInvocation annoyed me enough to try to upstream it at openjdk/jmh#97

I think we now can start discussing about whether to try to devirtualize neighbor and edge access.

…ify benchmarks into a single subproject.

mpollmeier

devirtualize neighbor and edge access

can you please elaborate?

bbrehm · 2023-03-21T10:20:51Z

devirtualize neighbor and edge access

can you please elaborate?

Sure! Timings look like

Benchmark                                       (shuffled)  Mode  Cnt    Score    Error  Units
joernBench.JoernGenerated.orderSum                    true  avgt    5   82.620 ±  2.324  ns/op
joernBench.JoernGenerated.orderSum                   false  avgt    5   34.356 ±  1.342  ns/op
joernBench.JoernLegacy.orderSum                       true  avgt    5   58.692 ±  0.553  ns/op
joernBench.JoernLegacy.orderSum                      false  avgt    5   29.085 ±  0.511  ns/op
joernBench.Odb2Generated.orderSumDevirtualized        true  avgt    5   49.812 ±  2.499  ns/op
joernBench.Odb2Generated.orderSumDevirtualized       false  avgt    5    5.504 ±  0.082  ns/op
joernBench.Odb2Generated.orderSumVirtual              true  avgt    5   99.325 ±  2.615  ns/op
joernBench.Odb2Generated.orderSumVirtual             false  avgt    5   10.972 ±  0.449  ns/op

So we can have a 2x speedup from the "devirtualized" optimization. The slowdown for the unshuffled case is presumably due to preventing inlining. The slowdown for the shuffled case is presumably due to branch misses that mess up the pipeline (due to the long reorder buffer, a branch miss can be much more expensive than the 20 cycle sticker price). Some relevant perf stats:

Benchmark                                                               (shuffled)  Mode  Cnt     Score     Error   Units
joernBench.Odb2Generated.orderSumDevirtualized:branch-misses:u                true  avgt          0.008              #/op
joernBench.Odb2Generated.orderSumDevirtualized:branches:u                     true  avgt         13.411              #/op
joernBench.Odb2Generated.orderSumDevirtualized:instructions:u                 true  avgt         78.990              #/op
joernBench.Odb2Generated.orderSumVirtual:branch-misses:u                      true  avgt          0.870              #/op
joernBench.Odb2Generated.orderSumVirtual:branches:u                           true  avgt         34.907              #/op
joernBench.Odb2Generated.orderSumVirtual:instructions:u                       true  avgt        174.542              #/op
joernBench.Odb2Generated.orderSumVirtual:branch-misses:u                     false  avgt          0.004              #/op
joernBench.Odb2Generated.orderSumVirtual:branches:u                          false  avgt         33.225              #/op
joernBench.Odb2Generated.orderSumVirtual:instructions:u                      false  avgt        170.058              #/op

In other words, we see that the devirtualized variant has no branch misses, even in the shuffled case; and due to inlining it can run with only ~80 instructions. In contrast, the virtual variant has a megamorphic call site that prevents inlining and balloons instruction count to ~170; more to the point, it has ~0.9 mis-predicted branches. These are indirect calls (i.e. via branch target buffer, not via branch history table) which is even worse. Unfortunately perfasm output doesn't do what I want for this specific case (I'm probably holding it wrong?).

The benchmark in question looks like

  @Benchmark
  def orderSumVirtual(blackhole: Blackhole): Int = {
    var sumOrder = 0
    for (node <- nodeStart.iterator.asInstanceOf[Iterator[v2.nodes.AstNode]]) {
      sumOrder += node.order
    }
    if (blackhole != null) blackhole.consume(sumOrder)
    sumOrder
  }

  @Benchmark
  def orderSumDevirtualized(blackhole: Blackhole): Int = {
    var sumOrder = 0
    val prop     = nodeStart.head.graph.schema.getPropertyIdByLabel("ORDER")
    for (node <- nodeStart) {
      sumOrder += odb2.Accessors.getNodePropertySingle(node.graph, node.nodeKind, prop, node.seq(), -1)
    }
    if (blackhole != null) blackhole.consume(sumOrder)
    sumOrder
  }

The actual code for .order is similar to what we do in odbv1:

trait AstNode extends StoredNode with AstNodeBase {
  def order: Int
...
}
class Annotation(graph_4762: odb2.Graph, seq_4762: Int) extends StoredNode(graph_4762, 0.toShort, seq_4762) with Expression {
  def order: Int                   = odb2.Accessors.getNodePropertySingle(graph, nodeKind, 41, seq, -1: Int)
...
}

This means that we need to do an invokeinterface in order to grab the property (invokevirtual would be almost as bad).

I'd like to note that the implementation of order is completely independent of the node type. But our class hierarchy contains diamonds where:

Two abstract node types provide the same property
There exist concrete node types that derive from both, leading to a potential conflict.

some changes in graph accessors and schemagen. Add JMH benchmarks. Un…

295bf1b

…ify benchmarks into a single subproject.

bbrehm requested review from ml86 and mpollmeier March 20, 2023 16:25

mpollmeier approved these changes Mar 21, 2023

View reviewed changes

bbrehm merged commit e73cb77 into master Mar 21, 2023
1 check passed

delete-merged-branch bot deleted the bbrehm/moreBench branch March 21, 2023 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add first proper benchmarks #17

Add first proper benchmarks #17

bbrehm commented Mar 20, 2023

mpollmeier left a comment

bbrehm commented Mar 21, 2023

Add first proper benchmarks #17

Add first proper benchmarks #17

Conversation

bbrehm commented Mar 20, 2023

mpollmeier left a comment

Choose a reason for hiding this comment

bbrehm commented Mar 21, 2023