Skip to content
j-g-faustus edited this page Sep 14, 2010 · 9 revisions

Numeric performance in Clojure

This is a Clojure implementation of the computer language shootout benchmark n-body

The benchmark is a number crunching physics simulation, it calculates the motion of a few planets orbiting the sun.

I made two versions:

Timing them on my laptop with the Java program as the baseline, this benchmark says that

  • the Clojure 1.1 version is
    • ~4x slower than Java/C/Scala
    • an order of magnitude quicker than Python/Ruby
  • the Clojure 1.2 version is more than twice as fast, clocking in at
    • ~1.7x (70%) slower than Java
    • Plus it is more readable and ~20% shorter than the 1.1 version.

Clojure 1.2 is not released yet, so any benchmark is preliminary.

Performance observations

  • 1.2 is not quicker running the 1.1 program
  • The entire 1.2 speedup stems from using the new 1.2 constructs definterface and deftype
  • Deftype compiles to low-level Java, and with primitive mutable fields you can get similar performance.
    • The remaining overhead relative to Java comes from method calls.
    • In my implementation each planet is a deftype instance, and the calculation involves updating speed and position for all pairs of planets.
    • Unlike Java, all Clojure mutable fields are private, so I cannot update the values for two planets simultaneously without the overhead of a method call.
  • “Fast Clojure” is non-idiomatic in the sense that it gains speed by using mutable values.
    • Whereas Clojure’s primary model is functional programming where everything is immutable.
    • It even comes with a warning flag in the documentation: Do not use if you don’t know the difference between Java synchronized and volatile :)
  • But the same can be said for the Fast Java implementation from the shootout
    • It gains speed by not using any of the “advanced features” of Java.
    • All classes are final, objects are just primitive fields (i.e. no methods) similar to a C struct, and almost all calculations are done in a single loop.
    • That is, speed in an object oriented language is gained by not using object orientation.

All in all, I’m very happy with the 1.2 numeric performance as of today. I’m also happy with the fact that there are far fewer hoops to jump through to get high-performance numerics than in 1.1.

I think Clojure 1.2 as of today supports the Lisp spirit: Implement quickly and/or prettily to the point where it works, and then optimize only the parts where the performance benefits are big enough to make it worthwhile.

You can’t do that without a profiler, I’m using Java Visual VM
There are some other alternatives here though I haven’t tried them.

Clojure 1.2 “fast numerics”

There is work in progress on faster numerics for 1.2

These changes were not part of this test. My understanding is that this work will probably not increase the speed attainable by optimized code like the 1.2 benchmark code here, but it might speed up the 1.1 version or allow 1.1-level optimizations with less developer effort.

If you are interested in the details, check out

Profiling

Clojure 1.1

See the 1.1 call tree profile
(The wiki apparently does not allow me to inline images.)

  • Close to the entire runtime, 99.8% comes from the advance function.
    • Initialization and calculating the energy of the state is trivial in comparison, so there’s no need to optimize those.
  • The intCast(Object) takes 5% of the time.
    • It would be nice to make it go away, but I was not able to do so.
    • I think it’s done internally by the aget function: It converts array indices to int by a cast even when they are numeric literals like ‘0’ and ‘1’, and there is no way to bypass that.
    • I was able to turn intCast(Object) into intCast(int), but the timings actually got worse, so I left it as is.
  • Apart from that, the time is consumed by primitive operations – retrieving and calculating values.
    • This means that there are no obvious optimizations left, we are already as primitive as we can go.
  • Inlining is not always quicker.
    • I tried inlining the advance function into the main function, but timings got worse.
    • I don’t know why, but can always speculate that it relates to cache contention or CPU registers or something like that :) Anyway, it’s beyond the reach of optimization in a JVM language.

Clojure 1.2

See the 1.2 call tree profile

  • Again pretty much the entire runtime is taken by advance
    • No point in optimizing anything else
  • Curious method names like v_PLUS__BANG_ are an artifact of
    • Clojure supports names like v+! while Java doesn’t
    • so these are the rewrite rules to translate the Clojure name to a Java name.
    • At the moment you have to use v_PLUS__BANG_ in Clojure definterface and v+! in deftype, Clojure does automatic rewriting in one case but not in the other.
    • I assume this inconsistency will disappear before 1.2 is released.
  • Obvious improvements would be to
    • get rid of the method call overhead for v_PLUS__BANG_ and p_dt_BANG
    • replace them with primitive operations
    • But this is not easily done as mutable fields are private and cannot be changed from outside an object.
  • Still, this version is >2x faster and ~20% less code than the 1.1 version.
    • On balance, I felt that this was fast enough and not worth cluttering up the code for relatively small further improvements.

JVM version and flags

Timings were done on a MacBook Pro with the OS X default Java JVM using flags “-Xms256M -Xmx256M”, the same settings for both Clojure and Java.

The -server flag made little difference, so I left it out. Specifying a larger-than-default amount of memory does make a difference, ~10% in my case.

[pro ~]$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)