Naively benchmarks atomic increment on Java's AtomicLongArray against a custom implementation.
ant run
Currently only works on Linux/amd64.
Modify build.xml if you don't have a JVM at /usr/lib/jvm/java-7-openjdk-amd64.
The custom implementation is consistently 4-6 times faster with 4 threads doing 1 million increments each. The JIT is, I believe, properly warmed up for both. Both implementations are (unsurprisingly) hundreds of times slower in this test than a non-atomic increment on a long[].
Caveat: the benchmark currently only tests arrays of size 1. This probably does not matter.
Repeatedly reading a thread-local long[] and incrementing that was much faster than either form of atomic incrementing, but about 30 times slower than just incrementing a long[].
The custom implementation uses amd64's atomic increment command lock; incq (%r).
Java's implementation uses a compare-and-swap loop and has significantly more method call indirection, though the JVM might JIT some of that out.
Both implementations range check their indexes.