The compiler currently compiles a+b+c+d as a+(b+(c+d)). It should use (a+b)+(c+d) instead, because the latter can be executed out of order.
More broadly, we should balance trees of associative computation.
Doing this in the compiler with a rule tailored for a single computation type in round 4 of md5block.go yielded a 15% throughput improvement. (See below for details.)
It's not obvious to me whether we can do this with carefully crafted rewrite rules or whether a dedicated pass would be better. But it looks like there may be significant performance wins available on tight mathematical code.
I don't plan to work on this further, but I really hope someone else picks it up.
@seebs the SSA backend has different ops for different data types exactly to avoid this kind of issue. (We have @dr2chase to thank for that.) Relatedly this kind of optimization is also not appropriate for adding pointers, as intermediate results may be invalid pointers.