/ go Public

# math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable#27892

Closed
opened this issue Sep 27, 2018 · 6 comments
Closed

# math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable #27892

opened this issue Sep 27, 2018 · 6 comments
Labels

### odeke-em commented Sep 27, 2018

I was just studying some Mathematics and while examining some bit patterns thought that perhaps Go could use some optimizations if not already there(we don't have them in). The optimization is that we can return 1 ASAP if OnesCount* gets a power of 2. I checked through the bits.go code and didn't see such a condition.
A power of 2 is a value with only 1 set bit due to the incremental left shift of 1 0b1 and hence the reason why to detect a power of 2 we do: `x&(x-1) == 0`

Anyways, running some benchmarks on my Macbook pro show massive improvements for powers of 2 even with a branch to detect if powers of 2 ~29% speed up

### Source code

Throwing in a mix of values

```package main_test

import (
"math/bits"
"testing"
)

var tests = []struct {
v    uint64
want int
}{
{1, 1}, {2, 1}, {3, 2}, {4, 1}, {5, 2}, {9, 2}, {8, 1}, {16, 1},
{32, 1}, {64, 1}, {127, 7}, {256, 1}, {65536, 1},
{4294967296, 1}, {4294967295, 32},
}

var sink int

func BenchmarkCustomOnesCount(b *testing.B) {
runBenchmarks(b, customOnesCountForPowerOf2)
}

func BenchmarkStdlibOnesCount(b *testing.B) {
runBenchmarks(b, bits.OnesCount64)
}

func runBenchmarks(b *testing.B, fn func(uint64) int) {
b.ResetTimer()
for i := 0; i < b.N; i++ {
for _, tt := range tests {
sink = fn(tt.v)
if sink != tt.want {
b.Errorf("%d: got %d want %d", tt.v, sink, tt.want)
}
}
}
if sink == 0 {
b.Fatal("Did not run the benchmark!")
}
}

func customOnesCountForPowerOf2(x uint64) int {
// Simulate absolute branch since powerOf2 check.
if x > 0 && x&(x-1) == 0 {
return 1
}
const m = 1<<64 - 1
x = x>>1&(m0&m) + x&(m0&m)
x = x>>2&(m1&m) + x&(m1&m)
x = (x>>4 + x) & (m2 & m)
x += x >> 8
x += x >> 16
x += x >> 32
return int(x) & (1<<7 - 1)
}

// Inlining these constants code here to show what
// the code for OnesCount64 will look like after the alteration.
const m0 = 0x5555555555555555 // 01010101 ...
const m1 = 0x3333333333333333 // 00110011 ...
const m2 = 0x0f0f0f0f0f0f0f0f // 00001111 ...
const m3 = 0x00ff00ff00ff00ff // etc.
const m4 = 0x0000ffff0000ffff```

### Benchmark results

```goos: darwin
goarch: amd64
BenchmarkCustomOnesCount-8   	30000000	        42.1 ns/op
BenchmarkCustomOnesCount-8   	30000000	        41.3 ns/op
BenchmarkCustomOnesCount-8   	30000000	        41.5 ns/op
BenchmarkCustomOnesCount-8   	30000000	        42.7 ns/op
BenchmarkCustomOnesCount-8   	30000000	        42.8 ns/op
BenchmarkStdlibOnesCount-8   	30000000	        57.9 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        61.2 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        56.6 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        59.6 ns/op
BenchmarkStdlibOnesCount-8   	20000000	        59.5 ns/op```

Giving

```\$ benchstat before after
name               old time/op  new time/op  delta
CustomOnesCount-8  59.0ns ± 4%  42.1ns ± 2%  -28.63%  (p=0.008 n=5+5)```

If this doesn't make sense, please feel free to close it. I just thought that it would useful to mention.

### ALTree commented Sep 27, 2018 • edited

 On most amd64 machines, OnesCount is intrinsified into a POPCNT and it generally take less than a nanosecond ``````\$ cd go/src/math/bits \$ go test -run=XX -bench=BenchmarkOnesCount64 BenchmarkOnesCount64-2 2000000000 0.77 ns/op PASS ok math/bits 1.743s `````` We even have a test for this: https://github.com/golang/go/blob/master/test/codegen/mathbits.go#L111 I think the way you wrote the benchmark defeated the optimization (either that, or you're actually measuring your benchmarking setup).

### ALTree commented Sep 27, 2018

 Obviously the optimization you suggested would still benefit architectures where we don't intrinsify OnesCount, my point is just that we likely wouldn't see any benefit on amd64.

### odeke-em commented Sep 27, 2018

 @ALTree in deed, I had only examined that code once ever :) and was wondering why pop count wasn't being used. Oh yes, I see your point as bits.OnesCount64 is used as an argument which for sure defeats the optimization on machines with POPCNT*

changed the title math/bits: OnesCount* should return 1 ASAP for powers of 2 if the price of branching is acceptable math/bits: OnesCount* if non-instrinsified should return 1 ASAP for powers of 2 if the price of branching is acceptable Sep 27, 2018

### cznic commented Sep 27, 2018

 Also note that the OP's benchmark measures mostly the happy case (10 out of 15 cases), but there are only 64 powers of two out of 2^64 numbers. That means the optimization can make the performance actually get actually worse for a random input because of the added test.

### odeke-em commented Sep 27, 2018

 Yes in deed @cznic, great point I think the common case of non-powers beats the esoteric case of just 64 powers of 2. I found the need for this while working on a fast non-intrinsic square rooting algorithm but since the proposal here is marginally applicable useful in the common case, I'll rescind this issue. Thank you all for the input and sorry for the noise.

locked and limited conversation to collaborators Sep 27, 2019