Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: timediv should use |= instead of += since we are adding bitsets to an originally cleared value #27529

Closed
odeke-em opened this issue Sep 6, 2018 · 13 comments

Comments

Projects
None yet
7 participants
@odeke-em
Copy link
Member

commented Sep 6, 2018

Right now just before bed, I have been studying runtime/os_darwin.go in relation to https://go-review.googlesource.com/c/go/+/133655 to try to write a regression/end-to-end test after that CL is merged.

I noticed that in the code for timediv

go/src/runtime/runtime1.go

Lines 414 to 421 in 22afb35

func timediv(v int64, div int32, rem *int32) int32 {
res := int32(0)
for bit := 30; bit >= 0; bit-- {
if v >= int64(div)<<uint(bit) {
v = v - (int64(div) << uint(bit))
res += 1 << uint(bit)
}
}

we try to reduce the original timeunit by at most ((1<<31)-1) = 0x7FFFFFFF if (v >= conversionUnit<<bits)
where:

  • bits is in the reverse domain of [0, 30]
  • conversionUnit is "div" aka the number of units converting from the original to the target unit e..g for ns to ms conversion, div = 1e6, ns to us, div = 1e3

In particular notice that we are doing a += for values that are essentially bit sets for a value that started at 0

res += 1 << uint(bit)

i.e. we are only doing power of 2 increments/toggling bits.

We could replace that by |= and that code should be faster by a couple of nanoseconds and to justify the change:

  • That code is used is converting from nanotime() ns so at most ((1<<31)-1) * 1e9 ns (since max epoch second = (1<<31)-1) which by any measure will mean that every single invocation of timediv
    with a system nanotime that began later than September 8th 2001 ~7PM, will run through that loop at least once -- from ns to at best s, with smaller units it is even more guaranteed to run since (1e9<<30) (div=1e9, ns to s)

I believe that we could shave off a couple of nanoseconds and take advantage of registers.
My simple & dirty benchmark at https://play.golang.org/p/tWy9l5oH0tG when run shows an 18% decrement in time

name  old time/op  new time/op  delta
-8    37.7ns ± 2%  30.7ns ± 4%  -18.37%  (p=0.008 n=5+5)

timediv is in a very hot path throughout Go and the ecosystem; when we perform any semaphore operations e.g. while locking and unlocking a mutex on many OSes, of which Go's bread and butter for dealing with concurrent accesses are mutex operations and those are also used all around the standard library.

I am filing this issue because the comment for that function says that is is a very special function

go/src/runtime/runtime1.go

Lines 410 to 411 in 22afb35

// This is a very special function, do not use it if you are not sure what you are doing.
// int64 division is lowered into _divv() call on 386, which does not fit into nosplit functions.

and I also wonder if perhaps there was an intention of using +=?

Please feel free to correct me or provide some ideas, my apologies for any mistakes #~4AMChecks. Thank you!

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2018

Looks like |= is faster on amd64 because it replaces mov; shl; add with bts. On 386 we don't use bts, so both versions are about the same speed. I can't think of any reason why |= would be slower than += on any plausible hardware, so the change seems reasonable to me. Thanks.

@bcmills bcmills added this to the Unplanned milestone Sep 6, 2018

@rasky

This comment has been minimized.

Copy link
Member

commented Sep 6, 2018

If timediv is just a 64/32=32 division, it should probably be converted to use math/bits.Div32 once CL123157 goes in. That would speed it up a lot more than just using |= instead of +=, on both 32-bit and 64-bit platforms.

@odeke-em

This comment has been minimized.

Copy link
Member Author

commented Sep 10, 2018

If timediv is just a 64/32=32 division, it should probably be converted to use math/bits.Div32 once CL123157 goes in. That would speed it up a lot more than just using |= instead of +=, on both 32-bit and 64-bit platforms.

Thanks for the chime in @rasky! Interesting and cool that bits.Div* is being added. Perhaps I am misunderstanding how the new API will be used but I think it'll involve a bit of retrofitting given a few restrictions in the doc comments. I just skimmed real quick over it and thought to plugin and play one of the sanity checks in the runtime

if timediv(12345*1000000000+54321, 1000000000, &e) != 12345 || e != 54321 {

but the results are unexpected

package main_test

import (
	"math/bits"
	"testing"
)

func TestDiv(t *testing.T) {
	v := int64(12345*1000000000+54321)
	div := int32(1000000000)

	hi, lo := uint(v>>32), uint(v&0xFFFFFFFF)
	quo, rem := bits.Div(hi, lo, uint(div))
	if g, w := quo, uint(12345); g != w {
		t.Errorf("Quotient:\n\tgot  %d (0x%X)\n\twant %d (0x%X)", g, g, w, w)
	}
	if g, w := rem, uint(54321); g != w {
		t.Errorf("Remainder :\n\tgot  %d (0x%X)\n\twant %d (0x%X)", g, g, w, w)
	}
}

with results

$ go test -run=TestDiv
--- FAIL: TestDiv (0.00s)
    div_test.go:15: Quotient:
        	got  53015942467842 (0x3037BC6B1102)
        	want 12345 (0x3039)
    div_test.go:18: Remainder :
        	got  515390001 (0x1EB83A31)
        	want 54321 (0xD431)
FAIL
exit status 1
FAIL	_/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/runtime-timediv	0.005s

With my point being that: perhaps let's file a separate bug as a reminder to retrofit "math/bits".Div into the runtime once https://golang.org/cl/123157 is landed as you suggested. For now I think we can easily address using |= :)

@gopherbot

This comment has been minimized.

Copy link

commented Sep 10, 2018

Change https://golang.org/cl/134231 mentions this issue: runtime: convert initial timediv quotient increments to bitsets

@cherrymui

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2018

We should not use timediv on 64-bit machines. We should just use the division operator, as os_linux.go does.

@rasky

This comment has been minimized.

Copy link
Member

commented Sep 10, 2018

@odeke-em which arch are you testing that on? Try using Div32 instead of Div to make it behave the same on both 32-bit and 64-bit archs (your code would fail on 64 bit because of the right shift sign extending the 64 bit input).

@cherrymui using bits.Div is a portable way of doing a 64 bit division without two different code paths

@odeke-em

This comment has been minimized.

Copy link
Member Author

commented Sep 10, 2018

@rasky I was on AMD64, Darwin. Thank you for the tip of switching to Div32, that runs correctly and is magnitudes faster than all the others in deed

$ go test -v -run=^$ -bench=As
goos: darwin
goarch: amd64
BenchmarkAsPlus-8   	50000000	        27.8 ns/op
BenchmarkAsOr-8     	50000000	        26.0 ns/op
BenchmarkAsDiv-8    	2000000000	         0.39 ns/op
PASS
ok  	_/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/runtime-timediv	3.568s
@cherrymui

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2018

I'm not sure we want the runtime package to import math/bits, although it is probably a possibility.

And, with the current implementation in CL 123157, it probably doesn't work. math/bits.Div32 is implemented simply with 64-bit division. On 32-bit systems, this lowers to runtime.uint64div. As the comment in timediv says, this may not fit into nosplit stack limit (maybe it does now, I don't know, at least it didn't when timediv was introduced). And not all 32-bit machines have 64-bit-divided-by-32-bit intrinsic. Of course we could probably have a different implementation of Div32 that uses very small stack space.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2018

@odeke-em : That last benchmark result looks like the entire operation was optimized away. It's unlikely that the chip is doing a divide in less than a nanosecond.
(Or maybe it's just really pipelined?)

@cherrymui

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2018

If 64-bit division now fits into nosplit stack limit on all machines, I suggest changing back to using the division operator.

@odeke-em

This comment has been minimized.

Copy link
Member Author

commented Sep 10, 2018

@odeke-em : That last benchmark result looks like the entire operation was optimized away.

@randall77 aha, yeah it looked too good to be true, but alas seems to actually be running alright

package main

import (
        "math/bits"
        "testing"
)

func BenchmarkAsDiv(b *testing.B) {
        v := int64(12345*1000000000 + 54321)
        div := int32(1000000000)

        var quo, rem uint32

        for i := 0; i < b.N; i++ {
                hi, lo := uint32(v>>32), uint32(v&0xFFFFFFFF)
                quo, rem = 0, 0
                quo, rem = bits.Div32(hi, lo, uint32(div))
        }

        if g, w := quo, uint32(12345); g != w {
                b.Fatalf("Got %d\nWant %d", g, w)
        }
        if g, w := rem, uint32(54321); g != w {
                b.Fatalf("Got %d\nWant %d", g, w)
        }

}

still giving

$ go test -run=$^ -bench=As
goos: darwin
goarch: amd64
BenchmarkAsPlus-8   	50000000	        24.5 ns/op
BenchmarkAsOr-8     	100000000	        22.8 ns/op
BenchmarkAsDiv-8    	2000000000	         0.33 ns/op
PASS
ok  	_/Users/emmanuelodeke/Desktop/openSrc/bugs/golang/runtime-timediv	4.252s
$ uname -a
Darwin Emmanuels-MacBook-Pro-2.local 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64
@rasky

This comment has been minimized.

Copy link
Member

commented Sep 10, 2018

I'm not sure we want the runtime package to import math/bits, although it is probably a possibility.

We'll have to figure this out, as there are many other points where runtime could benefit from math/bits operations; there are for instances some multiplication overflow checks that could be simplified and made faster using bits.Mul.

And, with the current implementation in CL 123157, it probably doesn't work.

Right, I was implying that an intrinsification CL would eventually land (like CL129415, but for 32-bit archs).

And not all 32-bit machines have 64-bit-divided-by-32-bit intrinsic.

Oh that's right, I guess ARM doesn't.

@gopherbot gopherbot closed this in 178a609 Sep 12, 2018

@odeke-em

This comment has been minimized.

Copy link
Member Author

commented Sep 12, 2018

@rasky would you mind perhaps filing a reminder issue to examine what we can do around the standard library with math/bits.Div* when the CL lands? I ask because that conversation dominated this issue(which is now closed), but requires more thoughts and investigation, but also the conversations are very valuable and would be nice to keep it going on a fresh issue. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.