Division is much slower than multiplication, as we know. The algorithm for dividing 128 digits by 64 digits in math/bits.go/Div64, which has a certain impact on the efficiency of nats division, requires two divisions. And the method of using multiplication by 2/1 (128bits divide 64bits )multiplicative inverse and replacing division with it can increase the speed of divWVW algorithm by three times, and at the same time increase the speed of nats division.
I have submitted a code change using 2/1 type multiplicative inverse and the corresponding benchmark data in gerrit. If possible, using a 3/2(192bits divide 128bits) type multiplicative inverse and changing the structure of nats division code will get a more obvious improvement.