Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Inefficient integer multiplications #16
An ASIC will be 4 times more efficient with these two operations because a, b are 32-bit integers:
32-bit integer multiplications are inefficient on GPUs because GPUs only have 24-bit wide data path for multiplication. 32-bit MUL is 4 times slower than 24-bit MUL. It's better to use mul24 here.
Side note: it's a shame that OpenCL still doesn't have mul24_hi, but CUDA has it.
This is an intentional tradeoff for simplicity. Note that for mining latency doesn't matter, it's all about throughput.
AMD Vega and Polaris have V_MUL_LO_U32 and V_MUL_U32_U24. The 32-bit multiply is 4x longer latency than the 24-bit multiply. I haven't been able to find what the throughput difference is.
Nvidia Pascal has XMAD, which is 16-bit input, 32-bit output multiply-add. A 3 instruction sequence is used to do a full 32-bit multiply:
There's no full-speed, single-instruction common denominator between the two. Doing either 24-bit or 16-bit multiplies would be full speed on one architecture and require extra emulation code on the other architecture. Using normal 32-bit multiplies is a good compromise that's simple and works reasonably on both architectures.
This is one of the small optimizations that a theoretical ProgPoW ASIC could do to be 10-20% more efficient than existing GPUs. The inner ProgPoW loop expands to about 200 instructions on Pascal. This optimization on average would save <10 instructions.
Digging into the profiling data behind our Understanding ProgPoW blog post I just noticed that Turing appears to have a full-speed IMAD instruction. This allows the inner loop to be about 150 instructions, mostly because the *33 from the merge() takes a single IMAD instead of 2 XMADs.
So a ProgPoW ASIC with this optimization would simply match a GPU you can buy today.