New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient integer multiplications #16

Closed
SChernykh opened this Issue Nov 20, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@SChernykh
Copy link

SChernykh commented Nov 20, 2018

An ASIC will be 4 times more efficient with these two operations because a, b are 32-bit integers:

    case 1: return a * b;
    case 2: return mul_hi(a, b);

32-bit integer multiplications are inefficient on GPUs because GPUs only have 24-bit wide data path for multiplication. 32-bit MUL is 4 times slower than 24-bit MUL. It's better to use mul24 here.

Side note: it's a shame that OpenCL still doesn't have mul24_hi, but CUDA has it.

@pallas1

This comment has been minimized.

Copy link

pallas1 commented Nov 20, 2018

AFAIK, this is only true for very old cuda compute versions, thus irrelevant for mining.

@SChernykh

This comment has been minimized.

Copy link

SChernykh commented Nov 20, 2018

V_MUL_LO_U32, V_MUL_HI_U32 = 16 clock cycles on AMD Vega and RX GPUs
V_MUL_U32_U24 = 4 clock cycles on AMD

@ifdefelse

This comment has been minimized.

Copy link
Owner

ifdefelse commented Nov 20, 2018

This is an intentional tradeoff for simplicity. Note that for mining latency doesn't matter, it's all about throughput.

AMD Vega and Polaris have V_MUL_LO_U32 and V_MUL_U32_U24. The 32-bit multiply is 4x longer latency than the 24-bit multiply. I haven't been able to find what the throughput difference is.

Nvidia Pascal has XMAD, which is 16-bit input, 32-bit output multiply-add. A 3 instruction sequence is used to do a full 32-bit multiply:
https://groups.google.com/d/msg/maxas-discuss/4rovrjSRzKA/_UCuK9U2BAAJ

There's no full-speed, single-instruction common denominator between the two. Doing either 24-bit or 16-bit multiplies would be full speed on one architecture and require extra emulation code on the other architecture. Using normal 32-bit multiplies is a good compromise that's simple and works reasonably on both architectures.

This is one of the small optimizations that a theoretical ProgPoW ASIC could do to be 10-20% more efficient than existing GPUs. The inner ProgPoW loop expands to about 200 instructions on Pascal. This optimization on average would save <10 instructions.

@ifdefelse

This comment has been minimized.

Copy link
Owner

ifdefelse commented Nov 20, 2018

Digging into the profiling data behind our Understanding ProgPoW blog post I just noticed that Turing appears to have a full-speed IMAD instruction. This allows the inner loop to be about 150 instructions, mostly because the *33 from the merge() takes a single IMAD instead of 2 XMADs.

So a ProgPoW ASIC with this optimization would simply match a GPU you can buy today.

hackmod pushed a commit to EthersocialNetwork/ethminer-ProgPOW that referenced this issue Dec 9, 2018

@ifdefelse

This comment has been minimized.

Copy link
Owner

ifdefelse commented Dec 10, 2018

Closing issue since this isn't going to be changed.

@ifdefelse ifdefelse closed this Dec 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment