New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cryptonight variant 4 aka CryptonightR #5126

Open
wants to merge 3 commits into
base: master
from

Conversation

Projects
None yet
@SChernykh
Copy link
Contributor

SChernykh commented Feb 4, 2019

This is a proposal for the next Monero PoW algorithm. Please read original discussion before posting here.

Random integer math modification

Division and square root are replaced with a sequence of random integer instructions:

OP Description Frequency Comment
MUL a*b 40.05% Many multiplications ensure high latency
ADD a+b+C 11.88% 3-way addition with random constant
SUB a-b 12.21% b is always different from a
ROR ror(a,b) 7.52% Bit rotate right
ROL rol(a,b) 5.57% Bit rotate left
XOR a^b 22.78% b is always different from a

Program size is between 60 and 69 instructions, 63 instructions on average.

There are 9 registers named R0-R8. Registers R0-R3 are variable, registers R4-R8 are constant and can only be used as source register in each instruction. Registers R4-R8 are initialized with values from main loop registers on every main loop iteration.

All registers are 32 bit to enable efficient GPU implementation. It's possible to make registers 64 bit though - it's supported in miners below.

The random sequence changes every block. Block height is used as a seed for random number generator. This allows CPU/GPU miners to precompile optimized code for each block. It also allows to verify optimized code for all future blocks against reference implementation, so it'll be guaranteed safe to use in Monero daemon/wallet software.

An example of generated random math:

Optimized CPU miner:

Optimized GPU miner:

Pool software:

Design choices

Instruction set is chosen from instructions that are efficient on CPUs/GPUs compared to ASIC: all of them except XOR are complex operations at logic circuit level and require O(logN) gate delay. These operations have been studied extensively for decades and modern CPUs/GPUs already have the best implementations.

SUB, XOR are never executed with the same operands to prevent degradation to zero. ADD is defined as a 3-way operation with random 32-bit constant to fix trailing zero bits that tend to accumulate after multiplications.

Code generator ensures that minimal required latency for ASIC to execute random math is at least 2.5 times higher than what was needed for DIV+SQRT in CryptonightV2: current settings ensure latency equivalent to a chain of 15 multiplications while optimal ASIC implementation of DIV+SQRT has latency equivalent to a chain of 6 multiplications.

It also accounts for super-scalar and out of order CPUs which can execute more than 1 instruction per clock cycle. If ASIC implements random math circuit as simple in-order pipeline, it'll be hit with further up to 1.5x slowdown.

A number of simple checks is implemented to prevent algorithmic optimizations of the generated code. Current instruction mix also helps to prevent algebraic optimizations of the code. My tests show that generated C++ code compiled with all optimizations on is only 5% faster on average than direct translation to x86 machine code - this is synthetic test with just random math in the loop, but the actual Cryptonight loop is still dominated by memory access, so this number is needed to estimate the limits of possible gains for ASIC.

Performance on CPU/GPU and ASIC

CryptonightR parameters were chosen to:

  • have the same hashrate as CryptonightV2 on CPU/GPU
  • have a bit smaller power consumption on CPU/GPU

Actual numbers (hashrate and power consumption for different CPUs and GPUs) are available here.

ASIC will have to implement some simple and minimalistic instruction decoder and execution pipeline. While it's not impossible, it's much harder to create efficient out of order pipeline which can track all data dependencies and do more than 1 instruction per cycle. It will also have to use fixed clock cycle length, just like CPU, so for example XOR (single logic gate) won't be much faster anymore.

ASIC with external memory will have the same performance as they did on CryptonightV2, but they will require much more chip area to implement multiple CPU-like execution pipelines.
ASIC with on-chip memory will get 2.5-3.75 times slower due to increased math latency and randomness and they will also require more chip area.

@GoodEn0ugh

This comment has been minimized.

Copy link

GoodEn0ugh commented Feb 4, 2019

Has this been compared for pros/cons with the claimed "FPGA-proof" CN-GPU algo? I have 0 clue how they compare and lack the technical know-how to compare the two, but figured this would be a good place to discuss them to be sure we get the best of all available PoW algorithms:

fireice-uk/xmr-stak#2186

Hadn't seen the merits/issues of it discussed elsewhere by people who know these things.

@lememine

This comment has been minimized.

Copy link

lememine commented Feb 4, 2019

I hope CN-GPU will never be implemented as PoW on Monero, I want to be able to mine on CPU.

@SChernykh

This comment has been minimized.

Copy link
Contributor Author

SChernykh commented Feb 4, 2019

@GoodEn0ugh

  • I don't understand why "FPGA-proof" is a thing at all. FPGAs can't run CNv2 and CryptonightR efficiently as well, only ASICs are still efficient enough to be profitable (as it turned out).
  • CN-GPU has no description and design rationale published - only source code, so I can't compare now. What I understood so far is that CN-GPU is not Cryptonight at all - too many parts of the algorithm have changed. It's also very power hungry on GPU and not suitable for CPUs which goes against what's stated in the original Monero whitepaper.
@GoodEn0ugh

This comment has been minimized.

Copy link

GoodEn0ugh commented Feb 4, 2019

@GoodEn0ugh

  • I don't understand why "FPGA-proof" is a thing at all. FPGAs can't run CNv2 and CryptonightR efficiently as well, only ASICs are still efficient enough to be profitable (as it turned out).
  • CN-GPU has no description and design rationale published - only source code, so I can't compare now. What I understood so far is that CN-GPU is not Cryptonight at all - too many parts of the algorithm have changed. It's also very power hungry on GPU and not suitable for CPUs which goes against what's stated in the original Monero whitepaper.

That's exactly what I was hoping to hear. I had no idea it was GPU-only, as there is no documentation around it. Thanks 👍

@tevador

This comment has been minimized.

Copy link

tevador commented Feb 4, 2019

@SChernykh How are the instruction frequencies calculated? I remember it used to be 3/8 for multiplication and 1/8 for the rest.

Regarding "CN-GPU", it replaces the AES encryption in the initialization loop with keccak and then the main loop is replaced with just a lot of floating point math (single precision multiplication and addition). That's why it's power hungry. It will be most likely compute-bound on CPUs and possibly also on some GPUs.

@SChernykh

This comment has been minimized.

Copy link
Contributor Author

SChernykh commented Feb 4, 2019

How are the instruction frequencies calculated? I remember it used to be 3/8 for multiplication and 1/8 for the rest.

Yes, it is like you say initially (except ROR/ROL are less frequent (1/16) in favor of XOR (1/4)):

// MUL = opcodes 0-2
// ADD = opcode 3
// SUB = opcode 4
// ROR/ROL = opcode 5, shift direction is selected randomly
// XOR = opcodes 6-7

But it changes during code generation because code generator adjusts some sequences to avoid possible ASIC optimizations. You can read comments in variant4_random_math.h starting from line 263:

// Don't do ADD/SUB/XOR with the same register
// Don't do rotation with the same destination twice because it's equal to a single rotation
// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// Don't generate instructions that leave some register unchanged for more than 7 cycles

The instruction frequencies in the table are average from first 10,000,000 random programs.

// Generates as many random math operations as possible with given latency and ALU restrictions
static inline int v4_random_math_init(struct V4_Instruction* code, const uint64_t height)
{
// MUL is 3 cycles, 3-way addition and rotations are 2 cycles, SUB/XOR are 1 cycle

This comment has been minimized.

@tevador

tevador Feb 4, 2019

Not sure if it makes a big difference, but the real latency of ROL/ROR on Intel is ~1 cycle (reference). 2 cycle latency is only for flags dependence.

This comment has been minimized.

@SChernykh

SChernykh Feb 4, 2019

Author Contributor

These are worst case numbers, so they are conservative. I ran a lot of tests before and found that a few random seeds produce slower than usual code when it has a lot of rotations. This is why I set it to 2 cycles for rotations and reduced rotations frequency.

@tevador

This comment has been minimized.

Copy link

tevador commented Feb 4, 2019

// Don't do ADD/SUB/XOR with the same register
// Don't do rotation with the same destination twice because it's equal to a single rotation
// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// Don't generate instructions that leave some register unchanged for more than 7 cycles

These rules make sense since there is just one 'program' per block.

@moneromooo-monero
Copy link
Contributor

moneromooo-monero left a comment

Halfway through.

Show resolved Hide resolved src/crypto/variant4_random_math.h Outdated
Show resolved Hide resolved src/crypto/variant4_random_math.h Outdated
{
check_data(&data_index, 1, data, sizeof(data));

struct V4_InstructionCompact op = ((struct V4_InstructionCompact*)data)[data_index++];

This comment has been minimized.

@moneromooo-monero

moneromooo-monero Feb 5, 2019

Contributor

This code seems to be deterministic based on the height, so we can know know what the program for height 2e6 will be way in advance. I saw the rationale for height seed, so a GPU can get precompiled code in advance. However, using the previous block's hash also accomplishes this, while keeping everything unknown till shortly before the time. Would this be better ? It's unclear whether knowing all this in advance could be exploited somehow.

This comment has been minimized.

@mobilepolice

mobilepolice Feb 5, 2019

What's the risk to seeding off of a previous blocks hash in the event of a re-org? I'm trying to think of the ways that can go wrong, but I'm not sure I can come up with anything.

This comment has been minimized.

@SChernykh

SChernykh Feb 5, 2019

Author Contributor

Previous block hash makes pre-compilation impossible because it's unknown until new block arrives, so GPUs will be halted every time. Knowing programs in advance won't help ASICs much because there are just too many different programs (one for each block). They'll be able to precompile too, but it won't give more than 5% speedup (see first post).

Plus, using the block height makes it possible to just check the code generator for all future block heights and guarantee that it doesn't crash/freze etc. and produces working random programs. I think it's better to play safe here.

This comment has been minimized.

@vtnerd

vtnerd Feb 5, 2019

Contributor

What about an ASIC strategy in which the design was intended only to work with certain block heights? I'm not yet familiar enough with this proposal to know whether this is viable.

This comment has been minimized.

@SChernykh

SChernykh Feb 5, 2019

Author Contributor

Generated random programs are quite similar and each program has all possible instructions in it, so if ASIC can run one of them, it can run all.

This comment has been minimized.

@moneromooo-monero

moneromooo-monero Feb 6, 2019

Contributor

"Previous block hash makes pre-compilation impossible" does not apply if the hash is the one two steps back.

This comment has been minimized.

@SChernykh

SChernykh Feb 6, 2019

Author Contributor

Yes, but it'll require bigger refactoring because it's not available in cn_slow_hash (and functions calling it) now. Pool software will also require refactoring to support it. Block height is convenient because it's readily available with existing code both in monerod and in pool software.

// MUL = opcodes 0-2
// ADD = opcode 3
// SUB = opcode 4
// ROR/ROL = opcode 5, shift direction is selected randomly

This comment has been minimized.

@moneromooo-monero

moneromooo-monero Feb 5, 2019

Contributor

Since the shift count is the full size of the register in V4_EXEC, ROL and ROR are really the same thing (or, rol eax, 28 is the same as ror eax, 4).

This comment has been minimized.

@SChernykh

SChernykh Feb 5, 2019

Author Contributor

Yes, but it still adds a bit more logic to ASIC. This also why I only use one opcode for them.

This comment has been minimized.

@moneromooo-monero

moneromooo-monero Feb 6, 2019

Contributor

Just a 6 bit sub AFAICT. Something like sal/sar instead would at least change the op a bit. Or bswap also looks to be simple and latency 1. Anyway, you're the expert here so I won't say more.

This comment has been minimized.

@vtnerd

vtnerd Feb 8, 2019

Contributor

What additional logic needs to be added for the other rotate? Wouldn't the additional logic only need to be in the prepping stage? Which brings me to the next point - why not drop one of the rotates in the execution switch to compress the logic? Seems like it would really be tough for the compiler to optimize that one.

hash_extra_blake(data, sizeof(data), data);
*data_index = 0;
}
}

This comment has been minimized.

@moneromooo-monero

moneromooo-monero Feb 5, 2019

Contributor

I suspect most runs require the the same amount of calls, as the data needed seems fairly predictable. I kinda expect this code building part is not really time sensitive though, is it ?

This comment has been minimized.

@SChernykh

SChernykh Feb 5, 2019

Author Contributor

Code generator generates first 10,000,000 random programs in 30 seconds, so it's really fast - 3 microseconds on average.

// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// 2xADD(a, b, C) = ADD(a, b*2, C1+C2), same for SUB and rotations
// 2xXOR(a, b) = NOP
if ((opcode != MUL) && ((inst_data[a] & 0xFFFF00) == (opcode << 8) + ((inst_data[b] & 255) << 16)))

This comment has been minimized.

@moneromooo-monero

moneromooo-monero Feb 5, 2019

Contributor

You probably also want ADD a,b then SUB a,b and vice versa.
I also don't quite understand this. you seem to be storing only 8 bits of the source register here, is that because you don't care about false positives ?

This comment has been minimized.

@SChernykh

SChernykh Feb 5, 2019

Author Contributor

I store "register data revision" (change counter) here, so it can't be more than 256 because programs don't have that many instructions.

@Sonia-Chen

This comment has been minimized.

Copy link

Sonia-Chen commented Feb 5, 2019

[We are asicmakers (but not interested in secret Monero mining)]
I have a question:

Is it possible to use data from the blockchain itself in the PoW algo? block data? The problem with PoW in our view is that it's isolated from the block data. Inclusion of block data would force asicmakers to make chips that could be more useful later.

@tevador

tevador approved these changes Feb 5, 2019

@moneromooo-monero

This comment has been minimized.

Copy link
Contributor

moneromooo-monero commented Feb 6, 2019

Some chains do that. At least Boolberry. Not sure if you asked "can it be sensibly done", or "please consider doing it" :)

@moneromooo-monero
Copy link
Contributor

moneromooo-monero left a comment

The main loop's I don't understand yet. I might comment again on it later.

Show resolved Hide resolved src/crypto/variant4_random_math.h Outdated
Show resolved Hide resolved src/crypto/variant4_random_math.h
Show resolved Hide resolved src/crypto/variant4_random_math.h
Show resolved Hide resolved src/crypto/variant4_random_math.h
@vtnerd
Copy link
Contributor

vtnerd left a comment

So much to think about with this proposal ...

some initial thoughts.

Show resolved Hide resolved src/crypto/variant4_random_math.h
Show resolved Hide resolved src/crypto/variant4_random_math.h
Show resolved Hide resolved src/crypto/slow-hash.c Outdated
Show resolved Hide resolved src/crypto/variant4_random_math.h Outdated
Show resolved Hide resolved src/crypto/variant4_random_math.h
}

// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// 2xADD(a, b, C) = ADD(a, b*2, C1+C2), same for SUB and rotations

This comment has been minimized.

@vtnerd

vtnerd Feb 8, 2019

Contributor

Is this why a constant is used in addition? To prevent a ADD, SUB case which results in a NOP? Doesn't this happen in the case where the constant is zero? And even when the constant is non-zero, couldn't such a sequence be optimized further? i.e. ADD(A, B, 10), SUB(A, B) -> ADD(A, 10).

This comment has been minimized.

@SChernykh

SChernykh Feb 8, 2019

Author Contributor

Random constant is used to fix zero bits that accumulate after multiplications. The case when add -> sub can be optimized to single add is quite rare, it's not worth additional complexity of the code generator. We're talking about reducing possible 5% speedup from optimizing compiler if we fix all thinkable cases here, not just this one. 5% is not much already.

This comment has been minimized.

@vtnerd

vtnerd Feb 10, 2019

Contributor

Presumably you meant 5% speedup in this portion, and not the entire algorithm (which should be dominated by cache/memory accesses)?

I don't quite like this argument, the CPU is pegged (more power) while custom designs might be able to save further power by having the same latencies with less silicon. Although any JIT-like approach with LLVM should do the trick here too.

This comment has been minimized.

@SChernykh

SChernykh Feb 10, 2019

Author Contributor

5% speedup was in my tests where I had only random math in the loop and compared optimizing C++ compiler with direct translation to x86 code. The actual Cryptonight loop doesn't get any speedup from optimizing compiler on CPU because it's still dominated by the main memory-hard loop.

Custom designs will of course will have random math as limiting factor and will have optimizing compiler to assist them whenever possible.

I don't say in the description that ASIC is impossible. It's possible and can be still be 3-4 times more efficient per watt. But this algorithm is not the final code, it's only for the next 6 months.

// MUL = opcodes 0-2
// ADD = opcode 3
// SUB = opcode 4
// ROR/ROL = opcode 5, shift direction is selected randomly

This comment has been minimized.

@vtnerd

vtnerd Feb 8, 2019

Contributor

What additional logic needs to be added for the other rotate? Wouldn't the additional logic only need to be in the prepping stage? Which brings me to the next point - why not drop one of the rotates in the execution switch to compress the logic? Seems like it would really be tough for the compiler to optimize that one.

Show resolved Hide resolved src/crypto/slow-hash.c Outdated
Show resolved Hide resolved src/crypto/slow-hash.c Outdated
Show resolved Hide resolved src/crypto/variant4_random_math.h Outdated
@SChernykh

This comment has been minimized.

Copy link
Contributor Author

SChernykh commented Feb 8, 2019

@vtnerd I've fixed pointer aliasing issues, can you check that I didn't miss anything?

@xiphon

This comment has been minimized.

Copy link
Contributor

xiphon commented Feb 10, 2019

FYI, I tested the code generation routine for all the block heights starting from the current height till October 6 2019 (1768400 ... 1940093).
Had zero cases when generated code could be optimized to less than 60 CPU instructions.

@SChernykh

This comment has been minimized.

Copy link
Contributor Author

SChernykh commented Feb 11, 2019

@vtnerd @moneromooo-monero
I've just submitted my final tweak. There are 9 registers now, named R0-R8. Register R8 is used as a replacement for the case when we have ADD/SUB/XOR instruction with the same register. Why only there?

  • It's much easier to implement in existing miner code: of all 256 instructions, these 12 (ADD/SUB/XOR R0/R1/R2/R3 with itself) weren't used anyway
  • It won't require a change of binary format for code generator.

How it would affect existing ASIC designs (if there are any which I doubt):

  • Having 9 registers instead of 8 breaks all designs that didn't account for more than 8 registers and use 3 bits for register indexing
  • Register R8 is an additional dependency from the main loop variables. ASIC designs that supported more than 8 registers will still have to be updated to read new data from the main loop.
  • ASICs will have to pump more data (12.5% more) through their pipeline. They'll be a bit less energy efficient because of this. The difference is tiny but it still exists.

Effect on CPU/GPU: my tests show absolutely no changes to their performance/power usage.

@moneromooo-monero

This comment has been minimized.

Copy link
Contributor

moneromooo-monero commented Feb 12, 2019

Looks good here. Waiting for vtnerd's now.

Show resolved Hide resolved src/crypto/variant4_random_math.h Outdated

@SChernykh SChernykh force-pushed the SChernykh:variant4-pr branch from d0ff6dd to b92eb0a Feb 13, 2019

Cryptonight variant 4 aka CryptonightR
It introduces random integer math into the main loop.

@SChernykh SChernykh force-pushed the SChernykh:variant4-pr branch from 214fc8f to f51397b Feb 14, 2019

SChernykh and others added some commits Feb 14, 2019

Adding cnv4-2 tweaks
Co-Authored-By: Lee Clagett <vtnerd@users.noreply.github.com>
do if (variant >= 4) \
{ \
for (int i = 0; i < 4; ++i) \
V4_REG_LOAD(r + i, (uint8_t*)(state.hs.w + 12) + sizeof(v4_reg) * i); \

This comment has been minimized.

@MonadMonAmy

MonadMonAmy Feb 15, 2019

I suggest to use int8_t*

This comment has been minimized.

@SChernykh

SChernykh Feb 15, 2019

Author Contributor

It won't change anything. This pointer is passed to memcpy which is declared as void * memcpy ( void * destination, const void * source, size_t num );, so pointer type doesn't matter here.

@psychocrypt

This comment has been minimized.

Copy link

psychocrypt commented Feb 17, 2019

Is the algorithm for the next fork already final?

@SChernykh

This comment has been minimized.

Copy link
Contributor Author

SChernykh commented Feb 17, 2019

Yes, it's already merged into release-0.13 branch (the one Monero will use for the fork).

@moneromooo-monero

This comment has been minimized.

Copy link
Contributor

moneromooo-monero commented Feb 17, 2019

If you found a vulnerability, it can be not quite final though :)

@SChernykh

This comment has been minimized.

Copy link
Contributor Author

SChernykh commented Feb 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment