Cryptonight variant 4 aka CryptonightR #5126

SChernykh · 2019-02-04T18:07:23Z

This is a proposal for the next Monero PoW algorithm. Please read original discussion before posting here.

Random integer math modification

Division and square root are replaced with a sequence of random integer instructions:

OP	Description	Frequency	Comment
MUL	a*b	40.05%	Many multiplications ensure high latency
ADD	a+b+C	11.88%	3-way addition with random constant
SUB	a-b	12.21%	b is always different from a
ROR	ror(a,b)	7.52%	Bit rotate right
ROL	rol(a,b)	5.57%	Bit rotate left
XOR	a^b	22.78%	b is always different from a

Program size is between 60 and 69 instructions, 63 instructions on average.

There are 9 registers named R0-R8. Registers R0-R3 are variable, registers R4-R8 are constant and can only be used as source register in each instruction. Registers R4-R8 are initialized with values from main loop registers on every main loop iteration.

All registers are 32 bit to enable efficient GPU implementation. It's possible to make registers 64 bit though - it's supported in miners below.

The random sequence changes every block. Block height is used as a seed for random number generator. This allows CPU/GPU miners to precompile optimized code for each block. It also allows to verify optimized code for all future blocks against reference implementation, so it'll be guaranteed safe to use in Monero daemon/wallet software.

An example of generated random math:

https://github.com/SChernykh/CryptonightR/blob/master/CryptonightR/random_math.inl

Optimized CPU miner:

xmrig

Optimized GPU miner:

Pool software:

Design choices

Instruction set is chosen from instructions that are efficient on CPUs/GPUs compared to ASIC: all of them except XOR are complex operations at logic circuit level and require O(logN) gate delay. These operations have been studied extensively for decades and modern CPUs/GPUs already have the best implementations.

SUB, XOR are never executed with the same operands to prevent degradation to zero. ADD is defined as a 3-way operation with random 32-bit constant to fix trailing zero bits that tend to accumulate after multiplications.

Code generator ensures that minimal required latency for ASIC to execute random math is at least 2.5 times higher than what was needed for DIV+SQRT in CryptonightV2: current settings ensure latency equivalent to a chain of 15 multiplications while optimal ASIC implementation of DIV+SQRT has latency equivalent to a chain of 6 multiplications.

It also accounts for super-scalar and out of order CPUs which can execute more than 1 instruction per clock cycle. If ASIC implements random math circuit as simple in-order pipeline, it'll be hit with further up to 1.5x slowdown.

A number of simple checks is implemented to prevent algorithmic optimizations of the generated code. Current instruction mix also helps to prevent algebraic optimizations of the code. My tests show that generated C++ code compiled with all optimizations on is only 5% faster on average than direct translation to x86 machine code - this is synthetic test with just random math in the loop, but the actual Cryptonight loop is still dominated by memory access, so this number is needed to estimate the limits of possible gains for ASIC.

Performance on CPU/GPU and ASIC

CryptonightR parameters were chosen to:

have the same hashrate as CryptonightV2 on CPU/GPU
have a bit smaller power consumption on CPU/GPU

Actual numbers (hashrate and power consumption for different CPUs and GPUs) are available here.

ASIC will have to implement some simple and minimalistic instruction decoder and execution pipeline. While it's not impossible, it's much harder to create efficient out of order pipeline which can track all data dependencies and do more than 1 instruction per cycle. It will also have to use fixed clock cycle length, just like CPU, so for example XOR (single logic gate) won't be much faster anymore.

ASIC with external memory will have the same performance as they did on CryptonightV2, but they will require much more chip area to implement multiple CPU-like execution pipelines.
ASIC with on-chip memory will get 2.5-3.75 times slower due to increased math latency and randomness and they will also require more chip area.

sethforprivacy · 2019-02-04T19:07:04Z

Has this been compared for pros/cons with the claimed "FPGA-proof" CN-GPU algo? I have 0 clue how they compare and lack the technical know-how to compare the two, but figured this would be a good place to discuss them to be sure we get the best of all available PoW algorithms:

fireice-uk/xmr-stak#2186

Hadn't seen the merits/issues of it discussed elsewhere by people who know these things.

lememine · 2019-02-04T19:21:27Z

I hope CN-GPU will never be implemented as PoW on Monero, I want to be able to mine on CPU.

SChernykh · 2019-02-04T19:30:36Z

@Gooden0ugh

I don't understand why "FPGA-proof" is a thing at all. FPGAs can't run CNv2 and CryptonightR efficiently as well, only ASICs are still efficient enough to be profitable (as it turned out).
CN-GPU has no description and design rationale published - only source code, so I can't compare now. What I understood so far is that CN-GPU is not Cryptonight at all - too many parts of the algorithm have changed. It's also very power hungry on GPU and not suitable for CPUs which goes against what's stated in the original Monero whitepaper.

sethforprivacy · 2019-02-04T20:31:03Z

@Gooden0ugh

I don't understand why "FPGA-proof" is a thing at all. FPGAs can't run CNv2 and CryptonightR efficiently as well, only ASICs are still efficient enough to be profitable (as it turned out).

CN-GPU has no description and design rationale published - only source code, so I can't compare now. What I understood so far is that CN-GPU is not Cryptonight at all - too many parts of the algorithm have changed. It's also very power hungry on GPU and not suitable for CPUs which goes against what's stated in the original Monero whitepaper.

That's exactly what I was hoping to hear. I had no idea it was GPU-only, as there is no documentation around it. Thanks 👍

tevador · 2019-02-04T21:06:45Z

@SChernykh How are the instruction frequencies calculated? I remember it used to be 3/8 for multiplication and 1/8 for the rest.

Regarding "CN-GPU", it replaces the AES encryption in the initialization loop with keccak and then the main loop is replaced with just a lot of floating point math (single precision multiplication and addition). That's why it's power hungry. It will be most likely compute-bound on CPUs and possibly also on some GPUs.

SChernykh · 2019-02-04T21:14:19Z

How are the instruction frequencies calculated? I remember it used to be 3/8 for multiplication and 1/8 for the rest.

Yes, it is like you say initially (except ROR/ROL are less frequent (1/16) in favor of XOR (1/4)):

// MUL = opcodes 0-2
// ADD = opcode 3
// SUB = opcode 4
// ROR/ROL = opcode 5, shift direction is selected randomly
// XOR = opcodes 6-7

But it changes during code generation because code generator adjusts some sequences to avoid possible ASIC optimizations. You can read comments in variant4_random_math.h starting from line 263:

// Don't do ADD/SUB/XOR with the same register
// Don't do rotation with the same destination twice because it's equal to a single rotation
// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// Don't generate instructions that leave some register unchanged for more than 7 cycles

The instruction frequencies in the table are average from first 10,000,000 random programs.

tevador · 2019-02-04T21:49:51Z

src/crypto/variant4_random_math.h

+// Generates as many random math operations as possible with given latency and ALU restrictions
+static inline int v4_random_math_init(struct V4_Instruction* code, const uint64_t height)
+{
+	// MUL is 3 cycles, 3-way addition and rotations are 2 cycles, SUB/XOR are 1 cycle


Not sure if it makes a big difference, but the real latency of ROL/ROR on Intel is ~1 cycle (reference). 2 cycle latency is only for flags dependence.

These are worst case numbers, so they are conservative. I ran a lot of tests before and found that a few random seeds produce slower than usual code when it has a lot of rotations. This is why I set it to 2 cycles for rotations and reduced rotations frequency.

tevador · 2019-02-04T21:55:39Z

// Don't do ADD/SUB/XOR with the same register
// Don't do rotation with the same destination twice because it's equal to a single rotation
// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// Don't generate instructions that leave some register unchanged for more than 7 cycles

These rules make sense since there is just one 'program' per block.

src/crypto/variant4_random_math.h

moneromooo-monero

Halfway through.

src/crypto/variant4_random_math.h

moneromooo-monero · 2019-02-05T00:24:45Z

src/crypto/variant4_random_math.h

+		{
+			check_data(&data_index, 1, data, sizeof(data));
+
+			struct V4_InstructionCompact op = ((struct V4_InstructionCompact*)data)[data_index++];


This code seems to be deterministic based on the height, so we can know know what the program for height 2e6 will be way in advance. I saw the rationale for height seed, so a GPU can get precompiled code in advance. However, using the previous block's hash also accomplishes this, while keeping everything unknown till shortly before the time. Would this be better ? It's unclear whether knowing all this in advance could be exploited somehow.

What's the risk to seeding off of a previous blocks hash in the event of a re-org? I'm trying to think of the ways that can go wrong, but I'm not sure I can come up with anything.

Previous block hash makes pre-compilation impossible because it's unknown until new block arrives, so GPUs will be halted every time. Knowing programs in advance won't help ASICs much because there are just too many different programs (one for each block). They'll be able to precompile too, but it won't give more than 5% speedup (see first post).

Plus, using the block height makes it possible to just check the code generator for all future block heights and guarantee that it doesn't crash/freze etc. and produces working random programs. I think it's better to play safe here.

What about an ASIC strategy in which the design was intended only to work with certain block heights? I'm not yet familiar enough with this proposal to know whether this is viable.

Generated random programs are quite similar and each program has all possible instructions in it, so if ASIC can run one of them, it can run all.

"Previous block hash makes pre-compilation impossible" does not apply if the hash is the one two steps back.

Yes, but it'll require bigger refactoring because it's not available in cn_slow_hash (and functions calling it) now. Pool software will also require refactoring to support it. Block height is convenient because it's readily available with existing code both in monerod and in pool software.

moneromooo-monero · 2019-02-05T00:29:49Z

src/crypto/variant4_random_math.h

+			// MUL = opcodes 0-2
+			// ADD = opcode 3
+			// SUB = opcode 4
+			// ROR/ROL = opcode 5, shift direction is selected randomly


Since the shift count is the full size of the register in V4_EXEC, ROL and ROR are really the same thing (or, rol eax, 28 is the same as ror eax, 4).

Yes, but it still adds a bit more logic to ASIC. This also why I only use one opcode for them.

Just a 6 bit sub AFAICT. Something like sal/sar instead would at least change the op a bit. Or bswap also looks to be simple and latency 1. Anyway, you're the expert here so I won't say more.

What additional logic needs to be added for the other rotate? Wouldn't the additional logic only need to be in the prepping stage? Which brings me to the next point - why not drop one of the rotates in the execution switch to compress the logic? Seems like it would really be tough for the compiler to optimize that one.

moneromooo-monero · 2019-02-05T00:39:02Z

src/crypto/variant4_random_math.h

+		hash_extra_blake(data, sizeof(data), data);
+		*data_index = 0;
+	}
+}


I suspect most runs require the the same amount of calls, as the data needed seems fairly predictable. I kinda expect this code building part is not really time sensitive though, is it ?

Code generator generates first 10,000,000 random programs in 30 seconds, so it's really fast - 3 microseconds on average.

moneromooo-monero · 2019-02-05T00:40:56Z

src/crypto/variant4_random_math.h

+			// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
+			// 2xADD(a, b, C) = ADD(a, b*2, C1+C2), same for SUB and rotations
+			// 2xXOR(a, b) = NOP
+			if ((opcode != MUL) && ((inst_data[a] & 0xFFFF00) == (opcode << 8) + ((inst_data[b] & 255) << 16)))


You probably also want ADD a,b then SUB a,b and vice versa.
I also don't quite understand this. you seem to be storing only 8 bits of the source register here, is that because you don't care about false positives ?

I store "register data revision" (change counter) here, so it can't be more than 256 because programs don't have that many instructions.

Sonia-Chen · 2019-02-05T11:54:29Z

[We are asicmakers (but not interested in secret Monero mining)]
I have a question:

Is it possible to use data from the blockchain itself in the PoW algo? block data? The problem with PoW in our view is that it's isolated from the block data. Inclusion of block data would force asicmakers to make chips that could be more useful later.

moneromooo-monero · 2019-02-06T12:14:06Z

Some chains do that. At least Boolberry. Not sure if you asked "can it be sensibly done", or "please consider doing it" :)

moneromooo-monero

The main loop's I don't understand yet. I might comment again on it later.

src/crypto/variant4_random_math.h

vtnerd

So much to think about with this proposal ...

some initial thoughts.

src/crypto/variant4_random_math.h

src/crypto/slow-hash.c

src/crypto/variant4_random_math.h

vtnerd · 2019-02-08T02:28:27Z

src/crypto/variant4_random_math.h

+			}
+
+			// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
+			// 2xADD(a, b, C) = ADD(a, b*2, C1+C2), same for SUB and rotations


Is this why a constant is used in addition? To prevent a ADD, SUB case which results in a NOP? Doesn't this happen in the case where the constant is zero? And even when the constant is non-zero, couldn't such a sequence be optimized further? i.e. ADD(A, B, 10), SUB(A, B) -> ADD(A, 10).

Random constant is used to fix zero bits that accumulate after multiplications. The case when add -> sub can be optimized to single add is quite rare, it's not worth additional complexity of the code generator. We're talking about reducing possible 5% speedup from optimizing compiler if we fix all thinkable cases here, not just this one. 5% is not much already.

Presumably you meant 5% speedup in this portion, and not the entire algorithm (which should be dominated by cache/memory accesses)?

I don't quite like this argument, the CPU is pegged (more power) while custom designs might be able to save further power by having the same latencies with less silicon. Although any JIT-like approach with LLVM should do the trick here too.

5% speedup was in my tests where I had only random math in the loop and compared optimizing C++ compiler with direct translation to x86 code. The actual Cryptonight loop doesn't get any speedup from optimizing compiler on CPU because it's still dominated by the main memory-hard loop.

Custom designs will of course will have random math as limiting factor and will have optimizing compiler to assist them whenever possible.

I don't say in the description that ASIC is impossible. It's possible and can be still be 3-4 times more efficient per watt. But this algorithm is not the final code, it's only for the next 6 months.

vtnerd · 2019-02-08T04:19:20Z

src/crypto/variant4_random_math.h

+			// MUL = opcodes 0-2
+			// ADD = opcode 3
+			// SUB = opcode 4
+			// ROR/ROL = opcode 5, shift direction is selected randomly


What additional logic needs to be added for the other rotate? Wouldn't the additional logic only need to be in the prepping stage? Which brings me to the next point - why not drop one of the rotates in the execution switch to compress the logic? Seems like it would really be tough for the compiler to optimize that one.

src/crypto/slow-hash.c

src/crypto/variant4_random_math.h

SChernykh · 2019-02-08T21:27:54Z

@vtnerd I've fixed pointer aliasing issues, can you check that I didn't miss anything?

src/crypto/variant4_random_math.h

xiphon · 2019-02-10T19:45:45Z

FYI, I tested the code generation routine for all the block heights starting from the current height till October 6 2019 (1768400 ... 1940093).
Had zero cases when generated code could be optimized to less than 60 CPU instructions.

SChernykh · 2019-02-11T17:50:27Z

@vtnerd @moneromooo-monero
I've just submitted my final tweak. There are 9 registers now, named R0-R8. Register R8 is used as a replacement for the case when we have ADD/SUB/XOR instruction with the same register. Why only there?

It's much easier to implement in existing miner code: of all 256 instructions, these 12 (ADD/SUB/XOR R0/R1/R2/R3 with itself) weren't used anyway
It won't require a change of binary format for code generator.

How it would affect existing ASIC designs (if there are any which I doubt):

Having 9 registers instead of 8 breaks all designs that didn't account for more than 8 registers and use 3 bits for register indexing
Register R8 is an additional dependency from the main loop variables. ASIC designs that supported more than 8 registers will still have to be updated to read new data from the main loop.
ASICs will have to pump more data (12.5% more) through their pipeline. They'll be a bit less energy efficient because of this. The difference is tiny but it still exists.

Effect on CPU/GPU: my tests show absolutely no changes to their performance/power usage.

moneromooo-monero · 2019-02-12T12:26:51Z

Looks good here. Waiting for vtnerd's now.

src/crypto/variant4_random_math.h

It introduces random integer math into the main loop.

Co-Authored-By: Lee Clagett <vtnerd@users.noreply.github.com>

MonadMonAmi · 2019-02-13T15:01:52Z

src/crypto/slow-hash.c

+  do if (variant >= 4) \
+  { \
+    for (int i = 0; i < 4; ++i) \
+      V4_REG_LOAD(r + i, (uint8_t*)(state.hs.w + 12) + sizeof(v4_reg) * i); \


I suggest to use int8_t*

It won't change anything. This pointer is passed to memcpy which is declared as void * memcpy ( void * destination, const void * source, size_t num );, so pointer type doesn't matter here.

psychocrypt · 2019-02-17T19:49:22Z

Is the algorithm for the next fork already final?

SChernykh · 2019-02-17T19:51:34Z

Yes, it's already merged into release-0.13 branch (the one Monero will use for the fork).

SChernykh · 2019-02-22T06:56:07Z

Could you give the typical GPU hashrate on CNV4?

Same as CNv2.

If a GPU has no decrease on CNV4 than CNV2, the an ASIC can do the same I suppose.

I already told you why it's not true. GPUs are limited by memory bandwidth, not computation. CNv2 ASICs use on-chip SRAM, they're not limited by memory bandwidth/latency, but they're limited by computation. This is why they'll get slower while GPU won't. Are you really a hardware engineer?

Leochains · 2019-02-22T08:10:52Z

Yes, but it's not a very good example because first 2 multiplications can be done in parallel here. Try to synthesize a circuit to calculate x^8 and another one that does x=((((x^a)*b)^c)*d)^e)*f - these 2 circuits will require 3 consecutive multiplications.

Here I have got the result for these two logic synthesize result:

The period of clock is 0.62ns(with a frequency around 1.623GHz) with a little slack.
Considering the practical P&R(place and route) I suppose that the 3 steps of ops rate can reach to 1.4G.
The ASIC can use double clock rate for logic calculating and then the memory access clock rate might be 700MHz. The total max 69 ops can be done in (69 / 3 / 2) = 12 cycles @ 700MHz

SChernykh · 2019-02-22T08:48:32Z

@Leochains Ok, a few questions:

Can you do the same without negative slack, Or should we just assume that 3 consecutive multiplications can be done in 0.75 ns?
And how will you handle randomly changing register dependencies and instruction sequences? I suspect that these will make it significantly slower than 0.75 ns.

Take a look at sample program one more time: https://github.com/SChernykh/CryptonightR/blob/master/CryptonightR/random_math.inl

There are no cases where you can do 3 consecutive multiplications without dependencies from other registers - if you look at dependency chain for each register individually.

So the last question: how fast something like stripped down Amber design could run to execute random math in CryptonightR?

Leochains · 2019-02-22T09:05:53Z

Can you do the same without negative slack, Or should we just assume that 3 consecutive multiplications can be done in 0.75 ns?
Yes, 0.75 ns is enough.

And how will you handle randomly changing register dependencies and instruction sequences? I suspect that these will make it significantly slower than 0.75 ns.

Hardware can implement 23 small units with each one do 3 consecutive ops and use many mux to switch, each unit followed with DFF running at a clock rate around 1.4GHz(double of memory access frequency).
The actually rate may be a little slow but I think that may not be too much.

Take a look at sample program one more time: https://github.com/SChernykh/CryptonightR/blob/master/CryptonightR/random_math.inl

There are no cases where you can do 3 consecutive multiplications without dependencies from other registers - if you look at dependency chain for each register individually.

Latches can be used for the middle state registers to buffer the values or directly unroll those 3 ops.

So the last question: how fast something like stripped down Amber design could run to execute random math in CryptonightR?

I need some time.

SChernykh · 2019-02-22T09:18:18Z

Yes, 0.75 ns is enough.

It means that DIV+SQRT in CNv2 can be done in 1.5 ns (3+2 multiplications) + 2 ns (2 reads from ROM): 3 clock cycles at 800 MHz, or 6 clock cycles at 1600 MHz. Much faster than in your estimations.

Hardware can implement 23 small units for 3 consecutive ops and use many mux to switch, each unit followed with DFF running at a clock rate around 1.4GHz(double of memory access frequency).

Why 23? There are 6 ops, 3 consecutive ops can be any of 216 combinations.

Leochains · 2019-02-22T09:29:35Z

Yes, 0.75 ns is enough.

It means that DIV+SQRT in CNv2 can be done in 1.5 ns (3+2 multiplications) + 2 ns (2 reads from ROM): 3 clock cycles at 800 MHz, or 6 clock cycles at 1600 MHz. Much faster than in your estimations.

Hardware can implement 23 small units for 3 consecutive ops and use many mux to switch, each unit followed with DFF running at a clock rate around 1.4GHz(double of memory access frequency).

Why 23? There are 6 ops, 3 consecutive ops can be any of 216 combinations.

Sorry I got some misunderstanding description. Different ops just need different switches and muxs, 23 means 23 cycles at 1.4GHz clock (considering the max 69 instructions, every cycle do 3 instructions). The total cycles at 1.4GHz are 69/3, which corresponding to 23/2=12 cycles at 700MHz(memory access frequency).

SChernykh · 2019-02-22T09:44:14Z

Sorry I got some misunderstanding description. Different ops just need different switches and muxs, 23 means 23 cycles at 1.4GHz clock (considering the max 69 instructions, every cycle do 3 instructions). The total cycles at 1.4GHz are 69/3, which corresponding to 23/2=12 cycles at 700MHz(memory access frequency).

23 cycles at 1.4 GHz clock is ~16.5 ns. Don't forget that the next iteration can't start before random math is calculated - there is a data dependency for that, so this ASIC could do ~18 ns/iteration which is basically the same number as CPU - AMD Ryzen @ 4 GHz does 20 ns/iteration. CNv2 ASICs are much faster.

psychocrypt · 2019-02-22T20:16:03Z

Is there a testpool for the new monero pow available

SChernykh · 2019-02-22T20:18:03Z

@psychocrypt http://killallasics.moneroworld.com/

Leochains · 2019-02-23T07:24:37Z

23 cycles at 1.4 GHz clock is ~16.5 ns. Don't forget that the next iteration can't start before random math is calculated - there is a data dependency for that

The hardware can use latches instead of DFFs to break the dependence. you can understand that ASICs can split one cycle as 3 or more by different phas.

so this ASIC could do ~18 ns/iteration which is basically the same number as CPU - AMD Ryzen @ 4 GHz does 20 ns/iteration. CNv2 ASICs are much faster.

Don't forget that if with 16MB memory, an ASIC can reach to 8 times of CPU, and if they put 320 chips on one box like the current box, that means one box can get a 320 * 8 times rate than CPU.

So I suggest that the only way to assist ASICs is to enlarge memory requirement like ETH or Grin. Otherwise, no matter how the calculation algorithm modified ASICs box can still easier and cheaper to get a huge number times of hashrate than CPU/GPU.

Or if the CNV4 need to use many times of memory than CNV2, such as 16MB or 32MB, then an ASIC box can not get a huge number times rate than CPU/GPU, that is the second best way for current statement to anti ASIC I suppose.

SChernykh · 2019-02-23T08:20:06Z

The hardware can use latches instead of DFFs to break the dependence. you can understand that ASICs can split one cycle as 3 or more by different phas.

How can they break the dependency if they just don't know the address in scratchpad to read from until calculation is done?

Or if the CNV4 need to use many times of memory than CNV2, such as 16MB or 32MB, then an ASIC box can not get a huge number times rate than CPU/GPU, that is the second best way for current statement to anti ASIC I suppose.

This will slow down CPU and GPU proportionally. CNv4 is a temporary solution, but RandomX will use 4 GB memory.

Don't forget that if with 16MB memory, an ASIC can reach to 8 times of CPU, and if they put 320 chips on one box like the current box, that means one box can get a 320 * 8 times rate than CPU.

Current box has 320 chips but it's 320x1, not 320x8, because each of 320 chips scans nonces linearly. If it had 8 independent scratchpads, it wouldn't scan nonces this way. Each chip does 400 h/s, 128 kh/s in total. Similar configuration for CNv4 would be (assuming 18 n/s per iteration) 320*106 h/s = 34 kh/s.

tevador · 2019-02-23T09:42:40Z

Current box has 320 chips but it's 320x1, not 320x8, because each of 320 chips scans nonces linearly. If it had 8 independent scratchpads, it wouldn't scan nonces this way.

It can have multiple scratchpads and still scan linearly within a single chip. Assuming an 8 MB chip, which seems most likely, it will run nonces 0, 1, 2, 3 in the first batch, then 4, 5, 6, 7 etc. This is possible since it's a pipelined design and the whole batch is finished at the same time.

The independent nonce sequences of different chips are used to avoid inter-chip synchronization, which would be problematic especially if the miner has multiple boards that are not connected together.

SChernykh · 2019-02-23T09:46:19Z

@tevador It's possible but not very logical. They separate each of 320 chips nonce ranges by 2^22, why would they implement such interleaving within each chip? It would make more sense to split each range in 8 parts in this case.

Leochains · 2019-02-23T09:50:25Z

How can they break the dependency if they just don't know the address in scratchpad to read from until calculation is done?

Latches (different phase enable controlled) can do that.

Or if the CNV4 need to use many times of memory than CNV2, such as 16MB or 32MB, then an ASIC box can not get a huge number times rate than CPU/GPU, that is the second best way for current statement to anti ASIC I suppose.

This will slow down CPU and GPU proportionally. CNv4 is a temporary solution, but RandomX will use 4 GB memory.

The current CNV2 access memory with 512 bit width but only calculate with 128 every time, at least the new algorithm can using a 8MB memory, that might have no influence on CPU/GPU.

Don't forget that if with 16MB memory, an ASIC can reach to 8 times of CPU, and if they put 320 chips on one box like the current box, that means one box can get a 320 * 8 times rate than CPU.

Current box has 320 chips but it's 320x1, not 320x8, because each of 320 chips scans nonces linearly. If it had 8 independent scratchpads, it wouldn't scan nonces this way. Each chip does 400 h/s, 128 kh/s in total. Similar configuration for CNv4 would be (assuming 18 n/s per iteration) 320*106 h/s = 34 kh/s.

Why 8 independent scratchpads can not get the linearly scans? I think the 320 chips can divide a nonce with equal distribution and it's possible for every scratchpads do the same in one chip. On the other way, if you are a ASIC designer, integrated 18MB memory is already a general product proven method on LTC, would you just only integrate 2MB in one chip with a expensive package, testing, PCB and other cost?

SChernykh · 2019-02-23T09:54:59Z

Latches (different phase enable controlled) can do that.

Can do what? Something theoretically impossible - reading from memory when address is still not known?

Why 8 independent scratchpads can not get the linearly scans? I think the 320 chips can divide a nonce with equal distribution and it's possible for every scratchpads do the same in one chip.

They can, but they don't. We clearly see 320 distinct ranges on the nonce graph. They maybe configured as 40x8 or 20x16, I also don't think they are indeed 320 separate chips.

Leochains · 2019-02-23T10:00:31Z

Current box has 320 chips but it's 320x1, not 320x8, because each of 320 chips scans nonces linearly. If it had 8 independent scratchpads, it wouldn't scan nonces this way.

It can have multiple scratchpads and still scan linearly within a single chip. Assuming an 8 MB chip, which seems most likely, it will run nonces 0, 1, 2, 3 in the first batch, then 4, 5, 6, 7 etc. This is possible since it's a pipelined design and the whole batch is finished at the same time.

The independent nonce sequences of different chips are used to avoid inter-chip synchronization, which would be problematic especially if the miner has multiple boards that are not connected together.

Yes, that's right, this is a general way in hardware.

Leochains · 2019-02-23T10:08:55Z

Can do what? Something theoretically impossible - reading from memory when address is still not known?

You can treat latches as DFF which controlled by gating enable, they are triggered by enable signal level instead of clock edge. The FPGA have no gating as the logic are already fixed. But LATCH are standard cell on ASIC which generally using together with clock gating cells and the similar logic control.

tevador · 2019-02-23T10:30:30Z

They can, but they don't. We clearly see 320 distinct ranges on the nonce graph. They maybe configured as 40x8 or 20x16, I also don't think they are indeed 320 separate chips.

We don't know, but having 320 chips is not far fetched considering Innosilicon A8 has 160 chips. I think they have to use more than 320 scratchpads since 400 H/s per scratchpad would require operating frequency of around 2.4 GHz per the table posted above.

You can treat latches as DFF which controlled by gating enable, they are triggered by enable signal level instead of clock edge. The FPGA have no gating as the logic are already fixed. But LATCH are standard cell on ASIC which generally using together with clock gating cells and the similar logic control.

This still doesn't explain how an ASIC can load from memory before the address is calculated.

SChernykh · 2019-02-23T10:32:35Z

I think they have to use more than 320 scratchpads since 400 H/s per scratchpad would require operating frequency of around 2.4 GHz per the table posted above.

Table posted above overestimates numbers for CNv2. You showed yourself that 3 multiplications can be done in 0.75 ns. DIV requires 3 multiplications, SQRT requires 2 multiplications, so 400 h/s (4.76 ns/iteration) is quite possible.

tevador · 2019-02-23T10:39:38Z

@SChernykh The Open-CryptoNight-ASIC does ~235 H/s per scratchpad in CNv0 at 800 MHz, so it would require 1.4 GHz to reach 400 H/s. CNv2 must be slower than that due to higher div+sqrt latency.

Leochains · 2019-02-23T10:40:23Z

I think they have to use more than 320 scratchpads since 400 H/s per scratchpad would require operating frequency of around 2.4 GHz per the table posted above.

Table posted above overestimates numbers for CNv2. You showed yourself that 3 multiplications can be done in 0.75 ns. DIV requires 3 multiplications, SQRT requires 2 multiplications, so 400 h/s (4.76 ns/iteration) is quite possible.

But 16MB memory will get a double hashrate.

Leochains · 2019-02-23T10:52:45Z

This still doesn't explain how an ASIC can load from memory before the address is calculated.

Load data from memory twice every 12 clock cycles in one round, using a slow clock. Here just talking about how 3 instruction ops can be done in one cycle, have nothing to do with memory access.

SChernykh · 2019-02-23T14:30:31Z

@tevador

The Open-CryptoNight-ASIC does ~235 H/s per scratchpad in CNv0 at 800 MHz, so it would require 1.4 GHz to reach 400 H/s. CNv2 must be slower than that due to higher div+sqrt latency.

The problem is that div+sqrt latency turned out to be not higher for efficient implementation. It can also fit in 5 ns (4 cycles at 800 MHz).

Added support for CryptoNightR (Variant 4) utilizing the code from: monero-project/monero#5126

fluffypony

Reviewed

f1fb06b Fixed path to int-util.h (SChernykh) 9da0892 Adding cnv4-2 tweaks (SChernykh) f51397b Cryptonight variant 4 aka CryptonightR (SChernykh)

tevador reviewed Feb 4, 2019

View reviewed changes

src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved

moneromooo-monero reviewed Feb 5, 2019

View reviewed changes

bitkis mentioned this pull request Feb 5, 2019

CryptoNight Waltz support for CNv0 graft-project/GraftNetwork#223

Closed

tevador approved these changes Feb 5, 2019

View reviewed changes

moneromooo-monero reviewed Feb 6, 2019

View reviewed changes

src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved

src/crypto/variant4_random_math.h Show resolved Hide resolved

src/crypto/variant4_random_math.h Show resolved Hide resolved

src/crypto/variant4_random_math.h Show resolved Hide resolved

vtnerd reviewed Feb 8, 2019

View reviewed changes

xiphon reviewed Feb 10, 2019

View reviewed changes

src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved

hyc suggested changes Feb 12, 2019

View reviewed changes

src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved

SChernykh force-pushed the variant4-pr branch from d0ff6dd to b92eb0a Compare February 13, 2019 21:25

Cryptonight variant 4 aka CryptonightR

f51397b

It introduces random integer math into the main loop.

SChernykh force-pushed the variant4-pr branch from 214fc8f to f51397b Compare February 14, 2019 10:30

SChernykh and others added 2 commits February 14, 2019 20:42

Adding cnv4-2 tweaks

9da0892

Co-Authored-By: Lee Clagett <vtnerd@users.noreply.github.com>

Fixed path to int-util.h

f1fb06b

MonadMonAmi approved these changes Feb 15, 2019

View reviewed changes

Venthos pushed a commit to Venthos/node-cryptonight-hashing that referenced this pull request Feb 24, 2019

Add CryptoNightR (Variant 4) Support

782dc0e

Added support for CryptoNightR (Variant 4) utilizing the code from: monero-project/monero#5126

fluffypony approved these changes Mar 4, 2019

View reviewed changes

fluffypony merged commit f1fb06b into monero-project:master Mar 4, 2019

fluffypony added a commit that referenced this pull request Mar 4, 2019

Merge pull request #5126

1b4fa00

f1fb06b Fixed path to int-util.h (SChernykh) 9da0892 Adding cnv4-2 tweaks (SChernykh) f51397b Cryptonight variant 4 aka CryptonightR (SChernykh)

rubenfonseca mentioned this pull request Mar 5, 2019

New cryptonightR algorithm for XMR (9th March) oliverw/miningcore#572

Closed

moneromooo-monero mentioned this pull request Mar 5, 2019

Variant 4 Tweak 2 [Merge other Variant4-First] #5139

Closed

valiant1x mentioned this pull request Mar 5, 2019

release v3.1 letheanVPN/blockchain-iz#160

Merged

Cryptonight variant 4 aka CryptonightR #5126

Cryptonight variant 4 aka CryptonightR #5126

Conversation

SChernykh commented Feb 4, 2019 • edited Loading

Random integer math modification

Design choices

Performance on CPU/GPU and ASIC

sethforprivacy commented Feb 4, 2019

lememine commented Feb 4, 2019

SChernykh commented Feb 4, 2019

sethforprivacy commented Feb 4, 2019

tevador commented Feb 4, 2019

SChernykh commented Feb 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tevador commented Feb 4, 2019

moneromooo-monero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sonia-Chen commented Feb 5, 2019

moneromooo-monero commented Feb 6, 2019

moneromooo-monero left a comment

Choose a reason for hiding this comment

vtnerd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtnerd Feb 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SChernykh commented Feb 8, 2019

xiphon commented Feb 10, 2019 • edited Loading

SChernykh commented Feb 11, 2019

moneromooo-monero commented Feb 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psychocrypt commented Feb 17, 2019

SChernykh commented Feb 17, 2019

SChernykh commented Feb 22, 2019

Leochains commented Feb 22, 2019 • edited Loading

SChernykh commented Feb 22, 2019 • edited Loading

Leochains commented Feb 22, 2019 • edited Loading

SChernykh commented Feb 22, 2019

Leochains commented Feb 22, 2019 • edited Loading

SChernykh commented Feb 22, 2019

psychocrypt commented Feb 22, 2019

SChernykh commented Feb 22, 2019

Leochains commented Feb 23, 2019 • edited Loading

SChernykh commented Feb 23, 2019 • edited Loading

tevador commented Feb 23, 2019

SChernykh commented Feb 23, 2019

Leochains commented Feb 23, 2019 • edited Loading

SChernykh commented Feb 23, 2019

Leochains commented Feb 23, 2019

Leochains commented Feb 23, 2019

tevador commented Feb 23, 2019

SChernykh commented Feb 23, 2019

tevador commented Feb 23, 2019

Leochains commented Feb 23, 2019

Leochains commented Feb 23, 2019

SChernykh commented Feb 23, 2019

fluffypony left a comment

Choose a reason for hiding this comment

SChernykh commented Feb 4, 2019 •

edited

Loading

SChernykh commented Feb 4, 2019 •

edited

Loading

vtnerd Feb 10, 2019 •

edited

Loading

xiphon commented Feb 10, 2019 •

edited

Loading

Leochains commented Feb 22, 2019 •

edited

Loading

SChernykh commented Feb 22, 2019 •

edited

Loading

Leochains commented Feb 22, 2019 •

edited

Loading

Leochains commented Feb 22, 2019 •

edited

Loading

Leochains commented Feb 23, 2019 •

edited

Loading

SChernykh commented Feb 23, 2019 •

edited

Loading

Leochains commented Feb 23, 2019 •

edited

Loading