Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cryptonight variant 4 aka CryptonightR #5126

Merged
merged 3 commits into from
Mar 4, 2019

Conversation

SChernykh
Copy link
Contributor

@SChernykh SChernykh commented Feb 4, 2019

This is a proposal for the next Monero PoW algorithm. Please read original discussion before posting here.

Random integer math modification

Division and square root are replaced with a sequence of random integer instructions:

OP Description Frequency Comment
MUL a*b 40.05% Many multiplications ensure high latency
ADD a+b+C 11.88% 3-way addition with random constant
SUB a-b 12.21% b is always different from a
ROR ror(a,b) 7.52% Bit rotate right
ROL rol(a,b) 5.57% Bit rotate left
XOR a^b 22.78% b is always different from a

Program size is between 60 and 69 instructions, 63 instructions on average.

There are 9 registers named R0-R8. Registers R0-R3 are variable, registers R4-R8 are constant and can only be used as source register in each instruction. Registers R4-R8 are initialized with values from main loop registers on every main loop iteration.

All registers are 32 bit to enable efficient GPU implementation. It's possible to make registers 64 bit though - it's supported in miners below.

The random sequence changes every block. Block height is used as a seed for random number generator. This allows CPU/GPU miners to precompile optimized code for each block. It also allows to verify optimized code for all future blocks against reference implementation, so it'll be guaranteed safe to use in Monero daemon/wallet software.

An example of generated random math:

Optimized CPU miner:

Optimized GPU miner:

Pool software:

Design choices

Instruction set is chosen from instructions that are efficient on CPUs/GPUs compared to ASIC: all of them except XOR are complex operations at logic circuit level and require O(logN) gate delay. These operations have been studied extensively for decades and modern CPUs/GPUs already have the best implementations.

SUB, XOR are never executed with the same operands to prevent degradation to zero. ADD is defined as a 3-way operation with random 32-bit constant to fix trailing zero bits that tend to accumulate after multiplications.

Code generator ensures that minimal required latency for ASIC to execute random math is at least 2.5 times higher than what was needed for DIV+SQRT in CryptonightV2: current settings ensure latency equivalent to a chain of 15 multiplications while optimal ASIC implementation of DIV+SQRT has latency equivalent to a chain of 6 multiplications.

It also accounts for super-scalar and out of order CPUs which can execute more than 1 instruction per clock cycle. If ASIC implements random math circuit as simple in-order pipeline, it'll be hit with further up to 1.5x slowdown.

A number of simple checks is implemented to prevent algorithmic optimizations of the generated code. Current instruction mix also helps to prevent algebraic optimizations of the code. My tests show that generated C++ code compiled with all optimizations on is only 5% faster on average than direct translation to x86 machine code - this is synthetic test with just random math in the loop, but the actual Cryptonight loop is still dominated by memory access, so this number is needed to estimate the limits of possible gains for ASIC.

Performance on CPU/GPU and ASIC

CryptonightR parameters were chosen to:

  • have the same hashrate as CryptonightV2 on CPU/GPU
  • have a bit smaller power consumption on CPU/GPU

Actual numbers (hashrate and power consumption for different CPUs and GPUs) are available here.

ASIC will have to implement some simple and minimalistic instruction decoder and execution pipeline. While it's not impossible, it's much harder to create efficient out of order pipeline which can track all data dependencies and do more than 1 instruction per cycle. It will also have to use fixed clock cycle length, just like CPU, so for example XOR (single logic gate) won't be much faster anymore.

ASIC with external memory will have the same performance as they did on CryptonightV2, but they will require much more chip area to implement multiple CPU-like execution pipelines.
ASIC with on-chip memory will get 2.5-3.75 times slower due to increased math latency and randomness and they will also require more chip area.

@sethforprivacy
Copy link
Contributor

Has this been compared for pros/cons with the claimed "FPGA-proof" CN-GPU algo? I have 0 clue how they compare and lack the technical know-how to compare the two, but figured this would be a good place to discuss them to be sure we get the best of all available PoW algorithms:

fireice-uk/xmr-stak#2186

Hadn't seen the merits/issues of it discussed elsewhere by people who know these things.

@lememine
Copy link

lememine commented Feb 4, 2019

I hope CN-GPU will never be implemented as PoW on Monero, I want to be able to mine on CPU.

@SChernykh
Copy link
Contributor Author

@Gooden0ugh

  • I don't understand why "FPGA-proof" is a thing at all. FPGAs can't run CNv2 and CryptonightR efficiently as well, only ASICs are still efficient enough to be profitable (as it turned out).
  • CN-GPU has no description and design rationale published - only source code, so I can't compare now. What I understood so far is that CN-GPU is not Cryptonight at all - too many parts of the algorithm have changed. It's also very power hungry on GPU and not suitable for CPUs which goes against what's stated in the original Monero whitepaper.

@sethforprivacy
Copy link
Contributor

@Gooden0ugh

  • I don't understand why "FPGA-proof" is a thing at all. FPGAs can't run CNv2 and CryptonightR efficiently as well, only ASICs are still efficient enough to be profitable (as it turned out).
  • CN-GPU has no description and design rationale published - only source code, so I can't compare now. What I understood so far is that CN-GPU is not Cryptonight at all - too many parts of the algorithm have changed. It's also very power hungry on GPU and not suitable for CPUs which goes against what's stated in the original Monero whitepaper.

That's exactly what I was hoping to hear. I had no idea it was GPU-only, as there is no documentation around it. Thanks 👍

@tevador
Copy link
Contributor

tevador commented Feb 4, 2019

@SChernykh How are the instruction frequencies calculated? I remember it used to be 3/8 for multiplication and 1/8 for the rest.

Regarding "CN-GPU", it replaces the AES encryption in the initialization loop with keccak and then the main loop is replaced with just a lot of floating point math (single precision multiplication and addition). That's why it's power hungry. It will be most likely compute-bound on CPUs and possibly also on some GPUs.

@SChernykh
Copy link
Contributor Author

SChernykh commented Feb 4, 2019

How are the instruction frequencies calculated? I remember it used to be 3/8 for multiplication and 1/8 for the rest.

Yes, it is like you say initially (except ROR/ROL are less frequent (1/16) in favor of XOR (1/4)):

// MUL = opcodes 0-2
// ADD = opcode 3
// SUB = opcode 4
// ROR/ROL = opcode 5, shift direction is selected randomly
// XOR = opcodes 6-7

But it changes during code generation because code generator adjusts some sequences to avoid possible ASIC optimizations. You can read comments in variant4_random_math.h starting from line 263:

// Don't do ADD/SUB/XOR with the same register
// Don't do rotation with the same destination twice because it's equal to a single rotation
// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// Don't generate instructions that leave some register unchanged for more than 7 cycles

The instruction frequencies in the table are average from first 10,000,000 random programs.

// Generates as many random math operations as possible with given latency and ALU restrictions
static inline int v4_random_math_init(struct V4_Instruction* code, const uint64_t height)
{
// MUL is 3 cycles, 3-way addition and rotations are 2 cycles, SUB/XOR are 1 cycle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it makes a big difference, but the real latency of ROL/ROR on Intel is ~1 cycle (reference). 2 cycle latency is only for flags dependence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are worst case numbers, so they are conservative. I ran a lot of tests before and found that a few random seeds produce slower than usual code when it has a lot of rotations. This is why I set it to 2 cycles for rotations and reduced rotations frequency.

@tevador
Copy link
Contributor

tevador commented Feb 4, 2019

// Don't do ADD/SUB/XOR with the same register
// Don't do rotation with the same destination twice because it's equal to a single rotation
// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// Don't generate instructions that leave some register unchanged for more than 7 cycles

These rules make sense since there is just one 'program' per block.

Copy link
Collaborator

@moneromooo-monero moneromooo-monero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Halfway through.

src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved
src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved
{
check_data(&data_index, 1, data, sizeof(data));

struct V4_InstructionCompact op = ((struct V4_InstructionCompact*)data)[data_index++];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems to be deterministic based on the height, so we can know know what the program for height 2e6 will be way in advance. I saw the rationale for height seed, so a GPU can get precompiled code in advance. However, using the previous block's hash also accomplishes this, while keeping everything unknown till shortly before the time. Would this be better ? It's unclear whether knowing all this in advance could be exploited somehow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the risk to seeding off of a previous blocks hash in the event of a re-org? I'm trying to think of the ways that can go wrong, but I'm not sure I can come up with anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous block hash makes pre-compilation impossible because it's unknown until new block arrives, so GPUs will be halted every time. Knowing programs in advance won't help ASICs much because there are just too many different programs (one for each block). They'll be able to precompile too, but it won't give more than 5% speedup (see first post).

Plus, using the block height makes it possible to just check the code generator for all future block heights and guarantee that it doesn't crash/freze etc. and produces working random programs. I think it's better to play safe here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about an ASIC strategy in which the design was intended only to work with certain block heights? I'm not yet familiar enough with this proposal to know whether this is viable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated random programs are quite similar and each program has all possible instructions in it, so if ASIC can run one of them, it can run all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Previous block hash makes pre-compilation impossible" does not apply if the hash is the one two steps back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it'll require bigger refactoring because it's not available in cn_slow_hash (and functions calling it) now. Pool software will also require refactoring to support it. Block height is convenient because it's readily available with existing code both in monerod and in pool software.

// MUL = opcodes 0-2
// ADD = opcode 3
// SUB = opcode 4
// ROR/ROL = opcode 5, shift direction is selected randomly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the shift count is the full size of the register in V4_EXEC, ROL and ROR are really the same thing (or, rol eax, 28 is the same as ror eax, 4).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it still adds a bit more logic to ASIC. This also why I only use one opcode for them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a 6 bit sub AFAICT. Something like sal/sar instead would at least change the op a bit. Or bswap also looks to be simple and latency 1. Anyway, you're the expert here so I won't say more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What additional logic needs to be added for the other rotate? Wouldn't the additional logic only need to be in the prepping stage? Which brings me to the next point - why not drop one of the rotates in the execution switch to compress the logic? Seems like it would really be tough for the compiler to optimize that one.

hash_extra_blake(data, sizeof(data), data);
*data_index = 0;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect most runs require the the same amount of calls, as the data needed seems fairly predictable. I kinda expect this code building part is not really time sensitive though, is it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code generator generates first 10,000,000 random programs in 30 seconds, so it's really fast - 3 microseconds on average.

// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// 2xADD(a, b, C) = ADD(a, b*2, C1+C2), same for SUB and rotations
// 2xXOR(a, b) = NOP
if ((opcode != MUL) && ((inst_data[a] & 0xFFFF00) == (opcode << 8) + ((inst_data[b] & 255) << 16)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably also want ADD a,b then SUB a,b and vice versa.
I also don't quite understand this. you seem to be storing only 8 bits of the source register here, is that because you don't care about false positives ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I store "register data revision" (change counter) here, so it can't be more than 256 because programs don't have that many instructions.

@Sonia-Chen
Copy link

[We are asicmakers (but not interested in secret Monero mining)]
I have a question:

Is it possible to use data from the blockchain itself in the PoW algo? block data? The problem with PoW in our view is that it's isolated from the block data. Inclusion of block data would force asicmakers to make chips that could be more useful later.

@moneromooo-monero
Copy link
Collaborator

Some chains do that. At least Boolberry. Not sure if you asked "can it be sensibly done", or "please consider doing it" :)

Copy link
Collaborator

@moneromooo-monero moneromooo-monero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main loop's I don't understand yet. I might comment again on it later.

src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved
src/crypto/variant4_random_math.h Show resolved Hide resolved
src/crypto/variant4_random_math.h Show resolved Hide resolved
src/crypto/variant4_random_math.h Show resolved Hide resolved
Copy link
Contributor

@vtnerd vtnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much to think about with this proposal ...

some initial thoughts.

src/crypto/variant4_random_math.h Show resolved Hide resolved
src/crypto/variant4_random_math.h Show resolved Hide resolved
src/crypto/slow-hash.c Outdated Show resolved Hide resolved
src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved
src/crypto/variant4_random_math.h Show resolved Hide resolved
}

// Don't do the same instruction (except MUL) with the same source value twice because all other cases can be optimized:
// 2xADD(a, b, C) = ADD(a, b*2, C1+C2), same for SUB and rotations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this why a constant is used in addition? To prevent a ADD, SUB case which results in a NOP? Doesn't this happen in the case where the constant is zero? And even when the constant is non-zero, couldn't such a sequence be optimized further? i.e. ADD(A, B, 10), SUB(A, B) -> ADD(A, 10).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random constant is used to fix zero bits that accumulate after multiplications. The case when add -> sub can be optimized to single add is quite rare, it's not worth additional complexity of the code generator. We're talking about reducing possible 5% speedup from optimizing compiler if we fix all thinkable cases here, not just this one. 5% is not much already.

Copy link
Contributor

@vtnerd vtnerd Feb 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably you meant 5% speedup in this portion, and not the entire algorithm (which should be dominated by cache/memory accesses)?

I don't quite like this argument, the CPU is pegged (more power) while custom designs might be able to save further power by having the same latencies with less silicon. Although any JIT-like approach with LLVM should do the trick here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5% speedup was in my tests where I had only random math in the loop and compared optimizing C++ compiler with direct translation to x86 code. The actual Cryptonight loop doesn't get any speedup from optimizing compiler on CPU because it's still dominated by the main memory-hard loop.

Custom designs will of course will have random math as limiting factor and will have optimizing compiler to assist them whenever possible.

I don't say in the description that ASIC is impossible. It's possible and can be still be 3-4 times more efficient per watt. But this algorithm is not the final code, it's only for the next 6 months.

// MUL = opcodes 0-2
// ADD = opcode 3
// SUB = opcode 4
// ROR/ROL = opcode 5, shift direction is selected randomly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What additional logic needs to be added for the other rotate? Wouldn't the additional logic only need to be in the prepping stage? Which brings me to the next point - why not drop one of the rotates in the execution switch to compress the logic? Seems like it would really be tough for the compiler to optimize that one.

src/crypto/slow-hash.c Outdated Show resolved Hide resolved
src/crypto/slow-hash.c Outdated Show resolved Hide resolved
src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved
@SChernykh
Copy link
Contributor Author

@vtnerd I've fixed pointer aliasing issues, can you check that I didn't miss anything?

@xiphon
Copy link
Contributor

xiphon commented Feb 10, 2019

FYI, I tested the code generation routine for all the block heights starting from the current height till October 6 2019 (1768400 ... 1940093).
Had zero cases when generated code could be optimized to less than 60 CPU instructions.

@SChernykh
Copy link
Contributor Author

@vtnerd @moneromooo-monero
I've just submitted my final tweak. There are 9 registers now, named R0-R8. Register R8 is used as a replacement for the case when we have ADD/SUB/XOR instruction with the same register. Why only there?

  • It's much easier to implement in existing miner code: of all 256 instructions, these 12 (ADD/SUB/XOR R0/R1/R2/R3 with itself) weren't used anyway
  • It won't require a change of binary format for code generator.

How it would affect existing ASIC designs (if there are any which I doubt):

  • Having 9 registers instead of 8 breaks all designs that didn't account for more than 8 registers and use 3 bits for register indexing
  • Register R8 is an additional dependency from the main loop variables. ASIC designs that supported more than 8 registers will still have to be updated to read new data from the main loop.
  • ASICs will have to pump more data (12.5% more) through their pipeline. They'll be a bit less energy efficient because of this. The difference is tiny but it still exists.

Effect on CPU/GPU: my tests show absolutely no changes to their performance/power usage.

@moneromooo-monero
Copy link
Collaborator

Looks good here. Waiting for vtnerd's now.

src/crypto/variant4_random_math.h Outdated Show resolved Hide resolved
It introduces random integer math into the main loop.
SChernykh and others added 2 commits February 14, 2019 20:42
Co-Authored-By: Lee Clagett <vtnerd@users.noreply.github.com>
do if (variant >= 4) \
{ \
for (int i = 0; i < 4; ++i) \
V4_REG_LOAD(r + i, (uint8_t*)(state.hs.w + 12) + sizeof(v4_reg) * i); \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to use int8_t*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't change anything. This pointer is passed to memcpy which is declared as void * memcpy ( void * destination, const void * source, size_t num );, so pointer type doesn't matter here.

@psychocrypt
Copy link

Is the algorithm for the next fork already final?

@SChernykh
Copy link
Contributor Author

Yes, it's already merged into release-0.13 branch (the one Monero will use for the fork).

@SChernykh
Copy link
Contributor Author

Could you give the typical GPU hashrate on CNV4?

Same as CNv2.

If a GPU has no decrease on CNV4 than CNV2, the an ASIC can do the same I suppose.

I already told you why it's not true. GPUs are limited by memory bandwidth, not computation. CNv2 ASICs use on-chip SRAM, they're not limited by memory bandwidth/latency, but they're limited by computation. This is why they'll get slower while GPU won't. Are you really a hardware engineer?

@Leochains
Copy link

Leochains commented Feb 22, 2019

Yes, but it's not a very good example because first 2 multiplications can be done in parallel here. Try to synthesize a circuit to calculate x^8 and another one that does x=((((x^a)*b)^c)*d)^e)*f - these 2 circuits will require 3 consecutive multiplications.

Here I have got the result for these two logic synthesize result:
calc_1
calc_2

The period of clock is 0.62ns(with a frequency around 1.623GHz) with a little slack.
Considering the practical P&R(place and route) I suppose that the 3 steps of ops rate can reach to 1.4G.
The ASIC can use double clock rate for logic calculating and then the memory access clock rate might be 700MHz. The total max 69 ops can be done in (69 / 3 / 2) = 12 cycles @ 700MHz

@SChernykh
Copy link
Contributor Author

SChernykh commented Feb 22, 2019

@Leochains Ok, a few questions:

  • Can you do the same without negative slack, Or should we just assume that 3 consecutive multiplications can be done in 0.75 ns?
  • And how will you handle randomly changing register dependencies and instruction sequences? I suspect that these will make it significantly slower than 0.75 ns.

Take a look at sample program one more time: https://github.com/SChernykh/CryptonightR/blob/master/CryptonightR/random_math.inl

There are no cases where you can do 3 consecutive multiplications without dependencies from other registers - if you look at dependency chain for each register individually.

So the last question: how fast something like stripped down Amber design could run to execute random math in CryptonightR?

@Leochains
Copy link

Leochains commented Feb 22, 2019

  • Can you do the same without negative slack, Or should we just assume that 3 consecutive multiplications can be done in 0.75 ns?
    Yes, 0.75 ns is enough.
  • And how will you handle randomly changing register dependencies and instruction sequences? I suspect that these will make it significantly slower than 0.75 ns.

Hardware can implement 23 small units with each one do 3 consecutive ops and use many mux to switch, each unit followed with DFF running at a clock rate around 1.4GHz(double of memory access frequency).
The actually rate may be a little slow but I think that may not be too much.

Take a look at sample program one more time: https://github.com/SChernykh/CryptonightR/blob/master/CryptonightR/random_math.inl

There are no cases where you can do 3 consecutive multiplications without dependencies from other registers - if you look at dependency chain for each register individually.

Latches can be used for the middle state registers to buffer the values or directly unroll those 3 ops.

So the last question: how fast something like stripped down Amber design could run to execute random math in CryptonightR?

I need some time.

@SChernykh
Copy link
Contributor Author

Yes, 0.75 ns is enough.

It means that DIV+SQRT in CNv2 can be done in 1.5 ns (3+2 multiplications) + 2 ns (2 reads from ROM): 3 clock cycles at 800 MHz, or 6 clock cycles at 1600 MHz. Much faster than in your estimations.

Hardware can implement 23 small units for 3 consecutive ops and use many mux to switch, each unit followed with DFF running at a clock rate around 1.4GHz(double of memory access frequency).

Why 23? There are 6 ops, 3 consecutive ops can be any of 216 combinations.

@Leochains
Copy link

Leochains commented Feb 22, 2019

Yes, 0.75 ns is enough.

It means that DIV+SQRT in CNv2 can be done in 1.5 ns (3+2 multiplications) + 2 ns (2 reads from ROM): 3 clock cycles at 800 MHz, or 6 clock cycles at 1600 MHz. Much faster than in your estimations.

Hardware can implement 23 small units for 3 consecutive ops and use many mux to switch, each unit followed with DFF running at a clock rate around 1.4GHz(double of memory access frequency).

Why 23? There are 6 ops, 3 consecutive ops can be any of 216 combinations.

Sorry I got some misunderstanding description. Different ops just need different switches and muxs, 23 means 23 cycles at 1.4GHz clock (considering the max 69 instructions, every cycle do 3 instructions). The total cycles at 1.4GHz are 69/3, which corresponding to 23/2=12 cycles at 700MHz(memory access frequency).

@SChernykh
Copy link
Contributor Author

Sorry I got some misunderstanding description. Different ops just need different switches and muxs, 23 means 23 cycles at 1.4GHz clock (considering the max 69 instructions, every cycle do 3 instructions). The total cycles at 1.4GHz are 69/3, which corresponding to 23/2=12 cycles at 700MHz(memory access frequency).

23 cycles at 1.4 GHz clock is ~16.5 ns. Don't forget that the next iteration can't start before random math is calculated - there is a data dependency for that, so this ASIC could do ~18 ns/iteration which is basically the same number as CPU - AMD Ryzen @ 4 GHz does 20 ns/iteration. CNv2 ASICs are much faster.

@psychocrypt
Copy link

Is there a testpool for the new monero pow available

@SChernykh
Copy link
Contributor Author

@Leochains
Copy link

Leochains commented Feb 23, 2019

23 cycles at 1.4 GHz clock is ~16.5 ns. Don't forget that the next iteration can't start before random math is calculated - there is a data dependency for that

The hardware can use latches instead of DFFs to break the dependence. you can understand that ASICs can split one cycle as 3 or more by different phas.

so this ASIC could do ~18 ns/iteration which is basically the same number as CPU - AMD Ryzen @ 4 GHz does 20 ns/iteration. CNv2 ASICs are much faster.

Don't forget that if with 16MB memory, an ASIC can reach to 8 times of CPU, and if they put 320 chips on one box like the current box, that means one box can get a 320 * 8 times rate than CPU.

So I suggest that the only way to assist ASICs is to enlarge memory requirement like ETH or Grin. Otherwise, no matter how the calculation algorithm modified ASICs box can still easier and cheaper to get a huge number times of hashrate than CPU/GPU.

Or if the CNV4 need to use many times of memory than CNV2, such as 16MB or 32MB, then an ASIC box can not get a huge number times rate than CPU/GPU, that is the second best way for current statement to anti ASIC I suppose.

@SChernykh
Copy link
Contributor Author

SChernykh commented Feb 23, 2019

The hardware can use latches instead of DFFs to break the dependence. you can understand that ASICs can split one cycle as 3 or more by different phas.

How can they break the dependency if they just don't know the address in scratchpad to read from until calculation is done?

Or if the CNV4 need to use many times of memory than CNV2, such as 16MB or 32MB, then an ASIC box can not get a huge number times rate than CPU/GPU, that is the second best way for current statement to anti ASIC I suppose.

This will slow down CPU and GPU proportionally. CNv4 is a temporary solution, but RandomX will use 4 GB memory.

Don't forget that if with 16MB memory, an ASIC can reach to 8 times of CPU, and if they put 320 chips on one box like the current box, that means one box can get a 320 * 8 times rate than CPU.

Current box has 320 chips but it's 320x1, not 320x8, because each of 320 chips scans nonces linearly. If it had 8 independent scratchpads, it wouldn't scan nonces this way. Each chip does 400 h/s, 128 kh/s in total. Similar configuration for CNv4 would be (assuming 18 n/s per iteration) 320*106 h/s = 34 kh/s.

@tevador
Copy link
Contributor

tevador commented Feb 23, 2019

Current box has 320 chips but it's 320x1, not 320x8, because each of 320 chips scans nonces linearly. If it had 8 independent scratchpads, it wouldn't scan nonces this way.

It can have multiple scratchpads and still scan linearly within a single chip. Assuming an 8 MB chip, which seems most likely, it will run nonces 0, 1, 2, 3 in the first batch, then 4, 5, 6, 7 etc. This is possible since it's a pipelined design and the whole batch is finished at the same time.

The independent nonce sequences of different chips are used to avoid inter-chip synchronization, which would be problematic especially if the miner has multiple boards that are not connected together.

@SChernykh
Copy link
Contributor Author

@tevador It's possible but not very logical. They separate each of 320 chips nonce ranges by 2^22, why would they implement such interleaving within each chip? It would make more sense to split each range in 8 parts in this case.

@Leochains
Copy link

Leochains commented Feb 23, 2019

How can they break the dependency if they just don't know the address in scratchpad to read from until calculation is done?

Latches (different phase enable controlled) can do that.

Or if the CNV4 need to use many times of memory than CNV2, such as 16MB or 32MB, then an ASIC box can not get a huge number times rate than CPU/GPU, that is the second best way for current statement to anti ASIC I suppose.

This will slow down CPU and GPU proportionally. CNv4 is a temporary solution, but RandomX will use 4 GB memory.

The current CNV2 access memory with 512 bit width but only calculate with 128 every time, at least the new algorithm can using a 8MB memory, that might have no influence on CPU/GPU.

Don't forget that if with 16MB memory, an ASIC can reach to 8 times of CPU, and if they put 320 chips on one box like the current box, that means one box can get a 320 * 8 times rate than CPU.

Current box has 320 chips but it's 320x1, not 320x8, because each of 320 chips scans nonces linearly. If it had 8 independent scratchpads, it wouldn't scan nonces this way. Each chip does 400 h/s, 128 kh/s in total. Similar configuration for CNv4 would be (assuming 18 n/s per iteration) 320*106 h/s = 34 kh/s.

Why 8 independent scratchpads can not get the linearly scans? I think the 320 chips can divide a nonce with equal distribution and it's possible for every scratchpads do the same in one chip. On the other way, if you are a ASIC designer, integrated 18MB memory is already a general product proven method on LTC, would you just only integrate 2MB in one chip with a expensive package, testing, PCB and other cost?

@SChernykh
Copy link
Contributor Author

Latches (different phase enable controlled) can do that.

Can do what? Something theoretically impossible - reading from memory when address is still not known?

Why 8 independent scratchpads can not get the linearly scans? I think the 320 chips can divide a nonce with equal distribution and it's possible for every scratchpads do the same in one chip.

They can, but they don't. We clearly see 320 distinct ranges on the nonce graph. They maybe configured as 40x8 or 20x16, I also don't think they are indeed 320 separate chips.

@Leochains
Copy link

Current box has 320 chips but it's 320x1, not 320x8, because each of 320 chips scans nonces linearly. If it had 8 independent scratchpads, it wouldn't scan nonces this way.

It can have multiple scratchpads and still scan linearly within a single chip. Assuming an 8 MB chip, which seems most likely, it will run nonces 0, 1, 2, 3 in the first batch, then 4, 5, 6, 7 etc. This is possible since it's a pipelined design and the whole batch is finished at the same time.

The independent nonce sequences of different chips are used to avoid inter-chip synchronization, which would be problematic especially if the miner has multiple boards that are not connected together.

Yes, that's right, this is a general way in hardware.

@Leochains
Copy link

Can do what? Something theoretically impossible - reading from memory when address is still not known?

You can treat latches as DFF which controlled by gating enable, they are triggered by enable signal level instead of clock edge. The FPGA have no gating as the logic are already fixed. But LATCH are standard cell on ASIC which generally using together with clock gating cells and the similar logic control.

@tevador
Copy link
Contributor

tevador commented Feb 23, 2019

They can, but they don't. We clearly see 320 distinct ranges on the nonce graph. They maybe configured as 40x8 or 20x16, I also don't think they are indeed 320 separate chips.

We don't know, but having 320 chips is not far fetched considering Innosilicon A8 has 160 chips. I think they have to use more than 320 scratchpads since 400 H/s per scratchpad would require operating frequency of around 2.4 GHz per the table posted above.

You can treat latches as DFF which controlled by gating enable, they are triggered by enable signal level instead of clock edge. The FPGA have no gating as the logic are already fixed. But LATCH are standard cell on ASIC which generally using together with clock gating cells and the similar logic control.

This still doesn't explain how an ASIC can load from memory before the address is calculated.

@SChernykh
Copy link
Contributor Author

I think they have to use more than 320 scratchpads since 400 H/s per scratchpad would require operating frequency of around 2.4 GHz per the table posted above.

Table posted above overestimates numbers for CNv2. You showed yourself that 3 multiplications can be done in 0.75 ns. DIV requires 3 multiplications, SQRT requires 2 multiplications, so 400 h/s (4.76 ns/iteration) is quite possible.

@tevador
Copy link
Contributor

tevador commented Feb 23, 2019

@SChernykh The Open-CryptoNight-ASIC does ~235 H/s per scratchpad in CNv0 at 800 MHz, so it would require 1.4 GHz to reach 400 H/s. CNv2 must be slower than that due to higher div+sqrt latency.

@Leochains
Copy link

I think they have to use more than 320 scratchpads since 400 H/s per scratchpad would require operating frequency of around 2.4 GHz per the table posted above.

Table posted above overestimates numbers for CNv2. You showed yourself that 3 multiplications can be done in 0.75 ns. DIV requires 3 multiplications, SQRT requires 2 multiplications, so 400 h/s (4.76 ns/iteration) is quite possible.

But 16MB memory will get a double hashrate.

@Leochains
Copy link

This still doesn't explain how an ASIC can load from memory before the address is calculated.

Load data from memory twice every 12 clock cycles in one round, using a slow clock. Here just talking about how 3 instruction ops can be done in one cycle, have nothing to do with memory access.

@SChernykh
Copy link
Contributor Author

@tevador

The Open-CryptoNight-ASIC does ~235 H/s per scratchpad in CNv0 at 800 MHz, so it would require 1.4 GHz to reach 400 H/s. CNv2 must be slower than that due to higher div+sqrt latency.

The problem is that div+sqrt latency turned out to be not higher for efficient implementation. It can also fit in 5 ns (4 cycles at 800 MHz).

Venthos pushed a commit to Venthos/node-cryptonight-hashing that referenced this pull request Feb 24, 2019
Added support for CryptoNightR (Variant 4) utilizing the code from:
monero-project/monero#5126
Copy link
Contributor

@fluffypony fluffypony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed

@fluffypony fluffypony merged commit f1fb06b into monero-project:master Mar 4, 2019
fluffypony added a commit that referenced this pull request Mar 4, 2019
f1fb06b Fixed path to int-util.h (SChernykh)
9da0892 Adding cnv4-2 tweaks (SChernykh)
f51397b Cryptonight variant 4 aka CryptonightR (SChernykh)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet