Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea for ASIC resistance #3545

Open
zawy12 opened this issue Apr 3, 2018 · 229 comments

Comments

@zawy12
Copy link

@zawy12 zawy12 commented Apr 3, 2018

If ASICs are going to be a recurring problem, can you change the POW to maybe 5 or 10 different options every block based on the hash value of the previous block? That way the software could be changed in GPUs whereas ASICs would hopefully require different hardware for each POW if they are logically and/or mathematically independent enough. Even better would be if a POW could change a single variable to require a different ASIC, then the previous hash would be used to set the variable for the next block.

@moneromooo-monero

This comment has been minimized.

Copy link
Contributor

@moneromooo-monero moneromooo-monero commented Apr 3, 2018

That's the idea used by myriadcoin IIRC. It just increases the ASIC cost incrementally though. Maybe it's part of a solution though. For clarity for others reading this, what we've done for v7 is about breaking ASICs through unpredictable change, rather than making harder to build ASICs.

zawy12, could you rename the issue to "ideas for ASIC resistance" or so, we'll need to collect ideas anyway for next switch, since we're very unlikely to have a new PoW by then. So we'll use this to get people start thinking about different ways we could go.

@tevador

This comment has been minimized.

Copy link
Contributor

@tevador tevador commented Apr 3, 2018

I started developing a similar concept of an ASIC resistant PoW.

Basically, the idea is to create a virtual instruction set with basic bitwise, arithmetic and floating point operations and some branching (this must be designed carefully to avoid infinite loops). At the start of each new block, the hash of the previous block is expanded into a few hundred random instructions and compiled into machine code (for CPU/GPU). This random program is then applied to the scratchpad (initialized from the current block header). At the end, the scratchpad is hashed and the result is compared to the target.

Therefore, the PoW would change with every block and if the instuction set is broad enough to cover the majority of what current CPUs/GPUs implement in hardware, then any potential ASIC would be basically a CPU/GPU with similar performance.

The specific details can be tuned so the benefits from a custom hardware are negligible - for example, the PoW code should fit into L1I cache, scratchpad fits into L2/L3 cache, 64 bytes of data are processed at a time (whole cache line) etc. A different permutation of opcodes could be chosen for every block to prevent direct hardware decoding.

This approach has some disadvantages, but it's the only one guaranteed to be ASIC resistant.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 3, 2018

That's really interesting. How would you make solvetime be in the correct ballpark for a given network hahrate? That is, how do you come up with a difficulty setting for an unpredictable algorithm?

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Apr 4, 2018

Sounds a lot like what I outlined here https://www.reddit.com/r/Monero/comments/865vbb/asic_resistance/dw2p4b7/

The basic idea is to use randomly generated code. If you simply select from a handful of known algorithms, eventually someone will build an ASIC that just implements the entire set of algorithms. Using randomly generated code, with an RNG with an intractably large set of states, will defeat that possibility.

My approach is to randomly generate javascript. Developing your own instruction set sounds great but such a development will be immature for quite a long time, leaving countless opportunities for fatal flaws and unexpected optimizations. Using something mature like javascript means most optimization opportunities have already been exploited. (And meanwhile, if anyone does manage a new optimization, it becomes useful to the entire computing community.) I will have deployable code demonstrating this approach in a few days.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 4, 2018

Again I can't imagine how you guys are going to be able to adjust a difficulty for randomly generated code that has an unknown solve time. I guess you're right that they could build an Asic for 5 algorithms.

@tevador

This comment has been minimized.

Copy link
Contributor

@tevador tevador commented Apr 4, 2018

@zawy12 The result of the PoW is a standard 256 bit hash of the collapsed scratchpad state, so difficulty targeting is not affected. Basically, just the memory-hard loop of cryptonight would be replaced. Everything else stays the same.

Mining would go like this:

  1. Select a nonce value.
  2. Compute the Keccak state of the block header and initialize a 2 MiB scratchpad (same as cryptonight).
  3. Apply the block-specific algorithm to the scratchpad.
  4. Collapse the modified scratchpad to a 256 bit hash (same as cryptonight).
  5. If the hash meets the difficulty for the current block, the block is solved. Otherwise repeat from step 1).

The random algorithm would be long enough so that the latencies of different instructions average out and step 3) should take approximately the same time for every block.

The only major difference would be the addidional compilation step when a new block is found.

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Apr 9, 2018

How do you determine the difficulty of the random algorithm though? How do you determine if it even terminates? There must be some rules governing how the algorithm is generated and in that case it may be vulnerable to FPGAs or CGRAs using dynamic partial reconfiguration.

@gazhayes

This comment has been minimized.

Copy link
Contributor

@gazhayes gazhayes commented Apr 9, 2018

Using randomly generated code, with an RNG with an intractably large set of states, will defeat that possibility.

@hyc that sounds pretty cool, what's the source of randomness though, how do you prove this is random to everyone else? Or doesn't it matter for your solution?

If verifiable randomness is needed, Micali's VRFs could maybe help but that would probably require a shift towards a PoS-ish solution.

If randomness isn't verifiably random, couldn't an ASIC just propose their own 'randomness' that they are optimized for solving?

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Apr 9, 2018

@gazhayes I don't think the source of randomness matters, people solving based on their own randomness will not get the same result as everyone else, and thus would be rejected.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Apr 9, 2018

The source of randomness is the block hashing template and the nonce. These are fed as a seed into the PRNG. Everyone must use the same PRNG and same random program generator, otherwise the results are invalid.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Apr 9, 2018

Proof of concept is here https://github.com/hyc/randprog note that the actual program generator is some pretty bad code. This is nowhere close to useable for real, it literally only illustrates the idea.

Some other differences between mine and @tevador 's proposal - this approach generates an entirely new program for every nonce, his generates a single program for a given block. In my approach, the PRNG and the program generator play a more significant role in the overall PoW. The other approach relies primarily on the difficulty of executing the generated program.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 10, 2018

@hyc It seems like the miner might be able to see higher difficulty in some algorithms and just skip them (go to a new nonce). Trevador said the algorithm would be long enough that it's computational difficulty would average out. I'm skeptical but maybe it's possible by bounding outliers. I can't see how you've address the problem, especially with all functionality of javascript. It seems like ya'll are attempting a form of genetic programming, which has rules on what can be combined so that valid algorithms are created.

How are nodes going to easily validate the work without redoing it all?

Why change the algorithm with each nonce? Why not use bits of the hash of the previous block? All miners would know the algorithm as soon as they start working on a block.

@LordMajestros

This comment has been minimized.

Copy link

@LordMajestros LordMajestros commented Apr 10, 2018

On further consideration I don't think verifying this will be easy. It seems to me that the only way to verify it is to do it all over again.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Apr 10, 2018

Yes, the only way to verify is to perform the same steps. But that's true for Cryptonight too so this is no worse in that respect. (This is not the only approach we're exploring. There are other avenues which allow the verifier to be cheaper than the original proof but I'm not directly involved in that work.)

@zawy12 as I noted, this particular code isn't directly usable. But yes, having all of the functionality of javascript is partly the point. A real solution would ensure some uniformity of difficulty for all of the generated code. E.g., "output will have [30-40] functions, [80-100] variables; functions will have [10-20] statements" etc. - the actual range of parameters will need to be chosen to provide a large enough set of permutations.

Why change on every nonce? Because we already know that companies like Bitmain have automatic tools to turn e.g. C code into e.g. RTL - Someone using FPGAs could easily reconfigure if you only use one algorithm per block. Generating a new program for every nonce eliminates any such advantage. It also eliminates a lot of compiler advantage too - both AOT and JIT compilation trade off compile time for run time on code that executes more than once. Since the generated code will only ever execute once, complex AOT or JIT optimization are pointless.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 10, 2018

@hyc OK, so the validator only has to run the "winning" nonce etc (algorithm) while the miner had to do it for a bunch of different algorithms. Maybe you should use reverse-polish-notation (stack) thinking which is probably how a lot of genetic programming is done (not to be confused with GAs). By this I mean your first 2 variables go into a single math operation which might be limited to +, -, * , and /. Or maybe there is a set of operations that CPUs are better at. You choose variable widths optimized for CPU, equal to something like 64 bits. The output from the first operation is one of two inputs to the next operation, and a 3rd variable is the other. So you have N operations and N+1 variables. To be Turing complete, you have to enable outputs the ability to be inputs to earlier (make it recursive) or later operations, so N+1 is the max number of variables. Cellular automata can be similarly Turing complete. You don't need all of a language which may require a lot more protective logic (or limitations) for dealing with (or preventing) mal-formed operations. Some variables (and therefore operation difficulty on them) are going to blowup in size (and therefore difficulty). So to meet the "near equal" computation difficulty requirement, I would make the variable sizes equal to the max for the declaration and let truncation at the max to prevent overflow end be the norm. Maybe have to make sure the number of each of the possible math operations are always equal, including the recursion count of that operation. But you can't count the number of recursions until you run the simulation, and it's not Turing complete if this is not an option. So it can't be that general, unless you have a rule that the process stops after a certain number of operations, and that stopping point is the output. But then you can't predict if you'll have way too many divisions compared to additions, unless the stopping point includes an assumption about the relative difficulty of the operations which may not be a good idea. With recursion, I'm doubtful you can depend on a law of averaging like trevador requires. This brings me back to the idea of simpler cellular automata where the amount of computation in each step is known, so after the algorithm is created, you know exactly how many steps are required to meet the difficulty requirement. I guess this is effectively the same thing as the 4 math operations with recursion, except that the method of recursion is pre-defined in the algo at each step, instead of being a free-for-all in what you could end up creating.

So maybe there is a way to have a very unpredictable algorithm that CPUs are good at that also can have its computational difficulty fixed. I assume a hash of the output is the thing that is trying to meet the difficulty target. I don't know why it would be needed, but you might be able to increase the number of steps required via a difficulty algorithm instead of lowering the target hash, or do both.

What I've described goes in a completely different direction than I think equihash and cryptonight, if it is not done to require a large amount of RAM for random access. But does a GPU or ASIC haved any advantage over CPU with this? By requiring a lot of step-by-step recursion without much memory, the central processor of any of the architectures would generate a lot of heat. So it's electricity intensive instead of up front equipment cost. The smallest resolution IC's would have an efficiency advantage, which I think puts ASICs at a disadvantage.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 10, 2018

What I've described above could be done most cost-effectively (least electricity per simple computation) on risc architecture like cell phones, especially since they're optimized for electrical efficiency and low ram. FPGAs and ASICs would be at a disadvantage due to lagging the line resolution in GPUs and CPUs. An ARM expert could identify the math operations (including bitwise operations) that ARM excels at. It only needs a way to generate recursive procedures that result in pseudo-random complex patterns like certain cellular automata. But I don't know if Wolfram found a large "generate-able" class of them that reliably gives complex patterns.

I'm saying "cellular automata" but they're just sort of discrete differential equations. So you can think in those terms: just find a class of simple non-linear DE's that has these characteristics (predicable computation difficulty per step, yet non-reducible across their class to a more efficient computation). Both CA and DE's require initial conditions, and if the class of DE's is complex enough the only way to solve is step-by-step recursive procedure that's already been optimized by mathematicians.

@ipbc-dev

This comment has been minimized.

Copy link

@ipbc-dev ipbc-dev commented Apr 10, 2018

Hello, we thought that a recent idea we started building might be of interest to this discussion:
https://github.com/ipbc-dev/TuringsNightmare (Very WIP)

The idea is similar to what tevador suggests, I think.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Apr 10, 2018

@moneromooo-monero I have a great idea which can be used as a building block for the next small tweak to Cryptonight. It's not a big change like everything discussed above.

So, current Cryptonight is memory-intensive in terms of memory latency, but not bandwidth. Modern CPUs use 64-byte wide cache lines, but Cryptonight only loads/stores 16 bytes at a time, so 75% of available CPU cache bandwidth is wasted. ASICs are optimized for these 16 byte-wide memory accesses, so they always use 100% of whatever memory they have.

The idea is to do something computationally light and safe with the other 48 bytes of the same cache line (which is loaded in L1 cache anyway) on each half-step.

I did a quick test, basically this: _mm_load_si128 -> _mm_shuffle_epi32 -> _mm_store_si128 for the other 3 16-byte chunks in current 64-byte cache line, right after _mm_aesenc_si128 instruction and right after _umul128 instruction. CPU mining performance stayed the same! Well, it went down 0.5-1% in my test, but that's negligible. It's easy to explain why: modern CPUs execute code out of order and have plenty of execution resources, so they can do it in parallel, and these 3 16-byte chunks are already in L1 cache, so no additional delay happens. It's just executed in parallel with AES instruction and MUL instruction.

I haven't tested it on GPU yet, but my estimations show that GPUs use only 15-20% of their memory bandwidth when mining Cryptonight, so they should also handle this change with minimal impact on performance.

ASICs, on the other hand, will have to deal with 4x more bandwith usage, so they will be 4 times less efficient compared to the original Cryptonight ASICs.

This building block (_mm_load_si128 -> _mm_shuffle_epi32 -> _mm_store_si128 for the other 3 16-byte chunks) is easy to add to the next PoW change, together with other changes. And it can't break anything, because it only shuffles DWORDs that are in the scratchpad, it can never cancel anything out to 0 like XOR (if implemented unsafe) or reduce entropy in any other way.

P.S. I just realized that this makes FPGAs 4 times less efficient as well. Not only ASICs.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Apr 11, 2018

@ipbc-dev All that's needed to solve this is an ASIC designed to execute your 255 instruction opcodes. 255 instructions is a trivially small number to implement.

@zawy12 Your suggestion boils down to a finite state machine. It would require only a small number of logic gates to implement each state, and some other amount of memory.

@SChernykh a 4x speedup for CPUs is nice, but we're looking at ASICs with ~300x CPU efficiency. Spitting into the wind.

https://www.reddit.com/r/Monero/comments/8bbhxq/understanding_network_hashrate_prepost_asics_and/

Let's start with a rx 580 8gb rig and compare it to an asics rig. A 6 gpu rig gets 5000 h/s. The asics rig from bitmain gets 220,000 h/s for the same amount of power usage. You would need 44 - 6 gpu rigs to get the same hashrate.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Apr 11, 2018

@hyc It's not a speedup for CPU, more 4x slowdown of ASICs. I never said it would solve the ASIC problem entirely, but 4x more ROI time for ASICs could actually make them economically useless with 6 month PoW update cycle. Original ASICs had 1 month ROI, and had 3 months until the PoW change. With 4 times less efficient ASICs they will never ROI before the PoW change, or at least it will be questionable.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 11, 2018

@hyc ASIC gains depend exclusively on a particular algorithm that can be optimized where calculations can be skipped. Very simple recursive algorithms that produce pseudo-random output can't be optimized because they already are optimized. The extent to which the algorithm produces non-random output is the extent to which they can be optimized, but optimization beyond a certain point requires human tinkering. Before that point, a CPU can figure out the optimization before sending it off to a fabrication lab, so pseudo-random or even being some distance from random may not be a problem. But the general class of the on-the-fly generate-able algorithms should not have a significant demonstrable general pattern to them that could then be encoded as an optimization in an ASIC.

What I'm saying is the answer. Everyone's been thinking more RAM is the key, and it has a side marketing benefit of potentially not requiring as much electricity. But I'm saying there is a great route here that specifically tries to waste electricity. That provable waste is the underlying value of non-fiat coins like gold (see Szabo's writings, Dai's b-money comments about what he was trying to acheive, and attempts to make currency equal to Joules). Recursive output from simple algorithms can't be optimized. I read Wolfram's "New Kind of Science" and know the basic theoretical limits of computing which is how I know ASICs and FGPA's are not going to help.

I'm not sure your changing the algo "per nonce" method is better than the "per block" method, and it might not be good if a miner can identify harder algorithms and just skip it (lending to an exploit by ASICs), but it may be needed to average out the difficulty miners face in the series of recursive algorithms. I think per block might be better and force us to find a class of algorithms that have the same per-recursive-step computation time. This tightness might also force us to end up with something more similarly random, and more easily provably unfit for ASICs.

Each single bit transition in a computing element like the output of a NAND gate requires at least k*T*ln(2) joules of energy. Our devices are many orders of magnitude above this Landauer limit, largely depending on how small the gates can be. CPUs and GPUs have smaller gates than ASICs and FGPAs, or at least ASICs and FGPAs are not smaller.

Wolfram and others have proven you can't reduce certain very simple recursive algorithms in a way that eliminates the need to recursively go through all the steps. ASICs will have to burn more electricity than cell phones to do the same computations.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 11, 2018

@hyc Elementary cellular automata are not necessarily finite state machines and can be Turing complete like wiki says. I was thinking of automata that potentially use more than just the 2 adjacent cells (the definition of the elementary type) so that it is more likely to be pseudo-random output. One algorithm per nonce might be required because instead of trynig to create an FPGA per block, you would create a lookup table to fill RAM and use that instead of computing. But that can't be done with a change every nonce on a large general of algorithms.

Or maybe it could be done with 1 algorithm per block like this: make it start with 256 bits (cells). A hash of your nonce would be the random number that creates that initial condition. That way you can't build a lookup table for the algorithm. If it's random enough, you won't be able to reduce the logic for programming into a FPGA in the time span of a block.

If a very general class could be created that has the same amount of computation per step, then you would have a fixed number of steps to the answer which would be hashed to get your target. Just as an interesting point, maybe the difficulty could change that number of steps instead of making the target hash lower.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Apr 11, 2018

@zawy12

ASIC gains depend exclusively on a particular algorithm that can be optimized where calculations can be skipped. Very simple recursive algorithms that produce pseudo-random output can't be optimized because they already are optimized.

This is only part of the story. ASICs are also faster because their memory subsystems don't require complex caches and cache coherency protocols that general purpose CPUs require. You're only looking at the theory, but we're dealing with real world practice where other considerations apply. A state machine implemented in an ASIC can certainly be made to iterate through a set of states far faster than a CPU could be programmed to do. It's not just about a single specific algorithm, it's about all the auxiliary support that does (or doesn't) have to be put in place.

When you toss around mention of "Turing completeness" remember that a Turing machine is just a device that reads or writes one memory cell at a time on a very large ("infinite") memory tape. It is trivial to implement in hardware for a given size of memory. We already know that depending on "xxMB is too expensive to implement in ASIC" as a defense is inadequate.

Whenever you focus in on simple state machines, you're reducing to a Turing machine and the only hardness is how much memory it will need. My approach here is specifically focused on making the algorithm more complex, because (a) while all algorithms can eventually be reduced to something executable by a Turing machine, the number of steps needed to perform that reduction will drastically impact its runtime and (b) increasing the algorithm complexity increases difficulty far faster than just increasing memory requirement.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 11, 2018

Again, the primary goal is to not allow more than minimal memory to be helpful. The only thing that can be fed into the processor is mainly the previous output of the processor, i.e. a heavy reliance on unpredictable recursiveness instead of memory. I mentioned Turing completeness the second time because you stated it was a state machine which is not Turing complete. Turing complete is not needed, but something complex enough to have output that can't be easily predicted. We're talking about changing the "state machine" every block or nonce, so the ASIC and FPGA can't be created ahead of time. (to execute an optimization of the state machine) ( i am not talking about a finite state machine anyway)

To say it another way, modeling of complex differential equations has to do things 1 recursive step at a time like this. An ASIC or FPGA could be could be designed for a given model. But if the model is changing all the time, they can't help.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Apr 11, 2018

An ASIC or FPGA could be could be designed for a given model. But if the model is changing all the time, they can't help.

Yes, in this respect we're saying the same thing.

( i am not talking about a finite state machine anyway)

Beside the point. The cellular automaton you're talking about is a trivially small set of logic states. All it does is transition from one state to another based on the current input. My point is we need a much larger set of states to transition between.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 11, 2018

Why are more than 2^256 states per cycle (per recursion) needed? (maybe 1,000,000 loops per hash)

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Apr 11, 2018

All you've described is essentially a compute engine with 4 instructions in its instruction set and a 256 bit program. It is ridiculously trivial to implement.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Apr 11, 2018

Trivial to implement is core to my goal, not a problem. Do you mean it's trivial to optimize?

myhash = ffx32 
unless (myhash < target) {
   nonce++
   j =  hash(previous block hash + nonce)
   for (1 to 100)  { 
       j++
       function = generate_function( hash( j ) )
       for  (k=1 to 100,000) {
           input=output
           output = function( input )
       }
   }
   myhash = hash(output)
}
if ( myhash < target ) { claim my block }

The ending j and k need to be included in the header for validators, and j checked for being a reasonably small multiple of 100 above hash(previous block hash + nonce) Reasonably small would be < 1000x of expected hashes per block needed by the largest miner, assuming that miner has 10x rest of the network).

Where our goal is a class of functions that is reasonably unpredictable in logic, but predictably difficult to optimize and predictably nearly equal in time to compute (for a given "CPU") as all of those in its class.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Apr 11, 2018

How is what you describe in any way different from Cryptonight, which has already been broken?

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Oct 8, 2018

I'm trying to help the Monero community by publicly pointing out these flaws in the code generation approach. I could have kept this private and made a lot of money building an ASIC for your new PoW. If you want to ignore me, do not complain when I do just that.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Oct 8, 2018

I just don't understand how an ASIC "without cache" could do compilation fast. Can you explain?

@paulshapiro

This comment has been minimized.

Copy link
Contributor

@paulshapiro paulshapiro commented Oct 8, 2018

@timolson there has been some recent discussion by andytoshi, gmaxwell, and hyc in #monero-research-lab (freenode IRC) that the nonces which correspond to easier mining algos can be picked out and run preferentially. The conclusion was it should be simulated. May I invite you to come join the conversation?

@MoneroCrusher

This comment has been minimized.

Copy link

@MoneroCrusher MoneroCrusher commented Oct 8, 2018

@timolson @SChernykh
While I have no knowledge in hardware or software, from having read about this now for a big part of the year, I also think specialized hardware will always have an edge.
But do you know what is really ASIC resistant? The Monero community spitting out a new algo every 6 months, increasing ASIC cost & complexity in the same step. That's what's truly ASIC resistant about this Oct. 18th, knowing that another unknown one is just around the corner.
@timolson can you tape out an ASIC on a large scale within that 6 month period, break even all the labour, production cost, installation, electricity, disposing the ASICs after next fork etc? If so, tell the community about ASIC logistics/business, so we can take measures and i.e. increase forking frequency to 3 months if necessary or take other steps.

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Oct 8, 2018

Forking is not significantly different from code generation. An ASIC designer can start with a baseline CN impl like ours and intercept the datapath at various points to implement the tweaks in a simple CPU. The best thing the latest tweak does is change the datapath to memory. That breaks a lot and does require a great deal of rework.

The issue with tweaking the PoW so frequently is that it doesn't get enough review. You need to trust the difficulty of your PoW and if it's changing all the time you can't do that.

I'll jump on IRC.

@MoneroCrusher

This comment has been minimized.

Copy link

@MoneroCrusher MoneroCrusher commented Oct 8, 2018

@timolson Nothing against you, but obviously you have to preserve self-interest and your interest lies in bringing ASICs to Cryptonight, as you've clearly outlined. So you come in here and say ASICs are inevitable, why should we trust you to have a neutral view on this?
Can you say that you could profitably tape out CNv2 on a large scale within 6 months, including all expenses, including the risk of having worthless hardware in 6 months from now?

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Oct 8, 2018

The best thing the latest tweak does is change the datapath to memory. That breaks a lot and does require a great deal of rework.

What exactly do you mean by the latest tweak? There was a big change from me and a small tweak from @vtnerd - can you go through them step by step and describe how it affects ASICs?

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Oct 8, 2018

== Cannot send to nick/channel: #monero-research-lab

Private channel?

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Oct 8, 2018

@timolson you have to register your nick:
/msg NickServ register password email

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Oct 8, 2018

@paulshapiro Like gmaxwell, I mentioned several times in this issue it's a "no-go" if the "nonces" can be searched to select code that's easier to compile + solve. The answer seemed to be "We've got that covered. The solve times will be about equal." But I didn't see that answer given to gmaxwell.

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Oct 8, 2018

The answer to the easy nonce problem which was posted in the RandomJS issue does not seem sufficient to me. By using permutations instead of combinations, you are reducing complexity, simplifying the datapaths in an ASIC, and guaranteeing a fixed ratio of execution units.

@zawy12 zawy12 closed this Oct 9, 2018
@zawy12 zawy12 reopened this Oct 9, 2018
@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Oct 9, 2018

@timolson Nothing against you, but obviously you have to preserve self-interest and your interest lies in bringing ASICs to Cryptonight, as you've clearly outlined. So you come in here and say ASICs are inevitable, why should we trust you to have a neutral view on this?

Technical arguments may be weighed on objective merits, so you don't need to trust me. Go ahead and assume it's not a neutral view; that is fine. If a "friendly adversary" makes the PoW better, that's great.

However, in defense of my personal honor, there are plenty of startup projects to choose from. Even if it "had" to be mining and "had" to be an ASIC, we could focus our attention on other coins. I do honestly believe--and all evidence so far supports this--that specialized mining hardware is inevitable. The Monero community is the most passionate about maintaining decentralization and the original core values of cryptocurrency, and it is the strongest test of this inevitability claim. There would be many easier things to try than propose an ASIC for Monero...

@binaryFate

This comment has been minimized.

Copy link
Contributor

@binaryFate binaryFate commented Oct 10, 2018

A potential solution to the easy-nonce problem was discussed on IRC recently, which is to chain several generation&execution, where the next code generated depends on the output from the execution of the previous one.

This forces an attacker to actually execute the first N randomly-generated code to even know the code generated at step N+1.

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Oct 11, 2018

@binaryFate That might work for small N and be needed if you can't prevent some algos from having quick compile+solvetimes. You can only do it for small N because all nodes for all time have to verify the input to the N+1 algo which means they have to compile and execute all N+1 algorithms. I've been wondering if this is really a "proof of compile" POW rather than "proof of algo execution", and needing to do a sequence of N+1 compiles makes me wonder that even more. What percent of time is spent compiling compared to algo execution?

@tevador

This comment has been minimized.

Copy link
Contributor

@tevador tevador commented Oct 11, 2018

It might be a good solution with 2 chained code gen/exec steps. Longer chains would cause the verification time to be too high or require significant reduction of the average complexity of the program.

I've been wondering if this is really a "proof of compile" POW rather than "proof of algo execution"

It's a proof of program execution. There is no compilation going on. We are using an interpreted engine (XS) and even using a compiled engine (such as the V8) doesn't improve peformance much because each program is only executed once.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Oct 11, 2018

There is no compilation going on. We are using an interpreted engine (XS) and even using a compiled engine (such as the V8) doesn't improve peformance much because each program is only executed once.

Formally you are correct, but compilation does happen in reality. XS engine parses the code, then compiles it to internal byte-code format and then executes it.

@tevador

This comment has been minimized.

Copy link
Contributor

@tevador tevador commented Oct 11, 2018

True. But the compilation step is not required to calculate the PoW. Someone could in theory directly execute the expression tree as produced by the code generator. If the hash of the Javascript source code wasn't part of the PoW, there would be no need to even generate it.

It's one of the open issues being discussed here: tevador/RandomJS#9 (comment)

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Oct 11, 2018

This point about skipping any compilation or parsing or anything is what I was trying to say about ASIC’s. All these JS engines do far too much work relative to how an ASIC would operate. x86 generation wouldn’t change that.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Oct 11, 2018

You seem to be assuming that 100% of the algorithm will be implemented on the ASIC. That seems unlikely. It seems more likely to me that the code generator will reside on a host CPU, and its output will be fed to an ASIC for accelerated execution. In that case, you still need at least an IR or bytecode to transmit the code to the ASIC.

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Oct 11, 2018

You seem to be assuming that 100% of the algorithm will be implemented on the ASIC. That seems unlikely. It seems more likely to me that the code generator will reside on a host CPU, and its output will be fed to an ASIC for accelerated execution. In that case, you still need at least an IR or bytecode to transmit the code to the ASIC.

Assuming the easy nonce problem is sufficiently solved, yes, everything on the ASIC. Sending bytecode in or out of the ASIC is an absolute disaster because now you are bandwidth bound on the IO bus. Only the header needs to be sent during an initialization step once per block. 100% of the hashing algo will be implemented on-chip, even if you need to embed an ARM core, which I don’t think is necessary. Personally, I’d experiment with a custom “JS CPU” before resorting to an ARM license.

@cjdelisle

This comment has been minimized.

Copy link

@cjdelisle cjdelisle commented Nov 26, 2018

After looking at RandomJS and Vitalik's previous work on random executable code, I suspect that what gmax said in https://pastebin.com/spug4a8x is in fact correct and that anything random-code like is going to fall for miners pre-sifting to remove most of the code snippets and only retain the ones which they have optimized circuitry for.

However, I would like to suggest selecting a PoW algorithm based on a value beyond the miner's control (such as block height). If you have n radically different functions, each of which takes approximately the same amount of time to complete on "typical hardware" and the PoW for a block is FUNCTIONS[block_height % *n*], ASIC miners will need to either keep n times the circuitry around or will need to dynamically reconfigure, making them not vastly different from an FPGA.

Ideally these algorithms should be radically different in their demands for hardware, one using almost only floating point math, the next one using purely integer math, one using large amounts of memory, the next one using no memory at all, one requiring significant inter-core bus communication, the next being embarrassingly parallel, etc. The objective here is to make sure that at any given time, a largest possible amount of necessary circuitry remains idle.

This should still be quite easy for GPUs and quickly re-programmable FPGAs to mine. With my limited hardware understanding, I understand that the reason why FPGAs were never massively faster than GPUs for Bitcoin is because their flexibility requires them to have a lot of unused circuitry.


I think the algorithm designer wants to ensure that each transistor on an FPGA gets about 1/n of the usage. If it's not balanced then ASICs will skip certain blocks.

@tevador

This comment has been minimized.

Copy link
Contributor

@tevador tevador commented Nov 26, 2018

@cjdelisle

anything random-code like is going to fall for miners pre-sifting to remove most of the code snippets and only retain the ones which they have optimized circuitry for

This problem was discussed in the RandomJS issue linked above and the solution is to chain multiple "code snippets" so that the output of a previous program becomes the seed of the next program.

This way, "filtering" is virtually impossible because the miner doesn't know which functions will be used in the next program until they have executed the first code snippet.

PoW for a block is FUNCTIONS[block_height % 256]

For a double-spend attack, you would need an ASIC that supports less than 30 of the 256 functions.

I don't think it's feasible to design 256 radically different PoW algorithms.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Nov 26, 2018

I don't think it's feasible to design 256 radically different PoW algorithms.

If it were feasible, it would be easy to implement in hardware. Self-defeating exercise.

@cjdelisle

This comment has been minimized.

Copy link

@cjdelisle cjdelisle commented Nov 26, 2018

@tevador

the solution is to chain multiple "code snippets" so that the output of a previous program becomes the seed of the next program

This is a good idea, it reduces the effectiveness of sifting by 2**number_of_iterations but then the cost is born by the verifiers (I understand at some point someone will need to actually check that the code does make the claimed output).

For a double-spend attack, you would need an ASIC that supports less than 30 of the 256 functions.

That sounds scary, but if you say "you need 30 different ASICs" it sounds less scary. I'm not really convinced by this logic.

@hyc I think you're missing the point, it's meant to be easy to design the ASIC, but making the entire ASIC run at once requires an advancement in ALU design. Imagine for a moment that every other block required heavy integer math or floating point math to mine. Anyone who figures out how to design a chip where 50% of the chip is not idling at any given time has designed a better ALU...

@tevador

This comment has been minimized.

Copy link
Contributor

@tevador tevador commented Nov 26, 2018

This is a good idea, it reduces the effectiveness of sifting by 2**number_of_iterations but then the cost is born by the verifiers (I understand at some point someone will need to actually check that the code does make the claimed output).

The cost of verification increases linearly, but the cost of "filtering nonces" increases more than exponentially.

you need 30 different ASICs

An ASIC could just have a board with 1 chip per PoW function. It would be still orders of magnitude more efficient than a CPU, just the initial investment would be larger, which would only cause more centralization.

I think an ASIC that implements just 32 "easiest" functions would be economically feasible (you could mine only 1/8 of the blocks, but this would be compensated by much higher efficiency).

@zawy12

This comment has been minimized.

Copy link
Author

@zawy12 zawy12 commented Mar 2, 2019

This article delineates security in POW and shows electrical waste is not part of it. I'l describe a "virtual-POW" using VDF's with stake or hard-drive space.

[ edit: moved all the text to my blog ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.