Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full recovery of internal state #14

Closed
Sc00bz opened this issue Mar 27, 2023 · 23 comments
Closed

Full recovery of internal state #14

Sc00bz opened this issue Mar 27, 2023 · 23 comments

Comments

@Sc00bz
Copy link

Sc00bz commented Mar 27, 2023

Blog: https://tobtu.com/blog/2023/3/breaking-xor-shift-prng/

Code: https://github.com/Sc00bz/break-srandom

You should probably update the read me to state it's an insecure PRNG.

@atoponce
Copy link

Solid work Steve!

@josenk
Copy link
Owner

josenk commented Mar 28, 2023

I wouldn't mind double checking this work. How about a Makefile and instructions to run the "attack"?

Bug 1 & 3 is pretty simple fix. Spinlocks are pretty normal stuff in kernel code. It's the best way to to protect code that should be run by only 1 thread. I was not able to reproduce the problem of a 'block forever'. Do you have an alternative solution?

@josenk
Copy link
Owner

josenk commented Mar 28, 2023

BTW: I do highly appreciate the hard work you put into this.

@EL-File4138
Copy link

I wouldn't mind double checking this work. How about a Makefile and instructions to run the "attack"?

Bug 1 & 3 is pretty simple fix. Spinlocks are pretty normal stuff in kernel code. It's the best way to to protect code that should be run by only 1 thread. I was not able to reproduce the problem of a 'block forever'. Do you have an alternative solution?

Clone the repo, and click attack/test.html. If your system installed your own work, It will be detected. Otherwise, a true CSPRNG will not trigger the alarm.
Distinguishable from a CSPRNG means your randomness is not qualified for cryptographical solutions. This is already called a break.

If this is your required instruction and you can reproduce it, please update your README so that downstream developers (especially some custom android kernel) will receive alerts.

@atoponce
Copy link

atoponce commented Mar 28, 2023

It appears that attack/test.html only works in Chrome/Chromium. From my testing, it fails in Firefox and Brave, which means they're probably using the getrandom() system call to get randomness for window.crypto.getRandomValues() rather than reading /dev/urandom directly.

@josenk
Copy link
Owner

josenk commented Mar 28, 2023

@Sc00bz Are you willing to peer review v2.0?

@roycewilliams
Copy link

Unfortunately, the overall effect of @Sc00bz' work isn't merely "here are some bugs you should fix". Rather, it calls into question whether the entire project is safe - or even necessary.

The bar of whether something should be used as a replacement for /dev/urandom is much higher than a couple of bugfixes.

@josenk
Copy link
Owner

josenk commented Mar 29, 2023

LOL, if everyone abandoned and deleted every project that had an exploit, there would be nothing on the web. If you feel it's useless, unstar this repo and move on... Traffic insights of this project says otherwise, so other people since that past 8 years have found some use from it.

I'll assume you know that going from 1.x to 2.x is NOT merely bug fixes... If Sc00bz is unwilling to test/review v2.0, then I'll still try to improve this project without his input.

@Sc00bz
Copy link
Author

Sc00bz commented Mar 29, 2023

Spinlocks are pretty normal stuff in kernel code. It's the best way to to protect code that should be run by only 1 thread. I was not able to reproduce the problem of a 'block forever'. Do you have an alternative solution?

Did you run dd if=/dev/srandom of=/dev/null bs=1M in 17+ times or write a script to do the same? Also this only will work in standard mode because UHS has 32 arrays and only 16 can be locked currently.

You don't need the spin lock. Just lock around these lines because update_sarray*() locks:

srandom/srandom.c

Lines 359 to 364 in e4cb549

memcpy(new_buf + (Block * 512), prngArrays[arraysPosition], 512);
#if ULTRA_HIGH_SPEED_MODE
update_sarray_uhs(arraysPosition);
#else
update_sarray(arraysPosition);
#endif

Or on line 339 unlock and then lock again.

srandom/srandom.c

Lines 336 to 341 in e4cb549

while ((ArraysBusyFlags & 1 << arraysPosition) == (1 << arraysPosition)) {
arraysPosition += 1;
if (arraysPosition >= numberOfRndArrays) {
arraysPosition = 0;
}
}

Also I didn't mention this in the blog because it's a small thing but you need to lock on the work thread. It's possible for seed_PRND_*() to get undone by another thread.

I'm not going to review v2 unless there's crypto. I'd suggest using ChaCha8 (that's 8 vs 20 rounds. It's weaker but should be fine) and do 4x, 8x, or 16x in parallel using SIMD (SSE2, AVX, AVX2, and AVX-512 for x86 and NEON for ARM). Don't bother with endianness or [de]interleaving the instances because it doesn't matter for a PRNG. For seeding/reseeding, I'd suggest doing what /dev/urandom does unless you can grab from that or getrandom(). I did something like this in 2008 with SHA-256 and SSE2.

Optionally you could use VAES instead when it's available, but you'll need benchmarks to see if that's faster. You could use a spin lock so multiple threads can work at the same time but on different states. Since the current spin lock does very little for performance. Also you don't need to randomly select. Just rotate through them all. Also you should not throwaway generated random data. Mark how much in the buffer is used and refill it when there's nothing left. Since a lot of calls are going to be for nonce/key sized data 12-32 bytes.

@atoponce
Copy link

atoponce commented Mar 29, 2023

I'd suggest using ChaCha8 (that's 8 vs 20 rounds. It's weaker but should be fine)

This is a valid suggestion provided by recent cryptanalysis. According to Jean-Philippe Aumasson, ChaCha8 is still cryptographically secure, while ChaCha20 is over-cautious:

5.3 How many rounds?
Our goal is to propose numbers of rounds for which we have strong confidence that the algorithm will never be wounded, let alone broken, using the terminology defined in §4.4. Based on the previous research and our cryptanalysis experience, in particular as related to these algorithm (sic), we propose the following:

  • AES: ...
  • BLAKE2: ...
  • ChaCha: 8 rounds instead of 20 (that is, ChaCha8), yielding a 2.5× speed-up.
  • SHA-3: ...

These suggested numbers of rounds are probably imperfect choices in terms of consistency, yet the security margins of these four families of algorithms are more consistent with these new rounds numbers than with their original ones. Note that AES’ initial security margin was the thinnest of all, not because it’s the older design, but because of the initial not-overconservative choice of rounds vs. best known attack. AES is thus the primitive for which we propose the smallest change.

Because the current Linux kernel RNG is based on ChaCha20 and provides 400+ MBps throughput with 64k chunk sizes (according to my benchmark in #13), and knowing that ChaCha8 will provide a 2.5× speed up, we can expect close to 1+ GBps throughput. While this doesn't reach the throughput of xorshift64, it's cryptographically secure and still provides incredible performance.

@atoponce
Copy link

LOL, if everyone abandoned and deleted every project that had an exploit, there would be nothing on the web. If you feel it's useless, unstar this repo and move on

No one is suggesting you abandon the project (that I've seen so far). Instead we're advising you to either move away from xorshift64 to a cryptographically secure primitive (EG, ChaCha8), or remove the claims in the README.md that it's secure as well as the advice to replace /dev/urandom.

Traffic insights of this project says otherwise, so other people since that past 8 years have found some use from it.

I find value in this project as a fast RNG in kernel space (even though it might be a better fit in user space). I initially stumbled on this project researching the historic OpenBSD /dev/{a,p,s}random devices. I was intrigued when I came here, but was dismayed when I read the README.md followed by reading the source code.

Another RNG Linux kernel module that I stumbled on is https://github.com/Error916/LFSR_module which uses a 128-bit linear feedback shift register, but the performance is terrible. It's also (obviously) insecure, but being as slow as it is reduces my interest in the project.

Anyway, your project is great, we just want the security claims to be accurate.

@josenk
Copy link
Owner

josenk commented Mar 29, 2023

Thanks for all the positive feedback. The chacha8 seems like a good idea. I'll implement it in a dev branch and see how it performs.

@Sc00bz
Copy link
Author

Sc00bz commented Mar 29, 2023

the current Linux kernel RNG is based on ChaCha20

I'm pretty sure they use normal ChaCha20. Which can only use 128 bit SIMD. AVX2 and AVX-512 are 256 bit and 512 bit SIMD respectively. My suggestion would be 2-4x faster on top of the 2.5x. Since most/all current x86 CPUs have at least AVX2. Also the speed up from not shuffling the state in registers because it's registers of 4, 8, or 16 ChaCha states. So it's at least 5x faster with AVX2 and 10x with AVX-512.

Note that's for actual AVX2 and AVX-512. AMD has put out CPUs with those instruction sets but are working on half the SIMD width at a time. It's faster than half width SIMD by a little but not 2x. AMD Zen 2 moved from this to full width AVX2 and AMD Zen 4 added half width AVX-512.

@atoponce
Copy link

So the potential is there to outperform the existing xorshift64 implementation? Win-win!

@josenk
Copy link
Owner

josenk commented Mar 30, 2023

Personally, I don't think ChaCha8 will outperform xorshift... We'll need to wait until it's implemented to see how it all performs.

BTW: I know there's probably not much interest (in this thread in particular), but I just pushed a totally revamped v2beta using updated PRNGS and implementing fixes and recommendations.
I replace SplitMix with wyhash and I replaced an old xorshift128 with xoshiro256**.
I added many new things like random buffer selection (instead of sticking to one for the entire request), random shuffle, random selection of PSRNG and a totally separate PRNG that is "internal to the module only" to make all these random events unexposed. Fixed the locking, lots of cleanup and simplification.
The README is not complete and ChaCha8 is not implemented yet. I decided to use ChaCha8 in standard mode and UHS will use this updated XorShift if it's good enough.
It's about 50% slower then srandom 1.x.

@Sc00bz
Copy link
Author

Sc00bz commented Mar 30, 2023

"ChaCha8" in the second to last sentence caught my eye. So I missed the multiple times where you said you didn't implement it and went looking for it. Anyway I accidentally broke version 2's UHS mode. I'm also not sure how you were able to benchmark it because it should just segfault. Since you are writing to unallocated memory and overwriting a pointer with random. Also this won't compile on 32 bit architectures and you didn't fix an off by one bug—right I never mentioned it because it's a minor bug that didn't really matter until now. I guess it still doesn't really matter. Anyway looking forward to your ChaCha8 implementation.

@josenk
Copy link
Owner

josenk commented Mar 30, 2023

OK, now I'm going to need a bit of help... I want to make sure this is solid. I found an implementation of chacha20 that I was able to put into kernel code. It seems to work, but runs about 80MB/s slower than urandom. (I guess this would be because it's using the misc-device and not optimized).

If anyone here is able to convert the chacha20 to chacha8, it would be appreciated... (and fix the bug, you just pointed out)

https://github.com/josenk/srandom/tree/v2beta

I'm assuming once chacha20 is modified to chacha8, we should get close to the goal speeds.

@josenk
Copy link
Owner

josenk commented Mar 31, 2023

OK, I think I figured it out... I found here a reference; Perform the ChaCha rounds in sets of two, Column round and Diagonal round

So, I just need to simply change line 675 from 10 to 4, to make it from chacha20 to chacha8.

https://github.com/josenk/srandom/blob/v2beta/srandom.c#L675

@Sc00bz
Copy link
Author

Sc00bz commented Apr 1, 2023

I basically rewrote the whole thing. ChaCha8 uses scalar, SSE2, SSSE3, AVX, XOP, AVX2, AVX-512, and NEON in 160 lines. It runs with 16 states of ChaCha and outputs 1 KiB per call. I need to do some testing. I want to make sure XOP and AVX-512 work but I'd need to find something that runs those. Currently XOP is commented out because the GCC documentation is lacking. Also I don't think AMD has made a CPU with XOP in 7 years and Intel never implemented it. Oh right I wanted to do a condition variable but still haven't looked up how that works in a Linux kernel module. I'll probably do a pull request tomorrow.

I have three methods for reseeding:
Each set of states reseed after 4 TiB with get_random_bytes().
Each set of states reseed after 4 TiB - 1 KiB with the next 1 KiB.
Each set of states repeat output after 16 ZiB (2**74 bytes).

@Sc00bz
Copy link
Author

Sc00bz commented Apr 4, 2023

Hey so I ran into an issue with Linux and fixed it. There is still one small thing, apparently a compile feature in GCC is C++ only: __attribute__((target("..."))).

I can add the runtime test for which instructions sets are available but the number of files goes from 1 to 7 because each file needs to be compiled with -msse2, -mssse3, etc. Unless you know of a better way to do this. I think it can still be 1 file if there are defines like __AVX2__ for all of these SSE2, SSSE3, AVX, XOP, AVX2, and AVX-512. But I don't know how to detect those features in a makefile and add the appropriate flags.

I'll try to figure something out tomorrow. Also I wasn't able to test XOP or AVX-512.

@josenk
Copy link
Owner

josenk commented Apr 4, 2023

There are defines for cpu flags...

https://stackoverflow.com/questions/28939652/how-to-detect-sse-sse2-avx-avx2-avx-512-avx-128-fma-kcvi-availability-at-compile

Is this what you're looking for??? Creating compile time option flags for the detected cpu flags? (This list is not complete and is just for an example)

OPTS += $(shell cat /proc/cpuinfo |grep -q "avx " && echo "-mavx")
OPTS += $(shell cat /proc/cpuinfo |grep -q "avx2" && echo "-mavx2")
OPTS += $(shell cat /proc/cpuinfo |grep -q "avx512" && echo "-mavx512")
OPTS += $(shell cat /proc/cpuinfo |grep -q "sse " && echo "-msse")
OPTS += $(shell cat /proc/cpuinfo |grep -q "sse2" && echo "-msse2")
OPTS += $(shell cat /proc/cpuinfo |grep -q "sse3" && echo "-msse3")
OPTS += $(shell cat /proc/cpuinfo |grep -q "sse4" && echo "-msse4")

@josenk
Copy link
Owner

josenk commented Apr 10, 2023

Scoobz How are things working out?

@josenk
Copy link
Owner

josenk commented Jun 12, 2023

2.x has been released, so closing this issue.

@josenk josenk closed this as completed Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants