Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random generator init is very slow #57

Closed
patrikaxelsson opened this issue Sep 21, 2021 · 11 comments
Closed

Random generator init is very slow #57

patrikaxelsson opened this issue Sep 21, 2021 · 11 comments
Assignees
Milestone

Comments

@patrikaxelsson
Copy link

patrikaxelsson commented Sep 21, 2021

With the 68k version of AmiSSL (4.10 tested), the random generator init is very slow, apparently disregarding how fast processor you have. You can however vary the speed of it somewhat by changing the native screenmode - the higher refresh rate, the faster the random generator init runs.

First three runs on an A3000 030@25Mhz, with varying refresh rates from 50-70Hz (PAL, NTSC, Euro72 without VGAOnly). AmiSSL has already been pre-initialized with a run of InitAmiSSLTest, so the first test would not have a distractingly large OpenAmiSSL(). The initialization is taken from the https.c example and the function of interest is RAND_seed():

8.Ram Disk:> sysinfo -p | search vblank 
    19 VBlank Frequency in Hz....................... 50
8.Ram Disk:> pt:InitAmiSSLTest  
OpenLibrary(amisslmaster.library, 4): 0.001344s
InitAmiSSLMaster(16, 0):              0.000067s
OpenAmiSSL():                         0.027808s
InitAmiSSL(0)                         0.000242s
OPENSSL_init_ssl(0020000E, 0):        1.830160s
GenerateRandomSeed(073CACBC, 128):    0.001965s
RAND_seed(073CACBC, 128):             2.269295s
8.Ram Disk:> sysinfo -p | search vblank 
    19 VBlank Frequency in Hz....................... 60
8.Ram Disk:> pt:InitAmiSSLTest  
OpenLibrary(amisslmaster.library, 4): 0.001530s
InitAmiSSLMaster(16, 0):              0.000064s
OpenAmiSSL():                         0.024569s
InitAmiSSL(0)                         0.000238s
OPENSSL_init_ssl(0020000E, 0):        1.812117s
GenerateRandomSeed(073CACBC, 128):    0.001963s
RAND_seed(073CACBC, 128):             1.933553s

If the same thing is run on a comparatively blistering fast UAE in the same manner, the RAND_seed() times are still the same:

 6.Ram Disk:> sysinfo -p | search vblank
    19 VBlank Frequency in Hz....................... 50
6.Ram Disk:> work:InitAmiSSLTest 
OpenLibrary(amisslmaster.library, 4): 0.000001s
InitAmiSSLMaster(16, 0):              0.000000s
OpenAmiSSL():                         0.000507s
InitAmiSSL(0)                         0.000000s
OPENSSL_init_ssl(0020000E, 0):        0.058342s
GenerateRandomSeed(1013863C, 128):    0.000000s
RAND_seed(1013863C, 128):             2.230716s
6.Ram Disk:> sysinfo -p | search vblank
    19 VBlank Frequency in Hz....................... 60
6.Ram Disk:> work:InitAmiSSLTest 
OpenLibrary(amisslmaster.library, 4): 0.000507s
InitAmiSSLMaster(16, 0):              0.000000s
OpenAmiSSL():                         0.001014s
InitAmiSSL(0)                         0.000000s
OPENSSL_init_ssl(0020000E, 0):        0.033483s
GenerateRandomSeed(1013863C, 128):    0.000000s
RAND_seed(1013863C, 128):             1.867976s

Have tested on several different machines with the same result for RAND_seed().

I think the issue is a combination of these two things:

  1. timer.device/UNIT_VBLANK is used for the supposedly 1 microsecond delays, but UNIT_VBLANK gives a minimum delay of 1/refreshrate - https://github.com/jens-maus/amissl/blob/master/openssl/crypto/rand/rand_amiga.c#L121
  2. For unknown reason (not documented), 3/4 of data produced by these delays is discarded, reasonably making this already unreasonably slow process four times slower - https://github.com/jens-maus/amissl/blob/master/openssl/crypto/rand/rand_amiga.c#L188

Imagine using AmiSSL in a command line tool, disregarding how fast amiga you have, it would still have a minimal execution time of ~2s.

AmiSSLTest.zip

@Futaura Futaura self-assigned this Sep 21, 2021
@Futaura
Copy link
Collaborator

Futaura commented Sep 21, 2021

@patrikaxelsson Thanks for taking the time to look at this in so much detail. It's difficult to track this code, as the random stuff has changed through various OpenSSL versions. However, it seems https://github.com/jens-maus/amissl/blob/4.10/openssl/crypto/rand/rand_unix.c#L144 probably explains why @theantony chose to use only 1/4 of the data. I also see that @theantony switched from UNIT_MICROHZ to UNIT_VBLANK to reduce overhead. But, in your example, 128 bytes means 26 iterations and 416 (26 * 16) DoIO() calls.

I guess you took the 128 byte RAND_seed() from the included examples. I need to check, but strictly speaking you probably don't actually need to call this at all. IIRC, it should get seeded on init, but I could be wrong - the random stuff is hard to follow, and it has changed again for OpenSSL 3.0. That said, I probably already did it, but it'll be worth investigating how often OpenSSL reseeds itself automatically, in case it is a bottleneck. The question is how can it be improved - it hasn't really changed since AmiSSL v3.

@patrikaxelsson
Copy link
Author

patrikaxelsson commented Sep 22, 2021

You are correct that AmiSSL will reseed itself even without calling RAND_seed(). Did some experimenting/benchmarking with the AmiGemini source and the total time needed to complete a request will be exactly the same without RAND_seed(), the extra 2.2 seconds (in PAL) is instead spend around SSL_CTX_new() if I am not mistaken.

However, it was more convenient for benchmarking testing to use RAND_init() which causes the exact same thing. I took the init procedure from the include amiga https.c example.

128 calls of 1microsec TR_ADDREQUEST to timer.device/UNIT_VBLANK takes the predicted ~2.56s on all Amigas running PAL, which suggests the number of iterations in AmiSSL should be no more than 2.2/(1.0/50)=110.

Regarding improvements, changing back to UNIT_MICROHZ would be given. C= does document UNIT_VBLANK as the most efficient and stable device for delaying long periods. However I am not sure how that efficiency would be a problem for a low number of iterations (~110), to be blunt it doesn’t sound like the change was really tested as it totally ruined the performance of the entropy generation.

Going by the documentation, UNIT_MICROHZ sounds like the perfect match for this use case:

The microhertz timer also has much more system overhead,
which means accuracy is reduced as the system load increases. It is
primarily useful for short-burst timing for which critical accuracy
      is not required.

Good for short requests (which this surely is at minimum time possible) and potential for variation in accuracy would be good for entropy.

Regarding throwing away 3/4 of the data, I don’t see how not making use of all the data and then just making more measurements really helps entropy.

@Futaura
Copy link
Collaborator

Futaura commented Sep 22, 2021

I think it is more than 128 calls though, as it fills the buffer (20 bytes) 1 byte per TR_ADDREQUEST and only uses 5 bytes from the buffer, as you noticed. I'll stick some debug output in to check this out though.

First, I also want to check your tests with AmiSSL v3, just to make sure I didn't mess something up due to the modifications required by the OpenSSL 1.1.x changes. I know parameters changes from bits to bytes and vice versa at various times. I probably did check it all at the time, but I can't remember!

@Futaura
Copy link
Collaborator

Futaura commented Sep 22, 2021

Ok, so RAND_seed(buf,128) results in rand_pool_bytes_needed(pool, 1 /*entropy_factor*/) returning 32. So, yes, you're right about the iterations - 112 calls to TR_ADDREQUEST. I'm not sure why UNIT_VBLANK was chosen, as it was before my time, and there is not much detail in the history logs. One possible explanation is it adds an extra level of randomness as I think UNIT_MICROHZ uses the EClock, which the routine is already using.

I'll do some more testing to see what I can deduce. I'm currently working on porting OpenSSL 3.0 for AmiSSL v5, but if the random seeding can be improved, I'll certainly hope to get this fixed in another v4 release first (random seeding changed for OpenSSL 3.0 again too).

@theantony
Copy link

I don't remember any context on this change, but found a message from @Futaura in my mail archive from October 10, 2005. Assuming it corresponds to the same code path it may explain why the change was made:

(Initial quote and the last part are from Oliver, the "something wrong" part in the middle is from me)

Milliseconds, I guess. Certainly in the test program I timed it using
clock() and RAND_poll takes 10 seconds on my 060/50MHz. Seems to suck
the cpu power too, since while it is executing, IB's progress bar slows
right down.

Hum, there's something wrong there, it shouldn't be like that :I I'll try
to send you a test program sometimes next week to see what happens if
you get the chance to try it.

I meant to put some debug output in there, and now I have :) Turns out
it's slow because of the UNIT_MICROHZ timer (due to the overhead, adding
a time request of 1 microsecond always takes much longer than that). I
switched it to UNIT_VBLANK and it is much faster.

@Futaura
Copy link
Collaborator

Futaura commented Sep 22, 2021

@theantony Wow - thanks! I obviously completely forgot about that :-) I've now found your replies on this in my THOR archive (predates my switch to Gmail) and have refreshed my memory.

@patrikaxelsson In summary, I built AmiSSL v3 with SAS/C profiling enabled, to find slowdown bottlenecks in AmiSSL v3 (before the 68k version was released), which I was experiencing with IBrowse on my A1200. This highlighted the bn stuff as being a bottleneck for DH connections, which has since been partly addressed recently with some asm optimisations for 68k. The other was that RAND_poll() took 10 seconds to complete with UNIT_MICROHZ and just over 1 second with UNIT_VBLANK, I concluded:

Turns out it's slow because of the UNIT_MICROHZ timer (due to the overhead
adding a time request of 1 microsecond always takes much longer than
that). I switched it to UNIT_VBLANK and it is much faster.

To confirm this, I've just now tried changing the code back to UNIT_MICROHZ and running your test on my A1200 (60Hz vblank). RAND_seed() took 10.5 seconds, versus 1.9 seconds for UNIT_VBLANK. So, as suspected, there was a good reason for the original change and adequate testing, even if it took Andrija to remind me why :-). So, with UNIT_VBLANK having a minimum delay of 0.02 seconds (at 50Hz) and UNIT_MICROHZ having a minimum delay of over 4 times that (almost 0.1 seconds), obviously UNIT_MICROHZ must be avoided.

All that said, I'm still going to re-test this. I just want to make sure the code is behaving as intended (compared to other platforms) and I didn't break a calculation somewhere when porting over OpenSSL changes in this area. I'll also think about how this code can be improved - I mean 10 seconds to 2 seconds was a great speed boost at the time, but perhaps we can do better still.

@patrikaxelsson
Copy link
Author

patrikaxelsson commented Sep 22, 2021

It is not reasonable that UNIT_MICROHZ would have a higher minimal time than UNIT_VBLANK because of overhead. That would mean that the calculations it is doing to handle the delay would need more CPU time than the time of a screen refresh, which imho would be an insane amount for crunching for such task.

Also, such an inefficiency would not align with the C= documentation where they describe that UNIT_MICROHZ is suitable for short-burst timing - actually in case of such inefficiency, its entire existence would be pointless.

Did a test of of the time for 112 1microsecond delays for all the relevant timer.device units.

A3000 030@25Mhz:

8.Ram Disk:> sysinfo -p | search vblank 
    19 VBlank Frequency in Hz....................... 50
8.Ram Disk:> pt:TimerMinDelayTest 
112 TR_ADDREQUEST UNIT_MICROHZ 0.058045s
112 TR_ADDREQUEST UNIT_VBLANK  2.225692s
112 TR_ADDREQUEST UNIT_ECLOCK  0.048567s
8.Ram Disk:> sysinfo -p | search vblank 
    19 VBlank Frequency in Hz....................... 60
8.Ram Disk:> pt:TimerMinDelayTest  
112 TR_ADDREQUEST UNIT_MICROHZ 0.051184s
112 TR_ADDREQUEST UNIT_VBLANK  1.882906s
112 TR_ADDREQUEST UNIT_ECLOCK  0.048717s

UAE (very fast):

6.Ram Disk:> sysinfo -p | search vblank
    19 VBlank Frequency in Hz....................... 50
6.Ram Disk:> work:TimerMinDelayTest 
112 TR_ADDREQUEST UNIT_MICROHZ 0.001522s
112 TR_ADDREQUEST UNIT_VBLANK  2.206646s
112 TR_ADDREQUEST UNIT_ECLOCK  0.842928s
6.Ram Disk:> sysinfo -p | search vblank
    19 VBlank Frequency in Hz....................... 60
6.Ram Disk:> work:TimerMinDelayTest 
112 TR_ADDREQUEST UNIT_MICROHZ 0.001522s
112 TR_ADDREQUEST UNIT_VBLANK  1.862622s
112 TR_ADDREQUEST UNIT_ECLOCK  0.896442s

Something else must be happening. UNIT_MICROHZ does not manage 1microsecond delays, but I did not expect that as just the context switch most likely takes more than 1micro, but at 1/38 of the UNIT_VBLANK min delay on the 030 it is way faster.

Attaching code, with this new test included.
AmiSSLTest_20210922.zip

@Futaura
Copy link
Collaborator

Futaura commented Sep 22, 2021

Damn...

timer.device 50.22 (07/12/2002) [OS4.0/68k dev]
112 TR_ADDREQUEST UNIT_MICROHZ 10.414067s
112 TR_ADDREQUEST UNIT_VBLANK  1.890136s
112 TR_ADDREQUEST UNIT_ECLOCK  10.410630s
timer.device 39.4 (29/07/1992) [OS3.0/OS3.1]
112 TR_ADDREQUEST UNIT_MICROHZ 0.020158s
112 TR_ADDREQUEST UNIT_VBLANK  1.900106s
112 TR_ADDREQUEST UNIT_ECLOCK  0.018455s
timer.device 46.1 (04/08/2019) [OS3.2]
112 TR_ADDREQUEST UNIT_MICROHZ 0.019760s
112 TR_ADDREQUEST UNIT_VBLANK  1.898796s
112 TR_ADDREQUEST UNIT_ECLOCK  0.017195s
timer.device 53.5 (13/11/2020) [OS4.1/PPC]
112 TR_ADDREQUEST UNIT_MICROHZ 0.001988s
112 TR_ADDREQUEST UNIT_VBLANK  0.001858s
112 TR_ADDREQUEST UNIT_ECLOCK  0.001835s

That's on my A1200 BPPC 68060/50MHz (VBLANK 60Hz) and my A1XE for the latter, just for comparison (OS4/PPC has no UNIT_VBLANK - it uses UNIT_MICROHZ instead).

Sorry @theantony - seems I was wrong 15 years ago, and nobody noticed until now. Obviously, something in the timer.device C rewrite was suboptimal on 68k, before development switched exclusively to PPC after 50.22. I never imagined the changes could have made such a big difference. I was rewritten again since then too. I have been using 50.22 on my main A1200 boot since 2002, along with lots of other OS4 bits (I don't want to lose anti-aliased text rendering in IBrowse, for example). I've switched out my timer.device for the 46.1 version now and I really must look at the other components too at some point.

I will switch it back to UNIT_MICROHZ for AmiSSL 4.11. I will still do some further checking on the rand code to make sure everything is in order though.

@Futaura Futaura added this to the future version milestone Sep 22, 2021
@patrikaxelsson
Copy link
Author

Good find! Looking forward to AmiSSL 4.11 :).

Futaura added a commit that referenced this issue Sep 25, 2021
…unreleased 68k timer.device v50 (#57).

- Reworked code to be more readable and legacy entropy generation to correctly use a factor of 4 (#57).
- Switched from SHA-1 to SHA-256.
@Futaura
Copy link
Collaborator

Futaura commented Sep 25, 2021

Ok, so you were right that the RAND_pool_add() call was also not correct (has been since AmiSSL 4.3). As with Andrija's original code from v3, this was supposed to be using an entropy factor of 4, so only 2 bits of entropy from every 8 bits of random data generated. Therefore, it is correct that the code should be generating 128 bytes (opposed to 32). But, because the length argument was incorrect (entropy argument was correct), the code was effectively running at an entropy factor of 1. This has now been fixed and it still generates 128 bytes, but obviously doesn't discard 75% of this any longer, so no speed gain at all in that sense (the switch to UNIT_MICROHZ is enough I think).

I've have reworked the code a little to make the function of it a little clearer, so it makes more sense when reading it. I also took the opportunity to switch from SHA-1 to SHA-256, which did not appear to slow RAND_seed() down at all using your test program on my 68060 (in fact, it was fractionally faster, if anything).

Other than that, I'm happy with the implementation - it is very similar to implementations for other platforms and generic code elsewhere in OpenSSL, so happy to leave the rest as-is as there doesn't seem to be a better way on OS3.

@patrikaxelsson
Copy link
Author

Sounds great! Do you have any possibility to test it on a more modest machine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants