Need help debugging 05-test_rand.t #24211

rsbeckerca · 2024-04-19T18:38:10Z

rsbeckerca
Apr 19, 2024

I'm having problems in 05-test_rand.t in a new threading model. I need to be able to run the code using gdb directly. Is there an easy way to run the test directly without the wrappers?

nhorman · 2024-04-19T18:51:03Z

nhorman
Apr 19, 2024
Maintainer

all the wrap.pl does is setup OPENSSL_* env variables to point to the right places in the build tree to test the local build. So depending on the tests you can just set those manually within gdb.

Usually what I do is run the tests via the normal framework with V=1 and look for the wrap lines:

make TESTS=test_rand V=1 test
...
../../util/wrap.pl ../../test/rand_test
...
../../util/wrap.pl ../../test/drbgtest
...
../../util/wrap.pl ../../test/rand_status_test
...
../../util/wrap.pl ../../apps/openssl rand -engine ossltest -hex 16
...
../../util/wrap.pl ../../apps/openssl rand -hex 2K
...
../../util/wrap.pl ../../apps/openssl rand -engine dasync -hex 16

Then pick the test you want and load it into gdb, adjusting the paths in the executable accordingly:

LD_LIBRARY_PATH=/usr/lib64 gdb ./apps/openssl

Then, at the gdb prompt, enter the approriate environment variables and arguments

set env LD_LIBRARY_PATH=/path/to/openssl/build/dir
set env OPENSSL_CONF=/path/to/openssl/build/apps/openssl.cnf
set env OPENSSL_MODULES=/path/to/openssl/build/providers
set env OPENSSL_ENGINES=/path/to/openssl/build/engines
set args rand -engine dasync -hex 16

Its a bit verbose, so I usually just dump that into a ~/.gdbinit file

1 reply

paulidale Apr 21, 2024
Collaborator

There is a wrap.pl script in util in the build directory that's easier when you've multiple build trees.

util/wrap.pl gdb --args program arg1 arg2 arg3

The script can also take a -fips option to kick into FIPS mode.

rsbeckerca · 2024-04-19T18:54:05Z

rsbeckerca
Apr 19, 2024
Author

@nhorman Thanks, but I have to get in and try to debug why failures are occurring at test/drbgtest.c:223. This seems to be a reseed issue in a new thread model and is giving me a headache. The output is not sufficient to debug what might be wrong with the threads.

1 reply

paulidale Apr 21, 2024
Collaborator

Also certainly a seeding issue. The DRBG being used is per thread so the problem is likely in with primary DRBG (of which there is only one instance).

The reseed code is hairy and this test more so. It also was changed from locked to atomic last year.

nhorman · 2024-04-19T19:12:05Z

nhorman
Apr 19, 2024
Maintainer

I'm not sure I follow. the instructions above should apply to drbgtest in the same way as the other tests. Is there a particular bit of output your are trying to get at?

I don't know if this helps, but usually when I'm fighting with threads in gdb, I'll use a conditional breakpoint or an ignore directive to stop after a certain number of breakpoints have been hit, and then set scheduler-locking to step, to avoid further context switches so that I can single step through the code without having gdb swap tasks on me.

0 replies

rsbeckerca · 2024-04-19T19:19:17Z

rsbeckerca
Apr 19, 2024
Author

This is what I get on the first iteration. Some interpretation would be appreciated:

        # ERROR: (int) 'rand_bytes((unsigned char*)public_random, RANDOM_SIZE) == expect_success' failed @ /home/ituglib/randall/openssl-klt/test/drbgtest.c:223
        # [0] compared to [1]
        # ERROR: (bool) 'test_drbg_reseed(1, primary, public, private, &random[0], &random[RANDOM_SIZE], 1, 1, 1, 0) == true' failed @ /home/ituglib/randall/openssl-klt/test/drbgtest.c:392
        # false
        # ERROR: (int) 'status == 0' failed @ /home/ituglib/randall/openssl-klt/test/drbgtest.c:360
        # [256] compared to [0]
        # ERROR: (bool) 'test_drbg_reseed_in_child(primary, public, private, presult) == true' failed @ /home/ituglib/randall/openssl-klt/test/drbgtest.c:435
        # false
        # ERROR: (bool) 'test_rand_reseed_on_fork(primary, public, private) == true' failed @ /home/ituglib/randall/openssl-klt/test/drbgtest.c:540
        # false
        # OPENSSL_TEST_RAND_SEED=1713554141
        not ok 1 - iteration 1

1 reply

rsbeckerca Apr 19, 2024
Author

@nhorman I'm just trying to figure out what is actually wrong and where the error is. I have a breakpoint right before the failure but understanding what the test program wants is unclear to me.

nhorman · 2024-04-19T19:34:07Z

nhorman
Apr 19, 2024
Maintainer

ok, so starting with the top error, that failure is just doing an integer comparison of the return value of rand_bytes to the variable expect_success.

Looking at the next line down, we see that test_drbg_reseed failed (which is the function that was called on line (1)). We can infer from that that expect_success held the value 1, (as line 392 is right before the call, and passes 1 as the expect_success parameter,. and so rand_bytes must have returned a value other than 1.

the remaining lines there can likely be ignored (at least for now), as they appear to be subsequent failures based on the failure in rand_bytes.

You're in a bit of luck here (maybe). This test isn't multithreaded, its multiprocess (test_drbg_reseed_in_child() calls fork, and for the parent process it just blocks until the child exits), while the child does all the work). That will make tracing a bit easier as you won't have to contend with context switches between threads. What you do need to do however, is inform gdb that you want to debug the child process after the call to fork. the command for that is:

set follow-fork-mode child

That will allow you to debug the parent process (which you will be in when you start the program under gdb), and when it calls fork, it will switch to the child process, which is really where the failure is occurring.

as for what the root cause is, I can't begin to guess with the information here. But if I were debugging this, I would do the above, and start single stepping through the rand_bytes function to see if you can catch a return value that isn't 1.

0 replies

rsbeckerca · 2024-04-19T21:31:32Z

rsbeckerca
Apr 19, 2024
Author

Multithreaded == multiprocess on NonStop in this threading model.

On the 7th call to rand_bytes(), the following trace:

#0  RAND_get0_public (ctx=0x80dc130)
    at /home/ituglib/randall/openssl-klt/crypto/rand/rand_lib.c:776
#1  0x700e3705 in rand_bytes (buf=0x6fffe780 "", num=16)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:67
#2  0x700e4005 in test_drbg_reseed (expect_success=1, primary=0x100035e00,
    public=0x100060ab0, private=0x1000613b0, public_random=0x6fffe780 "",
    private_random=0x6fffe790 "", expect_primary_reseed=1,
    expect_public_reseed=1, expect_private_reseed=1, reseed_when=1713561634)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:226
#3  0x700e4c89 in test_drbg_reseed_in_child (primary=0x100035e00,
    public=0x100060ab0, private=0x1000613b0, result=0x6fffe8e0)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:401
#4  0x700e4f38 in test_rand_reseed_on_fork (primary=0x100035e00,
    public=0x100060ab0, private=0x1000613b0)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:446
#5  0x700e56de in test_rand_fork_safety (i=0)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:554
#6  0x700e8f61 in run_tests (test_prog_name=0x80de010 "../../test/drbgtest")
    at /home/ituglib/randall/openssl-klt/test/testutil/driver.c:377
#7  0x700e96cf in main (argc=1, argv=0x80de000)
    at /home/ituglib/randall/openssl-klt/test/testutil/main.c:31
#8  0x700e3580 in _MAIN () at \BLPROD.$RTAL.T8432L02.CMAIN64C:52

ends up in this path:

   774              if (CRYPTO_THREAD_get_local(&dgbl->private) == NULL
   775                      && !ossl_init_thread_start(NULL, ctx, rand_delete_thread_state))
*  776                  return NULL;

On the second call to CRYPTO_THREAD_get_local, NULL is returned. Not the first. This falls upward returning NULL.
After that, we end up

*  101                  if ((hands = OPENSSL_zalloc(sizeof(*hands))) == NULL)
(xInspect 2,1197):
*  104                  if (!CRYPTO_THREAD_set_local(local, hands)) {
(xInspect 2,1197):
*  105                      OPENSSL_free(hands);
(xInspect 2,1197):
*  106                      return NULL;

in the following trace:

#0  ossl_init_thread_start (index=0x0, arg=0x80dc130,
    handfn=0x7018c8c0 <rand_delete_thread_state>)
    at /home/ituglib/randall/openssl-klt/crypto/initthread.c:391
#1  0x7018c4ee in RAND_get0_public (ctx=0x80dc130)
    at /home/ituglib/randall/openssl-klt/crypto/rand/rand_lib.c:774
#2  0x700e3705 in rand_bytes (buf=0x6fffe780 "", num=16)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:67
#3  0x700e4005 in test_drbg_reseed (expect_success=1, primary=0x100035e00,
    public=0x100060ab0, private=0x1000613b0, public_random=0x6fffe780 "",
    private_random=0x6fffe790 "", expect_primary_reseed=1,
    expect_public_reseed=1, expect_private_reseed=1, reseed_when=1713562087)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:226
#4  0x700e4c89 in test_drbg_reseed_in_child (primary=0x100035e00,
    public=0x100060ab0, private=0x1000613b0, result=0x6fffe8e0)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:401
#5  0x700e4f38 in test_rand_reseed_on_fork (primary=0x100035e00,
    public=0x100060ab0, private=0x1000613b0)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:446
#6  0x700e56de in test_rand_fork_safety (i=0)
    at /home/ituglib/randall/openssl-klt/test/drbgtest.c:554
#7  0x700e8f61 in run_tests (test_prog_name=0x80de010 "../../test/drbgtest")
    at /home/ituglib/randall/openssl-klt/test/testutil/driver.c:377
#8  0x700e96cf in main (argc=1, argv=0x80de000)

1 reply

nhorman Apr 21, 2024
Maintainer

What's the value of dbgbl->private in each call? That's the key.for a thread local variable. My first thought would be that perhaps your new threading model is returning inconsistent thread local data

rsbeckerca · 2024-04-20T16:32:17Z

rsbeckerca
Apr 20, 2024
Author

One of our team members pointed out that the system has 6 IPUs available for threads. It is interesting that the failure occurs on the 7th pass through rand_bytes() in test_rand_fork_safety. Is there a way to limit the number of passes to 6 to see whether we can get through this instead of 1..16?

        # Subtest: test_rand_fork_safety
        1..16
        # ERROR: (int) 'rand_bytes((unsigned char*)public_random, RANDOM_SIZE) == expect_success' failed @ /home/ituglib/randall/openssl-klt/test/drbgtest.c:223
        # [0] compared to [1]
        # ERROR: (bool) 'test_drbg_reseed(1, primary, public, private, &random[0], &random[RANDOM_SIZE], 1, 1, 1, 0) == true' failed @ /home/ituglib/randall/openssl-klt/test/drbgtest.c:390
        # false

This always breaks on the first hit at in the parent in test_rand_fork_safety. I have referred this to the thread library development team.

0 replies

rsbeckerca · 2024-04-21T01:22:47Z

rsbeckerca
Apr 21, 2024
Author

What's the value of dbgbl->private in each call? That's the key.for a thread local variable. My first thought would be that perhaps your new threading model is returning inconsistent thread local data

In the parent or child?

1 reply

nhorman Apr 21, 2024
Maintainer

The child, the parent should be blocked.in waitpid, and will never call rand_bytes. If it is, that seems like a separate issue

rsbeckerca · 2024-04-21T01:28:41Z

rsbeckerca
Apr 21, 2024
Author

What's the value of dbgbl->private in each call? That's the key.for a thread local variable. My first thought would be that perhaps your new threading model is returning inconsistent thread local data

Aside: @nhorman Your help as been invaluable. Thank you.

1 reply

nhorman Apr 21, 2024
Maintainer

Thank you, and of course! Fixing weird bugs is honestly my most favorite thing to do

rsbeckerca · 2024-04-21T14:33:38Z

rsbeckerca
Apr 21, 2024
Author

On a side note: all PUT thread-related tests on 3.3 are now breaking on NonStop as of e5b1c72 - independent of this discussion.

1 reply

nhorman Apr 22, 2024
Maintainer

do you have a separate issue open for this?

rsbeckerca · 2024-04-21T21:52:58Z

rsbeckerca
Apr 21, 2024
Author

Also certainly a seeding issue. The DRBG being used is per thread so the problem is likely in with primary DRBG (of which there is only one instance).

The reseed code is hairy and this test more so. It also was changed from locked to atomic last year.

This seems plausible as a cause. Each logical CPU has 6 IPU cores. The first 6 hits on the DRBG seem to work, but the 7th does not. From what I understand of the threading implementation, the 7th fork may be going into a new process instance so might get a new instance of a process with a new DRBG. The implementation of this new threading model is also very hairy and not well understood outside the team that implemented it. 6 seems to be a magic number. I am trying to arrange to have a couple of cores turned off to see whether that impacts the test results.

NonStop uses an MPP architecture, not SMP, so anything to do with memory sharing can get really weird. We can get modify access within a logical CPU but if we get forced into a different CPU, I think the memory becomes read only - that may not be true but it was before this threading model was developed. My test machine appears to have 8 CPUs with 6 cores each - at least it did on Friday.

1 reply

paulidale Apr 22, 2024
Collaborator

Would it be worth trying the fallback lock based atomic operations?
Disable/remove the feature detection in the CRYPTO_atomic_* functions in crypto/threads/pthread.c

rsbeckerca · 2024-04-21T21:55:03Z

rsbeckerca
Apr 21, 2024
Author

I'm not sure I follow. the instructions above should apply to drbgtest in the same way as the other tests. Is there a particular bit of output your are trying to get at?

I don't know if this helps, but usually when I'm fighting with threads in gdb, I'll use a conditional breakpoint or an ignore directive to stop after a certain number of breakpoints have been hit, and then set scheduler-locking to step, to avoid further context switches so that I can single step through the code without having gdb swap tasks on me.

The debugger, which is based on gdb, does not actually cross thread boundaries on this box, so I have to manually invoke gdb on the different thread after I figure out what the fork's pid is, and force it into debug.

0 replies

rsbeckerca · 2024-04-22T15:01:18Z

rsbeckerca
Apr 22, 2024
Author

Three things came back.

pthread_once is not signal safe. Any catchable signals are deferred.
pthread_get_context is not supported at this time.
Is this test critical or can we skip it on NonStop?

1 reply

nhorman Apr 22, 2024
Maintainer

Can you elaborate on what you mean by (2)" pthread_get_context is not supported at this time"? I ask because I'm not aware of what that function is or does. I don't see reference to it in any man page, nor do I see it used anywhere in the openssl code? The closest thing I can imagine it being is pthread_getspecific, which gets a per-thread-data pointer. If thats what you're referring to, that may explain alot, as the rng code (as well as lots of other code in openssl), makes use of the CRYPTO_THREAD_get_local call (which is implemented via pthread_getspecific), and certainly matches with the data thats giving you problems. If the new threading model doesn't support that, thats going to be an issue I think

rsbeckerca · 2024-04-22T18:35:17Z

rsbeckerca
Apr 22, 2024
Author

Can you elaborate on what you mean by (2)" pthread_get_context is not supported at this time"? I ask because I'm not aware of what that function is or does. I don't see reference to it in any man page, nor do I see it used anywhere in the openssl code? The closest thing I can imagine it being is pthread_getspecific, which gets a per-thread-data pointer. If thats what you're referring to, that may explain alot, as the rng code (as well as lots of other code in openssl), makes use of the CRYPTO_THREAD_get_local call (which is implemented via pthread_getspecific), and certainly matches with the data thats giving you problems. If the new threading model doesn't support that, thats going to be an issue I think

I think it was pthread_getspecific(). The Thread DEV team has decided to skip this test. I am concerned as to the criticality of it. Can we legitimately skip this test without worry?

3 replies

nhorman Apr 22, 2024
Maintainer

I think the answer to that is likely up to you. the test really just does some basic sanity checking on the behavior of the openssl rand applet. You could certainly skip it just to move forward in your development, but if the problem is with pthread_getspecific as we discussed above, this is going to continue to be a problem in the rand applet until that functionality is implemented. Further, CRYPTO_THREAD_get_local is used in lots of places throughout the library (most notably in the ERR api, in the async job api, and rcu locking), so its likely you're going to encounter other, similar problems as you continue testing with your new thread model (assuming of course this is indeed the root cause of the issue you are encountering).

If you think this is the problem, you might consider hacking together your own implementation of CRYPTO_THREAD_[init|get|set]_local just to see if you can get through the test, though that might be a significant effort dependent on the details of this new thread model.

Sorry, I'm not sure what better advice to give you here. thread local data is has been pretty heavily relied upon since about 2015, so theres not a really easy way to avoid its use.

paulidale Apr 23, 2024
Collaborator

This test is verifying that the DRBGs are working as intended and the likely error you're getting is a problem reseeding in a multithreaded environment. A lot is going to depend on your context, ignoring this failure might be possible if your applications aren't multithreaded and don't fork e.g. Assuming you know how end users are using OpenSSL on NonStop.

Modern cryptography is entirely dependent upon quality random numbers. Even though it's convoluted and arcane, these tests are ensuring that the RNGs are working as intended. My main concern would be an unintentional compromise of the quality of the random bits.

rsbeckerca Apr 23, 2024
Author

The unthreaded code works fine. This is specifically for a new threading model being developed, so it really has to work properly when threaded. It works fine in PUT (that code) but not in the new model.

rsbeckerca · 2024-04-22T18:39:22Z

rsbeckerca
Apr 22, 2024
Author

Would it be worth trying the fallback lock based atomic operations? Disable/remove the feature detection in the CRYPTO_atomic_* functions in crypto/threads/pthread.c
@paulidale Do you mean enabling one of these regions?

# if defined(__GNUC__) && defined(__ATOMIC_ACQ_REL) && !defined(BROKEN_CLANG_ATO
MICS)
    if (__atomic_is_lock_free(sizeof(*val), val)) {
        *ret = __atomic_add_fetch(val, amount, __ATOMIC_ACQ_REL);
        return 1;
    }
# elif defined(__sun) && (defined(__SunOS_5_10) || defined(__SunOS_5_11))
    /* This will work for all future Solaris versions. */
    if (ret != NULL) {
        *ret = atomic_add_int_nv((volatile unsigned int *)val, amount);
        return 1;
    }

I'm not sure what the fallback lock actually means in this context.

2 replies

paulidale Apr 23, 2024
Collaborator

Not enabling either of those reasons and falling through to the generic code that follows. The generic code takes a lock, does the operation and releases the lock. More specifically forcing this bit of the code to be the implementation:

openssl/crypto/threads_pthread.c

Lines 843 to 852 in 264ff64

    
           if (lock == NULL || !CRYPTO_THREAD_write_lock(lock)) 
        
               return 0; 
        
           *val += amount; 
        
           *ret  = *val; 
        
           if (!CRYPTO_THREAD_unlock(lock)) 
        
               return 0; 
        
           return 1;

rsbeckerca Apr 23, 2024
Author

It looks like we are already falling through to that code as none of the above defines are specified.

rsbeckerca · 2024-04-23T18:17:27Z

rsbeckerca
Apr 23, 2024
Author

Would it be worth trying the fallback lock based atomic operations? Disable/remove the feature detection in the CRYPTO_atomic_* functions in crypto/threads/pthread.c

@paulidale Would you be able to point me a little deeper on this suggestion? I would like to try it, but I'm not sure which feature detection is relevant. Help please.

0 replies

nhorman · 2024-04-24T12:44:23Z

nhorman
Apr 24, 2024
Maintainer

@rsbeckerca I think what @paulidale is suggesting is that you do something like define BROKEN_CLANG_ATOMICS in your build to artificially force functions like CRYPTO_atomic_[add|load|etc] to use the fallback CRYPTO_THREAD_write_lock based implementation of those functions.

Its probably a good data point to try, but based on the debugging info you've provided, I'm hypothesizing that it won't affect your behavior, as the failure seems to be related to the fetching of thread local data. Given that you've suggested that your new thread model hasn't fully implemented thread local data yet, i think the problem more likely lies there.

2 replies

rsbeckerca Apr 24, 2024
Author

I'm trying USE_RWLOCK to see whether that makes a difference. It is still messing with my head as to why this works in PUT but not in the new model.

rsbeckerca Apr 24, 2024
Author

USE_RWLOCK does not change the outcome.

Need help debugging 05-test_rand.t #24211

rsbeckerca Apr 19, 2024

Replies: 17 comments · 16 replies

nhorman Apr 19, 2024 Maintainer

paulidale Apr 21, 2024 Collaborator

rsbeckerca Apr 19, 2024 Author

paulidale Apr 21, 2024 Collaborator

nhorman Apr 19, 2024 Maintainer

rsbeckerca Apr 19, 2024 Author

rsbeckerca Apr 19, 2024 Author

nhorman Apr 19, 2024 Maintainer

rsbeckerca Apr 19, 2024 Author

nhorman Apr 21, 2024 Maintainer

rsbeckerca Apr 20, 2024 Author

rsbeckerca Apr 21, 2024 Author

nhorman Apr 21, 2024 Maintainer

rsbeckerca Apr 21, 2024 Author

nhorman Apr 21, 2024 Maintainer

rsbeckerca Apr 21, 2024 Author

nhorman Apr 22, 2024 Maintainer

rsbeckerca Apr 21, 2024 Author

paulidale Apr 22, 2024 Collaborator

rsbeckerca Apr 21, 2024 Author

rsbeckerca Apr 22, 2024 Author

nhorman Apr 22, 2024 Maintainer

rsbeckerca Apr 22, 2024 Author

nhorman Apr 22, 2024 Maintainer

paulidale Apr 23, 2024 Collaborator

rsbeckerca Apr 23, 2024 Author

rsbeckerca Apr 22, 2024 Author

paulidale Apr 23, 2024 Collaborator

rsbeckerca Apr 23, 2024 Author

rsbeckerca Apr 23, 2024 Author

nhorman Apr 24, 2024 Maintainer

rsbeckerca Apr 24, 2024 Author

rsbeckerca Apr 24, 2024 Author

rsbeckerca
Apr 19, 2024

Replies: 17 comments 16 replies

nhorman
Apr 19, 2024
Maintainer

paulidale Apr 21, 2024
Collaborator

rsbeckerca
Apr 19, 2024
Author

paulidale Apr 21, 2024
Collaborator

nhorman
Apr 19, 2024
Maintainer

rsbeckerca
Apr 19, 2024
Author

rsbeckerca Apr 19, 2024
Author

nhorman
Apr 19, 2024
Maintainer

rsbeckerca
Apr 19, 2024
Author

nhorman Apr 21, 2024
Maintainer

rsbeckerca
Apr 20, 2024
Author

rsbeckerca
Apr 21, 2024
Author

nhorman Apr 21, 2024
Maintainer

rsbeckerca
Apr 21, 2024
Author

nhorman Apr 21, 2024
Maintainer

rsbeckerca
Apr 21, 2024
Author

nhorman Apr 22, 2024
Maintainer

rsbeckerca
Apr 21, 2024
Author

paulidale Apr 22, 2024
Collaborator

rsbeckerca
Apr 21, 2024
Author

rsbeckerca
Apr 22, 2024
Author

nhorman Apr 22, 2024
Maintainer

rsbeckerca
Apr 22, 2024
Author

nhorman Apr 22, 2024
Maintainer

paulidale Apr 23, 2024
Collaborator

rsbeckerca Apr 23, 2024
Author

rsbeckerca
Apr 22, 2024
Author

paulidale Apr 23, 2024
Collaborator

rsbeckerca Apr 23, 2024
Author

rsbeckerca
Apr 23, 2024
Author

nhorman
Apr 24, 2024
Maintainer

rsbeckerca Apr 24, 2024
Author

rsbeckerca Apr 24, 2024
Author