Need help debugging 05-test_rand.t #24211
Replies: 17 comments 16 replies
-
all the wrap.pl does is setup OPENSSL_* env variables to point to the right places in the build tree to test the local build. So depending on the tests you can just set those manually within gdb. Usually what I do is run the tests via the normal framework with V=1 and look for the wrap lines:
Then pick the test you want and load it into gdb, adjusting the paths in the executable accordingly:
Then, at the gdb prompt, enter the approriate environment variables and arguments
Its a bit verbose, so I usually just dump that into a ~/.gdbinit file |
Beta Was this translation helpful? Give feedback.
-
@nhorman Thanks, but I have to get in and try to debug why failures are occurring at |
Beta Was this translation helpful? Give feedback.
-
I'm not sure I follow. the instructions above should apply to drbgtest in the same way as the other tests. Is there a particular bit of output your are trying to get at? I don't know if this helps, but usually when I'm fighting with threads in gdb, I'll use a conditional breakpoint or an ignore directive to stop after a certain number of breakpoints have been hit, and then set scheduler-locking to step, to avoid further context switches so that I can single step through the code without having gdb swap tasks on me. |
Beta Was this translation helpful? Give feedback.
-
This is what I get on the first iteration. Some interpretation would be appreciated:
|
Beta Was this translation helpful? Give feedback.
-
ok, so starting with the top error, that failure is just doing an integer comparison of the return value of rand_bytes to the variable expect_success. Looking at the next line down, we see that test_drbg_reseed failed (which is the function that was called on line (1)). We can infer from that that expect_success held the value 1, (as line 392 is right before the call, and passes 1 as the expect_success parameter,. and so rand_bytes must have returned a value other than 1. the remaining lines there can likely be ignored (at least for now), as they appear to be subsequent failures based on the failure in rand_bytes. You're in a bit of luck here (maybe). This test isn't multithreaded, its multiprocess (test_drbg_reseed_in_child() calls fork, and for the parent process it just blocks until the child exits), while the child does all the work). That will make tracing a bit easier as you won't have to contend with context switches between threads. What you do need to do however, is inform gdb that you want to debug the child process after the call to fork. the command for that is:
That will allow you to debug the parent process (which you will be in when you start the program under gdb), and when it calls fork, it will switch to the child process, which is really where the failure is occurring. as for what the root cause is, I can't begin to guess with the information here. But if I were debugging this, I would do the above, and start single stepping through the rand_bytes function to see if you can catch a return value that isn't 1. |
Beta Was this translation helpful? Give feedback.
-
Multithreaded == multiprocess on NonStop in this threading model. On the 7th call to
ends up in this path:
On the second call to
in the following trace:
|
Beta Was this translation helpful? Give feedback.
-
One of our team members pointed out that the system has 6 IPUs available for threads. It is interesting that the failure occurs on the 7th pass through
This always breaks on the first hit at in the parent in |
Beta Was this translation helpful? Give feedback.
-
In the parent or child? |
Beta Was this translation helpful? Give feedback.
-
Aside: @nhorman Your help as been invaluable. Thank you. |
Beta Was this translation helpful? Give feedback.
-
On a side note: all PUT thread-related tests on 3.3 are now breaking on NonStop as of e5b1c72 - independent of this discussion. |
Beta Was this translation helpful? Give feedback.
-
This seems plausible as a cause. Each logical CPU has 6 IPU cores. The first 6 hits on the DRBG seem to work, but the 7th does not. From what I understand of the threading implementation, the 7th fork may be going into a new process instance so might get a new instance of a process with a new DRBG. The implementation of this new threading model is also very hairy and not well understood outside the team that implemented it. 6 seems to be a magic number. I am trying to arrange to have a couple of cores turned off to see whether that impacts the test results. NonStop uses an MPP architecture, not SMP, so anything to do with memory sharing can get really weird. We can get modify access within a logical CPU but if we get forced into a different CPU, I think the memory becomes read only - that may not be true but it was before this threading model was developed. My test machine appears to have 8 CPUs with 6 cores each - at least it did on Friday. |
Beta Was this translation helpful? Give feedback.
-
The debugger, which is based on gdb, does not actually cross thread boundaries on this box, so I have to manually invoke gdb on the different thread after I figure out what the fork's pid is, and force it into debug. |
Beta Was this translation helpful? Give feedback.
-
Three things came back.
|
Beta Was this translation helpful? Give feedback.
-
I think it was pthread_getspecific(). The Thread DEV team has decided to skip this test. I am concerned as to the criticality of it. Can we legitimately skip this test without worry? |
Beta Was this translation helpful? Give feedback.
-
I'm not sure what the fallback lock actually means in this context. |
Beta Was this translation helpful? Give feedback.
-
@paulidale Would you be able to point me a little deeper on this suggestion? I would like to try it, but I'm not sure which feature detection is relevant. Help please. |
Beta Was this translation helpful? Give feedback.
-
@rsbeckerca I think what @paulidale is suggesting is that you do something like define BROKEN_CLANG_ATOMICS in your build to artificially force functions like CRYPTO_atomic_[add|load|etc] to use the fallback CRYPTO_THREAD_write_lock based implementation of those functions. Its probably a good data point to try, but based on the debugging info you've provided, I'm hypothesizing that it won't affect your behavior, as the failure seems to be related to the fetching of thread local data. Given that you've suggested that your new thread model hasn't fully implemented thread local data yet, i think the problem more likely lies there. |
Beta Was this translation helpful? Give feedback.
-
I'm having problems in 05-test_rand.t in a new threading model. I need to be able to run the code using gdb directly. Is there an easy way to run the test directly without the wrappers?
Beta Was this translation helpful? Give feedback.
All reactions