-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[3.2.0-beta1] quicapitest hangs for NonStop thread (SPT) builds #22588
Comments
Can you run the quicapitest with verbose output and paste the output:
|
Nothing particularly informative on this. It ran through completely:
|
So, this seems to show quicapitest not hanging - but running to completion. But your initial description says that it hangs. So is it only hanging when run as part of the complete test suite?? |
When I run the full test suite, it does hang pretty consistently. quic_tserver is more random. This has a much higher likely hang on ia64 than x86 potentially because of timing (x86 is 4x faster). |
There is a current hang (in progress) of |
So if you run the verbose test command above several times repeatedly - can you induce a hang? |
Not on the x86. I will try again on the ia64... stay tuned. Going for a clean build first. |
Can you attach a debugger to the process when its hung and provide a backtrace for the running threads? |
The trace shows that the process is waiting on an thread event. Basically select but in the OS common runtime. Not at all informative. I will post what I can when I get the ia64 build to finish. |
Here is a failure trace. It actually did not hang on ia64 for this run - although I had to wait 15-20 minutes. On x86, this ran in under 10 seconds:
|
So, understanding that it didn't hang strictly speaking, I would still find it educational to see, if you could the backtrace of the stalled process, just to understand what event/data we are waiting on |
The trace, for what it is worth, is:
With the The only file error at the time is a TCP socket error |
Killing the hung process (when run in a full suite test, which hangs, instead of a single test, which does not) resulted in:
|
I had a hang this morning from a clean build:
Same stack trace as above. Hung process is waiting for input, presumably from perl. We have seen similar situations previously when perl did not set up the communication structure correctly. |
As of commit 28932ab, the hang is easier to reproduce on both x86 and ia64 platform variants. |
The stack trace you supplied here looks to me (based on the presence of the call to PROCESS_WAIT_), looks like the perl receipe is just waiting for the subordinate process (quic_multistream_test) to complete. Its the backtrace of that process that would be more telling as to the issue. When the hang occurs, can you find the quic_multistream_test process, attach to it and backtrace that? Or are you indicating that the quic_multistream_test process has already exited, but somehow the perl script is still waiting on it (i.e. a race in the perl test utils code somewhere)? |
The trace is from the quic_multistream_test process, not perl. The trace is just what SPT threads looks like when it goes into a wait state for a thread to wake up based on some input. As I said, not particularly useful but it is what SPT does on the platform. |
apologies, I'm not used to seeing nonstop traces, and assumed it came from the parent process. That said, I may see the problem. quic_multistream_test uses several (under the covers) pthread_cond/pthread_mutex calls, and appears to have reversed the order of mutex_unlock and cond_signal. I'll have a pr for you shortly to test please |
Ready to test when you are. The situation seems to go beyond |
I'll post the PR for multistream test immediately to get you started, and ammend if I find more of the same intances of the issue, or anything simmilar |
Please check |
quic_multistream test was issuing a signal on a condvar after dropping the corresponding mutex, not before, leading to potential race conditions in the reading of the associated data Fixes openssl#22588
quic_multistream test was issuing a signal on a condvar after dropping the corresponding mutex, not before, leading to potential race conditions in the reading of the associated data Fixes openssl#22588
My interpretation is that there are two threads - one real (a.k.a. |
I should note that if this ultimately turns out to be an SPT defect, we are stuck and may have to back away from that thread model for 3.2. Getting fixes to that thread model on NonStop is going to be like pulling teeth. |
yeah, that was my plan, standby |
ok quick-dgram-test_loop branch updated to instrument file_gets, should dump out what we're reading. If you can please test at your convienience, thanks |
Here is what I have. It's long, so attached instead. Also very curious that it did not get through the FIPS initialization. |
thank you! So thats....odd. It looks like we iterated over each line in apps/openssl.cnf, until line 390, which got returned, and the next call to fgets blocked, even though there is more data in the file (7 more lines specifically) I don't see anything particularly different about the next expected line that we're blocking on, save for the fact that its just a newline (which we've read several of previously in the trace). Just to confirm the apps/openssl.cnf file you're using is the one at the head of the git tree without modification, correct? It should end with this sequence of lines:
|
What I've got is this, which is different (I did not modify it myself). It is a complete file. Looks like the EOF was either not detected or ignored.
|
yeah, I would agree, that seems like it was the last line of the file in your environment ( 👍 ), but the next fgets should have immediately returned null, but it stalled there ( 👎 ). https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=c02128682 page 1093 describes spt_fgetsx, which mentions nothing about blocking. However https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=c02492445 on page 346 (table 58) lists general notes about the spt, spt_*x, and spt_*z variants and says this:
Now, I'm not sure why this would block, as the io should complete, given that we should be in an EOF condition, but here we are, with a thread blocking in fgets (aliased to spt_fgetsx) I would normally say that we could just try to set O_NONBLOCK here, but this is a regular file and O_NONBLOCK is meaningless. I'm at a bit of a loss for what to do here. Thoughts? |
Where is the actual I/O occurring? |
I'm not sure this answers your question, but from your stack trace its the file_gets call in the file BIO, which calls fgets, which in turn on that build maps to spt_fgetsx |
The situation is that |
thats unfortunate. I know its not helpful, but POSIX indicates that a null pointer should be returned on an eof condition: I'll make the change you describe. In the interim, can you do me a favor. Run the following command:
It also occurs to me that, given that our config file is different from mine, if an embedded null character someone was in your config file, the above behavior would actually be correct and posix compliant, in that spt_fgetsx actually read data, it just happened to be a 0 char. I'm not sure how that would occur, but if it were the case, it might be indicative of a condition we should check for generally, not just in the TANDEM case. |
I did not find any nulls in the file. No one but me could have been in the file and I don't generally put nulls in. The man page on platform does not cover the condition that is occurring, so I am concerned that raising this as a defect in SPT will not get anywhere. However, sticking a null in this file is bad, obviously, but I don't think a NULL LF is likely. LF NULL is possible but also unlikely. |
actually, I'm a bit confused by your suggestion as I look at it. if we don't check for the null terminator and always calculate ret = strlen(buf), in the condition described we will get ret = 0 anyway, as strlen computes the length of the string s, excluding the terminating null byte. If buf[0] == '\0', then we should get 0, just as we are currently, no? |
You are correct. This makes me very confused unless buf is shared between threads. I am testing the situation. |
This early in the setup, I would be very hard pressed to think any other thread would be created prior to library init in the unit tests, but anything is possible. |
When I single step through |
A timing issue between what context? Im still working under the assumption that we are single threaded at this point, so I'm struggling to think of what timing modifications would lead to behavior like this |
I think the issue here is that |
Another alternative I am planning on trying is to put |
That would be an option, yes. An alternative thought is that, in file_gets we could alternatively determine EOF by using ftell() and fstat to find our file position and overall size, finding EOF that way. It would be nice though, to understand the conditions under which spt_fgetsx blocks, thats really confusing me |
@rsbeckerca any good fortune on your efforts here? |
Nothing positive. I am considering the possibility that we have to use different knobs at different points, like bio_dgram needs |
You're the definitive source here, but based on your description, that sounds to me like perhaps supporting the SPT thread model is going to be more ongoing and consistent effort to maintain than its perhaps worth. Let me know how you want to procede |
What are the downsides of not supporting SPT? Are there applications linked against OpenSSL that need it? How reasonable is it to say that you need to migrate to a different threading model to use QUIC on NonStop? |
SPT was working correctly until DGRAM were introduced into OpenSSL. My personal feeling is that SPT may not be worth preserving. Currently surveying the customer base to find out if anyone is using it.
There might not be a downside. What I have proposed to my team is to keep SPT support for 3.0 and 3.1 as is, which takes us through until 2026. SPT has a low probability of being supported after that - a bit of a gamble. It is probably reasonable to tell customers to use PUT for QUIC or any other DGRAM use. There is also a new threading model coming, so supporting three (when the time comes) does not make much sense. |
Anything further needed on this or can I close it? |
No, closing. |
When testing NonStop builds for 3.2.0-beta1,
quicapitest
(and sometimesquic_tserver_test
) hangs waiting for initial input. This did not occur with 3.2.0-alpha2 but this may be transient.Stopping the hung processes to not cause the tests to fail, which is odd itself.
This only applies to pthread builds using the SPT model. PUT is fine, but it is much faster implementation so any transient problems may get masked. Unthreaded builds work consistently fine.
Config dump:
The text was updated successfully, but these errors were encountered: