Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.2.0-beta1] quicapitest hangs for NonStop thread (SPT) builds #22588

Closed
rsbeckerca opened this issue Nov 1, 2023 · 169 comments
Closed

[3.2.0-beta1] quicapitest hangs for NonStop thread (SPT) builds #22588

rsbeckerca opened this issue Nov 1, 2023 · 169 comments
Assignees
Labels
branch: master Merge to master branch branch: 3.2 Merge to openssl-3.2 triaged: bug The issue/pr is/fixes a bug

Comments

@rsbeckerca
Copy link
Contributor

When testing NonStop builds for 3.2.0-beta1, quicapitest (and sometimes quic_tserver_test) hangs waiting for initial input. This did not occur with 3.2.0-alpha2 but this may be transient.

Stopping the hung processes to not cause the tests to fail, which is odd itself.

This only applies to pthread builds using the SPT model. PUT is fine, but it is much faster implementation so any transient problems may get masked. Unthreaded builds work consistently fine.

Config dump:

Command line (with current working directory = .):

    perl ./Configure nonstop-nsx_spt_floss --prefix=/usr/local-ssl3.2 --openssldir=/usr/local-ssl3.2/ssl --with-rand-seed=rdcpu threads enable-fips -D_REENTRANT

Perl information:

    perl
    5.30.3 for NSV-D-nonstop_kernel

Enabled features:

    acvp-tests
    apps
    argon2
    aria
    async
    autoalginit
    autoerrinit
    autoload-config
    bf
    blake2
    bulk
    cached-fetch
    camellia
    capieng
    cast
    chacha
    cmac
    cmp
    cms
    comp
    ct
    default-thread-pool
    deprecated
    des
    dgram
    dh
    docs
    dsa
    dso
    dtls
    dynamic-engine
    ec
    ec2m
    ecdh
    ecdsa
    ecx
    egd
    engine
    err
    filenames
    fips
    fips-securitychecks
    gost
    http
    idea
    legacy
    loadereng
    md4
    mdc2
    module
    multiblock
    nextprotoneg
    ocb
    ocsp
    padlockeng
    pic
    pinshared
    poly1305
    posix-io
    psk
    quic
    rc2
    rc4
    rdrand
    rfc3779
    rmd160
    scrypt
    secure-memory
    seed
    shared
    siphash
    siv
    sm2
    sm2-precomp
    sm3
    sm4
    sock
    srp
    srtp
    sse2
    ssl
    ssl-trace
    static-engine
    stdio
    tests
    thread-pool
    threads
    tls
    ts
    ui-console
    whirlpool
    tls1
    tls1-method
    tls1_1
    tls1_1-method
    tls1_2
    tls1_2-method
    tls1_3
    dtls1
    dtls1-method
    dtls1_2
    dtls1_2-method

Disabled features:

    afalgeng            [not-linux]      OPENSSL_NO_AFALGENG
    asan                [default]        OPENSSL_NO_ASAN
    asm                 [no asm_arch]    OPENSSL_NO_ASM
    brotli              [default]        OPENSSL_NO_BROTLI
    brotli-dynamic      [default]        OPENSSL_NO_BROTLI_DYNAMIC
    buildtest-c++       [default]
    winstore            [not-windows]    OPENSSL_NO_WINSTORE
    crypto-mdebug       [default]        OPENSSL_NO_CRYPTO_MDEBUG
    devcryptoeng        [default]        OPENSSL_NO_DEVCRYPTOENG
    ec_nistp_64_gcc_128 [default]        OPENSSL_NO_EC_NISTP_64_GCC_128
    external-tests      [default]        OPENSSL_NO_EXTERNAL_TESTS
    fuzz-afl            [default]        OPENSSL_NO_FUZZ_AFL
    fuzz-libfuzzer      [default]        OPENSSL_NO_FUZZ_LIBFUZZER
    ktls                [default]        OPENSSL_NO_KTLS
    makedepend          [unavailable]
    md2                 [default]        OPENSSL_NO_MD2 (skip crypto/md2)
    msan                [default]        OPENSSL_NO_MSAN
    rc5                 [default]        OPENSSL_NO_RC5 (skip crypto/rc5)
    sctp                [default]        OPENSSL_NO_SCTP
    tfo                 [default]        OPENSSL_NO_TFO
    trace               [default]        OPENSSL_NO_TRACE
    ubsan               [default]        OPENSSL_NO_UBSAN
    unit-test           [default]        OPENSSL_NO_UNIT_TEST
    uplink              [no uplink_arch] OPENSSL_NO_UPLINK
    weak-ssl-ciphers    [default]        OPENSSL_NO_WEAK_SSL_CIPHERS
    zlib                [default]        OPENSSL_NO_ZLIB
    zlib-dynamic        [default]        OPENSSL_NO_ZLIB_DYNAMIC
    zstd                [default]        OPENSSL_NO_ZSTD
    zstd-dynamic        [default]        OPENSSL_NO_ZSTD_DYNAMIC
    ssl3                [default]        OPENSSL_NO_SSL3
    ssl3-method         [default]        OPENSSL_NO_SSL3_METHOD

Config target attributes:

    AR => "ar",
    ARFLAGS => "qc",
    CC => "c99",
    HASHBANGPERL => "/usr/bin/env perl",
    RANLIB => "ranlib",
    RC => "windres",
    bn_ops => "THIRTY_TWO_BIT",
    build_file => "Makefile",
    build_scheme => [ "unified", "unix" ],
    cflags => "-g -O2 -Wextensions -Wnowarn=203,220,272,734,770,1506 -Wbuild_neutral_library -Wverbose -Wtarget=tns/x -Wsystype=oss -Wilp32 -WIEEE_float",
    cppflags => "",
    defines => [ "OPENSSL_BUILDING_OPENSSL", "OPENSSL_VPROC=\$(OPENSSL_VPROC)", "_XOPEN_SOURCE", "_XOPEN_SOURCE_EXTENDED=1", "_TANDEM_SOURCE", "__NSK_OPTIONAL_TYPES__", "B_ENDIAN", "OPENSSL_TANDEM_FLOSS", "_SPT_MODEL_", "_REENTRANT", "_ENABLE_FLOSS_THREADS" ],
    disable => [  ],
    dso_scheme => "DLFCN",
    enable => [ "egd" ],
    ex_libs => "-lrld -lfloss -lspt",
    includes => [ "/usr/local/include" ],
    lflags => "-Wxld=\"-set systype oss\" -Wxld=\"-set floattype ieee_float\"",
    lib_cflags => "",
    lib_cppflags => "",
    lib_defines => [  ],
    module_cflags => "",
    module_cxxflags => undef,
    module_ldflags => "-Wshared",
    multibin => "-spt",
    multilib => "-spt",
    perl => "/usr/bin/perl",
    perl_platform => "Unix",
    shared_argfileflag => "-Wxld_obey=",
    shared_cflag => "",
    shared_defflag => "-Wxld_obey=",
    shared_defines => [  ],
    shared_extension => ".so",
    shared_ldflag => "-Wshared",
    shared_rcflag => "",
    shared_target => "nonstop-shared",
    sys_id => "TANDEM",
    thread_defines => [  ],
    thread_scheme => "(unknown)",
    unistd => "<unistd.h>",

Recorded environment:

    AR =
    BUILDFILE =
    CC =
    CFLAGS =
    CPPFLAGS =
    CROSS_COMPILE =
    CXX =
    CXXFLAGS =
    HASHBANGPERL =
    LDFLAGS =
    LDLIBS =
    OPENSSL_LOCAL_CONFIG_DIR =
    PERL =
    RANLIB =
    RC =
    RCFLAGS =
    WINDRES =
    __CNF_CFLAGS =
    __CNF_CPPDEFINES =
    __CNF_CPPFLAGS =
    __CNF_CPPINCLUDES =
    __CNF_CXXFLAGS =
    __CNF_LDFLAGS =
    __CNF_LDLIBS =

Makevars:

    AR              = ar
    ARFLAGS         = qc
    ASFLAGS         =
    CC              = c99
    CFLAGS          =
    CPPDEFINES      = _REENTRANT
    CPPFLAGS        =
    CPPINCLUDES     =
    CXXFLAGS        =
    HASHBANGPERL    = /usr/bin/env perl
    LDFLAGS         =
    LDLIBS          =
    PERL            = perl
    RANLIB          = ranlib
    RC              = windres
    RCFLAGS         =

NOTE: These variables only represent the configuration view.  The build file
template may have processed these variables further, please have a look at the
build file for more exact data:
    Makefile

build file:

    Makefile

build file templates:

    Configurations/common0.tmpl
    Configurations/unix-Makefile.tmpl
@rsbeckerca rsbeckerca added the issue: bug report The issue was opened to report a bug label Nov 1, 2023
@t8m t8m added branch: master Merge to master branch triaged: bug The issue/pr is/fixes a bug branch: 3.2 Merge to openssl-3.2 and removed issue: bug report The issue was opened to report a bug labels Nov 1, 2023
@mattcaswell
Copy link
Member

Can you run the quicapitest with verbose output and paste the output:

make TESTS=test_quicapi V=1 test

@rsbeckerca
Copy link
Contributor Author

Can you run the quicapitest with verbose output and paste the output:

make TESTS=test_quicapi V=1 test

Nothing particularly informative on this. It ran through completely:

75-test_quicapi.t ..
# The results of this test will end up in test-runs/test_quicapi
1..2
    # Subtest: ../../test/quicapitest
    1..19
        # Subtest: test_quic_write_read
        1..3
        ok 1 - iteration 1
        # SKIP:  @ /home/randall/openssl/test/quicapitest.c:64
        # Blocking tests not supported in this build
        ok 2 - iteration 2 # skipped
        # SKIP:  @ /home/randall/openssl/test/quicapitest.c:64
        # Blocking tests not supported in this build
        ok 3 - iteration 3 # skipped
    ok 1 - test_quic_write_read
    # SKIP:  @ /home/randall/openssl/test/quicapitest.c:218
    # Blocking tests not supported in this build
    ok 2 - test_fin_only_blocking # skipped
    ok 3 - test_ciphersuites
    ok 4 - test_cipher_find
    ok 5 - test_version
    ok 6 - test_ssl_trace
    ok 7 - test_quic_forbidden_apis_ctx
    ok 8 - test_quic_forbidden_apis
    ok 9 - test_quic_forbidden_options
        # Subtest: test_quic_set_fd
        1..3
        ok 4 - iteration 1
        ok 5 - iteration 2
        ok 6 - iteration 3
    ok 10 - test_quic_set_fd
    ok 11 - test_bio_ssl
    ok 12 - test_back_pressure
    ok 13 - test_multiple_dgrams
        # Subtest: test_non_io_retry
        1..2
        ok 7 - iteration 1
        # SKIP:  @ /home/randall/openssl/test/quicapitest.c:1088
        # Blocking tests not supported in this build
        ok 8 - iteration 2 # skipped
    ok 14 - test_non_io_retry
    ok 15 - test_quic_psk
        # Subtest: test_client_auth
        1..3
        ok 9 - iteration 1
        ok 10 - iteration 2
        ok 11 - iteration 3
    ok 16 - test_client_auth
        # Subtest: test_alpn
        1..2
        ok 12 - iteration 1
        ok 13 - iteration 2
    ok 17 - test_alpn
        # Subtest: test_noisy_dgram
        1..2
        ok 14 - iteration 1
        ok 15 - iteration 2
    ok 18 - test_noisy_dgram
    ok 19 - test_get_shutdown
../../util/wrap.pl ../../test/quicapitest default ../../test/default.cnf ../../test/certs ../../test/recipes/75-test_quicapi_data => 0
ok 1 - running quicapitest
    # Subtest: ../../test/quicapitest
    1..19
        # Subtest: test_quic_write_read
        1..3
        ok 1 - iteration 1
        # SKIP:  @ /home/randall/openssl/test/quicapitest.c:64
        # Blocking tests not supported in this build
        ok 2 - iteration 2 # skipped
        # SKIP:  @ /home/randall/openssl/test/quicapitest.c:64
        # Blocking tests not supported in this build
        ok 3 - iteration 3 # skipped
    ok 1 - test_quic_write_read
    # SKIP:  @ /home/randall/openssl/test/quicapitest.c:218
    # Blocking tests not supported in this build
    ok 2 - test_fin_only_blocking # skipped
    ok 3 - test_ciphersuites
    ok 4 - test_cipher_find
    ok 5 - test_version
    ok 6 - test_ssl_trace
    ok 7 - test_quic_forbidden_apis_ctx
    ok 8 - test_quic_forbidden_apis
    ok 9 - test_quic_forbidden_options
        # Subtest: test_quic_set_fd
        1..3
        ok 4 - iteration 1
        ok 5 - iteration 2
        ok 6 - iteration 3
    ok 10 - test_quic_set_fd
    ok 11 - test_bio_ssl
    ok 12 - test_back_pressure
    ok 13 - test_multiple_dgrams
        # Subtest: test_non_io_retry
        1..2
        ok 7 - iteration 1
        # SKIP:  @ /home/randall/openssl/test/quicapitest.c:1088
        # Blocking tests not supported in this build
        ok 8 - iteration 2 # skipped
    ok 14 - test_non_io_retry
    ok 15 - test_quic_psk
        # Subtest: test_client_auth
        1..3
        ok 9 - iteration 1
        ok 10 - iteration 2
        ok 11 - iteration 3
    ok 16 - test_client_auth
        # Subtest: test_alpn
        1..2
        ok 12 - iteration 1
        ok 13 - iteration 2
    ok 17 - test_alpn
        # Subtest: test_noisy_dgram
        1..2
        ok 14 - iteration 1
        ok 15 - iteration 2
    ok 18 - test_noisy_dgram
    ok 19 - test_get_shutdown
../../util/wrap.pl ../../test/quicapitest fips ../../test/fips-and-base.cnf ../../test/certs ../../test/recipes/75-test_quicapi_data => 0
ok 2 - running quicapitest
ok
All tests successful.
Files=1, Tests=2,  6 wallclock secs ( 0.02 usr  0.00 sys +  4.93 cusr  0.01 csys =  4.95 CPU)
Result: PASS

@mattcaswell
Copy link
Member

So, this seems to show quicapitest not hanging - but running to completion. But your initial description says that it hangs. So is it only hanging when run as part of the complete test suite??

@rsbeckerca
Copy link
Contributor Author

So, this seems to show quicapitest not hanging - but running to completion. But your initial description says that it hangs. So is it only hanging when run as part of the complete test suite??

When I run the full test suite, it does hang pretty consistently. quic_tserver is more random. This has a much higher likely hang on ia64 than x86 potentially because of timing (x86 is 4x faster).

@rsbeckerca
Copy link
Contributor Author

There is a current hang (in progress) of quic_multistream_test also. This happens less frequently.

@mattcaswell
Copy link
Member

So if you run the verbose test command above several times repeatedly - can you induce a hang?

@rsbeckerca
Copy link
Contributor Author

So if you run the verbose test command above several times repeatedly - can you induce a hang?

Not on the x86. I will try again on the ia64... stay tuned. Going for a clean build first.

@nhorman
Copy link
Contributor

nhorman commented Nov 1, 2023

Can you attach a debugger to the process when its hung and provide a backtrace for the running threads?

@rsbeckerca
Copy link
Contributor Author

Can you attach a debugger to the process when its hung and provide a backtrace for the running threads?

The trace shows that the process is waiting on an thread event. Basically select but in the OS common runtime. Not at all informative. I will post what I can when I get the ia64 build to finish.

@rsbeckerca
Copy link
Contributor Author

Here is a failure trace. It actually did not hang on ia64 for this run - although I had to wait 15-20 minutes. On x86, this ran in under 10 seconds:

make depend && make _tests
make[1]: Entering directory '/home/ituglib/randall/openssl'
make[1]: Leaving directory '/home/ituglib/randall/openssl'
make[1]: Entering directory '/home/ituglib/randall/openssl'
( SRCTOP=. \
  BLDTOP=. \
  PERL="perl" \
  FIPSKEY="f4556650ac31d35461610bac4ed81b1a181b2d8a43ea2854cbae22ca74560813" \
  EXE_EXT= \
  perl ./test/run_tests.pl test_quicapi )
00-prep_fipsmodule_cnf.t ..
# The results of this test will end up in test-runs/prep_fipsmodule
1..0 # SKIP FIPS module config file only supported in a fips build
skipped: FIPS module config file only supported in a fips build
Files=1, Tests=0,  2 wallclock secs ( 0.04 usr  0.00 sys +  0.77 cusr  0.00 csys =  0.81 CPU)
Result: NOTESTS
75-test_quicapi.t ..
# The results of this test will end up in test-runs/test_quicapi
1..1
    # Subtest: ../../test/quicapitest
    1..19
        # Subtest: test_quic_write_read
        1..3
        ok 1 - iteration 1
        # ERROR: (bool) 'globserverret == true' failed @ /home/ituglib/randall/openssl/test/helpers/quictestlib.c:517
        # false
        # ERROR: (bool) 'qtest_create_quic_connection(qtserv, clientquic) == true' failed @ /home/ituglib/randall/openssl/test/quicapitest.c:81
        # false
        # OPENSSL_TEST_RAND_SEED=1698874757
        not ok 2 - iteration 2
        # ERROR: (bool) 'globserverret == true' failed @ /home/ituglib/randall/openssl/test/helpers/quictestlib.c:517
        # false
        # ERROR: (bool) 'qtest_create_quic_connection(qtserv, clientquic) == true' failed @ /home/ituglib/randall/openssl/test/quicapitest.c:81
        # false
        # OPENSSL_TEST_RAND_SEED=1698874757
        not ok 3 - iteration 3
    # OPENSSL_TEST_RAND_SEED=1698874757
    not ok 1 - test_quic_write_read
    # ERROR: (bool) 'globserverret == true' failed @ /home/ituglib/randall/openssl/test/helpers/quictestlib.c:517
    # false
    # ERROR: (bool) 'qtest_create_quic_connection(qtserv, clientquic) == true' failed @ /home/ituglib/randall/openssl/test/quicapitest.c:229
    # false
    # OPENSSL_TEST_RAND_SEED=1698874757
    not ok 2 - test_fin_only_blocking
    ok 3 - test_ciphersuites
    ok 4 - test_cipher_find
    ok 5 - test_version
    ok 6 - test_ssl_trace
    ok 7 - test_quic_forbidden_apis_ctx
    ok 8 - test_quic_forbidden_apis
    ok 9 - test_quic_forbidden_options
        # Subtest: test_quic_set_fd
        1..3
        ok 4 - iteration 1
        ok 5 - iteration 2
        ok 6 - iteration 3
    ok 10 - test_quic_set_fd
    ok 11 - test_bio_ssl
    ok 12 - test_back_pressure
    ok 13 - test_multiple_dgrams
        # Subtest: test_non_io_retry
        1..2
        ok 7 - iteration 1
        # ERROR: (bool) 'globserverret == true' failed @ /home/ituglib/randall/openssl/test/helpers/quictestlib.c:517
        # false
        # ERROR: (bool) 'qtest_create_quic_connection_ex(qtserv, clientquic, SSL_ERROR_WANT_RETRY_VERIFY) == true' failed @ /home/ituglib/randall/openssl/test/quicapitest.c:1101
        # false
        # OPENSSL_TEST_RAND_SEED=1698874757
        not ok 8 - iteration 2
    # OPENSSL_TEST_RAND_SEED=1698874757
    not ok 14 - test_non_io_retry
    ok 15 - test_quic_psk
        # Subtest: test_client_auth
        1..3
        ok 9 - iteration 1
        ok 10 - iteration 2
        ok 11 - iteration 3
    ok 16 - test_client_auth
        # Subtest: test_alpn
        1..2
        ok 12 - iteration 1
        ok 13 - iteration 2
    ok 17 - test_alpn
        # Subtest: test_noisy_dgram
        1..2
        ok 14 - iteration 1
        ok 15 - iteration 2
    ok 18 - test_noisy_dgram
    ok 19 - test_get_shutdown
../../util/wrap.pl ../../test/quicapitest default ../../test/default.cnf ../../test/certs ../../test/recipes/75-test_quicapi_data => 255
not ok 1 - running quicapitest

#   Failed test 'running quicapitest'
#   at test/recipes/75-test_quicapi.t line 32.
# Looks like you failed 1 test of 1.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests

Test Summary Report
-------------------
75-test_quicapi.t (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=1, Tests=1, 176 wallclock secs ( 0.03 usr  0.00 sys + 124.29 cusr  0.01 csys = 124.33 CPU)
Result: FAIL
Makefile:1699: recipe for target 'run_tests' failed
make[1]: *** [run_tests] Error 1
make[1]: Leaving directory '/home/ituglib/randall/openssl'
Makefile:1696: recipe for target 'tests' failed
make: *** [tests] Error 2

@nhorman
Copy link
Contributor

nhorman commented Nov 2, 2023

So, understanding that it didn't hang strictly speaking, I would still find it educational to see, if you could the backtrace of the stalled process, just to understand what event/data we are waiting on

@rsbeckerca
Copy link
Contributor Author

So, understanding that it didn't hang strictly speaking, I would still find it educational to see, if you could the backtrace of the stalled process, just to understand what event/data we are waiting on

The trace, for what it is worth, is:

Procedure call trace:
  AsmBreakDispatchNoRelink + 0x2 (Milli)
  $n_NSK_Event_Wait + 0xCA (Milli)
  Timer_WaitOnEmbeddedTLE_ + 0xDE (SLr)
  PROCESS_WAIT_ + 0x38 (SLr)
  AWAITIOX + 0x59D (SLr)
  FILE_COMPLETE_ + 0x2A7 (SLr)
  $n_EnterPriv + 0x542 (Milli)
  cma__io_available + 0xF6 (DLL xsptdll)
  cma___null_thread + 0x63 (DLL xsptdll)
  cma__thread_base + 0xCE (DLL xsptdll)
  GTHREAD_START_ + 0x9A (SLr)

With the cma_ code being in the SPT library.

The only file error at the time is a TCP socket error EWOULDBLOCK.

@rsbeckerca
Copy link
Contributor Author

Killing the hung process (when run in a full suite test, which hangs, instead of a single test, which does not) resulted in:

70-test_quic_multistream.t ..............
../../util/wrap.pl ../../test/quic_multistream_test ../../test/certs/servercert.pem ../../test/certs/serverkey.pem => 137
not ok 1
70-test_quic_multistream.t .............. 1/? ----------------------------------
#   Failed test at test/recipes/70-test_quic_multistream.t line 19.
70-test_quic_multistream.t .............. Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests

@rsbeckerca
Copy link
Contributor Author

I had a hang this morning from a clean build:

make depend && make _tests
make[1]: Entering directory '/home/ituglib/randall/openssl'
make[1]: Leaving directory '/home/ituglib/randall/openssl'
make[1]: Entering directory '/home/ituglib/randall/openssl'
( SRCTOP=. \
  BLDTOP=. \
  PERL="perl" \
  FIPSKEY="f4556650ac31d35461610bac4ed81b1a181b2d8a43ea2854cbae22ca74560813" \
  EXE_EXT= \
  perl ./test/run_tests.pl test_quicapi )
00-prep_fipsmodule_cnf.t ..
# The results of this test will end up in test-runs/prep_fipsmodule
1..0 # SKIP FIPS module config file only supported in a fips build
skipped: FIPS module config file only supported in a fips build
Files=1, Tests=0,  2 wallclock secs ( 0.04 usr  0.00 sys +  0.77 cusr  0.00 csys =  0.81 CPU)
Result: NOTESTS
75-test_quicapi.t ..
# The results of this test will end up in test-runs/test_quicapi
1..1
    # Subtest: ../../test/quicapitest
    1..19
        # Subtest: test_quic_write_read
        1..3
        ok 1 - iteration 1

Same stack trace as above. Hung process is waiting for input, presumably from perl. We have seen similar situations previously when perl did not set up the communication structure correctly.

@rsbeckerca
Copy link
Contributor Author

As of commit 28932ab, the hang is easier to reproduce on both x86 and ia64 platform variants.

@nhorman
Copy link
Contributor

nhorman commented Nov 3, 2023

The stack trace you supplied here looks to me (based on the presence of the call to PROCESS_WAIT_), looks like the perl receipe is just waiting for the subordinate process (quic_multistream_test) to complete. Its the backtrace of that process that would be more telling as to the issue. When the hang occurs, can you find the quic_multistream_test process, attach to it and backtrace that?

Or are you indicating that the quic_multistream_test process has already exited, but somehow the perl script is still waiting on it (i.e. a race in the perl test utils code somewhere)?

@rsbeckerca
Copy link
Contributor Author

The stack trace you supplied here looks to me (based on the presence of the call to PROCESS_WAIT_), looks like the perl receipe is just waiting for the subordinate process (quic_multistream_test) to complete. Its the backtrace of that process that would be more telling as to the issue. When the hang occurs, can you find the quic_multistream_test process, attach to it and backtrace that?

Or are you indicating that the quic_multistream_test process has already exited, but somehow the perl script is still waiting on it (i.e. a race in the perl test utils code somewhere)?

The trace is from the quic_multistream_test process, not perl. The trace is just what SPT threads looks like when it goes into a wait state for a thread to wake up based on some input. As I said, not particularly useful but it is what SPT does on the platform.

@nhorman
Copy link
Contributor

nhorman commented Nov 3, 2023

apologies, I'm not used to seeing nonstop traces, and assumed it came from the parent process.

That said, I may see the problem. quic_multistream_test uses several (under the covers) pthread_cond/pthread_mutex calls, and appears to have reversed the order of mutex_unlock and cond_signal. I'll have a pr for you shortly to test please

@rsbeckerca
Copy link
Contributor Author

apologies, I'm not used to seeing nonstop traces, and assumed it came from the parent process.

That said, I may see the problem. quic_multistream_test uses several (under the covers) pthread_cond/pthread_mutex calls, and appears to have reversed the order of mutex_unlock and cond_signal. I'll have a pr for you shortly to test please

Ready to test when you are. The situation seems to go beyond quic_multistream_test.

@nhorman
Copy link
Contributor

nhorman commented Nov 3, 2023

I'll post the PR for multistream test immediately to get you started, and ammend if I find more of the same intances of the issue, or anything simmilar

@rsbeckerca
Copy link
Contributor Author

I'll post the PR for multistream test immediately to get you started, and ammend if I find more of the same intances of the issue, or anything simmilar

Please check quic_tserver_test and quicapitest. They hangs occasionally.

nhorman added a commit to nhorman/openssl that referenced this issue Nov 3, 2023
quic_multistream test was issuing a signal on a condvar after dropping
the corresponding mutex, not before, leading to potential race
conditions in the reading of the associated data

Fixes openssl#22588
nhorman added a commit to nhorman/openssl that referenced this issue Nov 3, 2023
quic_multistream test was issuing a signal on a condvar after dropping
the corresponding mutex, not before, leading to potential race
conditions in the reading of the associated data

Fixes openssl#22588
@rsbeckerca
Copy link
Contributor Author

My interpretation is that there are two threads - one real (a.k.a. 1 above.) and one sort-of real (a.k.a. main). When thread 1 yields, it will not go to the main thread in SPT but wait for another thread, which there isn't one, so thread 1 is dispatched. The main thread is not treated as a real thread so should not really be getting control. I am not convinced spt_fgetsx is actually reading data. Can we instrument a display of what is being returned through the EOF?

@rsbeckerca
Copy link
Contributor Author

I should note that if this ultimately turns out to be an SPT defect, we are stuck and may have to back away from that thread model for 3.2. Getting fixes to that thread model on NonStop is going to be like pulling teeth.

@nhorman
Copy link
Contributor

nhorman commented Nov 14, 2023

yeah, that was my plan, standby

@nhorman
Copy link
Contributor

nhorman commented Nov 14, 2023

ok quick-dgram-test_loop branch updated to instrument file_gets, should dump out what we're reading. If you can please test at your convienience, thanks

@rsbeckerca
Copy link
Contributor Author

Here is what I have. It's long, so attached instead. Also very curious that it did not get through the FIPS initialization.
test-bio-dgram.log

@nhorman
Copy link
Contributor

nhorman commented Nov 14, 2023

thank you!

So thats....odd. It looks like we iterated over each line in apps/openssl.cnf, until line 390, which got returned, and the next call to fgets blocked, even though there is more data in the file (7 more lines specifically)

I don't see anything particularly different about the next expected line that we're blocking on, save for the fact that its just a newline (which we've read several of previously in the trace).

Just to confirm the apps/openssl.cnf file you're using is the one at the head of the git tree without modification, correct? It should end with this sequence of lines:

[rr]
# Certificate revocation
cmd = rr
oldcert = $insta::certout # insta.cert.pem

[pkcs12]
certBagAttr = cb_attr

# Uncomment this if you need Java compatible PKCS12 files
[cb_attr]
#jdkTrustedKeyUsage = anyExtendedKeyUsage

@rsbeckerca
Copy link
Contributor Author

thank you!

So thats....odd. It looks like we iterated over each line in apps/openssl.cnf, until line 390, which got returned, and the next call to fgets blocked, even though there is more data in the file (7 more lines specifically)

I don't see anything particularly different about the next expected line that we're blocking on, save for the fact that its just a newline (which we've read several of previously in the trace).

Just to confirm the apps/openssl.cnf file you're using is the one at the head of the git tree without modification, correct? It should end with this sequence of lines:

[rr]
# Certificate revocation
cmd = rr
oldcert = $insta::certout # insta.cert.pem

[pkcs12]
certBagAttr = cb_attr

# Uncomment this if you need Java compatible PKCS12 files
[cb_attr]
#jdkTrustedKeyUsage = anyExtendedKeyUsage

What I've got is this, which is different (I did not modify it myself). It is a complete file. Looks like the EOF was either not detected or ignored.

[cr]
cmd = cr

[kur]
# Certificate update
cmd = kur
oldcert = $insta::certout # insta.cert.pem

[rr]
# Certificate revocation
cmd = rr
oldcert = $insta::certout # insta.cert.pem

@nhorman
Copy link
Contributor

nhorman commented Nov 14, 2023

yeah, I would agree, that seems like it was the last line of the file in your environment ( 👍 ), but the next fgets should have immediately returned null, but it stalled there ( 👎 ).

https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=c02128682 page 1093 describes spt_fgetsx, which mentions nothing about blocking.

However https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=c02492445 on page 346 (table 58) lists general notes about the spt, spt_*x, and spt_*z variants and says this:

spt_*x() Behavior for regular files:
Always process-blocking (all threads in the
process must wait for I/O to complete

Now, I'm not sure why this would block, as the io should complete, given that we should be in an EOF condition, but here we are, with a thread blocking in fgets (aliased to spt_fgetsx)

I would normally say that we could just try to set O_NONBLOCK here, but this is a regular file and O_NONBLOCK is meaningless.

I'm at a bit of a loss for what to do here. Thoughts?

@rsbeckerca
Copy link
Contributor Author

Where is the actual I/O occurring?

@nhorman
Copy link
Contributor

nhorman commented Nov 14, 2023

I'm not sure this answers your question, but from your stack trace its the file_gets call in the file BIO, which calls fgets, which in turn on that build maps to spt_fgetsx

@rsbeckerca
Copy link
Contributor Author

The situation is that spt_fgetsx is returning the address of buf but strlen(buf) is 0 indicating an EOF. The documentation on platform is unclear as to what will happen - return NULL or return buf[0]==NULL. If we check the latter, after the ret < 1 check, it should correctly detect the EOF. This could be isolated for OPENSSL_SYS_TANDEM. This could be easy simply by removing the if (buf[0] != '\0') line and always calculating ret = strlen(buf);, which will terminate the calling loop.

@nhorman
Copy link
Contributor

nhorman commented Nov 14, 2023

thats unfortunate.

I know its not helpful, but POSIX indicates that a null pointer should be returned on an eof condition:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/fgets.html

I'll make the change you describe. In the interim, can you do me a favor. Run the following command:

hexdump -C apps/openssl.cnf

It also occurs to me that, given that our config file is different from mine, if an embedded null character someone was in your config file, the above behavior would actually be correct and posix compliant, in that spt_fgetsx actually read data, it just happened to be a 0 char. I'm not sure how that would occur, but if it were the case, it might be indicative of a condition we should check for generally, not just in the TANDEM case.

@rsbeckerca
Copy link
Contributor Author

I did not find any nulls in the file. No one but me could have been in the file and I don't generally put nulls in. The man page on platform does not cover the condition that is occurring, so I am concerned that raising this as a defect in SPT will not get anywhere. However, sticking a null in this file is bad, obviously, but I don't think a NULL LF is likely. LF NULL is possible but also unlikely.

@nhorman
Copy link
Contributor

nhorman commented Nov 14, 2023

actually, I'm a bit confused by your suggestion as I look at it. if we don't check for the null terminator and always calculate ret = strlen(buf), in the condition described we will get ret = 0 anyway, as strlen computes the length of the string s, excluding the terminating null byte. If buf[0] == '\0', then we should get 0, just as we are currently, no?

@rsbeckerca
Copy link
Contributor Author

You are correct. This makes me very confused unless buf is shared between threads. I am testing the situation.

@nhorman
Copy link
Contributor

nhorman commented Nov 14, 2023

This early in the setup, I would be very hard pressed to think any other thread would be created prior to library init in the unit tests, but anything is possible.

@rsbeckerca
Copy link
Contributor Author

When I single step through file_gets, spt_fgetsx does detected the EOF condition correctly (returns 0x0). When running without stepping, I do not see the EOF ever - hangs after the last line of the file. This looks like either a timing problem or a defect in SPT (I suspect the former).

@nhorman
Copy link
Contributor

nhorman commented Nov 14, 2023

A timing issue between what context? Im still working under the assumption that we are single threaded at this point, so I'm struggling to think of what timing modifications would lead to behavior like this

@rsbeckerca
Copy link
Contributor Author

A timing issue between what context? Im still working under the assumption that we are single threaded at this point, so I'm struggling to think of what timing modifications would lead to behavior like this

I think the issue here is that spt_fgetsx is not behaving correctly when in the main program outside a thread. I'm trying different knobs now to get the best behaviour. Worst case, we are going to drop SPT support permanently if we cannot get this to work.

@rsbeckerca
Copy link
Contributor Author

Another alternative I am planning on trying is to put SPT_THREAD_AWARE_NONBLOCK only in the bio_dgram related code instead of globally.

@nhorman
Copy link
Contributor

nhorman commented Nov 15, 2023

That would be an option, yes.

An alternative thought is that, in file_gets we could alternatively determine EOF by using ftell() and fstat to find our file position and overall size, finding EOF that way.

It would be nice though, to understand the conditions under which spt_fgetsx blocks, thats really confusing me

@nhorman
Copy link
Contributor

nhorman commented Nov 17, 2023

@rsbeckerca any good fortune on your efforts here?

@rsbeckerca
Copy link
Contributor Author

@rsbeckerca any good fortune on your efforts here?

Nothing positive. I am considering the possibility that we have to use different knobs at different points, like bio_dgram needs SPT_THREAD_AWARE_NONBLOCK as does the quic test suite to get spt_sendtox and spt_recvfromx to be used, while elsewhere we need spt_fread and spt_fwrite mixed with fread and fwrite. However, there seems to be a known bug in spt_fgets when used from a main program that it does not consistently detect EOF correctly. This appears to have been reported as a defect some years ago by myself and another developer, with no resolution. That will take a lot of McGyvering to get to work.

@nhorman
Copy link
Contributor

nhorman commented Nov 20, 2023

You're the definitive source here, but based on your description, that sounds to me like perhaps supporting the SPT thread model is going to be more ongoing and consistent effort to maintain than its perhaps worth. Let me know how you want to procede

@hlandau
Copy link
Member

hlandau commented Nov 20, 2023

What are the downsides of not supporting SPT? Are there applications linked against OpenSSL that need it? How reasonable is it to say that you need to migrate to a different threading model to use QUIC on NonStop?

@rsbeckerca
Copy link
Contributor Author

You're the definitive source here, but based on your description, that sounds to me like perhaps supporting the SPT thread model is going to be more ongoing and consistent effort to maintain than its perhaps worth. Let me know how you want to procede

SPT was working correctly until DGRAM were introduced into OpenSSL. My personal feeling is that SPT may not be worth preserving. Currently surveying the customer base to find out if anyone is using it.

What are the downsides of not supporting SPT? Are there applications linked against OpenSSL that need it? How reasonable is it to say that you need to migrate to a different threading model to use QUIC on NonStop?

There might not be a downside. What I have proposed to my team is to keep SPT support for 3.0 and 3.1 as is, which takes us through until 2026. SPT has a low probability of being supported after that - a bit of a gamble. It is probably reasonable to tell customers to use PUT for QUIC or any other DGRAM use. There is also a new threading model coming, so supporting three (when the time comes) does not make much sense.

@rsbeckerca
Copy link
Contributor Author

Anything further needed on this or can I close it?

@t8m
Copy link
Member

t8m commented Feb 2, 2024

Anything further needed on this or can I close it?

No, closing.

@t8m t8m closed this as completed Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch branch: 3.2 Merge to openssl-3.2 triaged: bug The issue/pr is/fixes a bug
Projects
Status: Done
Development

No branches or pull requests

5 participants