Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doubts about OpenSSL asynchronous mode #6864

Closed
tokers opened this issue Aug 6, 2018 · 21 comments
Closed

doubts about OpenSSL asynchronous mode #6864

tokers opened this issue Aug 6, 2018 · 21 comments
Milestone

Comments

@tokers
Copy link

tokers commented Aug 6, 2018

Hello!

We are trying to use the Intel qat service with the asynchronous mode OpenSSL in our OpenResty/Nginx service.

Problem is that we found our Nginx workers will exit on signal 11 (coredump) when running inside the libluajit (but start with the ASYNC_start_job), for example:

image

I consulted a staff for Intel QAT in this issue: intel/QAT_Engine#79. What he guessed is that since both luajit and async mode openssl use the coroutine, maybe the two kinds of implements contain some conflicts.

The trouble is OpenSSL when running asynchronously using the QAT Engine is also using it's own co-routines for each connection. These co-routines (referred to as fibres) are allocating their own stack (from the heap) and switching to their own stack when running and also use longjmp to switch out of the co-routine back to the standard stack.
Perhaps there is something bad going on between the two implementations of co-routines. Perhaps something related to both using longjmp and something getting confused or getting corrupted to do with the stack?

Any idea and suggestion for this?

  • OpenSSL version: 1.1.1-pre2

Best Regards
Alex Zhang

@paulidale
Copy link
Contributor

As far as I can see LuaJIT doesn't use setjmp and longjmp to implement it's co-routines (except as part of the build process).

This doesn't mean there isn't a longjmp problem but it does mean that there isn't a problem with two such implementations being incompatible.

@tokers
Copy link
Author

tokers commented Aug 6, 2018

@paulidale
Thanks for the response.

I have tried to disable the SSL_MODE_ASYNC and the service works well. Is there some suggestions for troubleshooting this problem? I'm also trying to check whether it is caused by our own code.

@tokers
Copy link
Author

tokers commented Aug 6, 2018

I used the dasync engine and captured another backtrace:

#0  0x00007f3fecc09c8d in ?? ()
(gdb) bt
#0  0x00007f3fecc09c8d in ?? ()
#1  0x0000000041974bf8 in ?? ()
#2  0x0000000041974bd8 in ?? ()
#3  0x0000000000000001 in ?? ()
#4  0x0000000041974c18 in ?? ()
#5  0x0000000041974c78 in ?? ()
#6  0x4198209840e9aae8 in ?? ()
#7  0x00000000000006f1 in ?? ()
#8  0x00000000410fa940 in ?? ()
#9  0x402c206000000004 in ?? ()
#10 0x40b0000000000000 in ?? ()
#11 0x000000000000000e in ?? ()
#12 0x0000000000001000 in ?? ()
#13 0x0000000040e28a78 in ?? ()
#14 0x00000000419820a0 in ?? ()
#15 0x40e9a98000000001 in ?? ()
#16 0x00007f3f40e9a958 in ?? ()
#17 0x000000000041c770 in EVP_DigestUpdate@plt ()
#18 0x00007ffd6cefee80 in ?? ()
#19 0x0000000000000000 in ?? ()

@paulidale
Copy link
Contributor

Clearly something isn't dealing with the changing stack frames. I'd expect a byte code interpreter to be tolerant in this regard (they rarely consume much stack space). It is possible LuaJIT is using lots of stack during compilation. Perhaps, try disabling compilation in LuaJIT -- once it is back to being just an interpreter, the stack use should be small.

@tokers
Copy link
Author

tokers commented Aug 6, 2018

I have tried to close the JIT compiler, it reduces the frequency of the process core dump, the abnormal phenomenon still remain (but the stack information now is clear).

@tokers
Copy link
Author

tokers commented Aug 6, 2018

ping @mattcaswell

@mattcaswell
Copy link
Member

Possibly a stack overflow?

You could try adjusting the stack size defined here:

#define STACKSIZE 32768

@tokers
Copy link
Author

tokers commented Aug 6, 2018

@mattcaswell
Thanks for the response. I will try to enlarge this value. I'm not familiar with the common implementations about the co-routines, can you explain the purpose of this stack simply (ucontext_t->uc_stack)? :)

@mattcaswell
Copy link
Member

When you call an async capable function in OpenSSL (with async mode on), we create a new stack for that function to use. This enables us to temporarily return to the top level calling function at any time, and then resume where we left off at some later time.

@tokers
Copy link
Author

tokers commented Aug 6, 2018

@mattcaswell
I have enlarged this macro to 4194304 (4M), but still the core dump happens.

@mattcaswell
Copy link
Member

Well, that probably rules out a stack overflow! Looking at the stack trace it seems to fail somewhere in lua code (in the certificate callback). Strangely I can't match up the line numbers given in the stacktrace with the 1.1.1-pre2 version that you reported using. Anyway, I think you're going to have to turn your attention to lua and see if anyone there can help identify what is causing this.

One other thing is that pre2 is quite old and a lot of bugs have been fixed since then. You might want to consider upgrading.

@tokers
Copy link
Author

tokers commented Aug 6, 2018

@mattcaswell

One other thing is that pre2 is quite old and a lot of bugs have been fixed since then. You might want to consider upgrading.

Thanks! I will try to upgrade my OpenSSL firstly. Is there some relevant bugs about the async mode?

By the way, another type of backtrace was capture:

+0xa56b5 -> async_start_func: job->ret = job->func(job->funcargs);
+0x44a9f -> ssl_do_handshake_intern: return s->handshake_func(s);
+0x5d551 -> ossl_statem_accept: return state_machine(s, 1);
+0x5d9c5 -> state_machine: ssret = read_state_machine(s);
+0x5e061 -> read_state_machine: st->read_state_work = post_process_message(s, st->read_state_work);
+0x70b13 -> ossl_statem_server_post_process_message: return tls_post_process_client_hello(s, wst);
+0x72fd7 -> tls_post_process_client_hello: int rv = s->cert->cert_cb(s, s->cert->cert_cb_arg);
ngx_http_lua_ssl_protocol_handler
ngx_http_lua_ssl_protocol_handler_file
ngx_http_lua_cache_loadfile
lua_pcall
lj_vm_hotcall
lj_trace_ins
lj_vm_cpcall
lj_record_ins
lj_snap_add

@mattcaswell
Copy link
Member

This backtrace also crashes in the certificate callback (in lua code). Without knowing what that is doing, its going to be difficult for us to make progress.

@tokers
Copy link
Author

tokers commented Aug 7, 2018

Unfortunately after the upgrade, the problem still exists.

@tokers
Copy link
Author

tokers commented Aug 7, 2018

After I closed the JIT Compiler in LuaJIT, our process crashed at another point:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000000000047f8f1 in ngx_http_request_handler (ev=0x421ee98) at src/http/ngx_http_request.c:2259
#2  0x00000000004515d0 in ngx_event_process_posted (cycle=0x2c1c050, posted=0x9baa40 <ngx_posted_events>) at src/event/ngx_event_posted.c:33
#3  0x000000000044f2ee in ngx_process_events_and_timers (cycle=0x2c1c050) at src/event/ngx_event.c:265
#4  0x000000000045c42b in ngx_worker_process_cycle (cycle=0x2c1c050, data=0x5) at src/os/unix/ngx_process_cycle.c:792
#5  0x0000000000458eda in ngx_spawn_process (cycle=0x2c1c050, proc=0x45c37d <ngx_worker_process_cycle>, data=0x5, name=0x6afd70 "worker process", respawn=-4)
    at src/os/unix/ngx_process.c:202
#6  0x000000000045b570 in ngx_start_worker_processes (cycle=0x2c1c050, n=16, type=-4) at src/os/unix/ngx_process_cycle.c:373
#7  0x000000000045b0cd in ngx_master_process_cycle (cycle=0x2c1c050) at src/os/unix/ngx_process_cycle.c:256
#8  0x000000000041cef0 in main (argc=1, argv=0x7fff66d95838) at src/core/nginx.c:393

One of the event handler pointer was reset to 0x0, that's incredible. I will continue to troubleshoot it and update the progress in time.

@tokers
Copy link
Author

tokers commented Aug 13, 2018

@mattcaswell

After I consulted the LuaJIT community, what the said is:

  1. Lua calls a C function (via Lua C api or FFI, doesn't matter for now).
  2. That C function performs task switching (switches to a different C stack/execution state in the same thread).
  3. Now another C function is executing.
  4. It invokes Lua, or returns to Lua.

LuaJIT is not expecting this kind of reentrancy. Don't do it. Just don't!

For example,

0  async_fibre_swapcontext (o=0x235ede0, n=0x215c070, r=1) at crypto/async/arch/async_posix.h:38
#1  0x00007f8e7c5cdbae in ASYNC_pause_job () at crypto/async/async.c:273
#2  0x00007f8e7bd44e4b in dummy_pause_job () at engines/e_dasync.c:459
#3  0x00007f8e7bd44eb4 in dasync_sha1_update (ctx=0x236f6e0, data=0x212cd10, count=1236) at engines/e_dasync.c:486
#4  0x00007f8e7c6a290b in EVP_DigestUpdate (ctx=0x236f6e0, data=0x212cd10, count=1236) at crypto/evp/digest.c:148
#5  0x00007f8e7c6a2e2b in EVP_Digest (data=0x212cd10, count=1236, md=0x22e5cd0 "", size=0x0, type=0x7f8e7ca13dc0 <sha1_md>, impl=0x0) at crypto/evp/digest.c:275
#6  0x00007f8e7c5b19bb in ASN1_item_digest (it=0x7f8e7ca24e60 <X509_it>, type=0x7f8e7ca13dc0 <sha1_md>, asn=0x22e5ba0, md=0x22e5cd0 "", len=0x0) at crypto/asn1/a_digest.c:57
#7  0x00007f8e7c75d866 in X509_digest (data=0x22e5ba0, type=0x7f8e7ca13dc0 <sha1_md>, md=0x22e5cd0 "", len=0x0) at crypto/x509/x_all.c:372
#8  0x00007f8e7c776e17 in x509v3_cache_extensions (x=0x22e5ba0) at crypto/x509v3/v3_purp.c:359
#9  0x00007f8e7c77655d in X509_check_purpose (x=0x22e5ba0, id=-1, ca=-1) at crypto/x509v3/v3_purp.c:83
#10 0x00007f8e7c777f52 in X509_get_extension_flags (x=0x22e5ba0) at crypto/x509v3/v3_purp.c:822
#11 0x00007f8e7cab5c4a in ssl_security_cert_sig (s=0x23a6b50, ctx=0x0, x=0x22e5ba0, op=393234) at ssl/t1_lib.c:2258
#12 0x00007f8e7cab5d8a in ssl_security_cert (s=0x23a6b50, ctx=0x0, x=0x22e5ba0, vfy=0, is_ee=0) at ssl/t1_lib.c:2282
#13 0x00007f8e7ca6dd81 in ssl_cert_add0_chain_cert (s=0x23a6b50, ctx=0x0, x=0x22e5ba0) at ssl/ssl_cert.c:295
#14 0x00007f8e7ca69008 in ssl3_ctrl (s=0x23a6b50, cmd=89, larg=0, parg=0x22e5ba0) at ssl/s3_lib.c:3552
#15 0x00007f8e7ca79db1 in SSL_ctrl (s=0x23a6b50, cmd=89, larg=0, parg=0x22e5ba0) at ssl/ssl_lib.c:2239
#16 0x0000000000578574 in ngx_http_lua_ffi_ssl_set_der_certificate (r=0x246ac20, data=0x40931498 "0\202\006\065\060\202\005\035\240\003\002\001\002\002\t", len=5010, err=0x40910bf0) at /disk/ssd2/alex_workflow/marco/deps/lua-nginx-module-0.10.11h/src/ngx_http_lua_ssl_certby.c:688
#17 0x00007f8e7cff7b86 in lj_vm_ffi_call () from /usr/local/marco/luajit/lib/libluajit-5.1.so.2
#18 0x00007f8e7d038848 in lj_ccall_func (L=0x41591ba0, cd=<optimized out>) at lj_ccall.c:1161
#19 0x00007f8e7d04cbb6 in lj_cf_ffi_meta___call (L=<optimized out>) at lib_ffi.c:230
#20 0x00007f8e7cff5be7 in lj_BC_FUNCC () from /usr/local/marco/luajit/lib/libluajit-5.1.so.2
#21 0x0000000000550c13 in ngx_http_lua_run_thread (L=0x41cd0378, r=0x246ac20, ctx=0x22a0f20, nrets=0) at /disk/ssd2/alex_workflow/marco/deps/lua-nginx-module-0.10.11h/src/ngx_http_lua_util.c:1013
#22 0x000000000057825c in ngx_http_lua_ssl_cert_by_chunk (L=0x41cd0378, r=0x246ac20) at /disk/ssd2/alex_workflow/marco/deps/lua-nginx-module-0.10.11h/src/ngx_http_lua_ssl_certby.c:527
#23 0x0000000000577457 in ngx_http_lua_ssl_cert_handler_file (r=0x246ac20, lscf=0x208fa20, L=0x41cd0378) at /disk/ssd2/alex_workflow/marco/deps/lua-nginx-module-0.10.11h/src/ngx_http_lua_ssl_certby.c:57
#24 0x0000000000577c2c in ngx_http_lua_ssl_cert_handler (ssl_conn=0x23a6b50, data=0x0) at /disk/ssd2/alex_workflow/marco/deps/lua-nginx-module-0.10.11h/src/ngx_http_lua_ssl_certby.c:315
#25 0x00007f8e7caab44a in tls_post_process_client_hello (s=0x23a6b50, wst=WORK_MORE_B) at ssl/statem/statem_srvr.c:2179
#26 0x00007f8e7caa8d2f in ossl_statem_server_post_process_message (s=0x23a6b50, wst=WORK_MORE_A) at ssl/statem/statem_srvr.c:1148
#27 0x00007f8e7ca95e52 in read_state_machine (s=0x23a6b50) at ssl/statem/statem.c:660
#28 0x00007f8e7ca957a9 in state_machine (s=0x23a6b50, server=1) at ssl/statem/statem.c:428
#29 0x00007f8e7ca9533b in ossl_statem_accept (s=0x23a6b50) at ssl/statem/statem.c:251
#30 0x00007f8e7ca7c42d in ssl_do_handshake_intern (vargs=0x1ff6b20) at ssl/ssl_lib.c:3467
#31 0x00007f8e7c5cd70f in async_start_func () at crypto/async/async.c:154
#32 0x00007f8e7c1a75d0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#33 0x0000000000000000 in ?? ()

Our service set the cert_cb to a function which calls the LuaJIT VM and runs some Lua codes, the Lua VM in turn calls some C functions which call the SSL_add1_chain_cert, and the async engine will called again in the end.

What I need now is disable the async engine temporarily (in the current cert_cb stage), is there any simple ways to achieve it?

@tokers
Copy link
Author

tokers commented Aug 13, 2018

@mattcaswell
What about postpone the moment that we set the SSL_MODE_ASYNC?

@mejedi
Copy link
Contributor

mejedi commented Aug 14, 2018

It looks like ASYNC_block_pause will do the trick. It would be great if you do a research whether there are other problematic callbacks in OpenResty, fix all of them and send them a PR.

https://www.openssl.org/docs/man1.1.0/crypto/ASYNC_start_job.html

The ASYNC_block_pause() function will prevent the currently active job from pausing. The block will remain in place until a subsequent call to ASYNC_unblock_pause(). These functions can be nested, e.g. if you call ASYNC_block_pause() twice then you must call ASYNC_unblock_pause() twice in order to re-enable pausing. If these functions are called while there is no currently active job then they have no effect. This functionality can be useful to avoid deadlock scenarios. For example during the execution of an ASYNC_JOB an application acquires a lock. It then calls some cryptographic function which invokes ASYNC_pause_job(). This returns control back to the code that created the ASYNC_JOB. If that code then attempts to acquire the same lock before resuming the original job then a deadlock can occur. By calling ASYNC_block_pause() immediately after acquiring the lock and ASYNC_unblock_pause() immediately before releasing it then this situation cannot occur.

@tokers
Copy link
Author

tokers commented Aug 14, 2018

@mejedi
Thanks for the advice. I have done some extra jobs to bypass this limitation, I will research the ASYNC_block_pause in my leisure time :) .

@tokers
Copy link
Author

tokers commented Aug 15, 2018

Seems the problem was fixed after added the ASYNC_block_pause in some of the ngx_lua functions. Thanks a lot!

@chenwei791129
Copy link

chenwei791129 commented Apr 25, 2021

Possibly a stack overflow?

You could try adjusting the stack size defined here:

#define STACKSIZE 32768

Hi @mattcaswell adjusting the stack size is work for me, thanks for your information.
but i don't know why regex function related with openssl ....haha
i trying use ab to stress test, nginx will crash when call ngx.re.match(server_name, regex_pattern, "ijo") with LuaJIT mode (lua-nginx-module)

maybe can adjust default STACKSIZE to 4194304?
nginx debug info:

(gdb) bt
#0  0x000055e031d3a970 in ?? ()
#1  0x00007fabae63e5d4 in ssl3_finish_mac () from /usr/local/ssl/lib/libssl.so.1.1
#2  0x00007fabae66d7ee in ssl3_do_write () from /usr/local/ssl/lib/libssl.so.1.1
#3  0x00007fabae6648e6 in state_machine () from /usr/local/ssl/lib/libssl.so.1.1
#4  0x00007fabae1c7f3c in async_start_func () from /usr/local/ssl/lib/libcrypto.so.1.1
#5  0x00007fabadd877b0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x0000000000000000 in ?? ()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants