-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor Performance of MariaDB in Multi-thread Scenario #853
Comments
I think this is due to the involvement of frequent file access related system calls that need to cross the trusted boundary and served by the untrusted. Also, as we discussed via email, the unavailability of unix domain sockets also adds additional overhead. Can you please capture strace log to confirm? |
Interesting performance numbers. Thanks for doing this effort @qijiax.
|
I grab the perf record on sgx-direct. There is a spin lock there. |
@qijiax What build are you using? Could you try current master with
This is spinlock inside host kernel. It really doesn't matter (at least for now). Btw what kernel do you use exactly? Judging from these traces it's most likely high contention on some lock in LibOS (which translates to PalEvent{Wait,Set} -> host futex) |
What spinlock? Why do you keep talking about some spinlock? |
I'm now using the latest master branch. And I using 5.17.0-intel-next+ kernel. |
No, they don't really help. All we can see is that there is a lot of host level futex calls, which we already knew before. |
@qijiax The flame graphs show that there is indeed some problem with futexes (it is clearly seen on the last graph with 16 threads). However, these flame graphs do not show from where inside Gramine these futex usages originate. The graphs only show
I would kindly ask to try this suggestion from @boryspoplawski. This will give us some insights into what part of Gramine calls these futexes. |
@qijiax MariaDB server can run with just 2G enclave size. Could you reduce the enclave size in the manifest. And also the threads to 128? I think that could be the cause for futexes. |
I could repro the perf issue - adding strace and vtune dat |
@boryspoplawski @dimakuv Here's the hotspot on
There is an error when I set |
Looks good now. Could you look deeper into the "16 threads" report? In particular, could you unwrap the first 2-3 items, so that we can see the stack traces (= the callers of these funcs |
|
Thanks @qijiax, now this gives us a lot of interesting info. I am sure the problem is in our sub-optimal locking during send/recv on TCP/IP sockets. One exerpt from the perf-report:
So we have this malloc in Line 687 in f7995b6
Which calls the Slab allocator of LibOS, which grabs the lock: gramine/common/include/slabmgr.h Line 325 in 5a39aee
The gramine/libos/src/libos_malloc.c Line 23 in 74e74a8
This is an interesting avenue for optimizations. Our memory allocator is too dumb and uses global locking on every malloc/free, and we have a lot of malloc/free in the send/recv LibOS paths. @boryspoplawski @mkow @kailun-qin What do you think? |
I tried couple of things - Further, I when I used tcmalloc and that gives an additional boost of ~500 tps. |
Let me add this observation also - the above one is with |
With the suggestion from @svenkata9 , I re-benchmark on my environment.
This result meet Senkar's observation. The performance is scaling from 1-16 threads, but another bottleneck shown when threads exceed 32. |
This is an extract from
There is a strong correlation here between the NUMA cores and the performance. When @qijiax Could you try the same in your system? I think your system has 56 cores per socket. Perhaps, you can try BTW, I could achieve the same results with |
The problem is as @dimakuv mentioned, you cannot work around it using any kind of cpu pinning. The most straight forward thing to do is to remove: Line 471 in f7995b6
|
@qijiax @svenkata9 please check this branch: https://github.com/gramineproject/gramine/tree/borys/test_rm_pal_iov |
In my quick testing, this doesn't make the numbers different for |
It doesn't make any sense, this commit shouldn't worsen things, that shouldn't even be possible... |
Thanks for the patch @boryspoplawski .
I recorded the perf info of gramine-direct in 32T:
The percentage of |
@qijiax Can you set See https://gramine.readthedocs.io/en/latest/manifest-syntax.html#check-invalid-pointers |
Oh, wait, looks like you already have this option set? This cannot be true. Please verify your manifest file again. I don't think you correctly set this option. |
@dimakuv You are right, this param was not set. It was commented last time for debuging.
There is still a spin_lock in 64T, the perf record:
The hotspot come back to |
@boryspoplawski Any idea where these malloc/free come from now? |
Ah, yes, I only removed the translation in LibOS, forgot about PAL. @qijiax please try the same branch, I've pushed additional changes |
This improves the performance along with But, without |
I also tried on my environment, the performance is improves. The TPS is 17217 in 32T, but drops to 16417 in 64T.
|
|
Thanks for helping @boryspoplawski . BTW, will this patch merge the the master branch? |
@qijiax How is it performing with
|
@svenkata9 I can't reach your performance on 76 threads. I took a detailed testing on current patch.
Note: Gramine runs in one NUMA domain For gramine-direct, performance scaling from 1-8 threads. An interesting observation that in 32T gramine-sgx performs better than gramine-direct. |
Thank you @qijiax and @svenkata9 for this analysis! It will help us work on the bottlenecks, one by one. I believe we can merge @boryspoplawski's patch into Gramine, just need to make it better. In principal, I don't see problems with being binary-compatible with the Linux structs, where it makes significant perf difference. |
@qijiax I was just made aware that you are using a patched version of Gramine, yet you've never mentioned that. Due to that we cannot reason about performance, or even correctness of Gramine. |
What do you mean patched version? I used to install Gramine form yum, but the Gramine using now is built and installed from master branch and with your patch. |
@qijiax But, haven't you faced an error with Gramine and I asked you to apply the workaround through email -
|
@boryspoplawski Do you have any update on the fd lock? |
No, and unfortunately nothing will happen until after the next release, which is happening soon. Also this might be non-trivial amount of work. |
This issue is 2.5 years old, I'm closing it. |
Description of the problem
We use sysbench to made a benchmark on comparing the performance of MariaDB on non-gramine, gramine-direct and gramine-sgx. We use different thread in sysbench to modify different workload. The statistics of TPS shows a poor performance of MariaDB in Gramine :
In 1 or 2 threads, gramine-sgx has 40-50% performance drop compare with non-gramine. We consider that is the overhead of gramine and SGX. However, with the threads growing, TPS of the gramine reaches its top and drop rapidly.
We tied setup the rpc_thread, but that has no help on the max TPS.
Is the bottleneck in gramine or SGX enclave? How can we avoid that?
The perf top record:
gramine-sgx 8 threads:
gramine-sgx 32 threads:
gramine-sgx 64 threads:
Steps to reproduce
System: CentOS8
Setup steps: refer to link .
Some setting in manifest:
loader.insecure__use_cmdline_argv = true
sys.enable_sigterm_injection = true
sgx.nonpie_binary = true
sgx.enclave_size = "64G"
sgx.thread_num = 64
sgx.rpc_thread_num = 64
sgx.require_avx = true
sgx.require_avx512 = true
sgx.require_pkru = true
sgx.require_amx = true
libos.check_invalid_pointers = false
sys.stack.size = "16M"
sgx.preheat_enclave = true
loader.pal_internal_mem_size = "32G"
sgx.file_check_policy = "allow_all_but_log"
The text was updated successfully, but these errors were encountered: