Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark: part of threads never stopped when run with memtier_benchmark #33

Closed
bjzhjing opened this issue Feb 1, 2023 · 8 comments
Closed

Comments

@bjzhjing
Copy link

bjzhjing commented Feb 1, 2023

We are measuring pelikan performance with memtier_benchmark, but encounter the following issue.

Problem description

Run performance test with memtier_benchmark, when --test-time is reached, the output from memtier_benchmark indicates some threads never stop, and it has no chance to print the statistical data finally.

How to re-produce

  • Configuration:
    The problem happens with the following settings. If memtier_benchmark threads < 20, there is no such problem.
- memtier_benchmark: run with threads >= 20.
- Pelikan: [worker] threads in configure/segcache > 5.
  • Commands to start test:
> Start pelikan: target/release/pelikan_segcache_rs config/segcache.toml
> Start memtier_benchmark: memtier_benchmark -p 12321 -P memcache_text -d 256 --threads 20 --test-time 10

With some debugging and code analysis, we see that those threads are looping in epoll_pwait(), waiting for events from pelikan. The soft stack: client_group::run() (memtier_benchmark/client.c) -> event_base_dispatch() (libevent/event.c) -> event_base_loop() -> epoll_dispatch() (libevent/epoll.c) -> epoll_pwait2() (kernel systemcall)

Question

  1. Why pelikan does not send back the events which cause the blocking?
  2. What performance tools Pelikan use for performance evaluation and without the above issue?
@brayniac
Copy link
Collaborator

brayniac commented Feb 1, 2023

I'm unable to reproduce this on my laptop. The fact that memtier_benchmark fails to finish it's run is odd. Even if the server was not responding for some reason, it would ideally handle that and finish the run. Can you tell us more about the environment you're testing in? Might be helpful to know what OS/Version/Kernel/... and what hardware.

I use rpc-perf for my testing: https://github.com/iopsystems/rpc-perf

@bjzhjing
Copy link
Author

bjzhjing commented Feb 2, 2023

@brayniac Thanks for your quick answer! Here is my environment info:

hw: Intel Alder Lake
os: Ubuntu 22.10
kernel: 5.19.0-28-generic

The output from memtier_benchmark side is as follows. There is one thread never stop until you manually kill it.

Writing results to stdout
[RUN #1] Preparing benchmark client...
[RUN #1] Launching threads now...
[RUN #1 4%,   0 secs] 20 threads:     1189585 ops, 3298300 (avg: 3298300) ops/sec, 158.46MB/sec (avg: 158.46MB/sec),  0.30 (avg:  
[RUN #1 14%,   1 secs] 20 threads:     2361937 ops, 1175611 (avg: 1739413) ops/sec, 56.52MB/sec (avg: 83.60MB/sec),  0.85 (avg:  0
[RUN #1 24%,   2 secs] 20 threads:     3527287 ops, 1165214 (avg: 1495875) ops/sec, 55.94MB/sec (avg: 71.87MB/sec),  0.86 (avg:  0
[RUN #1 34%,   3 secs] 20 threads:     4709950 ops, 1182515 (avg: 1402549) ops/sec, 56.77MB/sec (avg: 67.37MB/sec),  0.84 (avg:  0
[RUN #1 44%,   4 secs] 20 threads:     5877782 ops, 1167694 (avg: 1348655) ops/sec, 56.07MB/sec (avg: 64.78MB/sec),  0.85 (avg:  0
[RUN #1 54%,   5 secs] 20 threads:     7091294 ops, 1213286 (avg: 1323388) ops/sec, 58.25MB/sec (avg: 63.56MB/sec),  0.82 (avg:  0
[RUN #1 64%,   6 secs] 20 threads:     8223382 ops, 1131880 (avg: 1293265) ops/sec, 54.34MB/sec (avg: 62.11MB/sec),  0.88 (avg:  0
[RUN #1 74%,   7 secs] 20 threads:     9387262 ops, 1163663 (avg: 1275649) ops/sec, 55.87MB/sec (avg: 61.26MB/sec),  0.86 (avg:  0
[RUN #1 84%,   8 secs] 20 threads:    10541236 ops, 1153834 (avg: 1261075) ops/sec, 55.41MB/sec (avg: 60.56MB/sec),  0.87 (avg:  0
[RUN #1 94%,   9 secs] 20 threads:    11667240 ops, 1125908 (avg: 1246631) ops/sec, 54.11MB/sec (avg: 59.87MB/sec),  0.89 (avg:  0
[RUN #1 100%,   9 secs] 13 threads:    12839641 ops, 1125908 (avg: 1283973) ops/sec, 54.11MB/sec (avg: 61.66MB/sec),  0.89 (avg:  
[RUN #1 100%,  10 secs]  1 threads:    12854183 ops, 1125908 (avg: 1284879) ops/sec, 54.11MB/sec (avg: 61.71MB/sec),  0.89 (avg:  
[RUN #1 100%,  10 secs]  1 threads:    12854183 ops, 1125908 (avg: 1284751) ops/sec, 54.11MB/sec (avg: 61.70MB/sec),  0.89 (avg:  
[RUN #1 100%,  10 secs]  1 threads:    12854183 ops, 1125908 (avg: 1284622) ops/sec, 54.11MB/sec (avg: 61.69MB/sec),  0.89 (avg:  
[RUN #1 100%,  10 secs]  1 threads:    12854183 ops, 1125908 (avg: 1284494) ops/sec, 54.11MB/sec (avg: 61.69MB/sec),  0.89 (avg:  
[RUN #1 100%,  10 secs]  1 threads:    12854183 ops, 1125908 (avg: 1284365) ops/sec, 54.11MB/sec (avg: 61.68MB/sec),  0.89 (avg:  
[RUN #1 100%,  10 secs]  1 threads:    12854183 ops, 1125908 (avg: 1284237) ops/sec, 54.11MB/sec (avg: 61.67MB/sec),  0.89 (avg:  
[RUN #1 100%,  10 secs]  1 threads:    12854183 ops, 1125908 (avg: 1284109) ops/sec, 54.11MB/sec (avg: 61.67MB/sec),  0.89 (avg:

Please be sure to set config/segcache with 6 worker threads or higher, and run memtier_benchmark with 20 threads or higher.

Additionally, I've tried ubuntu 20.04 with kernel 5.4, the problem is also able to be reproduced.

@bjzhjing
Copy link
Author

bjzhjing commented Feb 2, 2023

One more thing that the performance test was previously running on host directly, both pelikan and memtier_benchmark is installed on host. But when I run pelikan/pelikan_segcache_rs in docker image while memtier_benchmark is still on host, the issue disappears.

@hderms
Copy link
Contributor

hderms commented Feb 2, 2023

@bjzhjing does this happen every time?

@brayniac
Copy link
Collaborator

brayniac commented Feb 2, 2023 via email

@hderms
Copy link
Contributor

hderms commented Feb 2, 2023

@bjzhjing what specific git SHA are you using? I am going to try to reproduce this as well

@hderms
Copy link
Contributor

hderms commented Feb 2, 2023

@bjzhjing I am quite certain at this point that @brayniac hit the nail on the head. I dropped my ulimit to 900 ulimit -n 900 before running both segcache and memtier and I can reproduce a hang where it fails to quit out after 10 seconds have elapsed. Everything worked properly with the default ulimit on my system (1024). I then set the ulimit back to 1024 and restarted both processes and it succeeded again.

I haven't bothered to see if it's required that the ulimit be lower than necessary for both segcache and memtier in order to reproduce the hang, but I can't reproduce it without segcache being given too low of a ulimit.

@bjzhjing
Copy link
Author

bjzhjing commented Feb 3, 2023

@brayniac @hderms Yeah, you're right. It's due to the ulimit value. The default is 1024 on my system, after change it to 4096, the test runs well. I ignored this setting for no message indicates that. Thanks a bunch!

@bjzhjing bjzhjing closed this as completed Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants