Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running multiple testcases should not leak file descriptors #98

Closed
quinox opened this issue May 3, 2024 · 4 comments
Closed

Running multiple testcases should not leak file descriptors #98

quinox opened this issue May 3, 2024 · 4 comments

Comments

@quinox
Copy link
Contributor

quinox commented May 3, 2024

Observation

Running all testcases in 1 go crashes on my system:

$ ./flashmq-tests 2>/dev/null
INIT: forkingTestForkingTestServer
RUN: forkingTestForkingTestServer
PASS: forkingTestForkingTestServer

...

INIT: testDowngradeQoSOnSubscribeQos0to0
RUN: testDowngradeQoSOnSubscribeQos0to0
PASS: testDowngradeQoSOnSubscribeQos0to0

INIT: testDowngradeQoSOnSubscribeQos1to0
RUN: testDowngradeQoSOnSubscribeQos1to0
FAIL EXCEPTION: testDowngradeQoSOnSubscribeQos1to0: Too many open files

INIT: testDowngradeQoSOnSubscribeQos1to1
fish: Job 1, './flashmq-tests 2>/dev/null' terminated by signal SIGABRT (Abort)

It always crashes on the same testcase.

The testcase itself runs fine:

$ ./flashmq-tests testDowngradeQoSOnSubscribeQos1to0 2>/dev/null
INIT: testDowngradeQoSOnSubscribeQos1to0
RUN: testDowngradeQoSOnSubscribeQos1to0
PASS: testDowngradeQoSOnSubscribeQos1to0

Tests run: 1. Passed: 1. Failed: 0 (of which 0 exceptions). Total assertions: 16.



TESTS PASSED

If I raise my open file limit using ulimit -Sn 4096 it goes much further but still can't make it to the end.

@halfgaar
Copy link
Owner

halfgaar commented May 3, 2024

Weird, neither my system nor the Github builders have that issue. I can even reduce it to 512.

Can you give some more info about your system, branch, compiler, etc, etc?

@quinox
Copy link
Contributor Author

quinox commented May 3, 2024

Happy to help. I can provide shell access if that makes it easier for you (I don't mind doing the legwork though).

The details:

  • git revision: upstream master 7e5d0d0eec1312789895ae532a7c4202c5cabe90 (tagged v1.11.0) (clean state)
  • OS: Gentoo with kernel 6.6.21
  • I tried it with 2 compilers:
    • ./run-make-from-ci.sh says: The CXX compiler identification is GNU 12.3.1
    • ./run-make-from-ci.sh --compiler $CXX says: The CXX compiler identification is Clang 17.0.6
  • I use Fish as shell, but it also happens when I use Bash
  • I use a normal user account, not root. This might be important because I do see this error message in the testcase output:

[2024-05-03 16:43:07.442] [ERROR] Setting ulimit nofile failed: 'Operation not permitted'. This means the default is used.

  • Limits:
quinox@gofu ~/p/F/F/buildtests (master)> ulimit --all -S
Maximum size of core files created                              (kB, -c) 0
Maximum size of a process’s data segment                        (kB, -d) unlimited
Control of maximum nice priority                                    (-e) 0
Maximum size of files created by the shell                      (kB, -f) unlimited
Maximum number of pending signals                                   (-i) 128081
Maximum size that may be locked into memory                     (kB, -l) 8192
Maximum resident set size                                       (kB, -m) unlimited
Maximum number of open file descriptors                             (-n) 1024
Maximum bytes in POSIX message queues                           (kB, -q) 800
Maximum realtime scheduling priority                                (-r) 0
Maximum stack size                                              (kB, -s) 8192
Maximum amount of CPU time in seconds                      (seconds, -t) unlimited
Maximum number of processes available to current user               (-u) 128081
Maximum amount of virtual memory available to each process      (kB, -v) unlimited
Maximum contiguous realtime CPU time                                (-y) unlimited

quinox@gofu ~/p/F/F/buildtests (master)> ulimit --all -H
Maximum size of core files created                              (kB, -c) unlimited
Maximum size of a process’s data segment                        (kB, -d) unlimited
Control of maximum nice priority                                    (-e) 0
Maximum size of files created by the shell                      (kB, -f) unlimited
Maximum number of pending signals                                   (-i) 128081
Maximum size that may be locked into memory                     (kB, -l) 8192
Maximum resident set size                                       (kB, -m) unlimited
Maximum number of open file descriptors                             (-n) 4096
Maximum bytes in POSIX message queues                           (kB, -q) 800
Maximum realtime scheduling priority                                (-r) 0
Maximum stack size                                              (kB, -s) unlimited
Maximum amount of CPU time in seconds                      (seconds, -t) unlimited
Maximum number of processes available to current user               (-u) 128081
Maximum amount of virtual memory available to each process      (kB, -v) unlimited
Maximum contiguous realtime CPU time                                (-y) unlimited


Does it not leak files for you, or does it not crash for you?

The grep for epoll is for no special reason except it shows the leakage nicely (note my limit is 1024):

$ strace -fF ./flashmq-tests 2>&1 | grep 'epoll_create.*= [1-9][0-9]*$'
[pid  6338] epoll_create(999)           = 4
[pid  6338] epoll_create(999)           = 5
[pid  6340] epoll_create(999)           = 9
[pid  6340] epoll_create(999)           = 11
[pid  6340] epoll_create(999)           = 13
[pid  6340] epoll_create(999)           = 15
[pid  6340] epoll_create(999)           = 17
[pid  6340] epoll_create(999)           = 19
...
[pid  6338] epoll_create(999)           = 1000
[pid  6338] epoll_create(999)           = 1002
[pid  6338] epoll_create(999)           = 1004
[pid  6338] <... epoll_create resumed>) = 1006
[pid  6338] <... epoll_create resumed>) = 1009
[pid  6338] <... epoll_create resumed>) = 1007
[pid  6338] <... epoll_create resumed>) = 1012
[pid  6338] <... epoll_create resumed>) = 1014
[pid  6338] <... epoll_create resumed>) = 1017
[pid  6338] <... epoll_create resumed>) = 1019
[pid  6338] <... epoll_create resumed>) = 1020
[pid  6338] epoll_create(999)           = 39
[pid  6338] epoll_create(999)           = 40
[pid  6884[2024-05-03 16:35:05.910] [DEBUG] Adding event 'keep-alive check' to the timer with an interval of 5000
d>) = 1023
fish: Process 6334, 'strace' from job 1, 'strace -fF ./flashmq-tests 2>&1…' terminated by signal SIGABRT (Abort)

@quinox quinox changed the title Running multiple testcases should not leaking file descriptions Running multiple testcases should not leak file descriptors May 3, 2024
@quinox
Copy link
Contributor Author

quinox commented May 3, 2024

Capturing the state using lsof -nPX in a second window, the biggest capture I made:

  • Open handles: 8364
    • lsof takes time to run, the 172 handles over the 8192 limit are probably handles that already disappeared before lsof was done (and that's why they are of type "unknown" below)
  • By type:
quinox@gofu ~> gawk '{ print $7 }' /tmp/lsof_1714751372.txt  | sort | uniq -c | sort -h -r
   6880 a_inode
    881 0
    190 unknown
    168 FIFO
    144 REG
...
  • Zooming in on the a_inode handles:
quinox@gofu ~> gawk '$7 == "a_inode" { print $11 }' /tmp/lsof_1714751372.txt | sed 's/:.*//' | sort | uniq -c | sort -hr
   3542 eventfd:$num
   3338 eventpoll:$num

halfgaar added a commit that referenced this issue May 3, 2024
This mostly only impacts the tests, because the MainApp gets
reinstantiated all the time. When the environment limits the amount of
open file descriptors in such a way that even FlashMQ's call to
setrlimit fails, tests would reach the maximum.

#98
@halfgaar
Copy link
Owner

halfgaar commented May 3, 2024

Thanks, that error from setrlimit made it clear. It's interesting that doesn't work for you.

Anyway, It was kind of an accident I never ran into it. The setrlimit it just something FlashMQ does, so it also did so in tests. Some epoll and eventfd file descriptors plainly lacked a close, or even a destructor to call close() in... I fixed it.

@halfgaar halfgaar closed this as completed May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants