Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault running lwan in linux container with number of cores lower than number of physical cores #290

Open
diviaki opened this issue Sep 4, 2020 · 14 comments

Comments

@diviaki
Copy link

diviaki commented Sep 4, 2020

I moved an lxc container running for years to a new host just to see lwan crashing in it. I realised on the old host the number of allowed cores and number of physical cores were same, while the new host got much more cores. Cleaning the core limit of the container fixed the issue.
However, this is how lwan crashed with the limit on:

==6791== Use of uninitialised value of size 8
==6791==    at 0x11A8D0: lwan_thread_add_client (in /usr/local/bin/lwan)
==6791==    by 0x1FFEFFF4DF: ???
==6791==    by 0xBFFFFFFFF: ???
==6791==    by 0x1FFEFFF4DF: ???
==6791==    by 0x501B847: ???
==6791==    by 0x3F: ???
==6791==    by 0x1FFEFFE33F: ???
==6791==    by 0x8F9C18F9C18F9C18: ???
==6791==    by 0x111392: lwan_main_loop.cold.28 (in /usr/local/bin/lwan)
==6791==    by 0x3FF: ???
==6791==    by 0xB: ???
==6791== 
==6791== Invalid write of size 4
==6791==    at 0x11A8DA: lwan_thread_add_client (in /usr/local/bin/lwan)
==6791==    by 0x1FFEFFF4DF: ???
==6791==    by 0xBFFFFFFFF: ???
==6791==    by 0x1FFEFFF4DF: ???
==6791==    by 0x501B847: ???
==6791==    by 0x3F: ???
==6791==    by 0x1FFEFFE33F: ???
==6791==    by 0x8F9C18F9C18F9C18: ???
==6791==    by 0x111392: lwan_main_loop.cold.28 (in /usr/local/bin/lwan)
==6791==    by 0x3FF: ???
==6791==    by 0xB: ???
==6791==  Address 0x19089ae0 is not stack'd, malloc'd or (recently) free'd
==6791== 
==6791== 
==6791== Process terminating with default action of signal 11 (SIGSEGV)

This is from a Release build as trying to compile a Debug results in
/root/lwan/src/lib/hash.c:163: undefined reference to '__builtin_ia32_crc32si'
and RelWithDebInfo segfaults the compiler at
[ 30%] Building C object src/lib/CMakeFiles/lwan-static.dir/lwan-mod-serve-files.c.o

Sources just pulled, deb10, gcc8.

@lpereira
Copy link
Owner

lpereira commented Sep 6, 2020 via email

@diviaki
Copy link
Author

diviaki commented Sep 7, 2020

Success! Deleting the build folder before changing build type allowed building debug (RelWithDebInfo still segfaults cc)

Config is 2 cores allowed out of 8 (4 HT cores) on a unprivileged lxc node.

Test runs:

  • release: it works
  • proxy + release: segfault
  • proxy + release + valgrind: segfault (output sent earlier)
  • debug: segfault
  • debug + valgrind : it works
  • proxy + debug: segfault
  • proxy + debug + valgrind : it works

Release build starts with these warnings:

Could not set affinity for thread 0
Could not set affinity for thread 1

Debug build starts with (only relevant lines):

7401 lwan.c:723 lwan_init_with_config() Using 2 threads, maximum 262144 sockets per thread
7401 lwan-thread.c:701 lwan_thread_init() Initializing threads
7404 lwan-thread.c:453 thread_io_loop() Worker thread #1 starting
7401 lwan-thread.c:678 adjust_threads_affinity() Could not set affinity for thread 0
7401 lwan-thread.c:678 adjust_threads_affinity() Could not set affinity for thread 1
7405 lwan-thread.c:453 thread_io_loop() Worker thread #2 starting

An nginx proxy were used for testing, relevant lines:

                proxy_pass        http://192.168.xxx.xxx:8080;
                proxy_set_header  X-Real-IP  $remote_addr;

@lpereira
Copy link
Owner

lpereira commented Sep 7, 2020 via email

@diviaki
Copy link
Author

diviaki commented Sep 7, 2020

1727 lwan-thread.c:701 lwan_thread_init() Initializing threads
1727 lwan-thread.c:710 lwan_thread_init() Pending client file descriptor queue has 256 items
1730 lwan-thread.c:453 thread_io_loop() Worker thread #1 starting
1731 lwan-thread.c:453 thread_io_loop() Worker thread #2 starting
=================================================================
==1727==ERROR: AddressSanitizer: dynamic-stack-buffer-overflow on address 0x7ffd8bac6a50 at pc 0x55ff84890c3b bp 0x7ffd8bac6a10 sp 0x7ffd8bac6a08
READ of size 4 at 0x7ffd8bac6a50 thread T0
    #0 0x55ff84890c3a in siblings_to_schedtbl /root/lwan/src/lib/lwan-thread.c:637
    #1 0x55ff84891138 in topology_to_schedtbl /root/lwan/src/lib/lwan-thread.c:660
    #2 0x55ff84891a12 in lwan_thread_init /root/lwan/src/lib/lwan-thread.c:730
    #3 0x55ff84857daf in lwan_init_with_config /root/lwan/src/lib/lwan.c:728
    #4 0x55ff8485770f in lwan_init /root/lwan/src/lib/lwan.c:658
    #5 0x55ff8484e2c6 in main /root/lwan/src/bin/lwan/main.c:208
    #6 0x7fb6cc6f009a in __libc_start_main ../csu/libc-start.c:308
    #7 0x55ff8484d0a9 in _start (/usr/local/bin/lwan+0x1c0a9)

Address 0x7ffd8bac6a50 is located in stack of thread T0
SUMMARY: AddressSanitizer: dynamic-stack-buffer-overflow /root/lwan/src/lib/lwan-thread.c:637 in siblings_to_schedtbl
Shadow bytes around the buggy address:
  0x100031750cf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100031750d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100031750d10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100031750d20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100031750d30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x100031750d40: 00 00 00 00 ca ca ca ca 00 cb[cb]cb cb cb cb cb
  0x100031750d50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100031750d60: ca ca ca ca 00 cb cb cb cb cb cb cb 00 00 00 00
  0x100031750d70: ca ca ca ca 00 cb cb cb cb cb cb cb 00 00 00 00
  0x100031750d80: 00 00 00 00 00 00 00 00 00 00 00 00 ca ca ca ca
  0x100031750d90: 00 00 cb cb cb cb cb cb 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==1727==ABORTING

@diviaki
Copy link
Author

diviaki commented Sep 7, 2020

let me know if you want too ssh into the thing

@lpereira
Copy link
Owner

lpereira commented Sep 8, 2020

That's very curious! I'll take a look whenever I'm working on Lwan again. I'm pausing work on most of my personal projects for undetermined time, so while I appreciate your offer to SSH into that machine, I won't be able to do that right now.

In the meantime, you can apply this patch here that should make Lwan work in your environment:

diff --git a/src/lib/lwan-thread.c b/src/lib/lwan-thread.c
index b1ee42da..32db393c 100644
--- a/src/lib/lwan-thread.c
+++ b/src/lib/lwan-thread.c
@@ -712,7 +712,7 @@ void lwan_thread_init(struct lwan *l)
         create_thread(l, &l->thread.threads[i], n_queue_fds);
 
     const unsigned int total_conns = l->thread.max_fd * l->thread.count;
-#ifdef __x86_64__
+#if 0
     static_assert(sizeof(struct lwan_connection) == 32,
                   "Two connections per cache line");
     /*

@lpereira
Copy link
Owner

lpereira commented Sep 8, 2020

Actually, I think I know what's going on. You mentioned the computer has 8 cores, but only 2 threads are being spawn... this makes me think that LXC is limiting the number of cores reported by sysconfig(online cpus), but Lwan is using information from sysfs, which isn't filtered by LXC. Can you try this patch instead and see if it works for you? (Revert the previous patch if you applied it.)

diff --git a/src/lib/lwan-thread.c b/src/lib/lwan-thread.c
index b1ee42da..0e844767 100644
--- a/src/lib/lwan-thread.c
+++ b/src/lib/lwan-thread.c
@@ -617,10 +617,27 @@ static bool read_cpu_topology(struct lwan *l, uint32_t siblings[])
             __builtin_unreachable();
         }
 
-
         fclose(sib);
     }
 
+    /* Some systems may lie about the number of online CPUs (obtainable with
+     * sysconf()), but don't filter out the CPU topology information from
+     * sysfs, which might reference CPU numbers higher than the amount
+     * obtained with sysconf().  */
+    for (unsigned int i = 0; i < l->n_cpus; i++) {
+        if (siblings[i] == 0xbebacafe) {
+            lwan_status_warning("Could not determine sibling for CPU %d", i);
+            return false;
+        }
+
+        if (siblings[i] > l->n_cpus) {
+            lwan_status_warning("CPU topology information says CPU %d exists, "
+                                "but max CPUs is %d. Is Lwan running in a "
+                                "container?", siblings[i], l->n_cpus);
+            return false;
+        }
+    }
+
     return true;
 }
 
@@ -651,6 +668,9 @@ topology_to_schedtbl(struct lwan *l, uint32_t schedtbl[], uint32_t n_threads)
 {
     uint32_t *siblings = alloca(l->n_cpus * sizeof(uint32_t));
 
+    for (uint32_t i = 0; i < l->n_cpus; i++)
+        siblings[i] = 0xbebacafe;
+
     if (!read_cpu_topology(l, siblings)) {
         for (uint32_t i = 0; i < n_threads; i++)
             schedtbl[i] = (i / 2) % l->thread.count;

@lpereira
Copy link
Owner

lpereira commented Sep 9, 2020

I pushed this patch, as it doesn't hurt if this condition is never met. Please confirm if this fixes your issue.

@diviaki
Copy link
Author

diviaki commented Sep 9, 2020

Thanks for quick patch. Applied it.
It works!
With 2 allowed cores out of 8 cores lwan warns about "CPU topology information says CPU 4 exists, but max CPUs is 2.", does not say "Could not determine sibling...".

The strange report on number of cores made me think what if I set the count higher than 4 or something not even?
1 -> works, "CPU topology information says CPU 4 exists, but max CPUs is 1." and "Could not set affinity for thread 0, 1"
2 -> works, "CPU topology information says CPU 4 exists, but max CPUs is 2." and "Could not set affinity for thread 0, 1"
4 -> works, "CPU topology information says CPU 5 exists, but max CPUs is 4." and "Could not set affinity for thread 0, 1, 2, 3"
5 -> works, "CPU topology information says CPU 6 exists, but max CPUs is 5." and "Could not set affinity for thread 0, 1"
6 -> works, "CPU topology information says CPU 7 exists, but max CPUs is 6."
8 (or unlimited) -> works, no log
7 -> oops, no log, but:

==3066==ERROR: AddressSanitizer: dynamic-stack-buffer-overflow on address 0x7fff673757dc at pc 0x559ddc692e81 bp 0x7fff67375790 sp 0x7fff67375788
READ of size 4 at 0x7fff673757dc thread T0
    #0 0x559ddc692e80 in siblings_to_schedtbl /root/lwan/src/lib/lwan-thread.c:654
    #1 0x559ddc69342d in topology_to_schedtbl /root/lwan/src/lib/lwan-thread.c:680
    #2 0x559ddc693d07 in lwan_thread_init /root/lwan/src/lib/lwan-thread.c:750
    #3 0x559ddc659daf in lwan_init_with_config /root/lwan/src/lib/lwan.c:728
    #4 0x559ddc65970f in lwan_init /root/lwan/src/lib/lwan.c:658
    #5 0x559ddc6502c6 in main /root/lwan/src/bin/lwan/main.c:208
    #6 0x7f2241f5309a in __libc_start_main ../csu/libc-start.c:308
    #7 0x559ddc64f0a9 in _start (/usr/local/bin/lwan+0x1c0a9)

Looks like container management use some magic?

/I see this getting muddy, so if you're on a sabbatical on something, feel free to give it a break too - I use lxc for keeping stuff separated, not for limiting resources and I'm totally fine running it with unlimited cores/

@diviaki
Copy link
Author

diviaki commented Sep 9, 2020

Found some info.. I use Proxmox on the virtualisation host which in turn builds on lxc.
Proxmox is super smart and assigns the requested number of cores based on usage.
For example, now when I limit the cores to 2, lwan's container got core 5 and 6.

https://pve.proxmox.com/pve-docs/chapter-pct.html#pct_cpu

@lpereira
Copy link
Owner

Ah, yeah. I figured this could be the case. I do have a sketch of a patch that takes the number of configured (but not online) CPUs into consideration when allocating memory to calculate the affinity mask. I might push it this weekend if you'd like to test it out.

@lpereira
Copy link
Owner

Hey @diviaki -- if you'd like to try, I pushed a patch that should work in your setup regardless of how many cores is allocated by LXC.

@diviaki
Copy link
Author

diviaki commented Sep 17, 2020

Still on a 2 out of 8 cores setup, where cores 5 and 6 assigned out of 0..7

git pull; rm -r build; mkdir build; cd build; cmake .. -DCMAKE_BUILD_TYPE=Debug -DSANITIZER=address; make ; make install ; cd .. ; lwan
7348 lwan-thread.c:453 thread_io_loop() Worker thread #1 starting
7349 lwan-thread.c:453 thread_io_loop() Worker thread #2 starting
7345 lwan-thread.c:747 lwan_thread_init() 2 CPUs of 8 are online. Reading topology to pre-schedule clients
7345 lwan-thread.c:703 adjust_threads_affinity() Could not set affinity for thread 0
7345 lwan-thread.c:703 adjust_threads_affinity() Could not set affinity for thread 1
7345 lwan-thread.c:777 lwan_thread_init() Worker threads created and ready to serve
7345 lwan-socket.c:245 lwan_socket_init() Initializing sockets
7345 lwan-socket.c:168 listen_addrinfo() Listening on http://0.0.0.0:8080
7345 lwan.c:844 lwan_main_loop() Ready to serve
=================================================================
==7345==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x617000000b10 at pc 0x555ac9b9ec17 bp 0x7ffcc73d2610 sp 0x7ffcc73d2608
READ of size 8 at 0x617000000b10 thread T0
    #0 0x555ac9b9ec16 in spsc_queue_push /root/lwan/src/lib/queue.c:90
    #1 0x555ac9b91620 in lwan_thread_add_client /root/lwan/src/lib/lwan-thread.c:568
    #2 0x555ac9b5942d in schedule_client /root/lwan/src/lib/lwan.c:774
    #3 0x555ac9b5942d in accept_one /root/lwan/src/lib/lwan.c:807
    #4 0x555ac9b5942d in lwan_main_loop /root/lwan/src/lib/lwan.c:850
    #5 0x555ac9b4f323 in main /root/lwan/src/bin/lwan/main.c:215
    #6 0x7f12dc48b09a in __libc_start_main ../csu/libc-start.c:308
    #7 0x555ac9b4e0a9 in _start (/usr/local/bin/lwan+0x1c0a9)

@lpereira
Copy link
Owner

Thanks for testing. Yeah, that seems related even though it exploded somewhere else. Let's see if I can carve some time this weekend for this. (Will try testing it locally, too.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants