Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault during SGX-LKL shutdown #733

Closed
prp opened this issue Aug 6, 2020 · 8 comments · Fixed by #788
Closed

Segfault during SGX-LKL shutdown #733

prp opened this issue Aug 6, 2020 · 8 comments · Fixed by #788
Labels
p1 Medium priority

Comments

@prp
Copy link
Member

prp commented Aug 6, 2020

I observed the following failure during a CI run of kernel-syscalls-mmap-mmap11-(nonrelease)-(run-hw)-(8-ethreads):

[   SGX-LKL  ] Warning: SGXLKL_TRACE_SIGNAL ignored in non-debug mode.
[   SGX-LKL  ] nproc=4 ETHREADS=8 CMDLINE="mem=512m"
[   SGX-LKL  ] HW TLS support: econf->fsgsbase=1
[   SGX-LKL  ] Registering disk 0 (path='sgxlkl-miniroot-fs.img', mnt='/', [RW])
[   SGX-LKL  ] No tap device specified, networking will not be available.
[   SGX-LKL  ] get_signed_libsgxlkl_path... result=/opt/sgx-lkl-nonrelease/bin/../lib/libsgxlkl.so.signed
[   SGX-LKL  ] oe_create_enclave...
[   SGX-LKL  ] Enclave config: {"args":["/ltp/testcases/kernel/syscalls/mmap/mmap11"],"clock_res":[{"resolution":"0000000000000001"},{"resolution":"0000000000000001"},{"resolution":"0000000000000000"},{"resolution":"0000000000000000"},{"resolution":"0000000000000001"},{"resolution":"00000000003d0900"},{"resolution":"00000000003d0900"},{"resolution":"0000000000000001"}],"cwd":"/","env":[],"esleep":"16000","espins":"500","ethreads":"8","exit_status":"full","format_version":"1","fsgsbase":true,"host_import_env":[],"hostname":"lkl","hostnet":false,"image_sizes":{"num_heap_pages":"262144","num_stack_pages":"1024"},"io":{"block":true,"console":true,"network":true},"kernel_cmd":"mem=512m","kernel_verbose":true,"max_user_threads":"256","mmap_files":"shared","mode":"hw_debug","mounts":[],"net_gw4":"10.0.1.254","net_ip4":"10.0.1.1","net_mask4":"24","oe_heap_pagecount":"8192","root":{"key":"","key_id":null,"overlay":false,"readonly":false,"roothash":null,"roothash_offset":"0"},"stacksize":"524288","swiotlb":true,"sysctl":null,"tap_mtu":"0","verbose":true,"wg":{"ip":"10.0.2.1","key":null,"listen_port":"56002","peers":[]}}
result=0 (OE_OK)
[   SGX-LKL  ] sgxlkl_enclave_init(ethread_id=0)
[    0.000000] Linux version 5.3.0+ (AzDevOps@cisca24ba0002V4) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #1 Thu Aug 6 09:09:59 UTC 2020
[    0.000000] memblock address range: 0x7f00207c2000 - 0x7f00407c2000
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 129280
[    0.000000] Kernel command line: mem=512m console=hvc0 virtio_mmio.device=256@0x1000000:1
[    0.000000] Dentry cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 6, 262144 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory available: 516256k/524288k RAM
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS: 4096
[    0.000000] lkl: irqs initialized
[    0.000000] clocksource: lkl: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] lkl: time and timers initialized (irq2)
[    0.000559] printk: console [hvc0] enabled
[    0.000559] pid_max: default: 4096 minimum: 301
[    0.000560] Mount-cache hash table entries: 1024 (order: 1, 8192 bytes, linear)
[    0.000560] Mountpoint-cache hash table entries: 1024 (order: 1, 8192 bytes, linear)
[    0.002735] devtmpfs: initialized
[    0.018031] random: get_random_bytes called from 0x00007f000057a455 with crng_init=0
[    0.018032] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.018033] futex hash table entries: 16 (order: -4, 384 bytes, linear)
[    0.018584] xor: automatically using best checksumming function   8regs     
[    0.018586] NET: Registered protocol family 16
[    0.021872] clocksource: Switched to clocksource lkl
[    0.022983] NET: Registered protocol family 2
[    0.022985] tcp_listen_portaddr_hash hash table entries: 256 (order: 0, 4096 bytes, linear)
[    0.022985] TCP established hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.022986] TCP bind hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.022986] TCP: Hash tables configured (established 4096 bind 4096)
[    0.023539] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.023539] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.023540] NET: Registered protocol family 1
[    0.023541] virtio-mmio: Registering device virtio-mmio.0 at 0x1000000-0x10000ff, IRQ 1.
[    0.024095] workingset: timestamp_bits=62 max_order=18 bucket_order=0
[    0.025208] NET: Registered protocol family 38
[    0.025208] io scheduler mq-deadline registered
[    0.025759] io scheduler kyber registered
[    0.113409] software IO TLB: mapped [mem 0x7f10322c1000-0x7f10362c1000] (64MB)
[    0.116643] loop: module loaded
[    0.117200] device-mapper: ioctl: 4.40.0-ioctl (2019-01-18) initialised: dm-devel@redhat.com
[    0.126234] wireguard: WireGuard 0.0.20191219 loaded. See www.wireguard.com for information.
[    0.126235] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[    0.126789] NET: Registered protocol family 10
[    0.127333] Segment Routing with IPv6
[    0.127886] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[    0.128946] This architecture does not have kernel memory protection.
[    0.129500] Run /init as init process
[    0.131722] random: crng init done
[[  SGX-LKL ]] wg0 has public key W5/XFYlRAoayhmNTBF1ikKfZZapwu2VaDBW1TXPfkB0=
[    0.133394] virtio_blk virtio1: [vda] 3072000 512-byte logical blocks (1.57 GB/1.46 GiB)
[    0.135517] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts: 
mmap11      1  [1;32mTPASS[0m  :  test completed.
[    0.147612] reboot: Restarting system
[   SGX-LKL  ] ethread (5: 0) ethread (6: 0) ethread (3: 0) ethread (2: 0) init (0: 0 exit=0) [[  SGX-LKL ]] FAIL: ethread (4: 0) Exception SIGSEGV (page fault) received before LKL initialisation/after LKL shutdown (lt->tid=-1 [(?)] code=5 addr=0x7f00000ef327 opcode=0x3883 ret=0)
2020-08-06T09:36:05.000000Z [(H)ERROR] tid(0x7f10277fe700) | :OE_ENCLAVE_ABORTING [/agent/_work/1/s/openenclave/host/calls.c:oe_call_enclave_function_by_table_id:91]
ethread (7: 19) [   SGX-LKL  ] FAIL: sgxlkl_ethread_init() failed (id=7 result=19 (OE_ENCLAVE_ABORTING))
TEST_FAILED EXIT CODE: 1

This happens to one of the ethreads, after 6 of the 8 ethreads have already exited the enclave. Note that the signal handler cannot retrieve the tid anymore.

I suspect that the test here is irrelevant, and this is rather a rare race condition/memory corruption during shutdown. It would be good to observe this for a DEBUG build and get a stack trace. It seems different from the other issues that we have seen.

@github-actions github-actions bot added the needs-triage Bug does not yet have a priority assigned label Aug 6, 2020
@SeanTAllen
Copy link
Contributor

SeanTAllen commented Aug 6, 2020

perhaps of interest. the draft PR for setting PROT_NONE when unmapping memory causes a large number of core and ltp tests to fail on shutdown. Might be related: #720

https://dev.azure.com/sgx-lkl/sgx-lkl/_build/results?buildId=1705&view=results

@prp
Copy link
Member Author

prp commented Aug 6, 2020

perhaps of interest. the draft PR for setting PROT_NONE when unmapping memory causes a large number of core and ltp tests to fail on shutdown. Might be related: #720

https://dev.azure.com/sgx-lkl/sgx-lkl/_build/results?buildId=1705&view=results

Interesting. All of the segfaults for #720 happen in SW mode. Do you have a gdb stack trace of the segfaults in #720? It may be easy to fix.

@SeanTAllen
Copy link
Contributor

Here's one:

adjtimex01.c:24: PASS: adjtimex() with mode 0x403f
adjtimex01.c:34: PASS: adjtimex() with mode 0x8001
[    0.145277] reboot: Restarting system
[Thread 0x7fe040f07000 (LWP 1496) exited]

Thread 4 "sgx-lkl-run-oe" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff93aeb700 (LWP 1493)]
0x0000000040018b2b in virtio_process_queue (dev=0x7ffff7ff5008, qidx=1) at host_interface/virtio.c:285
285         while (q->last_avail_idx != q->avail->idx)
(gdb) bt
#0  0x0000000040018b2b in virtio_process_queue (dev=0x7ffff7ff5008, qidx=1) at host_interface/virtio.c:285
#1  0x00000000400169c9 in console_task (arg=0x0) at host_interface/virtio_console.c:326
#2  0x00007ffff72e66db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007ffff700fa3f in clone () from /lib/x86_64-linux-gnu/libc.so.6

I'm checking to see if other tests look similar.

@SeanTAllen
Copy link
Contributor

@prp I checked the stack trace for 8 different failing LTP tests. They are all the same.

@SeanTAllen
Copy link
Contributor

My guess @prp from looking at a little code is that we have this in console_task:

    while (1)
    {
        vio_host_process_enclave_event(_cfg->dev_id, timeout_ms);

        if (vio_host_check_guest_shutdown_evt())
            continue;

        virtio_process_queue(dev, TX_QUEUE_ID);
    }

and I suspect that after the shutdown check, we are starting the shutdown while we are trying to still use the queue. But I don't know this part of the code at all, so that is pure conjecture at this point. I believe you are more familiar with the virtio code and its interactions.

Do you have any insights?

@prp
Copy link
Member Author

prp commented Aug 6, 2020

I think you're right. By the way, this appears a duplicate of #158.

@SeanTAllen
Copy link
Contributor

Here's another that might be related:

https://dev.azure.com/sgx-lkl/ff25f828-9f87-48e4-94f0-7449609f7e8f/_apis/build/builds/1723/logs/148

during 'report/TEST-basic-global_vars_test-(nonrelease)-(run-hw)-(8-ethreads)-junit.xml'

2020-08-06T15:53:10.1702624Z [    0.162465] EXT4-fs (vda): mounted filesystem without journal. Opts: 
2020-08-06T15:53:10.1703541Z fopen failed errno(2)
2020-08-06T15:53:10.1703688Z TEST PASSED
2020-08-06T15:53:10.1704359Z [[  SGX-LKL ]] FAIL: Exception SIGSEGV (page fault) received before LKL initialisation/after LKL shutdown (lt->tid=1 [global_vars_test] code=5 addr=0x7f80001058b1 opcode=0x78b ret=0)
2020-08-06T15:53:10.1705040Z 2020-08-06T15:53:10.000000Z [(H)ERROR] tid(0x7f9c5d435700) | :OE_ENCLAVE_ABORTING [/agent/_work/1/s/openenclave/host/calls.c:oe_call_enclave_function_by_table_id:91]
2020-08-06T15:53:10.1706352Z [   SGX-LKL  ] ethread (6: 19) [   SGX-LKL  ] FAIL: sgxlkl_ethread_init() failed (id=6 result=19 (OE_ENCLAVE_ABORTING))
2020-08-06T15:53:10.1706651Z [    0.185198] reboot: Restarting system
2020-08-06T15:53:10.1707734Z Makefile:30: recipe for target 'run-hw' failed
2020-08-06T15:53:10.1708090Z make: *** [run-hw] Error 1
2020-08-06T15:53:10.1708540Z make run-hw: FAILED WITH EXIT CODE: 2
2020-08-06T15:53:10.1708730Z Test run completed with EXIT CODE 2
2020-08-06T15:53:10.1710290Z -----------stdout-end-------------

@paulcallen paulcallen added this to Needs triage in Issue triage via automation Aug 6, 2020
@paulcallen paulcallen moved this from Needs triage to Proposed p2 in Issue triage Aug 6, 2020
@paulcallen paulcallen moved this from Proposed p2 to Proposed p1 in Issue triage Aug 6, 2020
@paulcallen paulcallen added p1 Medium priority and removed needs-triage Bug does not yet have a priority assigned labels Aug 6, 2020
@paulcallen paulcallen removed this from Proposed p1 in Issue triage Aug 6, 2020
@prp
Copy link
Member Author

prp commented Aug 11, 2020

This may get fixed by PR #770.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 Medium priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants