Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wakeup_task: Trying to wake up task 1 which is not blocked #898

Closed
egrimley-arm opened this issue Sep 6, 2023 · 6 comments · Fixed by #908
Closed

wakeup_task: Trying to wake up task 1 which is not blocked #898

egrimley-arm opened this issue Sep 6, 2023 · 6 comments · Fixed by #908

Comments

@egrimley-arm
Copy link
Contributor

egrimley-arm commented Sep 6, 2023

I get this with the same testtcp program as in #846, but by running the release build for a much longer time on AArch64. The error is:

about to accept
[0][PANIC] panicked at src/scheduler/task.rs:481:13:
Trying to wake up task 1 which is not blocked
[0][INFO] Number of interrupts
[0][INFO] [0][Reschedule]: 291
[0][INFO] [0][Timer]: 487
[0][INFO] [0][virtio_net]: 4848
[0][INFO] Shutting down system
[0][ERROR] Unsupported exception class: 0x0, PC=0x4068c624

To build and run:

cargo build --target aarch64-unknown-hermit --release \
  -Z build-std=std,core,alloc,panic_abort \
  -Z build-std-features=compiler-builtins-mem

qemu-system-aarch64 \
    -machine virt,gic-version=3 \
    -cpu cortex-a72 -smp 1 -m 512M  \
    -display none -serial stdio \
    -kernel ../rusty-loader/target/aarch64/release/rusty-loader \
    -device guest-loader,addr=0x48000000,initrd=target/aarch64-unknown-hermit/release/testtcp \
    -netdev user,id=u1,hostfwd=tcp::8080-:8080 \
    -device virtio-net-pci,netdev=u1,disable-legacy=on 

To test, from a different terminal:

rm -f log ; date ; while socat -u TCP:localhost:8080 CREATE:out ; do date >> log ; done &

This typically runs for a minute or two and the error happens after 500-1000 connections have been made. I don't have enough data to say it looks like a Poisson distribution, but it looks sort of random.

rusty-hermit is at 5051aeed6c01845962973b0e907b366ed8fc0163
rusty-hermit/kernel is at 5f1ccd6

(For all I know it's possible that the only bug here is the assertion itself: perhaps waking up a task that is not blocked is something that should be expected to happen occasionally, for some reason? I have not seen this error on Intel but something different happens if I run the test on Intel: #899.)

@egrimley-arm
Copy link
Contributor Author

The error can be reproduced much more quickly by replacing let n = 300 with let n = 0 in the test program.

@egrimley-arm
Copy link
Contributor Author

I've only seen this with the release build, but GDB's where doesn't work for me with the release build, so I tried to do a backtrace by hand, which is not hard to do, despite a couple of functions being inlined, but very tedious and easy to make a mistake with, so this might not be correct:

sp         pc
0x5d1812f0 0x4066d7f4 wakeup_task task.rs:483 loop {} borrowed.status != TaskStatus::Blocked
0x5d181380 0x4066eaf4 handle_waiting_tasks task.rs:666
0x5d181420 0x40675494 libhermit_do_irq ???
0x5d1814a0 0x40636264 libhermit_el1_irq start.s:104
0x5d1815c0 0x4068d4c8 GicV3::send_sgi ???
0x5d1815c0 0x4063e620 reschedule scheduler/mod.rs:515
0x5d1815c0 0x4063e620 futex_wait_and_set futex.rs:130
0x5d181e20 0x4064ae14 wait executor/mod.rs:38
0x5d181e20 0x4064ae14 block_on network.rs:305
0x5d181f40 0x40645084 accept ???
0x5d181f80 0x40678aec sys_accept (mangled) ???
0x5d181ff0 0x40647d98 sys_accept (C): code uses msr spsel ...
0x5d282dc0 0x4069f06c sys_common:net::TcpListener::accept
0x5d282e30 0x40698c58 std::net::tcp::TcpListener::accept
0x5d282e80 0x40618140 main

@egrimley-arm
Copy link
Contributor Author

It seems that the nic.poll_common(now)at

nic.poll_common(now);
in handle_waiting_tasks is leading to an invocation of custom_wakeup that wakes up the task before handle_waiting_tasks itself tries to wake it up at:
Self::wakeup_task(node.task.clone());

If I move the initialisation of cursor at

let mut cursor = self.list.cursor_front_mut();
down to just before
while let Some(node) = cursor.current() {
where the variable is first used, then this error that I'm seeing on AArch64 goes away (and instead, a bit later, I see the same error as on Intel: #899).

I don't claim that moving the initialisation of cursor is the right way to fix this bug but this observation might help. (And I still don't know why this only happens on AArch64.)

@mkroening
Copy link
Member

mkroening commented Sep 11, 2023

For future reference, you can enable debug information in release builds. That should make getting back traces easier: Profiles—The Cargo Book—debug

@mkroening
Copy link
Member

I think your analysis is correct. Thanks a lot!

In principle, an external mutation of a mutable reference should not be possible and be caught earlier. I think the underlying issue is #622. If that was fixed, we'd get a more helpful “already borrowed” error.

We'd want a proper rework and fix of #622 in the long term, but for the meantime, moving the cursor creation around gets us forward.

@mkroening
Copy link
Member

This was potentially caused by two issues:

  1. interrupts::without and InterruptMutex did not properly synchronize with its interrupt handler through compiler fences (see fix(interrupts): add compiler fences to enable and disable rust-osdev/x86_64#436). This has definitely caused miscompilations.

    I have extracted and reworked the interrupt disabling code into the interrupts crate.

  2. Our core_scheduler function return mutable references, which is unsound and might cause miscompilations. Necessary changes are underway in Make core_scheduler sound #912, which should also rule out such errors in the future (or at least give clearer and earlier errors).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants