Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Running Futures Eventually Stall #3

Open
fko-kuptec opened this issue Sep 25, 2024 · 22 comments
Open

Long Running Futures Eventually Stall #3

fko-kuptec opened this issue Sep 25, 2024 · 22 comments

Comments

@fko-kuptec
Copy link

I am currently developing a firmware for the ESP that needs to serve the camera feed as HTTP stream. For testing purposes, I reduced that further down to serving a simple JPEG animation. I initially used edge-executor::LocalExecutor to run my custom async HTTP server, handling every connection in a separate future spawned to the same thread-local executor. One stream to one client is therefore one (possibly infinitely) running async task.

Unfortunately, I ran into some weird issues: When running multiple streams in parallel, any stream might freeze randomly and never get back to live. Sometimes, the whole ESP crashes with some unhandled exception. It does not seem to be the same exception every time, but here is one:

Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception). 
Debug exception reason: BREAK instr 
Core  0 register dump:
PC      : 0x420171d7  PS      : 0x00060b36  A0      : 0x82010e96  A1      : 0x3fccf060  
0x420171d7 - alloc::task::raw_waker::clone_waker
    at ??:??
A2      : 0x3fcd1154  A3      : 0x3fcd1154  A4      : 0x00000000  A5      : 0x3fccf3a0  
A6      : 0x3fc9ade8  A7      : 0x3fca12dc  A8      : 0xfefefefe  A9      : 0xffffffff  
0x3fc9ade8 - xKernelLock
    at ??:??
0x3fca12dc - pxReadyTasksLists
    at ??:??
A10     : 0xfefefefe  A11     : 0xfefefefe  A12     : 0xfefefefe  A13     : 0x00060b23  
A14     : 0xfffffffe  A15     : 0x0000cdcd  SAR     : 0x00000020  EXCCAUSE: 0x00000001  
EXCVADDR: 0x00000000  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0xffffffff  


Backtrace: 0x420171d4:0x3fccf060 0x42010e93:0x3fccf080 0x4201800a:0x3fccf9c0 0x4202bb7f:0x3fccf9f0 0x42030838:0x3fccfa10
0x420171d4 - alloc::task::raw_waker::clone_waker
    at ??:??
0x42010e93 - std::sys_common::backtrace::__rust_begin_short_backtrace
    at ??:??
0x4201800a - core::ops::function::FnOnce::call_once{{vtable.shim}}
    at ??:??
0x4202bb7f - std::sys::pal::unix::thread::Thread::new::thread_start
    at ??:??
0x42030838 - pthread_task_func
    at ??:??

At some point I tried using a different executor. After replacing edge-executor by futures-executor::LocalPool, I could not reproduce these issues anymore. The streams seem to continue running without problems. Then, I've written my own version of LocalExecutor from scratch without using any third-party crates, and this also seems to work fine.

This makes me believe, that edge-executor has a bug... somewhere. I have not really a clue where it is. I just found this open issue in async-task talking about tasks not getting rescheduled randomly. That would fit my observations, at least.

Sorry for not providing sample code. I am developing the firmware as employee and therefore cannot just share our product's firmware. If you are interested, I can however try to build a minimal example, when I find the time.

@ivmarkov
Copy link
Owner

ivmarkov commented Sep 25, 2024

Aww, this sounds frightening :( . I do hope it is not really async-task as you suggest, or else the whole smol-rs ecosystem has a big trouble.

Perhaps it is related to our queues. Btw - unlike async-executor, edge-executor's queues are bound by default; what this means is that if you are spawning extra tasks during the life cycle of your app, and at some point in time the number of active tasks becomes bigger than the queue size, the app would crash in an unpredictable way, most likely precisely in the waker.

Anything you could share about that problem w.r.t. minimal reproducible case would be greatly appreciated.

If your employer is OK with that, sharing your own Local executor might also be interesting.

@ivmarkov
Copy link
Owner

ivmarkov commented Sep 25, 2024

Just for testing... what happens if you remove the heapless feature and add the unbounded feature?

@fko-kuptec
Copy link
Author

I already tested it with the unbounded feature enabled, since I also assumed this to be the root cause. No luck, however. My last test today was with this configuration:

edge-executor = { version = "0.4", features = ["unbounded"] }

I talked to my supervisor and he is okay with providing a minial reproducable example and also with sharing my own executor implementation. The question would just be, how to handle the latter part. Maybe for starters, if I push it into a private Github repo and give you access? Then we could discuss if we want to merge it into edge-executor and how.

For now, I will try to build a minimal example for the bug.

@ivmarkov
Copy link
Owner

Yes - a private GH repo for the executor would be ideal, many thanks!

@ivmarkov
Copy link
Owner

Also, the following question: have you tried with async-executor? It is also based on async-task and - moreover - is API-compatible with edge-executor. If async-executor crashes on you as well, chances are there is a serious bug in async-task. Otherwise, there might just be something wrong in edge-executor itself, which should be easier to fix.

@fko-kuptec
Copy link
Author

Interesting... I cannot reproduce the issue in my minimal example. My best guess would be, that this is a compiler bug again. I've seen them before in my firmware with stuff like:

  • just including tokio as dependency without using it would make the streaming significantly more performant
  • removing logging would make it slow again
  • re-including the logging would make it fast again
  • removing the logging again wouldn't impact performance... 0.o

I don't know whats happening. But maybe its not a bug with edge-executor or even async-task, but just some weird side-effects?

@ivmarkov
Copy link
Owner

I don't know whether this is even an option, but how about flashing (a subset of) your firmware on a RISCV MCU and trying that out instead? The compiler backend for RISCV is upstreamed in LLVM and I guess it is significantly more stable than xtensa.

We are currently having this miscompilation issue, which seems to be triggered by a very large alignment (padding) in ConcurrentQueue that LLVM xtensa does not like.

Coincidentally, edge-executor uses ArrayQueue by default, which has the very same large cache-padding.

Chances this to be the culprit are slim, but trying out with RISCV is one option to eliminate this hypothesis; another would be to enable heapless and try with the queue from heapless.

A third option is to try your full firmware with async-executor as I mentioned earlier.

@ivmarkov
Copy link
Owner

Also: are you using any async drivers from esp-idf-hal? Perhaps you can stop those during your test?

@ivmarkov
Copy link
Owner

ivmarkov commented Sep 25, 2024

Interesting... I cannot reproduce the issue in my minimal example. My best guess would be, that this is a compiler bug again. I've seen them before in my firmware with stuff like:

  • just including tokio as dependency without using it would make the streaming significantly more performant

I think you might be getting paranoid :-) The linker would erase all traces of tokio if you don't call into it.

  • removing logging would make it slow again
  • re-including the logging would make it fast again
  • removing the logging again wouldn't impact performance... 0.o

As I said - maybe trying it out with RISCV is an option. Definitely sounds weird though, but without at least seeing a portion of the code, I'm running out of hypotheses.

@fko-kuptec
Copy link
Author

A short update for today: I finally can reproduce the issue with my minimal example. It happens with both edge-executor and async-executor, but only if the PSRAM is enabled, which is ... interesting. The exception messages make it look like heap corruptions, but enabling the integrity checks for the heap in the ESP IDF makes for less helpful panic messages.

Whatever, it will take me until tomorrow to give you a clean report and access to my private executor.

@ivmarkov
Copy link
Owner

ivmarkov commented Sep 25, 2024

Hmmmmm. This might be related. Are you confining the PSRAM to just one of the two CPUs, or are you happily using malloc with PSRAM enabled?

@ivmarkov
Copy link
Owner

ivmarkov commented Sep 26, 2024

@fko-kuptec Is this running on the stock ESP32, or on ESP32S3? Asking because - at least in theory - the ESP32S3 should not have the atomics-in-PSRAM issue I mentioned in my previous comment. (Also - and to my understanding - no RISCV chip should be affected by that.)

@fko-kuptec
Copy link
Author

I am running the tests on ESP32-S3 :)

@ivmarkov
Copy link
Owner

ivmarkov commented Sep 26, 2024

Ok, so either the s3 is also affected, or there is something else going on with PSRAM. In any case, your reproducible example, as well as your infallible executor would help greatly in figuring out what is going in. Especially given that you have confirmed that async-executor is also affected.

@ivmarkov
Copy link
Owner

Also and if you have them handy - can you paste here other stack crashes you have recorded?

@fko-kuptec
Copy link
Author

Finally, you can find the reproduction of the bug in this repo. The README includes instructions, information on what works and what doesn't, and a few stack traces.

I have also given you private access to the fko-kuptec/emserve repository, containing my LocalExecutor implementation.

@ivmarkov
Copy link
Owner

Thanks!

One thing I noticed (still shooting in the dark) is that you are running both async-executor and edge-executor on top of the single-future executor of esp_idf_hal::task::block_on. Have you tried using either of the two, but with futures_lite::task::block_on instead, with PSRAM enabled?

@fko-kuptec
Copy link
Author

One thing I noticed (still shooting in the dark) is that you are running both async-executor and edge-executor on top of the single-future executor of esp_idf_hal::task::block_on. Have you tried using either of the two, but with futures_lite::task::block_on instead, with PSRAM enabled?

I tried it with futures-lite::futures::block_on and edge-executor and now I am getting these panics:

assert failed: block_split tlsf.c:410 (block_to_ptr(remaining) == align_ptr(block_to_ptr(remaining), ALIGN_SIZE) && "remaining block not aligned properly")


Backtrace: 0x40375cfa:0x3fcc0190 0x4037cded:0x3fcc01b0 0x403846c5:0x3fcc01d0 0x40382a4b:0x3fcc02f0 0x40382427:0x3fcc0310 0x403763a3:0x3fcc0330 0x403763ee:0x3fcc0360 0x403846d5:0x3fcc0380 0x4202f9b4:0x3fcc03a0 0x4202fa5c:0x3fcc03c0 0x4202fa9b:0x3fcc03e0 0x42034a6b:0x3fcc0400 0x42034efa:0x3fcc0430 0x4203e8e0:0x3fcc0480 0x4203f68c:0x3fcc04b0 0x4202f769:0x3fcc04d0
0x40375cfa - panic_abort
    at ??:??
0x4037cded - esp_system_abort
    at ??:??
0x403846c5 - __assert_func
    at ??:??
0x40382a4b - tlsf_malloc
    at ??:??
0x40382427 - multi_heap_malloc_impl
    at ??:??
0x403763a3 - heap_caps_malloc_base
    at ??:??
0x403763ee - heap_caps_malloc_default
    at ??:??
0x403846d5 - pvalloc
    at ??:??
0x4202f9b4 - mem_malloc
    at ??:??
0x4202fa5c - do_memp_malloc_pool$isra$0
    at ??:??
0x4202fa9b - memp_malloc
    at ??:??
0x42034a6b - tcp_create_segment
    at ??:??
0x42034efa - tcp_write
    at ??:??
0x4203e8e0 - lwip_netconn_do_writemore$isra$0
    at ??:??
0x4203f68c - lwip_netconn_do_write
    at ??:??
0x4202f769 - tcpip_thread
    at ??:??
Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x40382b4a  PS      : 0x00060133  A0      : 0x8038242a  A1      : 0x3fcabaf0  
0x40382b4a - tlsf_malloc
    at ??:??
A2      : 0x3fccc91c  A3      : 0x3fca1eac  A4      : 0x00060720  A5      : 0x00060123  
A6      : 0x00000001  A7      : 0x3fca1e78  A8      : 0x3fca1e54  A9      : 0xffffffff  
A10     : 0x3fccc914  A11     : 0x00000019  A12     : 0x3fccc9e8  A13     : 0x00000010  
A14     : 0x00000020  A15     : 0x00000005  SAR     : 0x0000001b  EXCCAUSE: 0x0000001d  
EXCVADDR: 0x0000000b  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0xffffffff  


Backtrace: 0x40382b47:0x3fcabaf0 0x40382427:0x3fcabb10 0x403763a3:0x3fcabb30 0x403763b9:0x3fcabb60 0x40379035:0x3fcabb80 0x4209d24d:0x3fcabba0 0x40385e65:0x3fcabbd0 0x400398b5:0x3fcabc00 0x40383c31:0x3fcabc20 0x4038543d:0x3fcabc70
0x40382b47 - tlsf_malloc
    at ??:??
0x40382427 - multi_heap_malloc_impl
    at ??:??
0x403763a3 - heap_caps_malloc_base
    at ??:??
0x403763b9 - heap_caps_malloc
    at ??:??
0x40379035 - esp_coex_common_malloc_internal_wrapper
    at ??:??
0x4209d24d - esf_buf_alloc_dynamic
    at ??:??
0x40385e65 - esf_buf_alloc
    at ??:??
0x40383c31 - ieee80211_output_process
    at ??:??
0x4038543d - ppTask
    at ??:??

@fko-kuptec
Copy link
Author

Note, though, that I am also running my own executor with the ESP-IDF block_on(), which seems to work. Therefore, I don't think thats the root cause.

@fko-kuptec
Copy link
Author

Oh, and regarding RISC-V: Unfortunately, we don't have any normal ESP32 RISC-V devkits here. The only thing we have is the brand new P4 evaluation kit, but that's a different kind of beast...

@ivmarkov
Copy link
Owner

I was able to reproduce twice or trice your crashes with a stock esp32 + psram (I don't have any s3 with psram).

I'm not able to reproduce your crashes without psram so far, but honestly I did not try hard enough.

Even with psram, reproducing it after the first crash was a bit difficult. I have to wait quite a bit (sometimes 15+ minutes) until it happens.

Neither with, nor without psram I'm not observing any delays or stalls. Might be an s3-specific behavior of the same problem.

It is true that opening 3rd 4th 6th chrome window becomes slower and slower (the esp accepting 7th connection does not even work for me!) but this happens with or without psram and I think it is just how many sockets poor little esp can handle simultaneously or something

I still suspect the problem is related to atomics not working Ok when they are in psram, but I can't prove it yet (noticed your own executor does not use any atomics). Will continue working on this next week, juts keep in mind it might take a bit of time (will be away on Monday).

@fko-kuptec
Copy link
Author

Thank you for taking a look :)

I'm not able to reproduce your crashes without psram so far, but honestly I did not try hard enough.

Without PSRAM I couldn't find any crashes, as well (see the table in the repo's README). The issue only seems to occur when PSRAM is enabled.

Even with psram, reproducing it after the first crash was a bit difficult. I have to wait quite a bit (sometimes 15+ minutes) until it happens.

It's quite random, that's true. For me, however, the stalls or crashes usually start to show up within one or two minutes, maybe.

It is true that opening 3rd 4th 6th chrome window becomes slower and slower (the esp accepting 7th connection does not even work for me!) but this happens with or without psram and I think it is just how many sockets poor little esp can handle simultaneously or something

Yeah, the poor thing is not really made for that. But becomming slower is not really my concern, its just that it stalls for me under the described circumstances.

noticed your own executor does not use any atomics

Oh, my executor is using atomics, just the ones from std. So, no extra dependency for that ^^

Will continue working on this next week, juts keep in mind it might take a bit of time (will be away on Monday).

I am using my own executor in our firmware, which seems to run rock-solid now. So no pressure :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants