-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory corruption #832
Comments
HW Rev 2: - Spilt program in multiple tasks, this makes it easier to spot issues - Stale tasks at least should result in a log entry - Include reset reason in human readable format All HW revs: - Prevent memory corruption in MQTT PubSubClient lib (issue knolleary/pubsubclient#832) - Switch to MQTT PubSubClient lib without dynamic memory allocation - MQTT subscribed topic buffer max size - MQTT Dynamic JSON to Static JSON buffer - Logging buffer as local short lived vars - Config save and reset moved to looptask - More robust wifi scan code - Monitor heap largest block free - Added build instructions This release contains much changes, especially for HW revision 2. Please test and report any issues. If this release is stable I will push it as stable release.
After discussion with @arjenhiemstra on issue: knolleary/pubsubclient#832
After discussion with @arjenhiemstra on issue: knolleary/pubsubclient#832
I have been banging my head on this topic for the past two week, assuming this part in PubSubClient was ok. Below the message that I received:
A big THANK YOU to @arjenhiemstra on this one; please add his small validation to the trunk of this project! |
@genotix great this helped you as well! With this check the device doesn't corrupt the memory anymore but doesn't fix the underlying issue (can still cause timeouts) With the help of TD-er and a lot of testing the loop code was even further improved. Have a look at: https://github.com/arjenhiemstra/PubSubClientStatic Tested with rate of about 50.000 mqtt messages per hour and is rock steady stable now! |
@arjenhiemstra happy to review a PR with those improvements in given it sounds like it's been thoroughly tested - far more so than I would be able to do. |
I will prep a PR asap |
At a glance, it does look like you've forked a different fork... So depends on how extensive the changes are. But a PR is the best next step. |
I know, I will fork a clean one and only factor the relevant changes in. The other fork adds the possibility to allocate static mem buffer instead of heap allocated. That helped me to solve the issue but is not important for this issue in the end. |
@arjenhiemstra that is great to hear! I have been setting up a €12 device (LilyGo T-CALL v1.4) that is able to connect -over GPRS- to AWS. I will continue testing on the single code line change and as soon as I have my extra 4 USB-C cables in I will adopt your sources and roll it out to 4 extra devices. For now I have been sending out messages at random intervals, I'm at 2500 messages now. |
The memmove command is executed without any sanity checking on used values causing random reboots when an MQTT package gets processed with an impossible topic length. With a high rate test flow of MQTT messages this would sometimes corrupt memory after a few seconds. Extra sanity checks and calculation for payload_offset and message length have been implemented to better handle messages and prevent this memory corruption. Tested with a flow of about 50.000 messages per hour for multiple days. Went from hunderds of errors per hour til a few per day, the last errors mainly because the MQTT server couldn't keep up with the message flow. Never crashed again.
Created PR #835 |
@arjenhiemstra Arjen, I have tried your PubSubStatic version a few times now though I'm fairly sure this is giving me issues around the MQTT timerticks. Your "one liner" provides me with -a lot- more stability. My Average payload size is about 300 to 1500 bytes. Feel free to contact me and discuss this topic and perhaps make me help debug. / test this subject since I am very, very happy with that one-liner you created! |
@genotix maybe best to try the code of PR #835 (can be found here: https://github.com/arjenhiemstra/pubsubclient) |
Will check it! |
I'm getting FreeRTOS issues.
It seems as if the watchdog is triggered. Backtrace decoded:
|
I have been fighting with this issue as well for my ESP32 projects (random crashes and hanging code). Finally I boiled it down to getting extra 0xff bytes in the incoming TCP stream from my WiFiClient from time to time. The problem is on the TCP layer or even below, not with PubSubClient. I successfully reproduced this just with a TCP connection w/o any functionality above, but I have no idea yet where the extra bytes are coming from. As a consequence, PubSubClient runs into trouble when decoding such packets, especially when it comes to package lengths. I identified four points where that happens:
Case 1 is already handled by disconnecting the MQTT connection. I guess this is the best way to deal with a TCP stream that is out of sync. After reconnecting, we are back to normal. I modified the PubSubClient code to handle the other 3 cases similarly by disconnecting. And voila: Lots of disconnects, but no more random crashes and hanging code so far! (Tested by hammering the client, just like Arjen did. And running in production code for 1/2 year.) These issues do not happen with correct packets. However, I think this should be addressed anyhow. Bad guys might even exploit this by intentionally sending malformed packets... I prepared PR #843 including some debugging code. Anybody willing to crosscheck, maybe @genotix or @TD-er ? |
Sounds great @marcel7234 !!! if we get this stuff stable we should do an online beer session; my treat! |
Very nice analysis and great work @marcel7234!! I'll try to test the code as well! |
So far, so good, @marcel7234 ! Will be running it this night, hoping it will still be running tomorrow morning! |
Thanks, @genotix ! Maybe there are more reasons for crashes than the ones we already found. Let's make it a team effort to fix them! |
@marcel7234 great idea! |
I am using SSLClient for a TLS 1.2 connection to AWS IoT Core so this could be related to that, just how to figure it out... Just for addition; I'm transmitting between 1 and 3 seconds and the payload doesn't differ unless the system had to do any kind of reconnect. My setBufferSize() is differing but shouldn't hit far over 2Kb. I received a crash after about 2500 messages:
Decoded stack results
|
@genotix I think it is even more important to note the crash reason as well. Also a bit more context may be needed to interpret the crash. |
@genotix : Your stack looks like SSLClient is deep into encoding a message for sending, then suddenly the FreeRTOS interrupt handler (xPortSysTickHandler) steps in. There is no stack information above that, possibly due to a corrupted heap, so I can only guess. Do you call any MQTT operations (publish, loop,...) from inside an interrupt or timer? If so, @arjenhiemstra already pointed out the adverse effect of nested operations as PubSubClient uses a common buffer variable. The logic of PubSubClient seems to be that nested operations must be avoided by the user. Since you have no control over when an interrupt or timer fires up, you should never call a PubSubClient method from there. I ran into the same problem as well some time ago. My fault... The solution was to only use PubSubClient methods from within the main loop of my program. No interrupts, no timers, no threads, no nothing with MQTT. Then, one must be aware what PubSubClient::loop() does. I initially thought it was a good idea to call it as often as possible to process any open tasks. I did so from within my callback function that included sending messages. Then: callback() calls loop() calls callback() calls loop() and so on. Stack overflow, me bad... Call PubSubClient::loop() regularly from your main loop, but from nowhere else. Just some thoughts, maybe you can have a second look at your code having these remarks in mind. |
You should make sure that a callback due to a MQTT message coming in never can occur at the same time a publish is underway. Do not use ISR for callback and indeed as @marcel7234 states, try to do every |
Ok, I just removed the definition of the MQTT_MAX_TRANSFERSIZE (which was set to the Buffersize + 20 bytes. I'm at 4500 messages now and still going strong. |
Ok, I'll dig up the crash reason as soon as it shows again. |
I believe the SSLClient library is using Tasks (FreeRTOS) to manage the connection. |
Currently at 5000+ messages; going strong. My focus is currently on that anyway since Marcels code seems to respond more stable than the original and also the fix that @arjenhiemstra did (which already improved the connection -A LOT-) Thumbs up and will keep you posted! |
Crash at somewhat above 7300 messages. I'll try to find some backtrace once the IDE is responsive again. |
Ok I got something:
Followed by:
|
Unhandled debug exception This will usually be followed by a message like: Debug exception reason: Stack canary watchpoint triggered (task_name) —- |
I agree to Arjen. Put some static/gobal counter inside your functions, increment it on entering the function and decrement on leaving. Output a warning when this counter reaches >=2. This way you can identify functions that are nested, although they should not be. You might be surprised... |
I have moved all towards the main loop in this (had several mqtt loops around a SSLClient.flush() since it had helped some people using this library. Will soon revert on the results! |
Ok, moved to a single mqtt.loop() called from my loop() function.
command or
I've tried the big payload publish and the small one.
|
That is to be expected because the de kernel panic indicates you have ran out of memory. Stack memory to be precise. That most probably means the task you are running has no more stack memory left. You can find the current task's high level memory mark with this statement within the task you want to measure. uint32_t memHighLevelMark = uxTaskGetStackHighWaterMark( NULL ); Be aware that the arduino code (loop() ) is also running in a task. This task has standard 8k of stack memory, if you are using non dynamic allocated mem for SSL, a big MQTT buffer etc. this could cause to running out of mem quite fast. |
That's the canary again. You should figure out the reason for your stack overflow, that's highest priority. Functions like uxTaskGetStackHighWaterMark() can be handy to identify a growing stack. |
timing @marcel7234 ;-) lol! |
Perfectly aligned, @arjenhiemstra ! :-) |
Okidoki; moving focus to measuring stack! If I put the check in the loop of the application I assume it will show the loop stack only. I’m wondering if I can check tasks running and their stack without mingling wit the libraries used... 🤔 |
The Arduino ESP32 framework is using FreeRTOS, see: https://github.com/espressif/arduino-esp32/blob/8dc70e0add06e75da5a255ba94fe86df98ae5ef1/cores/esp32/main.cpp#L32
with the task's handle. For the Arduino task this handle is defined here:
If your lib uses tasks and defines a task handle, you could use this handle to specifically measure that task. |
Well, the operating system behind your Arduino sketches is FreeRTOS. Hence you are using it. Have a look at the documentation of uxTaskGetStackHighWaterMark(): |
Brilliant timing again @marcel7234 hahaha |
@arjenhiemstra : Funny! Same ideas, same timing. I like that! |
I have been digging at the SSLClient and ran into this article. |
Gent's I think it's safe to assume my issues are coming from the SSLClient library from the above mentioned article. |
A big shoutout to @marcel7234 and @arjenhiemstra 's efforts on this topic though! Thank you, guys! |
Sometimes things just start to move along. The dev over SSLClient just indicated an incoming modification; he has found the issue: OPEnSLab-OSU/SSLClient#30 (comment) Aside from this I think I might have found an alternative to the SSLClient mentioned above: https://github.com/govorox/SSLClient which uses mbedTLS instead of BearSSL |
@genotix : Sounds great! I can't help you with the SSL clients, but I wish you good success. There are issues in many libraries, and for sure way more in ones one code, for sure we will meet again somewhere... :-) Have fun! |
Some issues have disappeared in the past week rendering me with progress. |
Any fixes to this. I cannot get subscribe to be reliable. I am sending 1 message per second and I have cut the call back down to just send the mssage length and we are still crashing. Is there any extra debug that can be enabled to see why? I just get very slow messages and then a reconnnect. |
Implements the oneliner fix suggested by arjenhiemstra to fix issue knolleary#832 (knolleary#832) on the main PubSubClient github. This solves the issue where the ESP32 would crash if a Shared Attribute or RPC callback is received while we are uploading telemetry.
I had random crashes that I could pin to the PubSubClient. Further testing made it possible to pin it down to a specific part of the code.
Incoming and outgoing messages use the same buffer. If a publish is in the making while a callback is triggered, an incorrect value could be read from the buffer for the topic length, resulting in an out of memory bound memmove which will cause heap corruption.
Shouldn't we at least have a simple check here that tl < bufferSize? Cause that should never happen but it might in the described situation. ie.:
The text was updated successfully, but these errors were encountered: