-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenHAB MQTT] Crash due to yield panic #1625
Comments
Maybe you could test this PR: #1627 I just tested it will compile and did not crash my ESP32 node when pressing the test notification. |
I loaded PR 1627 into two nodeMCU boards. Identical configurations, except only one of them will do MQTT publish. I'll run them overnight and check their uptime in the morning. If uptime is wrong then we will know that there was at least one reboot.
|
13 hours later ... The other board was reset several times. Resets occurred about every 2-4 hours. I'm now running with serial log, Debug Dev. Maybe there will something in the expanded log that helps ID this.
|
It's important to note that although both boards have identical software configurations, the one that is working well is missing the the MPU6050 sensor I use to detect machine movement. So later today I will disable the MPU6050 and see if that stops the resets.
|
I captured two panics in the log. Maybe there's a bread crumb in there that helps to ID what is going on. See entries 11575248 and 899358. |
Is there a possibility the broker is busy or for some other reason not answering? |
That is a good question. Openhab is running correctly. It's installed on a RaspberryPi Model 3+. I studied the log messages and they seem to suggest that the panic occurred while in misc.ino's rulesProcessingFile(). That's because the last log entry before the crash is "ACT :" which is in rulesProcessingFile(). This log output is printed immediately before the PluginCall(). Also, whenever the openhab controller _C005.ino does a MQTTpublish() it prints a "MQTT :" log entry. But we don't see that in the logs. So I don't know if the publish ever occurred. Not sure what these debug crumbs are telling us. Do you have any ideas? For completeness, here's the logs leading up to the panic crash:
But it should have gone more like this:
|
How do you send the publish commands? |
Yes the publish actions are in the ESPEasy rule file. To further debug this I've commented them out and restarted my tests. If the reboot appears then the problem is not caused by MQTT publish. Here's the rule excerpt:
I don't use Touch events (idx>=500), so only the first if statement applies to this issue.
|
It's been running trouble free for four hours after removing the publish statements from ESPEasy's rule file. So I am back to testing with MQTT publish. I looked at the core and found the panic function. There's no Serial.flush before the exception, so my theory is that we lose the last buffered serial log messages at the panic reboot. So there is a chance the logs are incomplete before the reboot. Just a theory. I edited _C005.ino and added a couple more addLog statements with Serial.flush on both sides of the MQTTpublish(). So hopefully that helps push out the debug messages before a panic reboot. I've restore the rule file's publish statements and I am now restarting the testing. I shut down my home control system (openhab) to see what would happen to the unanswered MQTT. EasyESP doesn't care, no problems appeared. Lastly, I've increased the Advanced->Tools->Controller Message Interval from 100mS to 250mS. I'll collect new logs for a few hours. Maybe the new _C005.ino addLog messages will provide more debug insight.
|
This "Controller Message Interval" variable is being used for multiple things and one of them is actually a blocking wait for that period. So that's one thing I would like to change. It looks like it is meant to be used to give the MQTT broker some time to process requests, but I guess using a buffer at the controller side would make more sense. This yield panic crash occurs when a Yesterday I split the rules processing function to make it more readable. |
@TD-er: Thanks for the explanation. I was aware that you were working on the Controller Message Interval. My test run last night was interesting. Only one reboot, whereas I would have expected several. Could have been blind luck, or something about my altered test conditions contributed to stability. But the overnight test's panic reboot provided another clue. Here's the log excerpt:
The "Before MQTT" and "After MQTT" are addLog message wrappers I placed before and after the MQTTpublish() in _C005.ini. I only mention this because you are probably wondering what they are. The interesting part is the Command: publish log message immediately before the panic reboot. It is log message that occurs inside misc.ino's ExecuteCommand(). And it's even more interesting is that you are starting to look at ExecuteCommand. So hopefully this rabbit hole journey is taking us in the right direction.
|
This afternoon I added some commit just to make the commands return a String stating their return status. Hopefully I find enough time this evening to implement it so you can test it. |
Let me know when your changes are complete and need me to try them out.
|
I did change some this morning, but there are still some issues with that, which I will fix.... now ;) |
I just pushed a fix to the existing pull request. In short:
Still to do:
|
I have it running now. Everything is working fine. I'll keep an eye out for reboots or other odd behavior and report back tomorrow (or sooner if something goes wrong tonight). FWIW, the updated test branch doesn't seem to have the latest src files. For example, it did not have my recent Nextion.ino. I didn't check any other files for "freshness." Thank you for your efforts to make the code more stable. It is appreciated.
|
After running for about one hour it rebooted. I'll run it again with serial log enabled and capture the exception.
|
Sent from Yahoo Mail on Android
On Mon, Aug 6, 2018 at 8:25 PM, Thomas<notifications@github.com> wrote:
After running for about one hour it rebooted. I'll run it again with serial log enabled and capture the exception.
- Thomas
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the ythread.
|
Is this one now fixed? |
No problem, I'll test the 20181128 pre-compiled build when it becomes available. I've tested most of the previous builds and all experienced one or more reboots each day. I'm currently running 20181124 on two nodeMCU's and one has had 8 resets and the other 12 resets so far. The web GUI reports the random manual resets caused by software or hardware watchdog (varies). But the GUI's reset reason has never specifically stated a panic yield reset. So either this specific reset reason is not being reported in the GUI or the panic yield issue is fixed.
|
It might be the panic yield is back and if so, please report that also. |
I can do that if you add reset cause "panic yield" to the web GUI System Info page. That is to say, I don't think it specifically reports it with the current code. I don't want to use serial log for monitoring panic yield. That is because serial log usage introduces more resets and skews the testing.
|
Good point. |
Upcoming build will have a lot more free stack, so let's hope things work out even better now :) |
Looking forward to checking it out.
|
For the last few days I've been testing ESP_Easy_mega-20181207_test_ESP8266_4096.bin on two NodeMCU's. I still experience reboots. I'm using the OpenHab MQTT controller. Recently I reduced the Controller Message interval (Advanced Settings) to 75mS (was 100mS) and so far I've seen less reboots. If this truly did cause an improvement then it is suggesting that the MQTT controller is involved in my reboot issue. Perhaps there's some code blocking during MQTT communication that is hanging the core too long. I've setup my rules so that I get an email when a reboot occurs. I would like to have the email include the reboot reason. I've looked at the system variable list and did not find one for the reboot cause. Is there a hidden Easter egg variable that provides the reboot error code? If not, no worries.
|
I have continued to test new releases. Still experiencing random reboots every day. I no longer log the reboot messages, so I cannot say if they are related specifically to the "yield panic" exception. I recommend closing this old issue. There are now several open tickets on ESPEasy's random reboot problems. No need for multiple open cases on it.
|
It looks like you're also following other issues, but I would suggest you also test some of the newly added WiFi related experimental settings (Tools => Advanced, at the bottom) |
I previously tried the Tools=Advanced->Force WiFi B/G and did not observe any improvement to the reboot issues. Any other settings you think I should try?
|
The Gratuitous ARP. By sending those "answers to questions nobody asked for" you make it less likely the ESP cannot be reached anymore. |
Gratuitous ARP is currently enabled (default setting). Do you mean you want me to disable it? BTW, I read the comments in another issue ticket about the possibility of voltage drop during WiFi startup. So a few minutes ago I reworked one of my NodeMCU's as follows: I will be surprised if these mods help, but it's only 10 minutes work. So worth checking out.
|
Nope, leave it on. I've also seen some strange issues with missing packets as soon as you enable the "eco mode" I added recently. This eco mode is nothing more than calling |
Ok, thanks for the advice. I'll continue to monitor the problem. If any miracles occur I will let you know.
|
Miracles probably deserve a new issue, don't you think? |
Miracles on the reboot issue would deserve a celebration with beverages and party hats. Unfortunately my NodeMCU with the latest code and hardware mods rebooted a few minutes ago. So the miracle didn't happen yet.
|
Hmm too bad :( |
Is this still an issue? |
Reboots have been an ongoing issue. But after updating one of my NodeMCU's with ESP_Easy_mega-20191003_test_core_260_sdk222_alpha_ESP8266_4M1M.b last week things are looking much better. No reboots on this test subject in 6 days! A new record for this device. I'd suggest closing this issue since it's over a year old.
|
Great and maybe you can already order the "beverages and party hats." ;) |
I was about to do that, but I discovered that the new release has broken the Nextion plugin (sends NULLS instead of idx data). So I'm back to the older ESP_Easy_mega-20190830_test_core_260_sdk3_alpha_ESP8266_4M.bin.
|
Can you make an issue for that? Then I can have a look. (a bit more descriptive than this :) ) |
Creating an issue ticket will need to wait for some spare time (the required list of info takes a lot of time to prepare). But here's a head start. I found that the problem is in the rules parsing. I have rule that publishes the Nextion's IDX data to my Openhab2 broker. When it sends this: My OpenHab log shows this: The problem is averted if I do this: The previous release (ESP_Easy_mega-20190928_test_core_260_sdk222_alpha_ESP8266_4M1M) does not have this problem. The change log on the new release mentions several changes to the rules parser. Seems to me those changes have broken something.
|
I created a new issue about it. |
As reported here: #1574 (comment)
Suggested fix:
Make the callback function just register handling of the MQTT as a new job in the scheduler.
This will result in all calls made during processing of the MQTT message being called from the
loop()
and thus be allowed to callyield()
ordelay()
The text was updated successfully, but these errors were encountered: