Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenHAB MQTT] Crash due to yield panic #1625

Closed
TD-er opened this issue Aug 3, 2018 · 98 comments
Closed

[OpenHAB MQTT] Crash due to yield panic #1625

TD-er opened this issue Aug 3, 2018 · 98 comments
Labels
Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired Status: In Progress Is being worked on by the development team Type: Bug Considered a bug

Comments

@TD-er
Copy link
Member

TD-er commented Aug 3, 2018

As reported here: #1574 (comment)

Suggested fix:
Make the callback function just register handling of the MQTT as a new job in the scheduler.
This will result in all calls made during processing of the MQTT message being called from the loop() and thus be allowed to call yield() or delay()

It's due to a panic reboot. Happened after a MQTT publish to my Openhab home automation system. Here's the log:

11514348 : NEXTION075 : code: |s,i16,sOff
11514353 : EVENT: NEXTION#idx=16.00
11514394 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,16
11514408 : Command: publish

Panic C:\Users\Tom\AppData\Local\Arduino15\packages\esp8266\
hardware\esp8266\2.4.1\cores\esp8266\core_esp8266_main.cpp:99 __yield

ctx: cont 
sp: 3fff3e00 end: 3fff4750 offset: 01b0

>>>stack>>>
3fff3fb0:  00000001 00000001 3fff40f4 3fff2419  
3fff3fc0:  00000001 00000001 3fff899c 4025fab8  
3fff3fd0:  00000001 00000001 3fff899c 40241651  
3fff3fe0:  00000c01 00000000 00000000 40100200  
3fff3ff0:  00000000 00000010 00000000 00000000  
3fff4000:  00000000 00000000 3fff8b14 0000000f  
3fff4010:  00000000 3fff8b2c 0000000f 00000000  
3fff4020:  3fff8b44 0000000f 00000000 3fff8b5c  
3fff4030:  0000000f 00000000 3fff8b74 0000000f  
3fff4040:  00000000 00000000 6c627550 00687369  
3fff4050:  00000046 3fff4110 0000000d 402173d7  
3fff4060:  00000600 4c5f1800 00000000 4e2f7901  
3fff4070:  00000000 00000010 00000000 00000000  
3fff4080:  00000000 00000000 00000000 00000000  
3fff4090:  00000000 00000000 50003600 4c5f5a45  
3fff40a0:  646e7561 4e2f7972 49545845 692f4e4f  
3fff40b0:  00007864 00000000 00000000 00000000  
3fff40c0:  00000000 00000000 0000032d 4025ea7b  
3fff40d0:  312c7864 00000036 3fff41dc 4025eaad  
3fff40e0:  3fff41dc 00000000 3fff26ae 00000006  
3fff40f0:  00000001 3fff8afc 0000000f 00000000  
3fff4100:  00000001 00000001 00000001 40241dd5  
3fff4110:  00000c00 00000000 00000000 3fff5300  
3fff4120:  00000000 00000010 00000000 00000000  
3fff4130:  00000000 00000000 3fff8a34 0000000f  
3fff4140:  00000000 3fff8a4c 0000000f 00000000  
3fff4150:  3fff8a7c 0000000f 00000000 3fff8a94  
3fff4160:  0000000f 00000000 3fff8aac 0000000f  
3fff4170:  00000000 00000000 4e5b2c78 49545845  
3fff4180:  69234e4f 0d5d7864 6520200a 6669646e  
3fff4190:  20200a0d 5b206669 5458454e 234e4f49  
3fff41a0:  5d786469 30353d3e 2f202030 6f54202f  
3fff41b0:  20686375 6e657645 40107b10 00000000  
3fff41c0:  000003e8 00000000 3fff87fc 3fff883c  
3fff41d0:  00000000 00000000 00000000 3fff8ac4  
3fff41e0:  0000002f 00000025 3fff89d4 0000002f  
3fff41f0:  00000025 3fff899c 0000002f 00000025  
3fff4200:  3fff87e4 0000000f 00000000 3fff8964  
3fff4210:  0000002f 00000025 3fff891c 0000003f  
3fff4220:  00000025 3fff8a0c 0000001f 0000001b  
3fff4230:  00000001 00000001 00000001 00000001  
3fff4240:  3fff42d0 00000010 3fff4280 00000040  
3fff4250:  00000000 3fff2419 3fff428c 4025ebb0  
3fff4260:  3fff0031 3fff2419 00000001 00afb1f1  
3fff4270:  3fff42d0 3fff2419 00000001 40241f93  
3fff4280:  00000000 00000000 00000000 3fff87cc  
3fff4290:  0000000f 0000000a 3fff879c 0000000f  
3fff42a0:  0000000a 00000030 3fff42d0 4025f058  
3fff42b0:  3fff259e 00000000 00000000 00000005  
3fff42c0:  00000010 00000004 00000000 40242b55  
3fff42d0:  3fff8774 0000001f 00000011 4010020c  
3fff42e0:  3fff23c4 3fff0e58 3fff4310 40100690  
3fff42f0:  00000000 00000000 00000000 00000011  
3fff4300:  3fff1b68 3fff26a8 3fff4610 40244435  
3fff4310:  00000000 00000000 00000000 00000011  
3fff4320:  3fff1b68 3fff0e58 3fff4610 40264db4  
3fff4330:  ae4f0759 3fff0e58 3fff4560 40262844  
3fff4340:  3fff0030 3fff0e58 3fff45a8 00000011  
3fff4350:  3fff1b68 3fff0e58 3fff4610 4024a0c7  
3fff4360:  692c737c 732c3631 0066664f 0083126e  
3fff4370:  3fff5ccc 3fff4670 3fff4478 00000030  
3fff4380:  4021515a 3fff4670 3fff64b4 3fff645c  
3fff4390:  0000072f 0000002f 00000000 00000089  
3fff43a0:  00000000 00000000 0000001f 40104ec9  
3fff43b0:  4000050c 3fff1b00 3fff4470 00000030  
3fff43c0:  40000f83 00000030 00000000 ffffffff  
3fff43d0:  40285942 00000023 3fffbff8 00000e71  
3fff43e0:  000006b0 00000e70 00000001 3fff8a78  
3fff43f0:  000000c2 3fff64dc 00000001 3fff4b5c  
3fff4400:  3fff3640 00000000 00007388 00000030  
3fff4410:  40103e13 3ffef740 50800000 4000050c  
3fff4420:  3fffc278 40101f8c 3fffc200 00000022  
3fff4430:  3fff4440 00000023 3fff84fc 40214afa  
3fff4440:  ffffe660 00000020 00000000 4010053d  
3fff4450:  401004ba 00000023 00000003 3fff4490  
3fff4460:  3fff1b00 3fff4468 3fff4670 40214bbc  
3fff4470:  3fff4470 3fff4470 3fff4478 3fff4478  
3fff4480:  3fff4480 3fff4480 3fff4488 3fff4488  
3fff4490:  3fff4490 3fff4490 3fff4498 3fff4498  
3fff44a0:  3fff44a0 3fff44a0 3fff44a8 3fff44a8  
3fff44b0:  3fff44b0 3fff44b0 3fff44b8 401071ee  
3fff44c0:  3fff44c0 3fff44c0 3fff4610 40210069  
3fff44d0:  3fff44d0 3fff44d0 3fff44d8 40244d60  
3fff44e0:  3fff44e0 3fff44e0 3fff44e8 3fff44e8  
3fff44f0:  3fff44f0 3fff44f0 3fff44f8 3fff44f8  
3fff4500:  00000000 000006c0 000006c0 4010020c  
3fff4510:  3fff3640 000006c0 000006c0 00000046  
3fff4520:  3fff3648 00000000 3fff16d0 40285968  
3fff4530:  00000014 000035d8 000006bb 40207b90  
3fff4540:  00000010 00000010 3fff178c 4028f34f  
3fff4550:  4025d1f4 3fff4610 3fff178c 4025ebb0  
3fff4560:  00000000 00000000 00000000 4020b47c  
3fff4570:  3fff3640 000006bb 000006bb 4010020c  
3fff4580:  00000004 3fff4610 3fff45b0 40100690  
3fff4590:  3fff875c 0000000f 0000000b 3fff8744  
3fff45a0:  0000000f 00000000 3fff8r‚)ŠLþ�
ôŒ
JE0¸
†´ˆˆ9!9¯„ù„!��ÁÎ�ì)ÿ18Áù526 : INIT : Free RAM:21328
527 : INIT : I2C
527 : INIT : SPI not enabled
541 : NEXTION075 : serial pin config RX:13, TX:15, Plugin Enabled
543 : NEXTION075 : Using software serial
599 : INFO : Plugins: 71 [Normal] [Testing] (ESP82xx Core 2_4_1, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3)
696 : EVENT: System#Wake
745 : WIFI : Set WiFi to STA
@TD-er TD-er added Type: Bug Considered a bug Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired labels Aug 3, 2018
@TD-er TD-er added the Status: In Progress Is being worked on by the development team label Aug 3, 2018
@TD-er
Copy link
Member Author

TD-er commented Aug 3, 2018

Maybe you could test this PR: #1627

I just tested it will compile and did not crash my ESP32 node when pressing the test notification.
But since it is already close to 2am here, I see an opportunity for those still in day-light side of the globe ;)

@TD-er TD-er added Status: Fixed Commit has been made, ready for testing and removed Status: In Progress Is being worked on by the development team labels Aug 3, 2018
@thomastech
Copy link
Contributor

I loaded PR 1627 into two nodeMCU boards. Identical configurations, except only one of them will do MQTT publish.

I'll run them overnight and check their uptime in the morning. If uptime is wrong then we will know that there was at least one reboot.

  • Thomas

@thomastech
Copy link
Contributor

13 hours later ...
The NodeMCU board that didn't do MQTT is still OK. No resets (uptime correct).

The other board was reset several times. Resets occurred about every 2-4 hours. I'm now running with serial log, Debug Dev. Maybe there will something in the expanded log that helps ID this.

  • Thomas

@thomastech
Copy link
Contributor

It's important to note that although both boards have identical software configurations, the one that is working well is missing the the MPU6050 sensor I use to detect machine movement. So later today I will disable the MPU6050 and see if that stops the resets.

  • Thomas

@thomastech
Copy link
Contributor

I captured two panics in the log. Maybe there's a bread crumb in there that helps to ID what is going on. See entries 11575248 and 899358.
panic2.log

@TD-er
Copy link
Member Author

TD-er commented Aug 4, 2018

Is there a possibility the broker is busy or for some other reason not answering?

@thomastech
Copy link
Contributor

thomastech commented Aug 4, 2018

That is a good question. Openhab is running correctly. It's installed on a RaspberryPi Model 3+.

I studied the log messages and they seem to suggest that the panic occurred while in misc.ino's rulesProcessingFile(). That's because the last log entry before the crash is "ACT :" which is in rulesProcessingFile(). This log output is printed immediately before the PluginCall().

Also, whenever the openhab controller _C005.ino does a MQTTpublish() it prints a "MQTT :" log entry. But we don't see that in the logs. So I don't know if the publish ever occurred.

Not sure what these debug crumbs are telling us. Do you have any ideas?

For completeness, here's the logs leading up to the panic crash:

11575136 : EVENT: NEXTION#idx=26.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
RuleDebug: 001: endon
RuleDebug: 100: washer#ac do
RuleDebug: 101: if [washer#ac]=1
RuleDebug: 101: nextion,page0.va_washerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_washerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: dryermov#detect do
RuleDebug: 101: if [dryermov#detect]=1
RuleDebug: 101: nextion,page0.va_dryerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_dryerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 110: nextion#idx do
11575183 : Read settings: TaskSettings index: 1
11575189 : Read settings: TaskSettings index: 2
11575192 : Read settings: TaskSettings index: 4
11575196 : DEBUG DEV: Parsed String='  if 26'
11575199 : Read settings: TaskSettings index: 1
11575208 : Read settings: TaskSettings index: 2
11575211 : Read settings: TaskSettings index: 4
11575213 : DEBUG DEV: Parsed String='  if 26>=10 and 26'
RuleDebug: 111: if 26>=10 and 26<=30
11575220 : [if 26>=10 and 26<=30]=true
11575225 : Read settings: TaskSettings index: 1
11575230 : Read settings: TaskSettings index: 2
11575234 : Read settings: TaskSettings index: 4
11575238 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/idx,26'
RuleDebug: 111: publish /espez_laundry/nextion/idx,26
11575248 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,26

Panic C:\Users\Tom\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.4.1\cores\esp8266\core_esp8266_main.cpp:99 __yield

ctx: cont 
sp: 3fff3e70 end: 3fff47c0 offset: 01b0

>>>stack>>>

But it should have gone more like this:

190496 : EVENT: NEXTION#idx=26.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
RuleDebug: 001: endon
RuleDebug: 100: washer#ac do
RuleDebug: 101: if [washer#ac]=1
RuleDebug: 101: nextion,page0.va_washerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_washerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: dryermov#detect do
RuleDebug: 101: if [dryermov#detect]=1
RuleDebug: 101: nextion,page0.va_dryerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_dryerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 110: nextion#idx do
190542 : Read settings: TaskSettings index: 1
190548 : Read settings: TaskSettings index: 2
190551 : Read settings: TaskSettings index: 4
190555 : DEBUG DEV: Parsed String='  if 26'
190558 : Read settings: TaskSettings index: 1
190563 : Read settings: TaskSettings index: 2
190567 : Read settings: TaskSettings index: 4
190571 : DEBUG DEV: Parsed String='  if 26>=10 and 26'
RuleDebug: 111: if 26>=10 and 26<=30
190578 : [if 26>=10 and 26<=30]=true
190585 : Read settings: TaskSettings index: 1
190588 : Read settings: TaskSettings index: 2
190591 : Read settings: TaskSettings index: 4
190595 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/idx,26'
RuleDebug: 111: publish /espez_laundry/nextion/idx,26
190604 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,26
190622 : Command: publish
RuleDebug: 111: endif
190628 : Read settings: TaskSettings index: 1
190631 : Read settings: TaskSettings index: 2
190633 : Read settings: TaskSettings index: 4
190636 : DEBUG DEV: Parsed String='  if 26'
RuleDebug: 111: if 26>=500
190638 : [if 26>=500]=false
190641 : Read settings: TaskSettings index: 1
190645 : Read settings: TaskSettings index: 2
190650 : Read settings: TaskSettings index: 4
190653 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/idx,26'
RuleDebug: 111: publish /espez_laundry/nextion/idx,26
190664 : Read settings: TaskSettings index: 1
190668 : Read settings: TaskSettings index: 2
190672 : Read settings: TaskSettings index: 4
190676 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/value,0'
RuleDebug: 111: publish /espez_laundry/nextion/value,0
RuleDebug: 111: endif
RuleDebug: 000: endon
RuleDebug: 100: nextion#idx=98 do
RuleDebug: 101: nextion,page7.t_wifi_ssid.txt="ssid: %ssid%"
RuleDebug: 101: nextion,page7.t_ip.txt="ip: %ip%"
RuleDebug: 101: nextion,page7.t_signal.txt="rssi: [rssi#signal]dbm"
RuleDebug: 101: nextion,page7.t_date.txt="date %sysmonth%:%sysday%:%sysyears%"
RuleDebug: 101: nextion,page7.t_time.txt="time %syshour%:%sysmin%:%syssec% "
RuleDebug: 101: nextion,page7.t_uptime.txt="uptime [runtime#days]days"
190729 : EVENT: NEXTION#idx=26.00 Processing time:233 milliSeconds
190735 : EVENT: NEXTION#value=0.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
RuleDebug: 001: endon
RuleDebug: 100: washer#ac do
RuleDebug: 101: if [washer#ac]=1
RuleDebug: 101: nextion,page0.va_washerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_washerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: dryermov#detect do
RuleDebug: 101: if [dryermov#detect]=1
RuleDebug: 101: nextion,page0.va_dryerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_dryerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: nextion#idx do
RuleDebug: 101: if [nextion#idx]>=10 and [nextion#idx]<=30
RuleDebug: 101: publish /%sysname%/nextion/idx,[nextion#idx]
RuleDebug: 101: endif
RuleDebug: 101: if [nextion#idx]>=500
RuleDebug: 101: publish /%sysname%/nextion/idx,[nextion#idx]
RuleDebug: 101: publish /%sysname%/nextion/value,[nextion#value]
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: nextion#idx=98 do
RuleDebug: 101: nextion,page7.t_wifi_ssid.txt="ssid: %ssid%"
RuleDebug: 101: nextion,page7.t_ip.txt="ip: %ip%"
RuleDebug: 101: nextion,page7.t_signal.txt="rssi: [rssi#signal]dbm"
RuleDebug: 101: nextion,page7.t_date.txt="date %sysmonth%:%sysday%:%sysyears%"
RuleDebug: 101: nextion,page7.t_time.txt="time %syshour%:%sysmin%:%syssec% "
RuleDebug: 101: nextion,page7.t_uptime.txt="uptime [runtime#days]days"
190859 : EVENT: NEXTION#value=0.00 Processing time:124 milliSeconds
190865 : Read settings: ControllerSettings index: 0
190927 : MQTT : /ESPEZ_Laundry/NEXTION/idx 26
190928 : MQTT : /ESPEZ_Laundry/NEXTION/value 0
190930 : NEXTION075 : Pipe Command Sent: |s,i26,sOff26.00
190950 : Read settings: TaskSettings index: 1
190954 : SW   : State 0.00
190956 : EVENT: WASHER#ac=0.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Aug 4, 2018

How do you send the publish commands?
Are they commands from a rule?

@thomastech
Copy link
Contributor

thomastech commented Aug 4, 2018

Yes the publish actions are in the ESPEasy rule file. To further debug this I've commented them out and restarted my tests. If the reboot appears then the problem is not caused by MQTT publish.

Here's the rule excerpt:

on NEXTION#idx do
  if [NEXTION#idx]>=10 and [NEXTION#idx]<=30
      Publish /%sysname%/NEXTION/idx,[NEXTION#idx]
  endif
  if [NEXTION#idx]>=500  // Touch Events
      Publish /%sysname%/NEXTION/idx,[NEXTION#idx]
      Publish /%sysname%/NEXTION/value,[NEXTION#value]
  endif
endon

I don't use Touch events (idx>=500), so only the first if statement applies to this issue.

  • Thomas

@thomastech
Copy link
Contributor

thomastech commented Aug 5, 2018

It's been running trouble free for four hours after removing the publish statements from ESPEasy's rule file. So I am back to testing with MQTT publish.

I looked at the core and found the panic function. There's no Serial.flush before the exception, so my theory is that we lose the last buffered serial log messages at the panic reboot. So there is a chance the logs are incomplete before the reboot. Just a theory.

I edited _C005.ino and added a couple more addLog statements with Serial.flush on both sides of the MQTTpublish(). So hopefully that helps push out the debug messages before a panic reboot.

I've restore the rule file's publish statements and I am now restarting the testing.

I shut down my home control system (openhab) to see what would happen to the unanswered MQTT. EasyESP doesn't care, no problems appeared.

Lastly, I've increased the Advanced->Tools->Controller Message Interval from 100mS to 250mS.

I'll collect new logs for a few hours. Maybe the new _C005.ino addLog messages will provide more debug insight.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Aug 5, 2018

This "Controller Message Interval" variable is being used for multiple things and one of them is actually a blocking wait for that period.
It does freeze the system for that amount of time.
It is also used in places where another variable would make more sense instead of re-use.

So that's one thing I would like to change. It looks like it is meant to be used to give the MQTT broker some time to process requests, but I guess using a buffer at the controller side would make more sense.

This yield panic crash occurs when a yield or delay (which calls yield) is being used in a function that's being called from somewhere else.
Normally all functions are being called from either setup() or loop(). But callback functions are called from the core, so their scope does not include access to the yield function to run background tasks.

Yesterday I split the rules processing function to make it more readable.
See: TD-er@be7724d
Today I start looking at ExecuteCommand to see if there is something that can be handled asynchronous via the scheduler to make it being called from the loop() again.

@thomastech
Copy link
Contributor

@TD-er: Thanks for the explanation. I was aware that you were working on the Controller Message Interval.

My test run last night was interesting. Only one reboot, whereas I would have expected several. Could have been blind luck, or something about my altered test conditions contributed to stability.

But the overnight test's panic reboot provided another clue. Here's the log excerpt:

13901389 : Before MQTT
13901393 : After MQTT
13901395 : MQTT : /ESPEZ_Laundry/RSSI/signal -79
13905306 : NEXTION075 : code: |s,i26,sOff
13905308 : Read settings: TaskSettings index: 4
13905312 : EVENT: NEXTION#idx=26.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
RuleDebug: 001: endon
RuleDebug: 100: washer#ac do
RuleDebug: 101: if [washer#ac]=1
RuleDebug: 101: nextion,page0.va_washerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_washerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: dryermov#detect do
RuleDebug: 101: if [dryermov#detect]=1
RuleDebug: 101: nextion,page0.va_dryerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_dryerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 110: nextion#idx do
13905359 : Read settings: TaskSettings index: 1
13905368 : Read settings: TaskSettings index: 2
13905370 : Read settings: TaskSettings index: 4
13905372 : DEBUG DEV: Parsed String='  if 26'
13905375 : Read settings: TaskSettings index: 1
13905381 : Read settings: TaskSettings index: 2
13905385 : Read settings: TaskSettings index: 4
13905389 : DEBUG DEV: Parsed String='  if 26>=10 and 26'
RuleDebug: 111: if 26>=10 and 26<=30
13905396 : [if 26>=10 and 26<=30]=true
13905401 : Read settings: TaskSettings index: 1
13905406 : Read settings: TaskSettings index: 2
13905410 : Read settings: TaskSettings index: 4
13905414 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/idx,26'
RuleDebug: 111: publish /espez_laundry/nextion/idx,26
13905424 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,26
13905441 : Command: publish

Panic C:\Users\Tom\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.4.1\cores\esp8266\core_esp8266_main.cpp:99 __yield

ctx: cont 
sp: 3fff3e70 end: 3fff47c0 offset: 01b0

>>>stack>>>

The "Before MQTT" and "After MQTT" are addLog message wrappers I placed before and after the MQTTpublish() in _C005.ini. I only mention this because you are probably wondering what they are.

The interesting part is the Command: publish log message immediately before the panic reboot. It is log message that occurs inside misc.ino's ExecuteCommand().

And it's even more interesting is that you are starting to look at ExecuteCommand. So hopefully this rabbit hole journey is taking us in the right direction.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Aug 5, 2018

This afternoon I added some commit just to make the commands return a String stating their return status.
That has nothing really to do with this issue, but it gave me time to look at all the commands to see if there is something that cannot be done asynchronous.
So I will now try to add an entry to schedule these command executions as a scheduled job.
This will change the scheduled job a bit to keep the extra text line and the events in sync, but that's just a minor change.

Hopefully I find enough time this evening to implement it so you can test it.
This time zone difference between us is nice to get some tests done :)

@thomastech
Copy link
Contributor

Let me know when your changes are complete and need me to try them out.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Aug 6, 2018

I did change some this morning, but there are still some issues with that, which I will fix.... now ;)

@TD-er
Copy link
Member Author

TD-er commented Aug 6, 2018

I just pushed a fix to the existing pull request.
Could you (stress) test it?

In short:

  • All commands will be executed from the scheduler, except the commands sent from the Tools webpage.
  • Commands run from the web page will reply with a more meaningful reply.
  • Rules should be executed faster, since some commands like "publish" will now be started from the scheduler.
  • Commands will now be queued and thus not lost. (this queue may take quite some memory per item)

Still to do:

  • Maybe introduce some artificial delay between commands when sending a lot at once.

@thomastech
Copy link
Contributor

I have it running now. Everything is working fine. I'll keep an eye out for reboots or other odd behavior and report back tomorrow (or sooner if something goes wrong tonight).

FWIW, the updated test branch doesn't seem to have the latest src files. For example, it did not have my recent Nextion.ino. I didn't check any other files for "freshness."

Thank you for your efforts to make the code more stable. It is appreciated.

  • Thomas

@thomastech
Copy link
Contributor

After running for about one hour it rebooted. I'll run it again with serial log enabled and capture the exception.

  • Thomas

@yondom
Copy link

yondom commented Aug 7, 2018 via email

@TD-er
Copy link
Member Author

TD-er commented Nov 28, 2018

Is this one now fixed?
Please test with latest build, which will be made tonight. (20181128)

@thomastech
Copy link
Contributor

No problem, I'll test the 20181128 pre-compiled build when it becomes available. I've tested most of the previous builds and all experienced one or more reboots each day. I'm currently running 20181124 on two nodeMCU's and one has had 8 resets and the other 12 resets so far.

The web GUI reports the random manual resets caused by software or hardware watchdog (varies). But the GUI's reset reason has never specifically stated a panic yield reset. So either this specific reset reason is not being reported in the GUI or the panic yield issue is fixed.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Nov 28, 2018

It might be the panic yield is back and if so, please report that also.
Then I know what to add ;) (very simple fix if it is needed)

@thomastech
Copy link
Contributor

I can do that if you add reset cause "panic yield" to the web GUI System Info page. That is to say, I don't think it specifically reports it with the current code.

I don't want to use serial log for monitoring panic yield. That is because serial log usage introduces more resets and skews the testing.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Nov 28, 2018

Good point.
A crash as result of yield panic is hard to recognize as reboot cause.
At least it differs from the Watchdog resets, but I guess it is just an "exception" in the reboot description.

@TD-er
Copy link
Member Author

TD-er commented Dec 2, 2018

Upcoming build will have a lot more free stack, so let's hope things work out even better now :)

@thomastech
Copy link
Contributor

Looking forward to checking it out.

  • Thomas

@thomastech
Copy link
Contributor

thomastech commented Dec 11, 2018

For the last few days I've been testing ESP_Easy_mega-20181207_test_ESP8266_4096.bin on two NodeMCU's. I still experience reboots.

I'm using the OpenHab MQTT controller. Recently I reduced the Controller Message interval (Advanced Settings) to 75mS (was 100mS) and so far I've seen less reboots. If this truly did cause an improvement then it is suggesting that the MQTT controller is involved in my reboot issue. Perhaps there's some code blocking during MQTT communication that is hanging the core too long.

I've setup my rules so that I get an email when a reboot occurs. I would like to have the email include the reboot reason. I've looked at the system variable list and did not find one for the reboot cause. Is there a hidden Easter egg variable that provides the reboot error code? If not, no worries.

  • Thomas

@thomastech
Copy link
Contributor

I have continued to test new releases. Still experiencing random reboots every day. I no longer log the reboot messages, so I cannot say if they are related specifically to the "yield panic" exception.

I recommend closing this old issue. There are now several open tickets on ESPEasy's random reboot problems. No need for multiple open cases on it.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Mar 11, 2019

It looks like you're also following other issues, but I would suggest you also test some of the newly added WiFi related experimental settings (Tools => Advanced, at the bottom)

@thomastech
Copy link
Contributor

I previously tried the Tools=Advanced->Force WiFi B/G and did not observe any improvement to the reboot issues. Any other settings you think I should try?

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Mar 11, 2019

The Gratuitous ARP.
It does send ARP packets to announce its own MAC/ARP combination to make sure the switch/AP does know what MAC can be reached via what port on the switch.
One of the (many) issues is that the ESP is sometimes missing the ARP broadcasts and a typical time for a switch to start forgetting MAC addresses is 30 seconds.

By sending those "answers to questions nobody asked for" you make it less likely the ESP cannot be reached anymore.

@thomastech
Copy link
Contributor

Gratuitous ARP is currently enabled (default setting). Do you mean you want me to disable it?

BTW, I read the comments in another issue ticket about the possibility of voltage drop during WiFi startup. So a few minutes ago I reworked one of my NodeMCU's as follows:
Added 10uF tantalum to ESP8266MODH module's Vcc Pin-8, short leads.
Added heavy jumper wire from 3.3V linear VReg output pin to Vcc Pin-8 on ESP8266MODH module.
Added 10K Pullup on RST (recommended by a random ESP8266 user).

I will be surprised if these mods help, but it's only 10 minutes work. So worth checking out.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Mar 11, 2019

Nope, leave it on.
And about the voltage drop, it is indeed quite a strong current peak the ESP uses during RF calibration at startup.
The peak can be as high as 500 mA (very short peaks) and the RF calibration is relating its parameters to the seen "3v3" during this calibration. So a dip in voltage can result in incorrect RF calibration.

I've also seen some strange issues with missing packets as soon as you enable the "eco mode" I added recently. This eco mode is nothing more than calling delay() during the idle time of the scheduler.

@thomastech
Copy link
Contributor

Ok, thanks for the advice. I'll continue to monitor the problem. If any miracles occur I will let you know.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Mar 11, 2019

Miracles probably deserve a new issue, don't you think?

@thomastech
Copy link
Contributor

thomastech commented Mar 11, 2019

Miracles on the reboot issue would deserve a celebration with beverages and party hats.

Unfortunately my NodeMCU with the latest code and hardware mods rebooted a few minutes ago. So the miracle didn't happen yet.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Mar 12, 2019

Hmm too bad :(
Well at least that's one other thing we can move down on the list of most likely causes.

@TD-er
Copy link
Member Author

TD-er commented Oct 10, 2019

Is this still an issue?
We also now have the ADC plugin no longer reading the internal ADC value while WiFi connection is being setup.

@thomastech
Copy link
Contributor

Reboots have been an ongoing issue. But after updating one of my NodeMCU's with ESP_Easy_mega-20191003_test_core_260_sdk222_alpha_ESP8266_4M1M.b last week things are looking much better. No reboots on this test subject in 6 days! A new record for this device.

I'd suggest closing this issue since it's over a year old.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Oct 10, 2019

Great and maybe you can already order the "beverages and party hats." ;)

@TD-er TD-er closed this as completed Oct 10, 2019
@thomastech
Copy link
Contributor

thomastech commented Oct 11, 2019

Great and maybe you can already order the "beverages and party hats." ;)

I was about to do that, but I discovered that the new release has broken the Nextion plugin (sends NULLS instead of idx data). So I'm back to the older ESP_Easy_mega-20190830_test_core_260_sdk3_alpha_ESP8266_4M.bin.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Oct 11, 2019

Can you make an issue for that? Then I can have a look. (a bit more descriptive than this :) )

@thomastech
Copy link
Contributor

thomastech commented Oct 11, 2019

Creating an issue ticket will need to wait for some spare time (the required list of info takes a lot of time to prepare).

But here's a head start. I found that the problem is in the rules parsing. I have rule that publishes the Nextion's IDX data to my Openhab2 broker. When it sends this:
Publish /%sysname%/NEXTION/idx,[NEXTION#idx]
It sends it twice, first with idx NULL data, then ~60mS later is sends it again with valid idx data. Just to be clear, this rule should only publish the data once, not twice.

My OpenHab log shows this:
2019-10-11 08:17:23.396 [WARN ] [b.core.events.EventPublisherDelegate] - given new state is NULL, couldn't post update for 'NextionIDX'
2019-10-11 08:17:23.462 [vent.ItemStateChangedEvent] - NextionIDX changed from 17 to 26

The problem is averted if I do this:
let,10,[NEXTION#idx]
Publish /%sysname%/NEXTION/idx,[VAR#10]

The previous release (ESP_Easy_mega-20190928_test_core_260_sdk222_alpha_ESP8266_4M1M) does not have this problem. The change log on the new release mentions several changes to the rules parser. Seems to me those changes have broken something.

  • Thomas

@TD-er
Copy link
Member Author

TD-er commented Oct 11, 2019

I created a new issue about it.
See #2657
As soon as you may have new info. please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired Status: In Progress Is being worked on by the development team Type: Bug Considered a bug
Projects
None yet
Development

No branches or pull requests

4 participants