Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version >=2.0.1 is crashing on ESP8266 #174

Closed
Jendem opened this issue Dec 26, 2021 · 71 comments
Closed

Version >=2.0.1 is crashing on ESP8266 #174

Jendem opened this issue Dec 26, 2021 · 71 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@Jendem
Copy link

Jendem commented Dec 26, 2021

Crash after flashing using bin file and compiling latest source.

Relevant firmware information:

  • Version: 2.0.1 and 2.0.2. Version 2.0.0 is ok
  • MQTT: yes
  • AMS reader: Kamstrup module by E.O. February 2020

Same log for both versions:

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 3460, room 16
tail 4
chksum 0xcc
load 0x3fff20b8, len 40, room 4
tail 4
chksum 0xc9
csum 0xc9
v000a86b0
~ld
Sensors: 0

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Exception (28):
epc1=0x4020a268 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000

>>>stack>>>

ctx: cont
sp: 3ffffa20 end: 3fffffc0 offset: 0190
3ffffbb0:  3fff26c8 00000001 3fff2aa0 3fff2d1c
3ffffbc0:  3fff26c8 3fff29d8 3fff2918 4020b6cb
3ffffbd0:  effefeef effefeef effefeef effefeef
3ffffbe0:  effefeef effefeef effefeef effefeef
3ffffbf0:  effefeef effefeef effefeef effefeef
3ffffc00:  ef00feef effefeef effefeef effefeef
3ffffc10:  effefeef effefeef effefeef effefeef
3ffffc20:  effefeef effefeef effefeef effefeef
3ffffc30:  effefeef effefeef effefeef effefeef
3ffffc40:  7365feef effe0070 effefeef effefeef
3ffffc50:  effefeef effefeef effefeef effefeef
3ffffc60:  effefeef effefeef effefeef effefeef
3ffffc70:  effefeef effefeef effefeef effefeef
3ffffc80:  effefeef effefeef effefeef effefeef
3ffffc90:  effefeef effefeef effefeef effefeef
3ffffca0:  effefeef effefeef effefeef effefeef
3ffffcb0:  effefeef effefeef effefeef effefeef
3ffffcc0:  6675feef 376f7277 73653434 effe0070
3ffffcd0:  effefeef effefeef effefeef effefeef
3ffffce0:  effefeef effefeef effefeef effefeef
3ffffcf0:  effefeef effefeef effefeef effefeef
3ffffd00:  effefeef effefeef effefeef effefeef
3ffffd10:  effefeef effefeef effefeef effefeef
3ffffd20:  effefeef effefeef effefeef effefeef
3ffffd30:  effefeef effefeef effefeef effefeef
3ffffd40:  effefeef effefeef effefeef effefeef
3ffffd50:  effefeef effefeef effefeef effefeef
3ffffd60:  effefeef effefeef effefeef effefeef
3ffffd70:  effefeef effefeef effefeef effefeef
3ffffd80:  effefeef effefeef effefeef effefeef
3ffffd90:  effefeef effefeef effefeef effefeef
3ffffda0:  effefeef effefeef effefeef effefeef
3ffffdb0:  effefeef effefeef effefeef effefeef
3ffffdc0:  0000feef 7070612f 6163696c 6e6f6974
3ffffdd0:  2e32762d 2e322e30 0000736a 40234690
3ffffde0:  00000000 3fff403c 00000000 40211dd0
3ffffdf0:  40233840 40217584 00000000 3fff2728
3ffffe00:  00000000 3fff3e8c 3fff15d8 3fff2728
3ffffe10:  4021a0f4 00000000 3fff2728 3fffff4c
3ffffe20:  3ffeb6a2 3ffe9429 00000020 401012b4
3ffffe30:  3ffffe60 00000001 00000040 3fffff4c
3ffffe40:  3fffff60 3fff29d8 3fff2910 4020be48
3ffffe50:  00445453 01000000 0000030a 0000003c
3ffffe60:  01680101 6f700168 6e2e6c6f 6f2e7074
3ffffe70:  00006772 00000000 00000000 00000000
3ffffe80:  00000000 00000000 00000000 00000000
3ffffe90:  00000000 00000000 00000000 00000000
3ffffea0:  00000000 feef0000 feefeffe feefeffe
3ffffeb0:  feefeffe feefeffe feefeffe feefeffe
3ffffec0:  feefeffe feefeffe feefeffe feefeffe
3ffffed0:  feefeffe feefeffe feefeffe feefeffe
3ffffee0:  00000000 00000000 00000000 00000000
3ffffef0:  00000000 00000000 00000000 00000000
3fffff00:  00000000 00000000 00000000 00000000
3fffff10:  00000000 00000000 00000000 000003e8
3fffff20:  00445453 01000000 0000030a 0000003c
3fffff30:  feefeffe feefeffe feefeffe feefeffe
3fffff40:  feefeffe feefeffe feefeffe 00545344
3fffff50:  01000000 00000203 00000078 feefeffe
3fffff60:  01000000 00000203 feefeffe feefeffe
3fffff70:  01000000 0000030a feefeffe fe050000
3fffff80:  000001f7 feefeffe 3fff3d44 00000000
3fffff90:  feefeffe feefeffe feefeffe 3fff2d1c
3fffffa0:  3fffdad0 00000000 3fff2d08 40230d98
3fffffb0:  feefeffe feefeffe 3ffe8654 40100781
<<<stack<<<

--------------- CUT HERE FOR EXCEPTION DECODER ---------------
@ArnieO
Copy link
Contributor

ArnieO commented Dec 27, 2021

To be sure I understand: You can go back to v2.0.0 and it does not crash?
I am running an old Kamstrup module (like yours), and do not have this issue.

@Jendem
Copy link
Author

Jendem commented Dec 27, 2021

Yes, v2.0.0 is ok, the newer ones crashes at startup. Maybe related: I cannot set static IP in v2.0.0, when it restarts, it goes into AP mode and all setting are lost. (Haven't looked into this with serial debugging)

@NicolaiPetri
Copy link

NicolaiPetri commented Dec 27, 2021

I successfully upgraded to 2.0.1, but now I have lost access to device and a reset doesn't seem to get it back online.. So I guess I might have the same issue with crashing. I did have static ip configured. It doesn't look like it connects to my wifi at all and it doesn't look like it is AP mode

@ArnieO
Copy link
Contributor

ArnieO commented Dec 27, 2021

Strange... I use static IP, and have not seen any issue with the upgrade. (I skipped 2.0.1, went from 2.0.0 to 2.0.2 but cannot see how that should impact this).
@NicolaiPetri : Do you have what is needed to reflash it by cable (see user manual chapter 3)? I'm afraid that is the only option if the ESP is bricked.

@gskjold
Copy link
Member

gskjold commented Dec 28, 2021

Unable to reproduce this. Considering how many Kamstrup users we have, I think this must be related to configuration. Erase flash, reflash latest version and configure one thing at the time and see when it breaks.

@Jendem
Copy link
Author

Jendem commented Dec 28, 2021

Everything is working after erasing flash first!

python esptool.py --port "COM8" write_flash --erase-all 0x0 firmware.bin

@ArnieO
Copy link
Contributor

ArnieO commented Dec 29, 2021

Everything is working after erasing flash first!

python esptool.py --port "COM8" write_flash --erase-all 0x0 firmware.bin

@Jendem

  • Thank you for the reminder that a flash erase could sometimes be necessary.
    However. I believe this is the proper way to do it (reference):
    esptool.py --chip ESP8266 --port <e.g. COM3> erase_flash
  • Did your problems start after OTA upgrade, or after flashing by wire?

@gskjold
Maybe the flash command examples in the Wiki should be updated by adding the Erasing Flash Before Write option?
It could make the reflashing process more robust.

@Jendem
Copy link
Author

Jendem commented Dec 29, 2021

  • Yes, if you run the erase operation as a separate command yours is right. But you can combine it, as your reference Erasing Flash Before Write says.
    I added a script found here to inject --erase-all always.
Import("env")
old_uploaderflags = env["UPLOADERFLAGS"]
index_write_flash = old_uploaderflags.index("write_flash")
if index_write_flash != -1:
    new_uploaderflags = old_uploaderflags[::]
    new_uploaderflags.insert(index_write_flash + 1, "--erase-all")
    env.Replace(UPLOADERFLAGS=new_uploaderflags)
env.VerboseAction("$UPLOADCMD", "Uploading `$SOURCE")

The full command from platform io is
python esptool.py --before default_reset --after hard_reset --chip esp8266 --port "COM8" --baud 115200 write_flash --erase-all 0x0 firmware.bin

  • Never tried OTA, only by serial.
  • Started from version 1.4.1

@ArnieO
Copy link
Contributor

ArnieO commented Dec 30, 2021

Never tried OTA, only by serial.

Thank you for that information. Most users will update OTA, so it was important to clarify that your problems were not linked to that.

@bardahlm
Copy link

I think I have the same issue. My module crashes now and then. It works again for some time if I remove the module from the meter and wait some time before reinserting. Do I have to connect to the serial port to find out why it crashes?

I have a POW-K, using 2.0.2 from Github.

@ArnieO
Copy link
Contributor

ArnieO commented Dec 30, 2021

Do I have to connect to the serial port to find out why it crashes?

Maybe difficult to catch it while it happens, but you can activate telnet debugging in menu System/debugging:
image

Then open a command window on PC and use command telnet <IP address>

@bardahlm
Copy link

My issue seems to be that it drops of the network, as there are hourly data from the period while it was offline. So maybe not a crash but some wifi issue. Do the ESP have space for logging or will that burn out the flash?

@ArnieO
Copy link
Contributor

ArnieO commented Dec 30, 2021

My issue seems to be that it drops of the network, as there are hourly data from the period while it was offline. So maybe not a crash but some wifi issue. Do the ESP have space for logging or will that burn out the flash?

You can see if it has crashed and restarted by the uptime counter. I can see now that I too have a restart issue:
image

It is not a big problem for me, but @gskjold will surely look into this.

The data points for the graphs are calculated from the whole-hour List 3 datagrams when the meter reports accumulated consumption (kWh's) and stored in flash memory. So a reboot will not cause it to lose graph data points unless it happens at that moment when the meter sends List 3.

@bardahlm
Copy link

My unit was offline the entire night, there are big gaps in my graphs. When I disconnected and reconnected it, it came back online. As the hourly data was recorded one can assume that it was up and running in the period with missing data.

@Jendem
Copy link
Author

Jendem commented Jan 1, 2022

Mine has also restarted now, I believe it was up for 4 days. Looks like a very short restart, cannot see any gap in my database. (Kamstrup 10 sec interval)

@Jendem
Copy link
Author

Jendem commented Jan 4, 2022

And now i crashed again, at uptime = 313603 seconds. Logged data:

image

Edit: I'm running 8751b63

@gskjold
Copy link
Member

gskjold commented Jan 5, 2022

Very interesting. Are you all on Kamstrup?

@ArnieO
Copy link
Contributor

ArnieO commented Jan 5, 2022

Good observation!
Yes, all (@Jendem, @bardahlm and myself) that have reported the issue here so far are on Kamstrup, using some version of Pow-K.

In addition to upgrading, I moved from one Kamstrup to an other recently, in parallel with upgrading to v2.x.x. So I cannot say if it is the upgrade or the moving to a different Kamstrup meter that is the reason for this. I never had this issue on my previous location (with earlier firmware versions). So there is a possibility that this could be linked to some issue on individual meters (like Vout dropping out for a short period), causing a restart.

If this is the case, it should be visible on the Vcc reading just before the restart, as the supercap in Pow-K will hold the voltage up for a while (but dropping) even if Vout from the meter has dropped to zero. However, the above logging by @Jendem confirms that Vcc is stable during the restart - so this hypothesis seems incorrect.

I really don't see any other Pow-K HW related phenomena than loss of input voltage that could explain a reboot.

Are there any users on Aidon or Kaifa that have seen this?

Ideas on where to look are welcome!

@gskjold
Copy link
Member

gskjold commented Jan 5, 2022

Could be newly added data parser in v2.x series firmware. Will have a look when I have time.

@ThomasEdvardsen
Copy link

I am having severe rebooting issues on Kaifa, running AMS reader 220103.7.

Would like to downgrade to v. 2.0.0, to find if it stabilizes on that version. Do I have to completely erase the entire flash chip and reconfigure to avoid problems with the existing config files?

Screenshot from 2022-01-05 10-59-52

@ThomasEdvardsen
Copy link

Reflashed with the same version as before (220103.7), but with complete erasing of chip. Configured with the same values, and awaiting uptime logging to see if it helps.

@gskjold
Copy link
Member

gskjold commented Jan 5, 2022

If it doesn't work, try 220105.2:
esp32.zip
esp8266.zip

@gskjold gskjold self-assigned this Jan 5, 2022
@gskjold gskjold added the bug Something isn't working label Jan 5, 2022
@gskjold gskjold added this to the v2.0.3 milestone Jan 5, 2022
@ThomasEdvardsen
Copy link

I doesn't work, so I am intalling 220105.2 now.

@ThomasEdvardsen
Copy link

Still the same with 220105.2

@gskjold
Copy link
Member

gskjold commented Jan 8, 2022

Just to recap this tread:

  • Static IP, this have been fixed in attached firmware.
  • Disconnected over a long time, but collecting data. Suspect this is related to WiFi auto reconnect. Re-introduced old code for reconnect, maybe that will help
  • Random reboots, I'm still not able to see what could cause this, especially not if it works for 2.0.0. However I have had a few cases lately where debugging causes reboots. Not sure if you have tried it, but set level to Error and uncheck telnet and serial debugging and see if that helps.

esp8266.zip
esp32.zip

@gskjold gskjold added this to the v2.0.6 milestone Jan 16, 2022
@gskjold
Copy link
Member

gskjold commented Jan 22, 2022

I have found a possible problem, attaching new firmware.

EDIT: Sorry, constantly attaching wrong file, adding new one!
esp8266.zip

@erlandp
Copy link

erlandp commented Jan 23, 2022

I've been using this software with a nodemcu for about a week now. I've had frequent issues with 2.04 and 2.05. So far, fix 220122.2 has been running without hickups.

edit
Unfortunately 220122.2 crashed aswell, just took a little longer. I'm now running 2.0.0.

@bardahlm
Copy link

Would it be possible to include the running version in the MQTT-data? Then it would be easier to see what versions are the most stable.
Another option would be to set up an opt in usage reporting feature that reports current version, uptime and possily some other relevant parameters at intervals to a cloud service?

@gskjold
Copy link
Member

gskjold commented Jan 28, 2022

Including version in MQTT is not a bad idea, noted.

I've fixed a few problems in the following firmware:
esp32.zip
esp8266.zip

@gskjold gskjold modified the milestones: v2.0.6, v2.0.7 Jan 30, 2022
@Jendem
Copy link
Author

Jendem commented Feb 1, 2022

image
Start: 2022-01-01, end 2022-02-01

Starting test of v2.0.7 today

@gskjold
Copy link
Member

gskjold commented Feb 1, 2022

Thank you for the detailed monitoring, love it!

@Jendem
Copy link
Author

Jendem commented Feb 8, 2022

v2.0.7 had no noticeable change in the uptime. Tested v2.0.9, and it is crashing so often that it not usable...
image
No uptime longer than 15 minutes

@gskjold
Copy link
Member

gskjold commented Feb 8, 2022

Thanks, 2.0.9 was removed due to instability on ESP8266

@gskjold
Copy link
Member

gskjold commented Feb 19, 2022

General impressions are that v2.0.10 fixed this?

@Jendem
Copy link
Author

Jendem commented Feb 19, 2022

I've gone back to v2.0.0 to verify that it, and the hw, actually works. Uptime of 10 days now.

I can move to v2.0.10 today.

Screenshot_20220219_101518_com android chrome

@gskjold
Copy link
Member

gskjold commented Feb 19, 2022

Thank you, appreciate it!

@ArnieO
Copy link
Contributor

ArnieO commented Feb 19, 2022

Is this a clue?
image

@gskjold
Copy link
Member

gskjold commented Feb 19, 2022

Interesting, wonder how it managed to use that much.

@Jendem
Copy link
Author

Jendem commented Feb 19, 2022

Mine has had 19.4-23.4 kB free the last 40 days. Noticed that the han status has been jumping, but that is happening at every restart it seems.

image

@ArnieO
Copy link
Contributor

ArnieO commented Feb 19, 2022

Mine too is usually around 20 kb, this is first time I've seen it that low.
In case there is a link: is there a way I could log this to MQTT, and graph the trend vs uptime?

@mikkle
Copy link
Contributor

mikkle commented Feb 20, 2022

@ArnieO not currently, but see PR #240 - It's only for mqtt Raw format (so as to avoid fiddling with the json templates and introducing too many changes).
You could pull the json.data in addition to getting the mqtt feed. I did that, but i may introduce other hiccups/delays best avoided while concentrating on the mqtt scenatio

@mikkle
Copy link
Contributor

mikkle commented Feb 20, 2022

First of all, apologies for lots going on in the attached graphics.

One interesting finding, when troubleshooting uptime-issues (running mqtt raw) is that we actually get a meter-packet timestamp published to "/meter/dlms/timestamp".
This timestamp can be used to infer the HAN-status (like in web and data.json) by measuring the difference between timestamps (< 15s = ok, <30s = Warning, >= 30s = Danger)
I do that in Grafana with the influxdb query
SELECT difference("value") FROM "Pantry_HEM_Meter_DLMS_Timestamp" WHERE $timeFilter fill(none)
and then set up some coloured thresholds, value-mappings, and the like.

Next point:
The attached graphics are from my Pow-K running v2.0.10.
It seems to rather regularly die after around 40 hours.

Final point:
Regarding memFree, the data in the attached graphics is pulled with http from data.json (still running pure 2.0.10) along with the regular mqtt telemetry (exactly what I recommend avoiding in the post above ;-)
It doesn't seem that memFree is decreasing regularly down towards a reboot (currently gathering more data), but I do have a steep drop from around 16K free to around 10K free.
This occurred exactly then I went to the web-ui and enabled, then disabled debugging.

I'm currently (as of a couple of hours ago )running a custom build based on master, with PR #240 applied.
This build is ahead of v.2.0.10 by the following commits from @gskjold (in addition to PR 240):

I'd like to leave that custom-build running for a couple of days, or at least until it crashes again, but if any new findings/patches should turn up, I'm definetely ready to perform any necessary testing.

Edit: I should probably mention that my Pow-K sits in a Kamstrup meter, it's in Denmark, and using encryption keys (Radius)

Another edit: debug-level was set to ERROR, and debug was disabled, since this was the recommendation.
According to @gskjold latest commits, it seems that level should be WARNING on esp8266-based readers, so I'm at WARNING+disabled now, on the custom-build.

powk-grafana-old

@gskjold gskjold modified the milestones: v2.0.7, v2.0.10 Feb 20, 2022
@ArnieO
Copy link
Contributor

ArnieO commented Feb 20, 2022

First of all, apologies for lots going on in the attached graphics.

Not at all! We are all nerds here; the more complex the better! 😁
Thank you for the thorough analysis!

@mikkle
Copy link
Contributor

mikkle commented Feb 22, 2022

Right, so my local build of (effectively) 2.0.11 have now been running stably on my Pow-K for more than 2 days \o/

Attached some fresh telemetry data, still indicating that a gradual mem-decrease does not seem to be the/an issue.
Regarding HAN-status, I realized that my method of using Meter timestamps is not really that good - I put in another graph showing the timing between uptime updates, and that shows the exact same bad spikes as the timestamps graph, which basically means that these represent lost updates (and thus not necessarily lost HAN-packets internally in the module).

I have several components in my stack that might be the root cause for the lost updates, so I won't blame either the Pow-K or the software for those without further analysis of other moving parts.
Also, it seems I've lost 12 updates in 48 hours, which is an error-rate of 0.069% , which is definitely acceptable :)

In short: 2.0.11 way more stable than 2.0.10 on esp8266 (Pow-K) - for me, at least :)

Cheers!

2 0 11_local_build_telemetry

@Jendem
Copy link
Author

Jendem commented Feb 22, 2022

I have no errors on v2.0.10 for 3 days 12 hours so far.

@Jendem
Copy link
Author

Jendem commented Feb 26, 2022

Uptime of 7 days, 9 hours now. Free memory started to change (clear step down) the last 2 days:
image

@gskjold
Copy link
Member

gskjold commented Mar 12, 2022

Considering this fixed

@gskjold gskjold closed this as completed Mar 12, 2022
@Jendem
Copy link
Author

Jendem commented Mar 12, 2022

20 days uptime. Free memory stable at 16.2 kB. Fixed!
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants