-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plug becomes "unavailable" a few times a day #9
Comments
Just a few thoughts for debugging:
I haven't heard other complaints like this really. I can send a couple replacements if you email brian@kaufha.com. |
I'm also seeing this behavior, though the plug always resolves itself (by restarting). It's somewhat inconsistent in frequency - sometimes every 10-15 minutes, sometimes every few hours. I was able to capture logs in the dashboard right before it went unavailable. Looks like there's some kind of memory issue or leak, so after some time the device restarts.
Here's my ESPHome configuration. I'm running the latest version of
Does this seem like an issue with my specific plug, or something to debug on my end with ESPHome? |
Can you add debug sensors in your local yaml and see if you can see free heap memory slowly reducing over time? Could be a memory leak I guess. I haven't seen this reported as a widespread problem so not sure if there is global issue or just you. The example config entry here should work: |
After you add debug, it should show some debug info on boot as well. Let me know what that says. Edit: after it crashes not after you just update the yaml and it reboots to update. |
I have a plug running your exact yaml with the debug info added, so I'll let you know what I see. It's on a unit connected to serial logging as well so I may get more info that way. |
Thank you for the quick help on this! Well, this isn't what I expected to see: Looks like free heap memory is not reducing over time. Within ~10 minutes I hit another crash. Here's the boot log:
The fatal exception seems interesting. I'm new to ESP, but it seems like it's trying to write to protected memory: Also, immediately after restarting, it starts hitting the "could not allocate memory" issue. Anecdotally I'm seeing these statements even more often now after adding the debug logging. I don't remember seeing this a day or two ago (upon first boot or a while after). (I'm new to ESPHome, so hoping I'm not just doing something wrong here. I'm wondering why this seems to be increasing in frequency. Is there some kind of device-level storage I need to clear out (because of multiple OTA updates)? Or does this point to something outside of the ESPHome software side?) |
My heap size was super stable for about an hour. Looking into the JSON messages, it seems like something that only happens when the web server on the plug is accessed, so I tried doing that and the plug crashed with out of memory error.
|
Here's where the JSON messages come from: And that gets called from: The weird thing is why this would be randomly happening on your plug. Are you using the HTTP API? |
Ah, that would explain why this is happening more frequently this morning (since I was accessing the web server). I'm not using the HTTP API - just Home Assistant. Strangely, the past two nights the plug restarted multiple times, hours after I stopped using my computer. It's possible I left a browser open that caused it. I can monitor again tonight and see if it reproduces. Maybe the heap allocation is a red herring? I'll keep on the debug sensors, and also add the wifi sensors you mentioned above, in case this is actually network related. Thanks for your help! I'll update again if I learn anything new. |
Okay, an update from today: Good news: it didn't crash as early! This is probably because I didn't continually checking the web interface. It lasted over 24 hours before crashing. However, then it crashed a few times in a row, starting a few hours ago. Every time had the same reason in the "device info" logged to home assistant:
I don't see anything interesting in the heap, heap free, heap fragmentation, or wifi stats logged to home assistant. "Loop time" seems to spike to 18000ms right before one of the crashes. Any idea what could be happening? |
From what I'm seeing, this really seems to be an issue with the web server / json. I may just have so many entities and other info being sent it's reaching the limits of the heap. I'm going to have to do some debugging within the web_server and json components to see what's happening in there in more detail. When I ran using a more stock yaml, the plug was super stable with the web server open and refreshing, but it crashed pretty quick when I opened up the web server in two tabs at once and refreshed. Let me know what you see separate from the web server / json stuff. You indicated you were seeing this when you weren't using the web server, so if a cause can be narrowed down that would be helpful. In the meantime you might try using the minimal yaml instead of the normal one. Let me know if you need help replicating the settings. https://github.com/KaufHA/PLF10/blob/main/kauf-plug-minimal.yaml |
Another idea instead of switching to the minimal yaml would be to disable the webserver by setting a substitution Edited to add quotes around true in substitution. |
Thanks for the continued help on this! Since my last comment, I made a few changes:
So far, it seems like one of these changes did the trick! I'm at over 46 hours of uptime so far (no crashes). I'm hopeful this means I'm all set! I don't actually need the webserver enabled, the HA controls are more than sufficient for me. Happy to try to debug further if it's helpful, though. |
Thanks for the info. I'm sure the issue is the web server. Keep me posted if you learn anything else. |
Yeah I just got the plug I. Yesterday, it's my first esphome device and I am super excited😁 The device website takes forever to load after a restart, once it does become responsive. I have a static IP set on the device along with a DHCP reservation. I will turn off the webserver and report back in a little bit. It was happening every few hours to every few minutes. The longer the device is on the shorter time exists between crashes. |
Just putting this here in case anyone else runs into the same issue I just did. After adding the disabled web server substitution, and attempting to upload it through ESP home, it kept failing on the upload due to a timeout error. Apparently the plug was disconnecting from the Wi-Fi so often at that point that I couldn't even upload the newly compiled firmware. Unplugging and plugging the device back into restart it fix that issue. I now have the web server disabled and I will report back whether or not that is fixed. The issue with it timing out 👍 |
Well this response was sooner than expected... |
Thanks for the info and any logs you can provide would definitely be helpful. Can you completely remove the webserver and see if that helps? Edited to add what I would recommend you do to remove the webserver component completely: 1: download kauf-plug.yaml into your esphome config directory. 2: edit that downloaded kauf-plug.yaml file to remove the web server as follows (comment it out):
3: edit your device yaml to use that downloaded version as follows:
Here was the original option: In order to do that you need to remove the package from your yaml (these lines):
And then copy and paste the entire kauf-plug-yaml file into your yaml: https://raw.githubusercontent.com/KaufHA/PLF10/main/kauf-plug.yaml And then remove the webserver section (these lines):
|
I can give that a try at some point tomorrow. I noticed that after about an hour the issue seems to have stopped. so I want to give it some time to see if it comes back up. Also, when I added the disabled web server substitution
it threw an error that it was expecting a string but it received a bool so I had to send true as a string.
Is that correct? |
Yea, it needs to be a string. Sorry I forgot that when I typed it in my response earlier. I'll edit it so no one else gets confused. I tried kauf-plug-lite.yaml intead of kauf-plug.yaml (thereby eliminating use of custom components) and I'm basically having the same crashing experience. So, it doesn't appear to be anything wonky I've done in the custom components but rather something in the yaml file. |
Okay. Since my knee jerk reaction after disabling the webserver and seeing the issue continue, I stopped and waited. Disabling the webserver and uploading the change to the device does not fix the issue. However after I noticed it was still happening I performed a firmware restart (through the button provided as an entity). After that the issue stopped. My device has been up over 12 hours without issue. So it seems like whatever is happening sticks around even after a new compiled firmware is uploaded to the device, and only a restart clears it up. |
I'm also beginning to notice this behavior. With a set of four plugs, none modified except for uploading Firmware version 2.00(u), one appears to have become very unstable (the graph showing current has a bunch of breaks in it), but I didn't notice it until the Christmas tree didn't turn on automatically this morning. My first thought was signal strength, so I like the idea of including signal strength along with the other diagnostic data. When checking the plug over http, I noticed unusual uptimes, though, too -- 1255374s, (unavailable), 9527s, 5139s, which means that at least two of the others restarted on their own very recently. (I set them all up around the same time a couple weeks ago.) No idea why these might have restarted, but I will try to keep an eye on them. Probably coincidences, but here are a couple things I noticed: (1) The one that's been up consistently for two weeks is only used as a power monitor -- it has no automation that turns it off and on. (2) The one that's apparently gotten really flakey is the only one with a MAC address that starts with 98 (as opposed to 3C). I haven't had time to experiment yet, apart from unplugging and replugging the one that is apparently unstable, and that didn't seem to help much. In fact it doesn't appear to be coming back online at all now. In the image, the green line is the flakey one, and the purple one is controlled in a similar way (on/off automations) -- you can see what I mean by the breaks in the line. New to HA but please let me know if there's anything I can do to help figure this out. (I will try the ideas in this thread hopefully next weekend.) |
The plug worked great since my last update, then randomly about 3 days ago it started acting up again. I still have the webserver off. There was an update available today so I went ahead and updated and restarted the firmware. I then managed to catch some of the logs where the device kept becoming unavailable every couple minutes. The device is not throwing any error messages at all, it is just timing out and disconnecting. The logs I have show it timing out about three times and then finally staying on with the mention of some type of boot loop.
Full logs below.
I will say, that after updating and restarting the firmware, after it's initial funny business and that message about the boot looking successful resetting boot loop counter; the device does appear to be working again. It seems like something happens to trigger a boot loop in the device and it can't get out of it until a firmware restart and then it still takes some time to get out of it. |
It has started acting up again. |
Wait... Why do you have the plug set to restart itself every 15 minutes? Line 459 of kauf-plug.yaml |
new to this, but I believe this is the timeout in ESPHome that will cause the device to reboot when there's no client connected (for this amount of time), so if HA is connected to the API, then this timeout doesn't come into play. |
I moved mine right next to the router yesterday and it's since been stable so far/ (Not very useful at this location, but stable nonetheless... I have my fingers crossed.) So I wonder if this is just a really fragile API connection causing this behavior. @birdwing have you done any testing around that? |
@alexfranke I have not done any testing around that. But I don't think that's an issue with this plug itself. I might dig into the home assistant documentation, or the ESP home documentation to see if there are any settings related to timeout. I will be moving the plug closer to the Wi-Fi router because it's permanent home will be only a few feet away from it, however, it was working fine at its current location for weeks until it just randomly started misbehaving again. Nothing has changed with the network and the signal strength on the device is the same now as it was during the weeks it was working. |
I may try decreasing the monitoring update interval . If it's an issue with the plug itself, then that should cause it to disconnect more often due to the timeout issue on the plugs API, in which case increasing that interval should fix the problem. |
The reboot timeout is the default behavior of ESPHome. I've thought about disabling it by default as it's confused people more than anything. If you turn on the No HASS switch, rebooting timeout will be disabled. If this timeout were occuring, I believe you would see a message in the logs on the web interface or ESPhome dashboard. I'm going to create some additional yaml files for debug purposes. If anyone has any ideas for test yamls let me know. One's just going to be kauf-plug.yaml with no web server. In another I'm going to implement all config entities as substitutions just to cut the number of entities way down. Has anyone tried kauf-plug-lite.yaml and still had this problem? Has anyone removed the web server completely, not just disabled via substitution? |
To compile a firmware with the web server completely omitted, you can change the packages: line to the following. This will completely remove the web server and captive portal so they aren't even compiled in (the substitution still compiles in the web server, it just disables it from running).
On my end, the only way I'm getting the plugs to crash is opening up multiple web browser tabs at once connected to the same plug. So I still highly suspect web server is causing issues. |
I didn't try the lite version, but I did try minimal. Still had the same issue. |
Good point. Possibly the same here, actually -- it was working fine for weeks in the same location with no symptoms until I noticed it didn't switch on automatically the other day. Maybe I was just lucky for the first few weeks and the automation occurred only while it happened to be connected, but that seems a bit unlikely after seeing how often it was actually disconnected when I looked into it. @birdwing Interestingly, I also can't get it to take an OTA firmware update to 2.02 either -- it just resets the browser connection and does nothing. Also it apparently rebooted itself again 1.5 hours ago (before the firmware update attempt). Weird...
I'm brand new to a lot of these technologies and still fumbling my way through them (haven't gotten an ota update to work yet with esphome.exe), but I'm going to take another stab at this probably tomorrow. If it stays connected all day in its currently location, I'll move it back to the Christmas tree and see if it starts disconnecting again. Then I'll try kauf-plug-lite.yaml first and let you know how it goes. @bkaufx oh -- and happy new year! |
@birdwing Weird that the minimal yaml would have issues. Did you try disabling reboot timeout by turning on the No HASS switch or by editing the yaml? |
I have a new plug model that should be in stock in the next week. If everyone in this thread will email me their address to brian@kaufha.com I'll send everyone one of the new plugs to try out different hardware. |
All options entities can be removed by using the following packages section:
Options can be set using these substitutions:
|
I am using HA though. |
Just sent you an email 👍 |
I switched to the /kauf-plug-no-webserver.yaml file. It seems to happen less often without the webserver installed, but it still happens 2-3 times an hour (a big improvement over every 2 minutes) Before changing the file I did also try setting no hass to true, but that made no difference at all. Also changing the monitoring update interval didn't appear to make a difference either. |
Here's some more data for my symptomatic device: (sorry so wordy -- hoping there's something in here that's helpful)
Next I'll try it with a different load just in case that has something to do with it. The light string I'm currently using draws about 275mA btw. Let me know if there's anything in particular you'd like me to test around before I try to change the yaml -- for example, I could try same load with a different plf10, same load same plf closer to router, etc, but I'm flying blind as to what might be useful to you. |
Some more data -- it appears to become unstable when currently is flowing through it. I still need to test it with a different load. Here's the "uptime" graph for the past 24 hours along with a description of what happened. A - (from previous message) plugged the PLF10 into the wall with nothing plugged into it, and it connected right away. Here's the other weird thing: Literally every time, right before the line breaks, there's a decimal value recorded that's less than the previously recorded value -- sometimes just a bit lower, and sometimes quite a bit lower. For example this weird decimal value when ten seconds prior it was a nice integer 24392. |
Could it be the LED light string that's causing the problem? I'm starting to think that it is...
Plus, @birdwing looks like he was also working with Christmas lights... Presumably there's some isolation in the hardware, but knowing if this is even plausible is pretty much above my pay grade, I'm afraid... :/ |
@alexfranke My Christmas Lights were old incandescent bulbs. 😉 I had the same issue moving the plug to my soundbar.(using it so home assistant knows when its on, for use with my broadlink remote) I also don't think LED light strings can cause electrical interference any differently than other electrical devices. The lamp you plugged in, depending on the type of bulb, may have still drawn less power than the LED string however. Perhaps it has to do with how much power is being drawn. If you plug in something that pulls more power in than the LED light string, do you see the same issue? Perhaps there is electrical interference being created when power moves through the plug at or above a certain threshold. |
Do you guys remember having this issue before firmware v2.02? If not, it might be interesting to try reverting back to the stock HLW8012 component by adding this to your config yaml:
|
Most of my testing was with v2.00 and I reconfirmed the behavor with 2.02 starting yesterday. @bkaufx I'm still suspecting some kind of interference caused by this particular string of lights. It was up continuously for about 24 hours (22 with nothing plugged in, and 2 hrs with a 75W incandescent plugged in, testing @birdwing 's theory). I plugged the string lights back into it, and it's now rebooted itself twice in the last 20 minutes -- both times reporting a weird floating point value for uptime right before restarting. |
I got my plug after 2.02 was released so I never had anything older than that. |
Well, so much for that idea... This time it started acting up again with no intervention after working properly 19 hours straight, the last 12 of those hours in the OFF state and with nothing plugged in. For the last three hours it's been restarting over and over again, but at least remaining connected. This is really starting to test my tenacity. :/ I'd love to be able to figure out what's going on here but I'm beginning to suspect that might not ever happen. Now I'm thinking it's two different issues. |
Does this seem to be a problem for only one plug or do all plugs act the same way? |
What are the Watt/voltage history for the plugs? @bkaufx |
No, there's no correlation there. The lights are coming off the tree today so I'll be able to do some more different kinds of testing. Just now I noticed a lot of errors in the log produced by the devices website (json string allocation errors) and the site doesn't render all its UI elements.
...where X is either 8, 16, 32, 64, or (interestingly) 40. So my current theory is admittedly pretty convoluted: When the WiFi signal isn't awesome (e.g. >60) or a channel or plugged-in appliance is too noisy, the ESP8266 is prone to disconnect and scans and/or connects with slower and/or more power-hungry protocol, which may cause brown-outs -- or repeatedly (while scanning or connecting) runs some code with a memory leak that doesn't manifest itself until later when something overflows and causes it to restart. ...more testing later. |
Btw this was actually with a different plug (one that was previously restarting ransomly but NOT disconnecting), so that's partly why I'm thinking this is related to signal strength and/or the actual light strings. I thought the part highlighted in green was interesting -- when the plug had both restarted and reconnected at about 3:00 pm yesterday, the voltage readings were a lot noisier than before until the switch was turned off. This symptom didn't repeat itself from 1-2 today with the same amount of current, but by that time the plug had restarted at least once (at about midnight) and disconnected a few times. This is the plug that's currently still throwing the JSON memory allocation errors. I'm waiting to see if that's a sign that it's going to shut itself off or disconnect. |
I have another update on this, and now I'm totally confused. Once I got the lights off the tree last weekend, I tested the symptomatic plug close to the router with the same LED strings to try to isolate network connectivity as a contributing cause, and it seemed a lot more stable. So then I added some diagnostics (memory, signal strength, physical network mode, etc), reflashed the symptomatic plug, and put it back in its original location (other side of the house) with the same (original/symptomatic) light string. The only thing different this time was that the lights weren't on the tree and I had reflashed the device -- and it's only rebooted itself once in the past 40 hours with no sign of becoming "unavailable" during that time. I guess the other difference is that in its symptomatic state, I had I originally upgraded to v2.00 using the device's website, and I most recently reflashed it using esphome OTA. (I actually wasn't able to reflash it using the website at all - the website would just hang when I tried.) Maybe this was all just a bad firmware update? |
Have to say, thank you for sending the new hardware. I absolutely love the re-design. It's slightly thinner which makes it easier to use on my power strip behind the TV. The new hardware was also way more responsive for some reason. It booted quicker and connected to my network much faster than the older plug. After almost a week not a single issue with it. I will say as well, since moving the old plug to the soundbar rather than the Christmas lights, I have had much fewer issues with it. Haven't had any in the last week. When removing the christmas lights I did also re-flash the device (to change the name of it). And went back to the default kauf-plug.yaml file. I am still using |
Adding BTW, I also set |
I bought a 4-pack of these plugs and adopted them all with ESPHome Dashboard running in Docker on my Raspberry Pi. Home Assistant is running in another Docker container on the same Pi.
After resolving some initial struggles getting used to ESPHome and configuring the an appropriate 2.4GHz network at one end of my house for one plug, they have been working very responsively with two exceptions. Two of the plugs will occasionally show up as "Unavailable" in Home Assistant, leading to them being unable to be controlled through Home Assistant (or via automations). Sometimes they will resolve themselves after a few minutes, sometimes they will be "unavailable" until I physically remove them from the outlet and plug them back in, rebooting them. When they are "unavailable" in Home Assistant, they are also not available at their own web server.
Is this an issue that others have experienced?
What steps can I take to debug this issue on my end or potentially resolve it?
Thanks,
Wilson
The text was updated successfully, but these errors were encountered: