-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZHA Resets Randomly - All Devices Go Offline #107334
Comments
Can you upgrade to 2024.1.1? |
Sure. I'll upgrade and turn debug logging back on just in case. |
Had exactly the same issue after updating yesterday (HA core + SkyConnect Silicon Labs Multiprotocol) not sure who cased this. "NCP entered failed state." and then crash that HA restart would fix. Happended once in few hours. Updated again both to swifly arrived minor updates 2024.1.2 and 2.4.1. Monitoring. |
Having once daily ZHA issues as well. Stop’s respond around 10PM EST. Reloading the integration doesn’t fix it. I’ve gone into settings preparing to restart core every night but there’s an update so, killing two birds with one stone. Maybe todays update will prevent tomorrow’s failure (or failure to fail hopefully)🤞 |
Can you enable ZHA debug logging for an hour before it happens and then disable it once the integration is not working? |
I will test around 8, make sure it’s working and then enable it. |
It usually happens in the middle of the night, but it's really kind of random, so I don't have any way of knowing one hour ahead of time before it happens. I have just updated to the newest minor update and re-enabled debug logging. I'll post logs the next time it happens. |
For me, this happens around 7 AM EST daily for the last two or three days. I have the HUSBZB-1 by Nortek (HubZ Smart Home Controller). Running 2024.1.2 on a RPi4 Edit: Now it does this around 7 AM and 9 AM. |
Do any folks with seemingly reproducible times have any integrations or backups or automations scheduled at the same time? We had a user last week with the Google calendar custom component stuttering the event loop causing stability issues. We run in the event loop… and the radio stacks are latency sensitive… anything else stuttering the loop will cause stuff like this. |
I have a Google calendar check at midnight but my issue is observed around 10 PM. I’ll do some more testing to see if it happens earlier. The rest of my automations are in Node-Red. Are they separate from the event loop you’re talking about? NR has many more automations to go through then. |
I have the same problem. Sometimes it happens in one hour, sometimes longer. I have the Conbee3 running.
|
No issue tonight if course. Maybe the latest update got it or coincidence. I can update boring success logs if anybody wants. |
Since I upgraded to 2024.1.2, I haven't seen this issue return either. I haven't gone this long without a ZHA reset in quite some time, so I'm hopeful that something in the most recent release fixed whatever was causing this problem. Strangely, after I upgraded, several of my Leviton in-wall Zigbee switches stopped responding. I wasn't even able to re-pair them without flipping the breaker to power them off and back on. They're been stable since then, though. My Zigbee network also seems a lot more responsive than it was before. I'm keeping my fingers crossed that everything stays this way. |
@Huug4922 (and anyone else commenting) please enable ZHA debug logging and post the full debug log. |
@puddly, sure but I had to make the file much smaler. It was 125mb. Everthing before 20:00 is removed. |
Looks like I spoke too soon when I said this was working for me. I happened again last night; I woke up this morning and found that all but 1 or 2 of me Leviton in-wall Zigbee switches and several Sonoff plug-in switches had stopped working. I had to shut off power to all of them again in order to get ZHA to re-pair (otherwise it would just get stuck on "Configuring"). Strangely, every time this happens, it's only line-powered devices that drop off the network. I've never had a battery-powered device drop. This time, system logs show a watchdog time time-out, not an AppController restart. Unfortunately, I had turned off debug logging as I thought the problem was fixed. I've upgraded to 2024.1.3 and turned debug logging back on. Now it's just a matter of waiting. |
I have been having this issue ever since updating to 2024.1.1, and I have also experienced it on 2024.1.2 and today updated to 2024.1.3. The issue still persists. I started the debug logging during the initializing loop so logs from before it started will be missing, but ZHA restarted several times after starting logging so hopefully there is something useful in here that indicates what is causing it to need to reinitialise. |
we have a fix for the looping coming soon |
I think I should have some good log data for you. Today has been an extremely bad day for my Zigbee network. On multiple occasions, all my Zigbee devices became unavailable. It appears as though each instance corresponds with either a "Watchdog Error" or a "Watchdog Timeout" in the system log. The network is also extremely slow to recover; some devices require multiple attempts to switch them on/off before they start responding again. I'm also now seeing warnings for Zigbee channel utilization, but I have nothing that could be interfering. My WiFi network is not on an overlapping channel, and I don't have neighbors. I did a WiFi scan to be sure; nothing nearby is interfering with my Zigbee network. As far as I can tell, these are the times of interest (all on 1/13): |
I'm having the same issue. Will turn on debugging and share logs. Have also noticed my Bluetooth integration becomes unavailable at the same time. Has anyone successfully moved to Z2M as a solution? |
We have a fix for one of the causes of the excessive reloading. It will be in the next . release. |
i hope it will fix mine too to be sure i will add my debug too |
Hello! I have the same problem! With sonoff and nas Synology! Logger: homeassistant.components.zha.core.cluster_handlers Source: components/zha/core/cluster_handlers/init.py:388 Integration: Zigbee Home Automation (documentation, issues) First occurred: 21:37:25 (1 occurrences) Last logged: 21:37:25 [0xDF2A:1:0x0001]: 'async_initialize' stage failed: Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 326, in request return await req.result ^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/util/async_.py", line 186, in sem_task return await task ^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/init.py", line 388, in async_initialize await self._get_attributes( File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/init.py", line 490, in _get_attributes read, _ = await self.cluster.read_attributes( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 524, in read_attributes result = await self.read_attributes_raw(to_read, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 377, in request return await self._endpoint.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 325, in request async with asyncio_timeout(timeout): File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in aexit raise TimeoutError from exc_val TimeoutError |
Unfortunately updating HA to 2024.1.5 didn't change anything for me. Using Conbee3 and a VM on my Synology DS920+. See my logs above. |
how often does this happen? |
@Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. |
And for sonoff? |
And for HUSBZB-1?
On Jan 26, 2024 1:14 AM, neoback45 ***@***.***> wrote:
@Huug4922<https://github.com/Huug4922> There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release.
And for sonoff?
—
Reply to this email directly, view it on GitHub<#107334 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFH543S2VRF364NY2ZDLELDYQNCVZAVCNFSM6AAAAABBPHH5D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJRGU2DANBZGE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
I am also still having issues with my Sonoff controller. The last few updates have improved the issue; I'm not seeing frequent resets one after another, but I am still setting resets roughly once per day; each reset usually corresponds with one or more Zigbee devices needing to be other power-cycled or (more often) re-paired to the network. I've already posted logs several times, but I can post more if needed. |
Hello can you explain how you do it? |
same here, in the last days always around the same time.. all the zigbee network goes down for about a minute while the integration reloads, and the hass log freak out with errors in the meantime.. while almost everything goes back online it happened that some devices had to be repaired (mains routers and a couple of endevices). coordinator is a sonoff-p cc2652 |
Just wanted to say the issue is mostly fixed for me since updating to 2024.1.4, it no longer reinitializes the addon over and over again. However I am still having the occasional restart, which normally comes with the “NPC entered failed state. Requesting APP controller restart”. From what I’ve seen this is due to loss of connection of the zigbee dongle which is common with network dongles. However I am using a SkyConnect plugged in via USB. My server does run on a VMWare Workstation VM which I pass the device through to so maybe there’s an issue with the passthrough maybe dropping connection, I’m going to move over my install to bare metal soon and see if it improves. |
@slunat before 2023.12.1 my HUSBZB-1 never lost connection and I have never seen "NCP entered failed state" message. Something has changed there which is causing issues with all these sticks (I count atleast SkyConnect, Nortek HUSBZB-1, Sonoff and ConBee III). I think we have posted logs for all these sticks, not sure how can we help getting to the bottom of this bug. |
Yes I have a same problem! 2023.11.3 work no problem! |
@slunat I used to see that message frequently until one or two updates ago. It seems like that error has disappeared, but I'm still seeing "Watchdog failure" and "Watchdog heartbeat timeout: TimeoutError()" on a regular basis (twice today, for instance). Are you seeing the watchdog errors as well? Judging from what I'm reading here, it appears that these ZHA reset issues are fairly widespread. I am also running HA on a VM with USB passthrough; I'm using Proxmox, though, not VMWare. I thought that perhaps the USB passthrough was causing issues, but I'm passing through several other USB devices to different VMs, and they have a much higher data rate than my Zigbee controller. They've been running for months with no interruption. My HA VM also gets USB passthrough for an UPS, and it is rock solid. I don't think USB passthrough is the problem, although I am very curious to see what happens if you decide to switch over to a bare metal installation. I'm reluctant to set up a server just for HA when I have a perfectly good hypervisor already running; the UPS in my server rack is already close to capacity, and more hardware would likely push it over the edge. |
@neoback45 I mentioned a few different things in my comment. What specifically were you wondering how to do? |
For this : I am also still having issues with my Sonoff controller. The last few updates have improved the issue; I'm not seeing frequent resets one after another, but I am still setting resets roughly once per day; each reset usually corresponds with one or more Zigbee devices needing to be other power-cycled or (more often) re-paired to the network. I've already posted logs several times, but I can post more if needed. |
My ZHA crashed too, all devices are offline. I reported it here (including the log on the moment it broke): They might be related, my network is also slow. |
@brylee123 I just upgraded to 2024.1.6 which completely broke the ZHA integration. It is not loading any more and I had to restore to 2024.1.5. Here is my debug logging when reloading the integration. Still using Conbee III. |
I upgrade to 2024.1.6 and zha work for the moment... |
i also updated and works as before. |
I think it is time to think about giving up on ZHA. This is broken for me since 2023.12.1, so about 2 months and so far I did not see much progress fixing the problem. I did not see the owners posting any questions either to help making some progress. |
lol talk about dramatic. If you don’t see us working on these tickets then you clearly aren’t paying attention. Anyway, IF you are on a SI stick we may have found something that could help. We are testing a patch now. |
That's encouraging. I have a Sonoff ZBDongle-P. Very happy to help if I can. |
Today I tried again to upgrade to 2024.1.6, still the integration wont load. Hopefully someone can take a look at my debug-logging. |
crashes regularly almost once a day, happened in the last hour also.. home-assistant_zha_2024-02-05T10-09-14.097Z.log i can leave zha debugging enabled if needed and try to record what happens there next time. |
I was on a nortek husbzb-1 and was crashing like 10 times a day. So I bought and migrated over to a skyconnect. I crash less now, maybe once or twice a day, but still crashing. Ive attached my logs |
@r0bb10 Please enable debug logging for the entire duration. @artinbastani In your log file, I also see no crashing. Just devices reporting power usage. |
Strange, all of my devices went offline at 4:12 and again at 4:14. How much time prior to the event would be useful? |
Half an hour before and after would be fine. You can leave it for longer, however, the more context the better. |
ok. see if you find anything in this.... |
Take a look at this, seems to be happened a 01:11 last night looking for "Coordinator is disconnected", debug logs for zha are massive to share, had to reduce time window. |
I updated to 2024.2.1 today. It is stable for 10 hours now. Thanks! |
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. |
The problem
About once per day, every device on my Zigbee network will go offline. ZHA will re-initialize and the devices will eventually come back. After ZHA re-initializes, some devices will have a very delayed response the first time they're switched, and occasionally some will fail to respond alltogether.
I'm using a Sonoff Zigbee 3.0 USB dongle.
What version of Home Assistant Core has the issue?
12.4
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant OS
Integration causing the issue
ZHA
Link to integration documentation on our website
No response
Diagnostics information
I have uploaded two logs:
Starting from a fresh reboot on 1/3
Starting at midnight on 1/5 (the most recent occurrence of the problem was the morning of 1/5)
Both are too big to upload to Github, so I'm hosting them on Backblaze.
Example YAML snippet
No response
Anything in the logs that might be useful for us?
Additional information
No response
The text was updated successfully, but these errors were encountered: