ZHA Resets Randomly - All Devices Go Offline #107334

cameron686 · 2024-01-06T02:55:32Z

The problem

About once per day, every device on my Zigbee network will go offline. ZHA will re-initialize and the devices will eventually come back. After ZHA re-initializes, some devices will have a very delayed response the first time they're switched, and occasionally some will fail to respond alltogether.

I'm using a Sonoff Zigbee 3.0 USB dongle.

What version of Home Assistant Core has the issue?

12.4

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

ZHA

Link to integration documentation on our website

No response

Diagnostics information

I have uploaded two logs:
Starting from a fresh reboot on 1/3
Starting at midnight on 1/5 (the most recent occurrence of the problem was the morning of 1/5)

Both are too big to upload to Github, so I'm hosting them on Backblaze.

Example YAML snippet

No response

Anything in the logs that might be useful for us?

The issue appears have occurred at multiple times on 1/5 (based on when devices went offline): 0110, 0419, 0427, 0950, and at 1806 (while I was downloading the logs).

The problem seems to correspond with "NCP entered failed state. Requesting APP controller restart" in the system log (caused by bellows.ezsp).

Additional information

No response

puddly · 2024-01-06T04:11:54Z

Can you upgrade to 2024.1.1?

cameron686 · 2024-01-06T04:18:28Z

Can you upgrade to 2024.1.1?

Sure. I'll upgrade and turn debug logging back on just in case.

serrnovik · 2024-01-06T15:43:12Z

Had exactly the same issue after updating yesterday (HA core + SkyConnect Silicon Labs Multiprotocol) not sure who cased this. "NCP entered failed state." and then crash that HA restart would fix. Happended once in few hours. Updated again both to swifly arrived minor updates 2024.1.2 and 2.4.1. Monitoring.

keithcroshaw · 2024-01-07T03:29:20Z

Having once daily ZHA issues as well. Stop’s respond around 10PM EST. Reloading the integration doesn’t fix it. I’ve gone into settings preparing to restart core every night but there’s an update so, killing two birds with one stone. Maybe todays update will prevent tomorrow’s failure (or failure to fail hopefully)🤞

puddly · 2024-01-07T03:30:31Z

Can you enable ZHA debug logging for an hour before it happens and then disable it once the integration is not working?

keithcroshaw · 2024-01-07T03:41:50Z

I will test around 8, make sure it’s working and then enable it.

cameron686 · 2024-01-07T03:48:29Z

Can you enable ZHA debug logging for an hour before it happens and then disable it once the integration is not working?

It usually happens in the middle of the night, but it's really kind of random, so I don't have any way of knowing one hour ahead of time before it happens. I have just updated to the newest minor update and re-enabled debug logging. I'll post logs the next time it happens.

brylee123 · 2024-01-07T16:47:16Z

For me, this happens around 7 AM EST daily for the last two or three days. I have the HUSBZB-1 by Nortek (HubZ Smart Home Controller). Running 2024.1.2 on a RPi4

Edit: Now it does this around 7 AM and 9 AM.

dmulcahey · 2024-01-07T17:21:51Z

Do any folks with seemingly reproducible times have any integrations or backups or automations scheduled at the same time? We had a user last week with the Google calendar custom component stuttering the event loop causing stability issues. We run in the event loop… and the radio stacks are latency sensitive… anything else stuttering the loop will cause stuff like this.

keithcroshaw · 2024-01-07T17:54:47Z

I have a Google calendar check at midnight but my issue is observed around 10 PM. I’ll do some more testing to see if it happens earlier. The rest of my automations are in Node-Red. Are they separate from the event loop you’re talking about? NR has many more automations to go through then.

Huug4922 · 2024-01-07T21:40:23Z

I have the same problem. Sometimes it happens in one hour, sometimes longer. I have the Conbee3 running.

2024-01-07 21:34:13.947 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 3, 'value': 31105, 'timestamp': 1704659653.928778, 'min_update_delta': 30.0}) completed
2024-01-07 21:34:13.948 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>)
2024-01-07 21:34:13.948 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>) completed
2024-01-07 21:34:13.948 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 4, 'value': 27085, 'timestamp': 1704659653.92881, 'min_update_delta': 30.0})
2024-01-07 21:34:13.948 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 4, 'value': 27085, 'timestamp': 1704659653.92881, 'min_update_delta': 30.0}) completed
2024-01-07 21:34:13.949 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>)
2024-01-07 21:34:13.949 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>) completed
2024-01-07 21:34:13.949 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 7, 'value': 397, 'timestamp': 1704659653.928837, 'min_update_delta': 30.0})
2024-01-07 21:34:13.949 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, '\n            INSERT INTO attributes_cache_v12\n            VALUES (:ieee, :endpoint_id, :cluster_id, :attrid, :value, :timestamp)\n                ON CONFLICT (ieee, endpoint_id, cluster, attrid) DO UPDATE\n                SET value=excluded.value, last_updated=excluded.last_updated\n                WHERE\n                    value != excluded.value\n                    OR :timestamp - last_updated > :min_update_delta\n            ', {'ieee': 00:17:88:01:0d:54:17:3b, 'endpoint_id': 11, 'cluster_id': 768, 'attrid': 7, 'value': 397, 'timestamp': 1704659653.928837, 'min_update_delta': 30.0}) completed
2024-01-07 21:34:13.950 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>)
2024-01-07 21:34:13.950 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method commit of sqlite3.Connection object at 0x7f10647bdb70>) completed
2024-01-07 21:34:14.452 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x8182040001004c0cda81
2024-01-07 21:34:14.453 WARNING (MainThread) [zigpy_deconz.api] Unknown command received: Command(command_id=<CommandId.undefined_0x81: 129>, seq=130, payload=b'\x04\x00\x01\x00L\x0c\xda\x81')
2024-01-07 21:34:15.393 DEBUG (MainThread) [zigpy.application] Feeding watchdog
2024-01-07 21:34:15.394 DEBUG (MainThread) [zigpy_deconz.api] Sending CommandId.device_state{} (seq=9)
2024-01-07 21:34:15.394 DEBUG (MainThread) [zigpy_deconz.uart] Send: 0709000800000000
2024-01-07 21:34:15.504 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x22010049004200287472616e736c6174696f6e4c617965725f696e6974506c6174666f726d29436f6e4265652053746172746564205243415553453a205b30303030303230315d0a0d
2024-01-07 21:34:15.505 WARNING (MainThread) [zigpy_deconz.api] Unknown command received: Command(command_id=<CommandId.undefined_0x22: 34>, seq=1, payload=b'\x00I\x00B\x00(translationLayer_initPlatform)ConBee Started RCAUSE: [00000201]\n\r')
2024-01-07 21:34:15.505 DEBUG (MainThread) [zigpy_deconz.uart] Frame received: 0x0e020007002200
2024-01-07 21:34:15.505 DEBUG (MainThread) [zigpy_deconz.api] Received command CommandId.device_state_changed{'status': <Status.SUCCESS: 0>, 'frame_length': 7, 'device_state': DeviceState(network_state=<NetworkState2.CONNECTED: 2>, device_state=<DeviceStateFlags.APSDE_DATA_REQUEST_FREE_SLOTS_AVAILABLE: 8>), 'reserved': 0} (seq 2)
2024-01-07 21:34:17.197 DEBUG (MainThread) [zigpy_deconz.api] No response to 'CommandId.device_state' command with seq 9
2024-01-07 21:34:17.197 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/api.py", line 586, in _command
    return await fut
           ^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 665, in _watchdog_loop
    await self.watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 647, in watchdog_feed
    await self._watchdog_feed()
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/zigbee/application.py", line 91, in _watchdog_feed
    await self._api.get_device_state()
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/api.py", line 898, in get_device_state
    rsp = await self.send_command(CommandId.device_state)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/api.py", line 508, in send_command
    return await self._command(cmd, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy_deconz/api.py", line 585, in _command
    async with asyncio_timeout(COMMAND_TIMEOUT):
  File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError
2024-01-07 21:34:17.203 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError()
2024-01-07 21:34:17.204 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError()
2024-01-07 21:34:17.204 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop
2024-01-07 21:34:17.204 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication
2024-01-07 21:34:17.206 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x7f10646cb1a0>)
2024-01-07 21:34:17.206 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x7f10646cb1a0>) completed
2024-01-07 21:34:17.206 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, 'PRAGMA wal_checkpoint;', [])
2024-01-07 21:34:17.214 DEBUG (MainThread) [zigpy_deconz.api] Serial '/dev/serial/by-id/usb-dresden_elektronik_ConBee_III_DE03190147-if00-port0' connection lost unexpectedly: None
2024-01-07 21:34:17.443 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f10647bdb70>, 'PRAGMA wal_checkpoint;', []) completed
2024-01-07 21:34:17.444 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x7f1062e1b920>)
2024-01-07 21:34:17.445 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<function PersistingListener._set_isolation_level.<locals>.<lambda> at 0x7f1062e1b920>) completed
2024-01-07 21:34:17.446 DEBUG (Thread-6) [aiosqlite] executing functools.partial(<built-in method close of sqlite3.Connection object at 0x7f10647bdb70>)
2024-01-07 21:34:17.448 DEBUG (Thread-6) [aiosqlite] operation functools.partial(<built-in method close of sqlite3.Connection object at 0x7f10647bdb70>) completed
2024-01-07 21:34:17.551 WARNING (MainThread) [homeassistant.helpers.dispatcher] Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_group of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7f1073751b90>>
2024-01-07 21:34:17.576 DEBUG (MainThread) [homeassistant.components.zha.entity] light.tradfri_kast_light: stopped polling during device removal
2024-01-07 21:34:17.576 DEBUG (MainThread) [homeassistant.components.zha.entity] light.keuken_aanrecht_led: stopped polling during device removal
2024-01-07 21:34:17.577 DEBUG (MainThread) [homeassistant.components.zha.entity] light.muurspot_deur_onder_licht: stopped polling during device removal

keithcroshaw · 2024-01-08T04:17:23Z

No issue tonight if course. Maybe the latest update got it or coincidence. I can update boring success logs if anybody wants.

cameron686 · 2024-01-08T18:19:28Z

No issue tonight if course. Maybe the latest update got it or coincidence. I can update boring success logs if anybody wants.

Since I upgraded to 2024.1.2, I haven't seen this issue return either. I haven't gone this long without a ZHA reset in quite some time, so I'm hopeful that something in the most recent release fixed whatever was causing this problem.

Strangely, after I upgraded, several of my Leviton in-wall Zigbee switches stopped responding. I wasn't even able to re-pair them without flipping the breaker to power them off and back on. They're been stable since then, though. My Zigbee network also seems a lot more responsive than it was before. I'm keeping my fingers crossed that everything stays this way.

Huug4922 · 2024-01-09T22:44:11Z

Mine still keeps dropping. Anyone idea's where to look? See my debug logging above.

Updated to lastest Core 2024.1.2. Running on a VM at my Synology NAS.

puddly · 2024-01-09T22:54:09Z

@Huug4922 (and anyone else commenting) please enable ZHA debug logging and post the full debug log.

Huug4922 · 2024-01-10T17:35:48Z

@puddly, sure but I had to make the file much smaler. It was 125mb. Everthing before 20:00 is removed.

home-assistant_zha_2024-01-07T21-30-12.065Z_shorted.log

cameron686 · 2024-01-13T04:10:07Z

Looks like I spoke too soon when I said this was working for me. I happened again last night; I woke up this morning and found that all but 1 or 2 of me Leviton in-wall Zigbee switches and several Sonoff plug-in switches had stopped working. I had to shut off power to all of them again in order to get ZHA to re-pair (otherwise it would just get stuck on "Configuring").

Strangely, every time this happens, it's only line-powered devices that drop off the network. I've never had a battery-powered device drop.

This time, system logs show a watchdog time time-out, not an AppController restart. Unfortunately, I had turned off debug logging as I thought the problem was fixed. I've upgraded to 2024.1.3 and turned debug logging back on. Now it's just a matter of waiting.

slunat · 2024-01-13T16:00:22Z

I have been having this issue ever since updating to 2024.1.1, and I have also experienced it on 2024.1.2 and today updated to 2024.1.3. The issue still persists. I started the debug logging during the initializing loop so logs from before it started will be missing, but ZHA restarted several times after starting logging so hopefully there is something useful in here that indicates what is causing it to need to reinitialise.

home-assistant_zha_2024-01-13T15-55-37.899Z.log

dmulcahey · 2024-01-13T21:31:43Z

I have been having this issue ever since updating to 2024.1.1, and I have also experienced it on 2024.1.2 and today updated to 2024.1.3. The issue still persists. I started the debug logging during the initializing loop so logs from before it started will be missing, but ZHA restarted several times after starting logging so hopefully there is something useful in here that indicates what is causing it to need to reinitialise.

home-assistant_zha_2024-01-13T15-55-37.899Z.log

we have a fix for the looping coming soon

cameron686 · 2024-01-14T03:53:44Z

@dmulcahey @puddly

Log Files Are Here

I think I should have some good log data for you. Today has been an extremely bad day for my Zigbee network. On multiple occasions, all my Zigbee devices became unavailable. It appears as though each instance corresponds with either a "Watchdog Error" or a "Watchdog Timeout" in the system log. The network is also extremely slow to recover; some devices require multiple attempts to switch them on/off before they start responding again.

I'm also now seeing warnings for Zigbee channel utilization, but I have nothing that could be interfering. My WiFi network is not on an overlapping channel, and I don't have neighbors. I did a WiFi scan to be sure; nothing nearby is interfering with my Zigbee network.

As far as I can tell, these are the times of interest (all on 1/13):
04:32-04:39 (multiple resets during this period)
05:12-05:15 (multiple resets during this period)
13:44
14:03
14:27
15:43
16:08

stefan814 · 2024-01-16T13:57:58Z

I'm having the same issue. Will turn on debugging and share logs. Have also noticed my Bluetooth integration becomes unavailable at the same time.

Has anyone successfully moved to Z2M as a solution?

dmulcahey · 2024-01-16T14:19:49Z

We have a fix for one of the causes of the excessive reloading. It will be in the next . release.

muller119 · 2024-01-16T20:42:46Z

i hope it will fix mine too to be sure i will add my debug too
home-assistant_zha_2024-01-16T19-03-07.325Z.log

stefan814 · 2024-01-17T01:15:05Z

My logs here
Uploading home-assistant_zha_2024-01-17T00-51-47.466Z.log…

neoback45 · 2024-01-25T21:28:58Z

Hello! I have the same problem! With sonoff and nas Synology!
There is a update with this problem?
Logs :

Logger: homeassistant.components.zha.core.cluster_handlers Source: components/zha/core/cluster_handlers/init.py:388 Integration: Zigbee Home Automation (documentation, issues) First occurred: 21:37:25 (1 occurrences) Last logged: 21:37:25 [0xDF2A:1:0x0001]: 'async_initialize' stage failed: Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 326, in request return await req.result ^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/util/async_.py", line 186, in sem_task return await task ^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/init.py", line 388, in async_initialize await self._get_attributes( File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/init.py", line 490, in _get_attributes read, _ = await self.cluster.read_attributes( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 524, in read_attributes result = await self.read_attributes_raw(to_read, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 377, in request return await self._endpoint.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 325, in request async with asyncio_timeout(timeout): File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in aexit raise TimeoutError from exc_val TimeoutError

Huug4922 · 2024-01-25T21:33:13Z

Hello! I have the same problem! With sonoff and nas Synology! There is a update with this problem? Logs :

Unfortunately updating HA to 2024.1.5 didn't change anything for me. Using Conbee3 and a VM on my Synology DS920+. See my logs above.

neoback45 · 2024-01-25T21:36:52Z

Hello! I have the same problem! With sonoff and nas Synology! There is a update with this problem? Logs :

Unfortunately updating HA to 2024.1.5 didn't change anything for me. Using Conbee3 and a VM on my Synology DS920+. See my logs above.

how often does this happen?

puddly · 2024-01-25T22:41:01Z

@Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release.

neoback45 · 2024-01-26T06:14:39Z

@Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release.

And for sonoff?

brylee123 · 2024-01-26T13:10:36Z

And for HUSBZB-1? On Jan 26, 2024 1:14 AM, neoback45 ***@***.***> wrote: @Huug4922<https://github.com/Huug4922> There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. And for sonoff? — Reply to this email directly, view it on GitHub<#107334 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFH543S2VRF364NY2ZDLELDYQNCVZAVCNFSM6AAAAABBPHH5D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJRGU2DANBZGE>. You are receiving this because you commented.Message ID: ***@***.***>

cameron686 · 2024-01-27T06:44:28Z

And for sonoff?

I am also still having issues with my Sonoff controller. The last few updates have improved the issue; I'm not seeing frequent resets one after another, but I am still setting resets roughly once per day; each reset usually corresponds with one or more Zigbee devices needing to be other power-cycled or (more often) re-paired to the network.

I've already posted logs several times, but I can post more if needed.

neoback45 · 2024-01-27T08:06:25Z

And for sonoff?

I am also still having issues with my Sonoff controller. The last few updates have improved the issue; I'm not seeing frequent resets one after another, but I am still setting resets roughly once per day; each reset usually corresponds with one or more Zigbee devices needing to be other power-cycled or (more often) re-paired to the network.

I've already posted logs several times, but I can post more if needed.

Hello

can you explain how you do it?

r0bb10 · 2024-01-27T10:31:51Z

same here, in the last days always around the same time.. all the zigbee network goes down for about a minute while the integration reloads, and the hass log freak out with errors in the meantime.. while almost everything goes back online it happened that some devices had to be repaired (mains routers and a couple of endevices).

coordinator is a sonoff-p cc2652

slunat · 2024-01-27T12:57:09Z

Just wanted to say the issue is mostly fixed for me since updating to 2024.1.4, it no longer reinitializes the addon over and over again.

However I am still having the occasional restart, which normally comes with the “NPC entered failed state. Requesting APP controller restart”. From what I’ve seen this is due to loss of connection of the zigbee dongle which is common with network dongles. However I am using a SkyConnect plugged in via USB. My server does run on a VMWare Workstation VM which I pass the device through to so maybe there’s an issue with the passthrough maybe dropping connection, I’m going to move over my install to bare metal soon and see if it improves.

mirceadamian · 2024-01-27T13:07:21Z

@slunat before 2023.12.1 my HUSBZB-1 never lost connection and I have never seen "NCP entered failed state" message. Something has changed there which is causing issues with all these sticks (I count atleast SkyConnect, Nortek HUSBZB-1, Sonoff and ConBee III).
I have even changed the stick from HUSBZB-1 to Sonoff Dongle-P and re-registered all my sensors one by one (I did not migrate the radio) and the situation did not change. So for now I'm back to 2023.11.3, up for 22 days and I have not seen once the message again.
I'm running HA in a docker container on a Synology NAS.

I think we have posted logs for all these sticks, not sure how can we help getting to the bottom of this bug.
I have a suspicion that this is either something going wrong with the commands towards the controller (something that is causing the controller to become unresponsive) or simply there is to much traffic on the serial port which hogs the communication causing the restart.

neoback45 · 2024-01-27T21:58:04Z

@slunat before 2023.12.1 my HUSBZB-1 never lost connection and I have never seen "NCP entered failed state" message. Something has changed there which is causing issues with all these sticks (I count atleast SkyConnect, Nortek HUSBZB-1, Sonoff and ConBee III).

I have even changed the stick from HUSBZB-1 to Sonoff Dongle-P and re-registered all my sensors one by one (I did not migrate the radio) and the situation did not change. So for now I'm back to 2023.11.3, up for 22 days and I have not seen once the message again.

I'm running HA in a docker container on a Synology NAS.

I think we have posted logs for all these sticks, not sure how can we help getting to the bottom of this bug.

I have a suspicion that this is either something going wrong with the commands towards the controller (something that is causing the controller to become unresponsive) or simply there is to much traffic on the serial port which hogs the communication causing the restart.

Yes I have a same problem! 2023.11.3 work no problem!

cameron686 · 2024-01-28T03:19:21Z

However I am still having the occasional restart, which normally comes with the “NPC entered failed state. Requesting APP controller restart”

@slunat I used to see that message frequently until one or two updates ago. It seems like that error has disappeared, but I'm still seeing "Watchdog failure" and "Watchdog heartbeat timeout: TimeoutError()" on a regular basis (twice today, for instance). Are you seeing the watchdog errors as well? Judging from what I'm reading here, it appears that these ZHA reset issues are fairly widespread.

I am also running HA on a VM with USB passthrough; I'm using Proxmox, though, not VMWare. I thought that perhaps the USB passthrough was causing issues, but I'm passing through several other USB devices to different VMs, and they have a much higher data rate than my Zigbee controller. They've been running for months with no interruption. My HA VM also gets USB passthrough for an UPS, and it is rock solid. I don't think USB passthrough is the problem, although I am very curious to see what happens if you decide to switch over to a bare metal installation. I'm reluctant to set up a server just for HA when I have a perfectly good hypervisor already running; the UPS in my server rack is already close to capacity, and more hardware would likely push it over the edge.

cameron686 · 2024-01-28T03:20:42Z

Hello

can you explain how you do it?

@neoback45 I mentioned a few different things in my comment. What specifically were you wondering how to do?

neoback45 · 2024-01-28T07:46:43Z

Hello

can you explain how you do it?

@neoback45 I mentioned a few different things in my comment. What specifically were you wondering how to do?

For this :

I am also still having issues with my Sonoff controller. The last few updates have improved the issue; I'm not seeing frequent resets one after another, but I am still setting resets roughly once per day; each reset usually corresponds with one or more Zigbee devices needing to be other power-cycled or (more often) re-paired to the network.

I've already posted logs several times, but I can post more if needed.

tjerkw · 2024-01-31T19:48:36Z

My ZHA crashed too, all devices are offline. I reported it here (including the log on the moment it broke):
#107200 (comment)

They might be related, my network is also slow.
I'm using a Home Assistent Yellow, with multiprotocol disabled. HA Core 2024.1.5

Huug4922 · 2024-01-31T20:46:08Z

On Jan 26, 2024 1:14 AM, neoback45 @.***> wrote: @Huug4922 https://github.com/Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. >

@brylee123 I just upgraded to 2024.1.6 which completely broke the ZHA integration. It is not loading any more and I had to restore to 2024.1.5.

Here is my debug logging when reloading the integration. Still using Conbee III.
error_log.txt

neoback45 · 2024-01-31T20:57:38Z

On Jan 26, 2024 1:14 AM, neoback45 @.***> wrote: @Huug4922 https://github.com/Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. >

@brylee123 I just upgraded to 2024.1.6 which completely broke the ZHA integration. It is not loading any more and I had to restore to 2024.1.5.

I upgrade to 2024.1.6 and zha work for the moment...

r0bb10 · 2024-01-31T21:03:34Z

On Jan 26, 2024 1:14 AM, neoback45 @.***> wrote: @Huug4922 https://github.com/Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. >

@brylee123 I just upgraded to 2024.1.6 which completely broke the ZHA integration. It is not loading any more and I had to restore to 2024.1.5.

i also updated and works as before.

mirceadamian · 2024-02-03T14:00:31Z

I think it is time to think about giving up on ZHA. This is broken for me since 2023.12.1, so about 2 months and so far I did not see much progress fixing the problem. I did not see the owners posting any questions either to help making some progress.
So probably it is a dying integration. Sad.

dmulcahey · 2024-02-03T14:07:48Z

I think it is time to think about giving up on ZHA. This is broken for me since 2023.12.1, so about 2 months and so far I did not see much progress fixing the problem. I did not see the owners posting any questions either to help making some progress. So probably it is a dying integration. Sad.

lol talk about dramatic. If you don’t see us working on these tickets then you clearly aren’t paying attention.

Anyway, IF you are on a SI stick we may have found something that could help. We are testing a patch now.

mirceadamian · 2024-02-03T14:10:58Z

That's encouraging. I have a Sonoff ZBDongle-P. Very happy to help if I can.

Huug4922 · 2024-02-03T16:00:54Z

On Jan 26, 2024 1:14 AM, neoback45 @.***> wrote: @Huug4922 https://github.com/Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release. >

@brylee123 I just upgraded to 2024.1.6 which completely broke the ZHA integration. It is not loading any more and I had to restore to 2024.1.5.

i also updated and works as before.

Today I tried again to upgrade to 2024.1.6, still the integration wont load. Hopefully someone can take a look at my debug-logging.
home-assistant_zha_2024-02-03T15-59-01.903Z.log

r0bb10 · 2024-02-05T10:17:40Z

crashes regularly almost once a day, happened in the last hour also..

home-assistant_zha_2024-02-05T10-09-14.097Z.log

i can leave zha debugging enabled if needed and try to record what happens there next time.

artinbastani · 2024-02-11T14:29:17Z

I was on a nortek husbzb-1 and was crashing like 10 times a day. So I bought and migrated over to a skyconnect. I crash less now, maybe once or twice a day, but still crashing. Ive attached my logs

zha.log

puddly · 2024-02-11T17:43:14Z

@r0bb10 Please enable debug logging for the entire duration.

@artinbastani In your log file, I also see no crashing. Just devices reporting power usage.

artinbastani · 2024-02-11T18:20:43Z

Strange, all of my devices went offline at 4:12 and again at 4:14. How much time prior to the event would be useful?

puddly · 2024-02-11T18:22:31Z

Half an hour before and after would be fine. You can leave it for longer, however, the more context the better.

artinbastani · 2024-02-11T18:39:36Z

ok. see if you find anything in this....

zha2.log

r0bb10 · 2024-02-12T10:25:11Z

@r0bb10 Please enable debug logging for the entire duration.

@artinbastani In your log file, I also see no crashing. Just devices reporting power usage.

Take a look at this, seems to be happened a 01:11 last night looking for "Coordinator is disconnected", debug logs for zha are massive to share, had to reduce time window.

home-assistant_zha_2024-02-12T09-50-14.907Z.log

Huug4922 · 2024-02-12T22:55:12Z

@Huug4922 There's currently a bug with the Conbee III radio library. It'll be fixed in the upcoming beta or the next point release.

I updated to 2024.2.1 today. It is stable for 10 hours now. Thanks!

issue-triage-workflows · 2024-05-12T23:05:16Z

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

cameron686 mentioned this issue Jan 6, 2024

ZHA/Sonoff Zigbee dongle restarts causing entire zigbee network to be unavailable for at least several seconds #107298

Closed

issue-triage-workflows bot added the stale label May 12, 2024

issue-triage-workflows bot closed this as not planned Won't fix, can't repro, duplicate, stale May 19, 2024

github-actions bot locked and limited conversation to collaborators Jun 19, 2024

ZHA Resets Randomly - All Devices Go Offline #107334

ZHA Resets Randomly - All Devices Go Offline #107334

Comments

cameron686 commented Jan 6, 2024 • edited Loading

The problem

What version of Home Assistant Core has the issue?

What was the last working version of Home Assistant Core?

What type of installation are you running?

Integration causing the issue

Link to integration documentation on our website

Diagnostics information

Example YAML snippet

Anything in the logs that might be useful for us?

Additional information

puddly commented Jan 6, 2024

cameron686 commented Jan 6, 2024

serrnovik commented Jan 6, 2024

keithcroshaw commented Jan 7, 2024 • edited Loading

puddly commented Jan 7, 2024

keithcroshaw commented Jan 7, 2024

cameron686 commented Jan 7, 2024

brylee123 commented Jan 7, 2024 • edited Loading

dmulcahey commented Jan 7, 2024

keithcroshaw commented Jan 7, 2024

Huug4922 commented Jan 7, 2024 • edited Loading

keithcroshaw commented Jan 8, 2024

cameron686 commented Jan 8, 2024

Huug4922 commented Jan 9, 2024 • edited Loading

puddly commented Jan 9, 2024 • edited Loading

Huug4922 commented Jan 10, 2024

cameron686 commented Jan 13, 2024

slunat commented Jan 13, 2024

dmulcahey commented Jan 13, 2024

cameron686 commented Jan 14, 2024

stefan814 commented Jan 16, 2024

dmulcahey commented Jan 16, 2024

muller119 commented Jan 16, 2024

stefan814 commented Jan 17, 2024

neoback45 commented Jan 25, 2024

Huug4922 commented Jan 25, 2024 • edited Loading

neoback45 commented Jan 25, 2024

puddly commented Jan 25, 2024

neoback45 commented Jan 26, 2024

brylee123 commented Jan 26, 2024 via email

cameron686 commented Jan 27, 2024

neoback45 commented Jan 27, 2024

r0bb10 commented Jan 27, 2024

slunat commented Jan 27, 2024

mirceadamian commented Jan 27, 2024 • edited Loading

neoback45 commented Jan 27, 2024

cameron686 commented Jan 28, 2024

cameron686 commented Jan 28, 2024

neoback45 commented Jan 28, 2024

tjerkw commented Jan 31, 2024

Huug4922 commented Jan 31, 2024 • edited Loading

neoback45 commented Jan 31, 2024

r0bb10 commented Jan 31, 2024

mirceadamian commented Feb 3, 2024

dmulcahey commented Feb 3, 2024

mirceadamian commented Feb 3, 2024

Huug4922 commented Feb 3, 2024

r0bb10 commented Feb 5, 2024

artinbastani commented Feb 11, 2024

puddly commented Feb 11, 2024

artinbastani commented Feb 11, 2024

puddly commented Feb 11, 2024

artinbastani commented Feb 11, 2024

r0bb10 commented Feb 12, 2024

Huug4922 commented Feb 12, 2024

issue-triage-workflows bot commented May 12, 2024

cameron686 commented Jan 6, 2024 •

edited

Loading

keithcroshaw commented Jan 7, 2024 •

edited

Loading

brylee123 commented Jan 7, 2024 •

edited

Loading

Huug4922 commented Jan 7, 2024 •

edited

Loading

Huug4922 commented Jan 9, 2024 •

edited

Loading

puddly commented Jan 9, 2024 •

edited

Loading

Huug4922 commented Jan 25, 2024 •

edited

Loading

mirceadamian commented Jan 27, 2024 •

edited

Loading

Huug4922 commented Jan 31, 2024 •

edited

Loading