Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket connection seems to die and not reconnect #96

Closed
shlomki opened this issue Mar 21, 2021 · 16 comments
Closed

WebSocket connection seems to die and not reconnect #96

shlomki opened this issue Mar 21, 2021 · 16 comments

Comments

@shlomki
Copy link

shlomki commented Mar 21, 2021

Running latest stable version of all components: HA (2021.3.4), Deconz (2.09.03), and Conbee II (26680700).

At some random point in the day, the websocket connection seems to die.
This causes entities coming from Deconz to not get updated in HA (sensors, lights, etc).

Insights:

  1. At some point pydeconz.websocket debug logs are no longer printed to the log, suggesting that either the connection has been dropped or maybe the thread has died. However, I still see log entries for pydeconz.gateway.
  2. I couldn't find any errors in the log.
  3. I’m able to control entities from HA, but I don’t see their state gets updated in HA. For example, I’d turn on the light from HA, the light would actually turn on but the state would remain OFF in HA, and in Deconz it is ON.
    The last message I see from pydeconz.websocket looks like a normal state update message:

2021-03-21 10:40:29 DEBUG (MainThread) [pydeconz.websocket] {"attr":{"id":"53","lastannounced":"2021-03-15T11:53:13Z","lastseen":"2021-03-21T10:40Z","manufacturername":"LUMI","modelid":"lumi.plug","name":"REDACTED","swversion":"09-04-2018","type":"Smart plug","uniqueid":"REDACTED"},"e":"changed","id":"53","r":"lights","t":"event","uniqueid":"REDACTED"}

  1. After this, no log entries from pydeconz.websocket.
@Kane610
Copy link
Owner

Kane610 commented Mar 21, 2021

When did this start to happen?
How are you running hass and deconz?
Since you can reproduce this perhaps you can try different combinations of hass and deconz to see if you can pin point when this started happening to you?

@shlomki
Copy link
Author

shlomki commented Mar 21, 2021

When did this start to happen?

It's hard to pin-point the exact date, but it started happening about 2 months ago.

How are you running hass and deconz?

I'm running Hassio as a docker (using hassio_supervisor on unraid), and deconz installed on a separate server (rpi). Both are connected by ethernet cable directly to the router, so network issues are probably not the issue, though I'm not completely ruling it out. Though, if it is in fact network dropouts, I'd expect it to try and reconnect automatically.

Since you can reproduce this perhaps you can try different combinations of hass and deconz to see if you can pin point when this started happening to you?

I was thinking of ditching hassio and going strictly HA core as a docker, but I need to sort out a few things before I can make the switch.
I used to run deconz as a docker on the same machine as HA but had USB mapping problems I couldn't resolve, so I'm not sure that would work.
EDIT: I gave it a go anyway and switched deconz to run on docker. Will update soon if the problem returns.

@shlomki
Copy link
Author

shlomki commented Mar 21, 2021

Ok, this happened much sooner than expected- this is happening when deconz is running on docker as well, running on the same host as HA.
I think that pretty much rules out deconz or network connectivity being the culprit.
I guess the next thing I could try is to run home assistant core on docker as well, instead of hassio.
Any other ideas to try?

@Kane610
Copy link
Owner

Kane610 commented Mar 25, 2021

Ok, this happened much sooner than expected- this is happening when deconz is running on docker as well, running on the same host as HA.
I think that pretty much rules out deconz or network connectivity being the culprit.
I guess the next thing I could try is to run home assistant core on docker as well, instead of hassio.
Any other ideas to try?

I'm not sure what it rules out, if this was a wide spread issue more people would report on it. Do you mean it just happens after a few minutes?
And in the debug logs of hass the websocket messages continue to arrive but no events inside of hass?

I guess going back to earlier versions of hass to verify if it stops breaking is valuable

@shlomki
Copy link
Author

shlomki commented Mar 26, 2021

I'm not sure what it rules out, if this was a wide spread issue more people would report on it.

When I posted about this I found people who were having the same problem and decided to abandon deconz completely over this. So it might just be that they didn't bother to report.

Do you mean it just happens after a few minutes?

I meant that I expected to have to wait for a day or two, but it happened in a matter of a few hours. Then it started happening once every day or two again.

And in the debug logs of hass the websocket messages continue to arrive but no events inside of hass?

No, when the problem happens the websocket messages do not arrive anymore.

I guess going back to earlier versions of hass to verify if it stops breaking is valuable

I just went back to as early as 0.118, I'll report back if this happens again with this version.
Thanks again for your help on this!

@Kane610
Copy link
Owner

Kane610 commented Mar 26, 2021

The main issue is to be able to pin point the issue since it's not happening to me

@shlomki
Copy link
Author

shlomki commented Mar 31, 2021

Ok, so I spent a few days with 0.118 and also 0.117 - both exhibit the same problem.
So I really don't know where to take this from here, if I was at a loss earlier about this problem - now I'm completely clueless.
The only thing I haven't tried is to switch to HA Core on a docker instead of Hassio on a docker, I'll be trying that in the upcoming few days.
Any other ideas/suggestions?

@Kane610
Copy link
Owner

Kane610 commented Mar 31, 2021

Well what else are you running in your instance? It could be something else affecting the stability of the system.

I did some minor improvements to retry mechanisms of the websocket, I don't know how big of an effect it will have though. Also improved logging a bit It will be a part of the 2021.4 beta scheduled to be released today.

Its really basic code being used, so that it just hangs for you is unexpected.

@shlomki
Copy link
Author

shlomki commented Apr 1, 2021

I have some dockers running on the same host, like: mosquitto, unifi-controller, embyServer, deluge, radarr, sonarr, jackett, bazzaar. That's it, the rest are the hassio dockers, which I'll either be converting to ha core or just use a proper vm.

Thank you so much for taking the time to add retry mechanisms and more logs, I really appreciate it!
I've joined the beta channel, and will upgrade shortly.

@Kane610
Copy link
Owner

Kane610 commented Apr 1, 2021

Its out since last night ;)

I want it as stable as possible. Thats why I still refactor it and improve testing.

@shlomki
Copy link
Author

shlomki commented Apr 2, 2021

So I've switched over to a hassio VM with the latest beta. I even had to delete the deconz integration and set it up again, so I hoped that might do some voodoo wonders.
But alas, the problem is still happening.

These are the last 4 log lines I see in the debug log. Notice the time gap between the first two and the last two, the problem seems to have occurred at that point.
2021-04-02 16:19:14 DEBUG (MainThread) [pydeconz.websocket] {"attr":{"id":"29","lastannounced":null,"lastseen":"2021-04-02T13:19Z","manufacturername":"Heiman","modelid":"TS0003","name":"Bedroom Main Toggle Left","swversion":null,"type":"On/Off light","uniqueid":"ec:1b:bd:ff:fe:2d:a6:4e-02"},"e":"changed","id":"29","r":"lights","t":"event","uniqueid":"ec:1b:bd:ff:fe:2d:a6:4e-02"}

2021-04-02 16:19:19 DEBUG (MainThread) [pydeconz.websocket] {"attr":{"id":"3","lastannounced":null,"lastseen":"2021-04-02T13:19Z","manufacturername":"Heiman","modelid":"TS0002","name":"Living Room Hallway","swversion":null,"type":"On/Off light","uniqueid":"ec:1b:bd:ff:fe:65:ae:88-02"},"e":"changed","id":"3","r":"lights","t":"event","uniqueid":"ec:1b:bd:ff:fe:65:ae:88-02"}

2021-04-02 16:30:00 DEBUG (MainThread) [pydeconz.gateway] Sending "put" "{'on': True}" to "192.168.1.10 /lights/4/state"

2021-04-02 16:30:00 DEBUG (MainThread) [pydeconz.gateway] HTTP request response: [{'success': {'/lights/4/state/on': True}}]

@Kane610
Copy link
Owner

Kane610 commented Apr 2, 2021

And no crashes or anything? Could you try out disabling all other integrations to see if that affects anything?

@shlomki
Copy link
Author

shlomki commented Apr 2, 2021

No crashes or any sign of a problem, until I notice that a few of my sensor-based automations have stopped running.

Luckily, most of my smart home is based on zigbee, so I was able to disable all other integrations as well as custom_integrations without too much disruption to see what's causing this. Will report back.

@shlomki
Copy link
Author

shlomki commented Apr 3, 2021

It's possible that I'm starting to celebrate a little too early because less than 24 hours have passed since I've disabled all integrations, but I think there's a good chance that we've finally found the problem!

When I started disabling the integrations one by one, I realized that there's a specific custom integration that I've added in early December (along with some other additions), and right around that time this problem has started happening. I didn't realize that it might have something to do with deconz do I didn't think of this earlier.
I looked into its code and then saw that this integration was using the HA event loop and some stability issues were reported by other users, and that it was fixed in February (but I wasn't aware that there was an update).
As of now I still have all integrations disabled except for deconz and so far so good - no issues. I'll wait for a few more days to see if this has actually been resolved.
Fingers crossed!

@Kane610
Copy link
Owner

Kane610 commented Apr 3, 2021

That sounds promising at least! I should copy parts of HASS issue template to make sure that users with issues disable custom integrations before reporting. :)

What integration is it that is problematic?

@shlomki
Copy link
Author

shlomki commented Apr 4, 2021

So after 2 days with no issues, I'm pretty confident that it was indeed that integration!
This is the one: https://github.com/hllhll/HomeAssistant-ekon-local
It's an integration for a climate HVAC system called ekon. If you look at the latest commit from February, you would see where the HA event loop was mentioned.

I'm really sorry for all of the hassle! I literally tried everything I could think of before opening an issue here, and never thought that other integrations could affect one another. However, I've been talking to many people about this issue, and you're the only one who was able to figure it out :) So even though this wasn't the right place - thank you, thank you, and thank you again. 🙏

@shlomki shlomki closed this as completed Apr 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants