Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zeroconf.get_service_info seems broken in 0.26.0 #255

Closed
emontnemery opened this issue May 24, 2020 · 33 comments · Fixed by #258
Closed

Zeroconf.get_service_info seems broken in 0.26.0 #255

emontnemery opened this issue May 24, 2020 · 33 comments · Fixed by #258

Comments

@emontnemery
Copy link
Collaborator

emontnemery commented May 24, 2020

Several Home Assistant users have reported that cast devices can't be discovered correctly after a recent upgrade which bumps zeroconf from 0.25.1 to 0.26.1: home-assistant/core#35922

According to some testing, the problem appeared in 0.26.0, which adds #239

It seems that the problem is that Zeroconf.get_service_info doesn't return anything useful in 0.26.0+, from the logs it seems that Zeroconf keeps sending queries but does not get or does not accept answers.

Debug logs here:
home-assistant/core#35922 (comment)

For context, this is the code in pychromecast which calls Zeroconf.get_service_info, look for the add_service and service prints in the logs:

    def add_service(self, zconf, typ, name):
        """ Add a service to the collection. """
        service = None
        tries = 0
        _LOGGER.debug("add_service %s, %s", typ, name)
        while service is None and tries < 4:
            try:
                service = zconf.get_service_info(typ, name)
            except IOError:
                # If the zeroconf fails to receive the necessary data we abort
                # adding the service
                break
            tries += 1

        if not service:
            _LOGGER.debug("add_service failed to add %s, %s", typ, name)
            return

@mattsaxon, any idea?

@gjbadros
Copy link

I haven't looked at this hard, but the mdns-repeater process that is running under the HASS supervisor has crashed for me before, resulting in failed service discovery. Is the setup where the problem occurs relying on mdns-repeater and is it still running? (I had to docker restart that container.)

@hmmbob
Copy link

hmmbob commented May 25, 2020

Yes. I run the mdns repeater on the host and can stop the container with the not-working 0.26.0 setup and start the container with 0.25.1 to get a working set-up back - without further touching (or reboot) of the host.

Logs indicate that mdns traffic is still being reflected as well.

@jstasiak
Copy link
Collaborator

@emontnemery Is this about add_service callback not being called at all, or it being called all the same but get_service_info() returns some garbage now? If it's the former I suspect this is the explanation: zeroconf receives multiple packets/responses for the cast devices before it's able to dispatch first service added callback, so it dispatches service updated callback (since it already knew about the device(s)).

@jstasiak
Copy link
Collaborator

How that I think about this it should be the case only if a record changes, and I don't think you expect this here.

@mattsaxon
Copy link
Collaborator

mattsaxon commented May 25, 2020

I've just run some local tests with zeroconf 26.1 and a chromecast and it seems to be working both from a python test harness and also my HA (0.110.1) recognises it fine, including after a restart of HA. Any suggestions on how to recreate?

@hmmbob
Copy link

hmmbob commented May 25, 2020

Are you using docker?
And you might need multiple devices - 1 device did get connected in my case, all subsequent devices discovered didn't.

You can find logs and some initial thoughts at home-assistant/core#35922 (comment) and later comments.

@mattsaxon
Copy link
Collaborator

mattsaxon commented May 25, 2020

No I'm not using Docker.

There's a lot of logs to go through there, can we first try to isolate this to if if it occurs when not using HA with a simple test harness? The reason I mention this is there seemed to be quite a lot of changes to the way zeroconf was being used in HA 0.110 from a quick scan of the release logs.

Happy to work on a harness with you and then we can ensure that test case is in the base zeroconf test suite so we don't inadvertently reintroduce

@hmmbob
Copy link

hmmbob commented May 25, 2020

The Q about docker was because the thread seems to only have docker users reporting the issue - for now.

To summarize what we did already:

HA 109.6 with zeroconf 25.1 (default): all works fine
HA 110.0 with zeroconf 26.0 (default): cast broken
HA 110.0 with zeroconf 25.1 (requires manual installation): cast works fine again

26.1 and 26.2 didn't solve the issue either btw

"Cast broken" in this context means that not all devices seem to connect, or more specifically: any device after the first one.

@mattsaxon
Copy link
Collaborator

OK, thanks, that's a useful summary. Just reran my test harness, adding a second (mocked) chromecast device and this work fine too.

Can you enhance the logging in the code posted at the top of this thread, can you please catch all exceptions (rather than just IOError and log the exception in both cases so we can see what is occurring.

@hmmbob
Copy link

hmmbob commented May 25, 2020

I'll try later today - but it will require some work inside the docker container so i'll be needing some time to set everything up. I'll report back though.

@mattsaxon
Copy link
Collaborator

One thing to consider is that I’ve realised that I’m also implicitly using docker for my HA as I’m using HassOS. My test harness is running on a Window machine however elsewhere on my network. Is it worth us trying to produce this outside of HA first as it will save you fiddling with your HA? Also if we can’t produce it outside of HA it will help us with the diagnosis. Your call obviously, I’ll support you whichever way you choose to go.

@emontnemery
Copy link
Collaborator Author

emontnemery commented May 25, 2020

The key difference between the 0.25.1 and 0.26.1 logs is that Zeroconf.get_service_info returns None in 0.26.1.

Example from 0.25.1 log:

2020-05-24 16:37:39 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4-7._googlecast._tcp.local.
^--- This results in a call to `Zeroconf.get_service_info
2020-05-24 16:37:39 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [homeassistant.components.cast.discovery] Discovered chromecast ChromecastInfo(host='192.168.88.11', port=8009, service='Chromecast-32e39183bc97abac51d77fce98a766a4-7._googlecast._tcp.local.', uuid='32e39183-bc97-abac-51d7-7fce98a766a4', model_name='Chromecast', friendly_name='TV Slaapkamer')
^--- This means `Zeroconf.get_service_info returned a a service. Note in the log that no packages were sent by zeroconf to resolve the service

Example from 0.26.1 log:

2020-05-24 18:50:27 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4-7._googlecast._tcp.local.
^--- This results in a call to `Zeroconf.get_service_info
2020-05-24 18:50:27 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [zeroconf] Sending <DNSOutgoing:{multicast=True, flags=0, questions=[question[srv,in,Chromecast-32e39183bc97abac51d77fce98a766a4-7._googlecast._tcp.local.], question[txt,in,Chromecast-32e39183bc97abac51d77fce98a766a4-7._googlecast._tcp.local.], question[a,in,Chromecast-32e39183bc97abac51d77fce98a766a4-7._googlecast._tcp.local.], question[quada,in,Chromecast-32e39183bc97abac51d77fce98a766a4-7._googlecast._tcp.local.]], answers=[], authorities=[], additionals=[]}> (104 bytes) as b'\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00-Chromecast-32e39183bc97abac51d77fce98a766a4-7\x0b_googlecast\x04_tcp\x05local\x00\x00!\x00\x01\xc0\x0c\x00\x10\x00\x01\xc0\x0c\x00\x01\x00\x01\xc0\x0c\x00\x1c\x00\x01'...
^--- Question from zeroconf
2020-05-24 18:50:27 DEBUG (zeroconf-Engine) [zeroconf] Received from '172.22.0.2':5353: b'\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00-Chromecast-32e39183bc97abac51d77fce98a766a4-7\x0b_googlecast\x04_tcp\x05local\x00\x00!\x00\x01\xc0\x0c\x00\x10\x00\x01\xc0\x0c\x00\x01\x00\x01\xc0\x0c\x00\x1c\x00\x01' 
2020-05-24 18:50:27 DEBUG (zeroconf-Engine) [zeroconf] Received from '172.22.0.1':5353: b'\x00\x00\x84\x00\x00\x00\x00\x02\x00\x00\x00\x00-Chromecast-32e39183bc97abac51d77fce98a766a4-7\x0b_googlecast\x04_tcp\x05local\x00\x00\x10\x80\x01\x00\x00\x11\x94\x00\xac#id=32e39183bc97abac51d77fce98a766a4#cd=C69CA6478A7752218282F68F6D6520C6\x03rm=\x05ve=05\rmd=Chromecast\x12ic=/setup/icon.png\x10fn=TV Slaapkamer\tca=200709\x04st=0\x0fbs=FA8FCA9E2B99\x04nf=2\x03rs=\xc0\x0c\x00!\x80\x01\x00\x00\x00x\x00/\x00\x00\x00\x00\x1fI&32e39183-bc97-abac-51d7-7fce98a766a4-1\xc0K' 
2020-05-24 18:50:39 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service failed to add _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4-7._googlecast._tcp.local.
^--- pychromecast gives upp after  `Zeroconf.get_service_info` returned `None` 4 times

Another key difference between the 0.25.1 and 0.26.1 logs is that add_service is called multiple times for the same cast device in case of 0.25.1, although with different suffixes for the name:

[pychromecast.discovery] add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4-7._googlecast._tcp.local.
2020-05-24 16:37:54 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4-6._googlecast._tcp.local.
2020-05-24 16:37:54 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4-5._googlecast._tcp.local.
2020-05-24 16:38:06 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4-4._googlecast._tcp.local.
2020-05-24 16:38:42 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4-3._googlecast._tcp.local.
2020-05-24 16:38:54 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service _googlecast._tcp.local., Chromecast-32e39183bc97abac51d77fce98a766a4-2._googlecast._tcp.local.

note: pychromecast is relying only on add_service, update_service is not implemented.
@mattsaxon With your changes in #239, is it now necessary to also add update_service?

Maybe this is just a red herring though, I've seen chromecasts add suffixes before, and it shouldn't cause Zeroconf.get_service_info to fail as long as there is an add_service for each new name?

@mattsaxon
Copy link
Collaborator

mattsaxon commented May 25, 2020

It would be a good test to see what implementing update_service yields, however this would be a bug which we would rectify, the only requirement to implement update_service SHOULD be to get updates, such as if the IP address changes.

Given my testing so far on this, looks like we are investigating some sort of edge case, not sure what it is yet though.

What you’ve identified for the multiple calls to add_service looks worthy of investigation, any idea why there are 7 suffixes? Is that 7 devices in a single house? Or is that 7 services from a single device?

@emontnemery
Copy link
Collaborator Author

emontnemery commented May 25, 2020

It's 7 services from one device, but only one of them should be valid, the remaining ones should be stale records.

IIRC:
This is a fallback if the preferred name, without suffix, is already taken.
The Chromecast sends a question for its preferred name, and if receiving an answer, adds a suffix.
If that suffix is also taken, the suffix is incremented and so on.

Due to the nature of mDNS, the name may be taken by the Chromecast itself before it rebooted or reconnected etc, but cached by another client.

If @hmmbob has been "messing around" shortly before logging, that may explain the multiple services, and also why they show up in one log but not another.

In any case, logs from 0.25.1 and 0.26.1 with the changes from #254 should improve the readability.

@emontnemery
Copy link
Collaborator Author

@hmmbob A bit crazy, but could you try patching pychromecast/discovery.py by adding this to CastListener:

    def update_service(self, zconf, typ, name):
        _LOGGER.debug("update_service %s, %s", typ, name)
        self.add_service(zconf, typ, name)
        _LOGGER.debug("update_service done for %s, %s", typ, name)

@mattsaxon
Copy link
Collaborator

Looks a good suggestion.

Given what you’ve said about the large number of suffixes, I wonder if the issue is down to a large Mdns packet being split and when the first packet comes in, that there is not enough information to provide the data needed to service the get_service_info call. If this is the case, on a failure (ie receiving none), you could try waiting for, say 1 second and trying again. At the moment the code retries 4 times without a wait.

@hmmbob
Copy link

hmmbob commented May 25, 2020

@hmmbob A bit crazy, but could you try patching pychromecast/discovery.py by adding this to CastListener:

    def update_service(self, zconf, typ, name):
        _LOGGER.debug("update_service %s, %s", typ, name)
        self.add_service(zconf, typ, name)
        _LOGGER.debug("update_service done for %s, %s", typ, name)

What testing/code changes do you want me to focus on first? I've just got limited time available tonight, but I do want to help you guys forward with debugging this....

@emontnemery
Copy link
Collaborator Author

emontnemery commented May 25, 2020

@hmmbob My proposal is to:
Test with both zeroconf 0.25.1 and 0.26.1 with the following changes

+    def update_service(self, zconf, typ, name):
+        _LOGGER.debug("update_service %s, %s", typ, name)
+        self.add_service(zconf, typ, name)
+        _LOGGER.debug("update_service done for %s, %s", typ, name)
+
     def add_service(self, zconf, typ, name):
         """ Add a service to the collection. """
+        import time
         service = None
         tries = 0
         _LOGGER.debug("add_service %s, %s", typ, name)
         while service is None and tries < 4:
             try:
                 service = zconf.get_service_info(typ, name)
                 _LOGGER.debug("service: %s", service)
             except IOError:
                 # If the zeroconf fails to receive the necessary data we abort
                 # adding the service
                 break
             tries += 1
+            if service is None:
+                time.sleep(1)

@mattsaxon Does this seem reasonable? Is the sleep() in the right place?

@jstasiak
Copy link
Collaborator

jstasiak commented May 25, 2020

I'll let @mattsaxon speak for himself, but to me it looks like a reasonable direction (with the small exception of patching zeroconf to enable debug logs, see home-assistant/core#35922 (comment)). (crossed out because home-assistant/core#35922 (comment))

@hmmbob
Copy link

hmmbob commented May 25, 2020

@hmmbob My proposal is to:
Test with both zeroconf 0.25.1 and 0.26.1 with the following changes

First run is done. I started a brand new HA 110.0 container (to make sure previous debugging did not interfere). I observed none of the cast devices to show. Within the container, I patched the files as quoted above (i seem to get handy with vi while doing this, lol) and restarted home assistant. I observed none of the devices to show upon startup, but suddenly at 18:32-ish most of the devices showed up. I am still missing one chromecast device (LG-SH8 soundbar) and an AndroidTV (but that is quite regular if that one isn't turned on - which it isnt).

Logs: 110-26.1-patched.log

@hmmbob
Copy link

hmmbob commented May 25, 2020

And the second run is done too. I again started a brand new 110.0 container and observed all cast devices showing as unavailable. I installed zeroconf 25.1, patched all files as requested and restarted home assistant. I observed all cast devices to be online immediately when the HA frontend is available.

Logs: 110-25.1-patched.log

@emontnemery
Copy link
Collaborator Author

emontnemery commented May 25, 2020

Nice! I think there's some real progress here, Adding update_service is why the devices show up around 18:32!

From 0.26.1 log:

2020-05-25 18:30:44 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service _googlecast._tcp.local., Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.
^--- This results in a call to `Zeroconf.get_service_info`
2020-05-25 18:30:44 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [zeroconf] Sending <DNSOutgoing:{multicast=True, flags=0, questions=[question[srv,in,Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.], question[txt,in,Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.], question[a,in,Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.], question[quada,in,Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.]], answers=[], authorities=[], additionals=[]}> (103 bytes) as [b'\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00,Google-Home-d4063db0253e246c9d66d8f7fda77a6a\x0b_googlecast\x04_tcp\x05local\x00\x00!\x00\x01\xc0\x0c\x00\x10\x00\x01\xc0\x0c\x00\x01\x00\x01\xc0\x0c\x00\x1c\x00\x01']
^--- Question sent by zeroconf
2020-05-25 18:31:00 DEBUG (zeroconf-Engine) [zeroconf] Received from '172.22.0.1':5353: <DNSIncoming:{id=0, flags=33792, n_q=0, n_ans=2, n_auth=0, n_add=0, questions=[], answers=[record[txt,in-unique,Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.]=4500/4499,b'#id=d40'..., record[srv,in-unique,Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.]=120/119,d4063db0-253e-246c-9d66-d8f7fda77a6a.local.:8009]}> (320 bytes) as [b'\x00\x00\x84\x00\x00\x00\x00\x02\x00\x00\x00\x00,Google-Home-d4063db0253e246c9d66d8f7fda77a6a\x0b_googlecast\x04_tcp\x05local\x00\x00\x10\x80\x01\x00\x00\x11\x94\x00\xac#id=d4063db0253e246c9d66d8f7fda77a6a#cd=8F6DC5CC9152D3EA8E06D8C484007970\x03rm=\x05ve=05\x0emd=Google Home\x12ic=/setup/icon.png\x0ffn=Studeerkamer\tca=198660\x04st=0\x0fbs=FA8FCA8E8647\x04nf=1\x03rs=\xc0\x0c\x00!\x80\x01\x00\x00\x00x\x00-\x00\x00\x00\x00\x1fI$d4063db0-253e-246c-9d66-d8f7fda77a6a\xc0J']
^--- Reply received, should be valid?
2020-05-25 18:31:00 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service failed to add _googlecast._tcp.local., Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.
^--- Add service fails even though reply has been received

Devices show up around 18:32:

2020-05-25 18:32:21 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] update_service _googlecast._tcp.local., Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.
2020-05-25 18:32:21 DEBUG (zeroconf-ServiceBrowser__googlecast._tcp.local.) [pychromecast.discovery] add_service _googlecast._tcp.local., Google-Home-d4063db0253e246c9d66d8f7fda77a6a._googlecast._tcp.local.

One thing which unfortunately makes the logs inconclusive is that there are multiple zeroconf instances and multiple zeroconf-ServiceBrowser__googlecast._tcp.local. service browsers.

@emontnemery
Copy link
Collaborator Author

#239 adds the lock _handlers_lock:
https://github.com/jstasiak/python-zeroconf/blob/552a030eb592a0c07feaa7a01ece1464da4b1d0b/zeroconf/__init__.py#L2209

Which is used in the engine thread:
https://github.com/jstasiak/python-zeroconf/blob/552a030eb592a0c07feaa7a01ece1464da4b1d0b/zeroconf/__init__.py#L2484-L2489

And also by the service browser when issuing the state change callbacks:
https://github.com/jstasiak/python-zeroconf/blob/552a030eb592a0c07feaa7a01ece1464da4b1d0b/zeroconf/__init__.py#L1541-L1546

Both pychromecast and Home Assistant calls Zeroconf.get_service_info from the service callbacks which means the lock may be held for several seconds. Could this be the root cause of this issue?

@mattsaxon
Copy link
Collaborator

mattsaxon commented May 25, 2020

I haven’t considered what you’ve written above yet, but let me explain why I added the lock. The intention was to deal with the situation of calling add_service and then update_service day in short succession. The expectations was that adding it would allow the add_serivce to have all the information rather than calling add_service early and then calling update_service. Pretty much trying to avoid the problem we may be facing here .

@emontnemery
Copy link
Collaborator Author

emontnemery commented May 25, 2020

OK, so the intention is to prevent the service browsers from rushing away and call add_service too early while looping over for record in msg.answers:?

Maybe the lock can remain, but make a list handlers_to_really_really_call which can be looped over without holding the lock call self._service_state_changed.fire() outside the lock:

             if len(self._handlers_to_call) > 0 and not self.zc.done:
                 with self.zc._handlers_lock:
                     handler = self._handlers_to_call.popitem(False)
-                    self._service_state_changed.fire(
-                        zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1]
-                    )
+                self._service_state_changed.fire(
+                    zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1]
+                )

@jstasiak
Copy link
Collaborator

I think this could be just deindenting the self._service_state_changed.fire() lines?

@emontnemery
Copy link
Collaborator Author

Yes, that's right, not sure what I was thinking @_@

@emontnemery
Copy link
Collaborator Author

emontnemery commented May 25, 2020

Maybe another round of test, test with zeroconf 0.26.1 and the below changes.
@jstasiak, @hmmbob does it seem reasonable?

Changes:

             if len(self._handlers_to_call) > 0 and not self.zc.done:
                 with self.zc._handlers_lock:
                     handler = self._handlers_to_call.popitem(False)
-                    self._service_state_changed.fire(
-                        zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1]
-                    )
+                self._service_state_changed.fire(
+                    zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1]
+                )
  • Patch pychromecast/discovery.py by adding this to CastListener:
+    def update_service(self, zconf, typ, name):
+        _LOGGER.debug("update_service %s, %s", typ, name)
+        self.add_service(zconf, typ, name)
+        _LOGGER.debug("update_service done for %s, %s", typ, name)
+
     def add_service(self, zconf, typ, name):
         """ Add a service to the collection. """
+        import time
         service = None
         tries = 0
         _LOGGER.debug("add_service %s, %s", typ, name)
         while service is None and tries < 4:
             try:
                 service = zconf.get_service_info(typ, name)
                 _LOGGER.debug("service: %s", service)
             except IOError:
                 # If the zeroconf fails to receive the necessary data we abort
                 # adding the service
                 break
             tries += 1
+            if service is None:
+                time.sleep(1)

@hmmbob
Copy link

hmmbob commented May 25, 2020

I'll give it a run tomorrow!

Strike that - I still have that container from the previous test so I'll do it now.

edit: @emontnemery @jstasiak First run seems to be successful (I mean: cast items show at startup with patched 26.1). I am verifying the run again.
edit2: 2nd run also succesful.

Logs: 110-26.1-patched-2.log

@jstasiak
Copy link
Collaborator

@emontnemery Yes, +1 from me, although late, as @hmmbob was faster with actual testing. :)

@hmmbob
Copy link

hmmbob commented May 25, 2020

But now I wonder why I ran into this, with just a few others - but not the other thousands of HA users.....

@jstasiak
Copy link
Collaborator

Is the percentage of HA users who upgraded to the latest HA and use pychromecast/zeroconf known (not the precise figure, naturally, just an approximation)?

@emontnemery
Copy link
Collaborator Author

emontnemery commented May 25, 2020

Based on the number of comments in home-assistant/core#35922, i don't think the percentage is too high. A few users also complain in the forums.

@hmmbob is running a "non recommended" HA setup where the cast devices and home assistant are on different networks, with Avahi in reflector mode helping to bridge, that seems to be the common factor for others as well.

Maybe it would be worth digging a bit deeper in the logs from @hmmbob to understand why the deadlock / starve was triggered in this case. Maybe Avahi does some filtering of mDNS packets which somehow changes the timing cuasing Zeroconf.get_service_info to always/often deadlock?

jstasiak pushed a commit that referenced this issue May 26, 2020
Closes #255

Background:
#239 adds the lock _handlers_lock:

python-zeroconf/zeroconf/__init__.py

    self._handlers_lock = threading.Lock()  # ensure we process a full message in one go 

Which is used in the engine thread:

     def handle_response(self, msg: DNSIncoming) -> None: 
         """Deal with incoming response packets.  All answers 
         are held in the cache, and listeners are notified.""" 
  
         with self._handlers_lock: 
  

And also by the service browser when issuing the state change callbacks:

 if len(self._handlers_to_call) > 0 and not self.zc.done: 
     with self.zc._handlers_lock: 
         handler = self._handlers_to_call.popitem(False) 
         self._service_state_changed.fire( 
             zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1] 
         ) 

Both pychromecast and Home Assistant calls Zeroconf.get_service_info from the service callbacks which means the lock may be held for several seconds which will starve the engine thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants