New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zeroconf.get_service_info seems broken in 0.26.0 #255
Comments
I haven't looked at this hard, but the mdns-repeater process that is running under the HASS supervisor has crashed for me before, resulting in failed service discovery. Is the setup where the problem occurs relying on mdns-repeater and is it still running? (I had to docker restart that container.) |
Yes. I run the mdns repeater on the host and can stop the container with the not-working 0.26.0 setup and start the container with 0.25.1 to get a working set-up back - without further touching (or reboot) of the host. Logs indicate that mdns traffic is still being reflected as well. |
@emontnemery Is this about |
How that I think about this it should be the case only if a record changes, and I don't think you expect this here. |
I've just run some local tests with zeroconf 26.1 and a chromecast and it seems to be working both from a python test harness and also my HA (0.110.1) recognises it fine, including after a restart of HA. Any suggestions on how to recreate? |
Are you using docker? You can find logs and some initial thoughts at home-assistant/core#35922 (comment) and later comments. |
No I'm not using Docker. There's a lot of logs to go through there, can we first try to isolate this to if if it occurs when not using HA with a simple test harness? The reason I mention this is there seemed to be quite a lot of changes to the way zeroconf was being used in HA 0.110 from a quick scan of the release logs. Happy to work on a harness with you and then we can ensure that test case is in the base zeroconf test suite so we don't inadvertently reintroduce |
The Q about docker was because the thread seems to only have docker users reporting the issue - for now. To summarize what we did already: HA 109.6 with zeroconf 25.1 (default): all works fine 26.1 and 26.2 didn't solve the issue either btw "Cast broken" in this context means that not all devices seem to connect, or more specifically: any device after the first one. |
OK, thanks, that's a useful summary. Just reran my test harness, adding a second (mocked) chromecast device and this work fine too. Can you enhance the logging in the code posted at the top of this thread, can you please catch all exceptions (rather than just IOError and log the exception in both cases so we can see what is occurring. |
I'll try later today - but it will require some work inside the docker container so i'll be needing some time to set everything up. I'll report back though. |
One thing to consider is that I’ve realised that I’m also implicitly using docker for my HA as I’m using HassOS. My test harness is running on a Window machine however elsewhere on my network. Is it worth us trying to produce this outside of HA first as it will save you fiddling with your HA? Also if we can’t produce it outside of HA it will help us with the diagnosis. Your call obviously, I’ll support you whichever way you choose to go. |
The key difference between the 0.25.1 and 0.26.1 logs is that Example from 0.25.1 log:
Example from 0.26.1 log:
Another key difference between the 0.25.1 and 0.26.1 logs is that
note: Maybe this is just a red herring though, I've seen chromecasts add suffixes before, and it shouldn't cause |
It would be a good test to see what implementing update_service yields, however this would be a bug which we would rectify, the only requirement to implement update_service SHOULD be to get updates, such as if the IP address changes. Given my testing so far on this, looks like we are investigating some sort of edge case, not sure what it is yet though. What you’ve identified for the multiple calls to add_service looks worthy of investigation, any idea why there are 7 suffixes? Is that 7 devices in a single house? Or is that 7 services from a single device? |
It's 7 services from one device, but only one of them should be valid, the remaining ones should be stale records. IIRC: Due to the nature of mDNS, the name may be taken by the Chromecast itself before it rebooted or reconnected etc, but cached by another client. If @hmmbob has been "messing around" shortly before logging, that may explain the multiple services, and also why they show up in one log but not another. In any case, logs from 0.25.1 and 0.26.1 with the changes from #254 should improve the readability. |
@hmmbob A bit crazy, but could you try patching pychromecast/discovery.py by adding this to
|
Looks a good suggestion. Given what you’ve said about the large number of suffixes, I wonder if the issue is down to a large Mdns packet being split and when the first packet comes in, that there is not enough information to provide the data needed to service the get_service_info call. If this is the case, on a failure (ie receiving none), you could try waiting for, say 1 second and trying again. At the moment the code retries 4 times without a wait. |
What testing/code changes do you want me to focus on first? I've just got limited time available tonight, but I do want to help you guys forward with debugging this.... |
@hmmbob My proposal is to:
+ def update_service(self, zconf, typ, name):
+ _LOGGER.debug("update_service %s, %s", typ, name)
+ self.add_service(zconf, typ, name)
+ _LOGGER.debug("update_service done for %s, %s", typ, name)
+
def add_service(self, zconf, typ, name):
""" Add a service to the collection. """
+ import time
service = None
tries = 0
_LOGGER.debug("add_service %s, %s", typ, name)
while service is None and tries < 4:
try:
service = zconf.get_service_info(typ, name)
_LOGGER.debug("service: %s", service)
except IOError:
# If the zeroconf fails to receive the necessary data we abort
# adding the service
break
tries += 1
+ if service is None:
+ time.sleep(1) @mattsaxon Does this seem reasonable? Is the |
I'll let @mattsaxon speak for himself, but to me it looks like a reasonable direction |
First run is done. I started a brand new HA 110.0 container (to make sure previous debugging did not interfere). I observed none of the cast devices to show. Within the container, I patched the files as quoted above (i seem to get handy with Logs: 110-26.1-patched.log |
And the second run is done too. I again started a brand new 110.0 container and observed all cast devices showing as unavailable. I installed zeroconf 25.1, patched all files as requested and restarted home assistant. I observed all cast devices to be online immediately when the HA frontend is available. Logs: 110-25.1-patched.log |
Nice! I think there's some real progress here, Adding From 0.26.1 log:
Devices show up around 18:32:
One thing which unfortunately makes the logs inconclusive is that there are multiple zeroconf instances and multiple zeroconf-ServiceBrowser__googlecast._tcp.local. service browsers. |
#239 adds the lock Which is used in the engine thread: And also by the service browser when issuing the state change callbacks: Both pychromecast and Home Assistant calls |
I haven’t considered what you’ve written above yet, but let me explain why I added the lock. The intention was to deal with the situation of calling add_service and then update_service day in short succession. The expectations was that adding it would allow the add_serivce to have all the information rather than calling add_service early and then calling update_service. Pretty much trying to avoid the problem we may be facing here . |
OK, so the intention is to prevent the service browsers from rushing away and call Maybe the lock can remain, but if len(self._handlers_to_call) > 0 and not self.zc.done:
with self.zc._handlers_lock:
handler = self._handlers_to_call.popitem(False)
- self._service_state_changed.fire(
- zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1]
- )
+ self._service_state_changed.fire(
+ zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1]
+ ) |
I think this could be just deindenting the |
Yes, that's right, not sure what I was thinking @_@ |
Maybe another round of test, test with zeroconf 0.26.1 and the below changes. Changes:
if len(self._handlers_to_call) > 0 and not self.zc.done:
with self.zc._handlers_lock:
handler = self._handlers_to_call.popitem(False)
- self._service_state_changed.fire(
- zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1]
- )
+ self._service_state_changed.fire(
+ zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1]
+ )
+ def update_service(self, zconf, typ, name):
+ _LOGGER.debug("update_service %s, %s", typ, name)
+ self.add_service(zconf, typ, name)
+ _LOGGER.debug("update_service done for %s, %s", typ, name)
+
def add_service(self, zconf, typ, name):
""" Add a service to the collection. """
+ import time
service = None
tries = 0
_LOGGER.debug("add_service %s, %s", typ, name)
while service is None and tries < 4:
try:
service = zconf.get_service_info(typ, name)
_LOGGER.debug("service: %s", service)
except IOError:
# If the zeroconf fails to receive the necessary data we abort
# adding the service
break
tries += 1
+ if service is None:
+ time.sleep(1) |
I'll give it a run tomorrow! Strike that - I still have that container from the previous test so I'll do it now. edit: @emontnemery @jstasiak First run seems to be successful (I mean: cast items show at startup with patched 26.1). I am verifying the run again. Logs: 110-26.1-patched-2.log |
@emontnemery Yes, +1 from me, although late, as @hmmbob was faster with actual testing. :) |
But now I wonder why I ran into this, with just a few others - but not the other thousands of HA users..... |
Is the percentage of HA users who upgraded to the latest HA and use pychromecast/zeroconf known (not the precise figure, naturally, just an approximation)? |
Based on the number of comments in home-assistant/core#35922, i don't think the percentage is too high. A few users also complain in the forums. @hmmbob is running a "non recommended" HA setup where the cast devices and home assistant are on different networks, with Avahi in reflector mode helping to bridge, that seems to be the common factor for others as well. Maybe it would be worth digging a bit deeper in the logs from @hmmbob to understand why the deadlock / starve was triggered in this case. Maybe Avahi does some filtering of mDNS packets which somehow changes the timing cuasing |
Closes #255 Background: #239 adds the lock _handlers_lock: python-zeroconf/zeroconf/__init__.py self._handlers_lock = threading.Lock() # ensure we process a full message in one go Which is used in the engine thread: def handle_response(self, msg: DNSIncoming) -> None: """Deal with incoming response packets. All answers are held in the cache, and listeners are notified.""" with self._handlers_lock: And also by the service browser when issuing the state change callbacks: if len(self._handlers_to_call) > 0 and not self.zc.done: with self.zc._handlers_lock: handler = self._handlers_to_call.popitem(False) self._service_state_changed.fire( zeroconf=self.zc, service_type=self.type, name=handler[0], state_change=handler[1] ) Both pychromecast and Home Assistant calls Zeroconf.get_service_info from the service callbacks which means the lock may be held for several seconds which will starve the engine thread.
Several Home Assistant users have reported that cast devices can't be discovered correctly after a recent upgrade which bumps zeroconf from 0.25.1 to 0.26.1: home-assistant/core#35922
According to some testing, the problem appeared in 0.26.0, which adds #239
It seems that the problem is that
Zeroconf.get_service_info
doesn't return anything useful in 0.26.0+, from the logs it seems that Zeroconf keeps sending queries but does not get or does not accept answers.Debug logs here:
home-assistant/core#35922 (comment)
For context, this is the code in pychromecast which calls
Zeroconf.get_service_info
, look for theadd_service
andservice
prints in the logs:@mattsaxon, any idea?
The text was updated successfully, but these errors were encountered: