Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoAP Pair-Verify timeout #78

Closed
roysjosh opened this issue Feb 13, 2022 · 9 comments
Closed

CoAP Pair-Verify timeout #78

roysjosh opened this issue Feb 13, 2022 · 9 comments
Labels
Milestone

Comments

@roysjosh
Copy link
Contributor

What do you think about the current CoAP pair-verify timeout of 8s? During startup, HA will try to connect 3x causing a 24s delay to HA startup if any HAP/CoAP device is unreachable. I don't have any battery-powered devices but I feel like 4s should be enough for just about anything to complete pair-verify.

@Jc2k
Copy link
Owner

Jc2k commented Feb 13, 2022

It's a good question. It's been a while since i've looked at the startup path.

From what I remember (and a very quick look):

  • Right now in theory HA only needs to talk to the device if the characteristic cache is stale. Even if its stale, it can load the stale entities and then create any missing ones when the cache is eventually updated.
  • The only other I/O that happens at startup is setting up subscriptions, and that runs on its own asyncio task. It also sets up a poll to happen using async_track_time_interval. That shouldn't do i/o immediately.

So on the "happy" path I wouldn't expect any delay during HA's startup - the connect should happen in the background.

The case where there is no characteristic cache should be quite rare, in which case I think is reasonable.

@Jc2k
Copy link
Owner

Jc2k commented Feb 13, 2022

The ip backend has a reconnect loop that basically tries forever in the background. Then I can wait for the connection from other asyncio tasks and the timeout doesn't stop it from trying to reconnect in the background task. It will also consult the zeroconf cache for updated connection details. This mostly works quite well, though there is a bug where the char cache is empty at start up I recently notice.

The implementation is messy, and thats largely driven by a poorly defined lifetime/ownership model. It was also added before the zeroconf component matured to the level its at today. You could probably get away with just letting zeroconf announce when it came back online.

Also right now polling is frequent enough that just letting HA wake the connection back up is fine. Though if we reduce polling for the sake of battery devices that might not be the case.

@roysjosh
Copy link
Contributor Author

I added a traceback print & here's what I'm seeing... it looks like subscribe is triggering a connect. The web interface has that message at the bottom "Home Assistant is starting, not everything will be available until it is finished."

File "/home/lambda/code/home-assistant/core/venv/bin/hass", line 33, in <module>
  sys.exit(load_entry_point('homeassistantconsole_scripts', 'hass')())
File "/home/lambda/code/home-assistant/core/homeassistant/__main__.py", line 191, in main
  exit_code = runner.run(runtime_conf)
File "/home/lambda/code/home-assistant/core/homeassistant/runner.py", line 119, in run
  return loop.run_until_complete(setup_and_run_hass(runtime_config))
File "/usr/lib64/python3.9/asyncio/base_events.py", line 629, in run_until_complete
  self.run_forever()
File "/usr/lib64/python3.9/asyncio/base_events.py", line 596, in run_forever
  self._run_once()
File "/usr/lib64/python3.9/asyncio/base_events.py", line 1890, in _run_once
  handle._run()
File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
  self._context.run(self._callback, *self._args)
File "/home/lambda/code/home-assistant/core/homeassistant/components/homekit_controller/connection.py", line 380, in async_process_entity_map
  await self.pairing.subscribe(self.watchable_characteristics)
File "/home/lambda/code/home-assistant/core/venv/lib64/python3.9/site-packages/aiohomekit/controller/coap/pairing.py", line 107, in subscribe
  await self._ensure_connected()
File "/home/lambda/code/home-assistant/core/venv/lib64/python3.9/site-packages/aiohomekit/controller/coap/pairing.py", line 49, in _ensure_connected
  await self.connection.connect(self.pairing_data)
File "/home/lambda/code/home-assistant/core/venv/lib64/python3.9/site-packages/aiohomekit/controller/coap/connection.py", line 392, in connect
  logger.debug(traceback.format_stack())

It tries to connect 7 (!!! I thought it was 3) times before giving up & then the web interface finishes loading. If I change the pair-verify timeout to 40s, it takes 7 * 40s for the banner thing to disappear. I haven't figured out why yet.

2022-02-14 00:03:51 WARNING (MainThread) [aiohomekit.controller.coap.connection] Pair verify timed out
2022-02-14 00:04:31 WARNING (MainThread) [aiohomekit.controller.coap.connection] Pair verify timed out
2022-02-14 00:05:11 WARNING (MainThread) [aiohomekit.controller.coap.connection] Pair verify timed out
2022-02-14 00:05:51 WARNING (MainThread) [aiohomekit.controller.coap.connection] Pair verify timed out
2022-02-14 00:06:31 WARNING (MainThread) [aiohomekit.controller.coap.connection] Pair verify timed out
2022-02-14 00:07:11 WARNING (MainThread) [aiohomekit.controller.coap.connection] Pair verify timed out
2022-02-14 00:07:51 WARNING (MainThread) [aiohomekit.controller.coap.connection] Pair verify timed out
2022-02-14 00:07:51 INFO (MainThread) [homeassistant.bootstrap] Home Assistant initialized in 281.33s

@Jc2k Jc2k added the CoAP label Feb 14, 2022
@Jc2k Jc2k added this to the 1.0.0 milestone Feb 14, 2022
@Jc2k
Copy link
Owner

Jc2k commented Feb 14, 2022

Eep. My memory and late night glances were not enough. So it is triggered byasync_process_entity_map here. Could you try editing it like this:


        self.add_entities()

-        if self.watchable_characteristics:
-            await self.pairing.subscribe(self.watchable_characteristics)
-            if not self.pairing.is_connected:
-                return

-        await self.async_update()
+       self.hass.async_create_task(self.async_update())

I think that will stop the I/O if you have an up to date cache.

I also wonder about not having subscribe do I/O unless it is already connected.

We could look to reduce the number of subscriptions. E.g. If the entities asking for a subscription didn't trigger right away. We could do a single batch call at the end of add_entities (if it was connected) otherwise send them as part of the connection process. Thats a bit more work, so something i'd probably put off if we can avoid slow starts for most users with the above.

@Jc2k
Copy link
Owner

Jc2k commented Feb 14, 2022

Argh now, that will mean the backend doesn't know about the subscriptions at all though.

@Jc2k
Copy link
Owner

Jc2k commented Feb 14, 2022

So it already is just making one call to subscribe after add_entities, so I don't know what is happening at all 😆

So here is what I was planning for IP:

  • Add IpPairing.async_start and IpPairing.async_stop.
  • In HA, async_setup is called as part of Connection.async_setup. And Connection.async_unload should call async_stop.
  • async_start should create a while True loop running in a new task via async_create_task. We should hold a private reference to it on the pairing.
  • async_stop should cancel and await the loop.
  • The loop needs to be resilient - so it should try: except Exception, log the exception and restart. Possibly with a delay.
  • The loop is in charge of pair-very and re-doing subscriptions and then notifying any waiters via an asyncio.Event() that the connection is up.
  • The loop should check self.description.addresses and self.description.port before starting a new connection in case the ip/port updated. These will be updated automatically by background processes.
  • There will be some mechanism for the background discovery to notify these have changed. That should either use an Event to notify the loop to do something, or just cancel the loop and restart it.
  • _ensure_connected then only ever has to wait on this _is_connected Event, rather than actually trying to reach out to the device.
  • With this model, subscribe should not ever call _ensure_connected. If we are currently connected, it can send subscribe immediately. If not it should not try and connect.
  • This in turn means that we can mostly not do I/O in the foreground at startup
  • The one remaining case is fetching the entity map when the cache is cold. _ensure_connected can be called and time out after 8s, then I believe we catch the exception and fail the config flow. HA will retry it for us later. It won't block startup.

We already do bits of this in IP. There is a retry loop. It does check for IP/Port changes (though not through the new mechanism as its not quite finished). Right now the pairing /does/ have a close. But actually starting up the connection is always on demand. And I think that creates a bit of a weird thing where you expect subscribe to start the connection and hence block.

@Jc2k
Copy link
Owner

Jc2k commented Feb 14, 2022

I think BLE will always be connect on demand, as those things really like to be disconnected. Theres like a 10-20s window where you can reuse the existing session. Subscribe there will be subscribe if connected, otherwise remember for when you connect.

@Jc2k
Copy link
Owner

Jc2k commented Feb 14, 2022

I imagine CoAP is somewhat similar to the existing IP backend, but as there is no "connection" in UDP its mostly s/connection/session/. The "reconnect loop" could be something that does one of those ping packets you talked about and re-pair-verifies if the ping fails?

@roysjosh
Copy link
Contributor Author

Found the source of the problem. I added the connection locks to the CoAP path because I was seeing overlapping connection attempts... the culprit is add_watchable_characteristics. It calls .subscribe 5x for each bulb and then async_process_entity_map calls .subscribe once. I think I'm seeing such long timeouts because the first 5x trip the lock, each in turn, and the last one from async_process_entity_map has to wait for them to finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants