Add back a timeout for service calls run from scripts #98501

allenporter · 2023-08-16T07:39:59Z

Proposed change

Add back a timeout when running service calls from scripts. This prevents integration service calls that have no timeout from hanging indefinitely. Instead, they will fail with an explicit timeout error.

The timeout was removed in #94657 where it was previously baked into the service call. The old behavior was: When the timeout is reached, just return from the function anyway (!) which would effectively swallow/ignore the error. Now we explicitly raise an error.

This is an example of what a slow service call now looks like in the UI:

We may also want to update these steps with timeouts:

_async_device_step
_async_scene_step

There are likely a few other places where we want to add an explicit timeout which can happen in followup PRs (for example, device actions may be calling a service within the integration). As a result, may want to push this to a higher level (in the future?) however then we'd need a way to disable it for some actions.

Type of change

Dependency upgrade
Bugfix (non-breaking change which fixes an issue)
New integration (thank you!)
New feature (which adds functionality to an existing integration)
Deprecation (breaking change to happen in the future)
Breaking change (fix/feature causing existing functionality to break)
Code quality improvements to existing code or addition of tests

Additional information

This PR fixes or closes issue:
This PR is related to issue: Automations marked as "Still Running" After Upgrade to 2023.7 & 2023.8 #98073
Link to documentation pull request:

Checklist

The code change is tested and works locally.
Local tests pass. Your PR cannot be merged unless tests pass
There is no commented out code in this PR.
I have followed the development checklist
I have followed the perfect PR recommendations
The code has been formatted using Black (black --fast homeassistant tests)
Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

Documentation added/updated for www.home-assistant.io

If the code communicates with devices, web services, or third-party tools:

The manifest file has all fields filled out correctly.
Updated and included derived files by running: python3 -m script.hassfest.
New or updated dependencies have been added to requirements_all.txt.
Updated by running python3 -m script.gen_requirements_all.
For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

I have reviewed two other open pull requests in this repository.

frenck

I'm not in favor of restoring this behavior. We should look into how we can detect these cases and resolve them better.

home-assistant · 2023-08-16T16:20:36Z

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

allenporter · 2023-08-16T17:28:37Z

I'm not in favor of restoring this behavior. We should look into how we can detect these cases and resolve them better.

Can you elaborate?

As an example, go into the zwave service and add a timeout: Is that what you mean? (And repeat for all integrations). That is, do you see this as an integration bug? that shouldn't have a central defense.

frenck · 2023-08-16T18:47:36Z

As an example, go into the zwave service and add a timeout: Is that what you mean? (And repeat for all integrations).

Maybe? Depends on the case. The question is: should it time out for Z-Wave? Or is there something wrong that it doesn't return to begin with?

That is, do you see this as an integration bug?

Yes, for sure. Just timing out the service will not fix/handle the background task (actually, this will mask it).

that shouldn't have a central defense.

This is not a central defense IMHO, this is a timeout of 10 sec, which may not be long at all. Especially for return responses on services that need longer (for example, a more database query).

Solving this could be done in multiple ways; we actually have a debugger built in (allowing pausing and stepping through automations), which we don't use.

From that perspective, we should be able to tell the user the automation is still running and why that is, dump information from it, and even offer a way to cancel the operation from the debugger.

From a final line of defense, we could consider a ridiculous timeout (but that still won't solve issues).

allenporter · 2023-08-17T04:07:08Z

Thanks, thats helpful context. I'll proceed with helping users connect with integration owners to fix the root causes (i was planing to do that eventually anyway as I agree this isn't fixing the root cause, i was assuming some other stance)

In my experience there are two deadline approaches i've seen:
(1) top down deadline propagation based on the context of the caller (it knows best about the high level context about how long the caller is expecting to wait)
(2) deadlines based on the specific work item at the leaf (it knows best about the operation it is performing)

I think it sounds like we're saying we should go with the latter -- or in the case of script (1) would be "very very very very very long time". I suspect we may have other cases that do (1) like websocket or rest but i haven't looked closely. (Question i'm considering: Do we want a combination or remove all top down deadlines?)

We may want to update the integration quality scale to speak to our expectations around (2) and integrations setting reasonable timeouts as a detail about how they handle device unavailability.

Anyway, for the specific issue as you suggest i'll connect the end user reports with the correct integrations that are hanging (which used to just proceed silently) rather than considering this an automation wide issue to fix. Thanks.

jjmerri · 2023-08-18T01:45:57Z

Are there follow up actions on this? The old behavior was a better user experience in my opinion than perpetually running the script. Now my scripts hang and I have to login to kill them just so I can try to run them again. Setting the mode to parallel is ano option for me but I dont think that is really a solution.

allenporter · 2023-08-18T02:14:21Z

Are there follow up actions on this? The old behavior was a better user experience in my opinion than perpetually running the script. Now my scripts hang and I have to login to kill them just so I can try to run them again. Setting the mode to parallel is ano option for me but I dont think that is really a solution.

Yes there are follow up actions, what we said above. We need to, and will, fix the broken integrations.

allenporter · 2023-08-18T02:14:51Z

Let's keep discussion on the issue.

Add back a timeout for service calls run from scripts

901d41b

allenporter added this to the 2023.8.3 milestone Aug 16, 2023

allenporter requested a review from a team as a code owner August 16, 2023 07:39

home-assistant bot added bugfix cla-signed core labels Aug 16, 2023

allenporter mentioned this pull request Aug 16, 2023

Set script execution as cancelled when stopped #97839

Closed

20 tasks

allenporter marked this pull request as draft August 16, 2023 15:03

Fix handling of long task cancellation exception

c437b62

allenporter marked this pull request as ready for review August 16, 2023 15:12

allenporter mentioned this pull request Aug 16, 2023

Automation excuted by the rest API with a delay doesn't work anymore #98166

Closed

frenck requested changes Aug 16, 2023

View reviewed changes

home-assistant bot marked this pull request as draft August 16, 2023 16:20

frenck removed this from the 2023.8.3 milestone Aug 16, 2023

allenporter closed this Aug 17, 2023

iDontWantAUsername mentioned this pull request Aug 17, 2023

Automations marked as "Still Running" After Upgrade to 2023.7 & 2023.8 #98073

Closed

This was referenced Aug 18, 2023

ZWave light service calls never timeout when entities go to Unavailable with turn_off service #98491

Closed

Automations triggered by Zigbee device are now hanging or failing all together #98428

Closed

github-actions bot locked and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add back a timeout for service calls run from scripts #98501

Add back a timeout for service calls run from scripts #98501

allenporter commented Aug 16, 2023 •

edited

frenck left a comment

home-assistant bot commented Aug 16, 2023

allenporter commented Aug 16, 2023

frenck commented Aug 16, 2023

allenporter commented Aug 17, 2023

jjmerri commented Aug 18, 2023

allenporter commented Aug 18, 2023

allenporter commented Aug 18, 2023

Add back a timeout for service calls run from scripts #98501

Add back a timeout for service calls run from scripts #98501

Conversation

allenporter commented Aug 16, 2023 • edited

Proposed change

Type of change

Additional information

Checklist

frenck left a comment

Choose a reason for hiding this comment

home-assistant bot commented Aug 16, 2023

allenporter commented Aug 16, 2023

frenck commented Aug 16, 2023

allenporter commented Aug 17, 2023

jjmerri commented Aug 18, 2023

allenporter commented Aug 18, 2023

allenporter commented Aug 18, 2023

allenporter commented Aug 16, 2023 •

edited