New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add back a timeout for service calls run from scripts #98501
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in favor of restoring this behavior. We should look into how we can detect these cases and resolve them better.
Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍 |
Can you elaborate? As an example, go into the zwave service and add a timeout: Is that what you mean? (And repeat for all integrations). That is, do you see this as an integration bug? that shouldn't have a central defense. |
Maybe? Depends on the case. The question is: should it time out for Z-Wave? Or is there something wrong that it doesn't return to begin with?
Yes, for sure. Just timing out the service will not fix/handle the background task (actually, this will mask it).
This is not a central defense IMHO, this is a timeout of 10 sec, which may not be long at all. Especially for return responses on services that need longer (for example, a more database query). Solving this could be done in multiple ways; we actually have a debugger built in (allowing pausing and stepping through automations), which we don't use. From that perspective, we should be able to tell the user the automation is still running and why that is, dump information from it, and even offer a way to cancel the operation from the debugger. From a final line of defense, we could consider a ridiculous timeout (but that still won't solve issues). |
Thanks, thats helpful context. I'll proceed with helping users connect with integration owners to fix the root causes (i was planing to do that eventually anyway as I agree this isn't fixing the root cause, i was assuming some other stance) In my experience there are two deadline approaches i've seen: I think it sounds like we're saying we should go with the latter -- or in the case of script (1) would be "very very very very very long time". I suspect we may have other cases that do (1) like websocket or rest but i haven't looked closely. (Question i'm considering: Do we want a combination or remove all top down deadlines?) We may want to update the integration quality scale to speak to our expectations around (2) and integrations setting reasonable timeouts as a detail about how they handle device unavailability. Anyway, for the specific issue as you suggest i'll connect the end user reports with the correct integrations that are hanging (which used to just proceed silently) rather than considering this an automation wide issue to fix. Thanks. |
Are there follow up actions on this? The old behavior was a better user experience in my opinion than perpetually running the script. Now my scripts hang and I have to login to kill them just so I can try to run them again. Setting the mode to parallel is ano option for me but I dont think that is really a solution. |
Yes there are follow up actions, what we said above. We need to, and will, fix the broken integrations. |
Let's keep discussion on the issue. |
Proposed change
Add back a timeout when running service calls from scripts. This prevents integration service calls that have no timeout from hanging indefinitely. Instead, they will fail with an explicit timeout error.
The timeout was removed in #94657 where it was previously baked into the service call. The old behavior was: When the timeout is reached, just return from the function anyway (!) which would effectively swallow/ignore the error. Now we explicitly raise an error.
This is an example of what a slow service call now looks like in the UI:
We may also want to update these steps with timeouts:
There are likely a few other places where we want to add an explicit timeout which can happen in followup PRs (for example, device actions may be calling a service within the integration). As a result, may want to push this to a higher level (in the future?) however then we'd need a way to disable it for some actions.
Type of change
Additional information
Checklist
black --fast homeassistant tests
)If user exposed functionality or configuration variables are added/changed:
If the code communicates with devices, web services, or third-party tools:
Updated and included derived files by running:
python3 -m script.hassfest
.requirements_all.txt
.Updated by running
python3 -m script.gen_requirements_all
..coveragerc
.To help with the load of incoming pull requests: