Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add back a timeout for service calls run from scripts #98501

Closed
wants to merge 2 commits into from

Conversation

allenporter
Copy link
Contributor

@allenporter allenporter commented Aug 16, 2023

Proposed change

Add back a timeout when running service calls from scripts. This prevents integration service calls that have no timeout from hanging indefinitely. Instead, they will fail with an explicit timeout error.

The timeout was removed in #94657 where it was previously baked into the service call. The old behavior was: When the timeout is reached, just return from the function anyway (!) which would effectively swallow/ignore the error. Now we explicitly raise an error.

This is an example of what a slow service call now looks like in the UI:
image

We may also want to update these steps with timeouts:

  • _async_device_step
  • _async_scene_step

There are likely a few other places where we want to add an explicit timeout which can happen in followup PRs (for example, device actions may be calling a service within the integration). As a result, may want to push this to a higher level (in the future?) however then we'd need a way to disable it for some actions.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Black (black --fast homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

@allenporter allenporter added this to the 2023.8.3 milestone Aug 16, 2023
@allenporter allenporter requested a review from a team as a code owner August 16, 2023 07:39
@allenporter allenporter marked this pull request as draft August 16, 2023 15:03
Copy link
Member

@frenck frenck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in favor of restoring this behavior. We should look into how we can detect these cases and resolve them better.

@home-assistant
Copy link

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

@home-assistant home-assistant bot marked this pull request as draft August 16, 2023 16:20
@frenck frenck removed this from the 2023.8.3 milestone Aug 16, 2023
@allenporter
Copy link
Contributor Author

I'm not in favor of restoring this behavior. We should look into how we can detect these cases and resolve them better.

Can you elaborate?

As an example, go into the zwave service and add a timeout: Is that what you mean? (And repeat for all integrations). That is, do you see this as an integration bug? that shouldn't have a central defense.

@frenck
Copy link
Member

frenck commented Aug 16, 2023

As an example, go into the zwave service and add a timeout: Is that what you mean? (And repeat for all integrations).

Maybe? Depends on the case. The question is: should it time out for Z-Wave? Or is there something wrong that it doesn't return to begin with?

That is, do you see this as an integration bug?

Yes, for sure. Just timing out the service will not fix/handle the background task (actually, this will mask it).

that shouldn't have a central defense.

This is not a central defense IMHO, this is a timeout of 10 sec, which may not be long at all. Especially for return responses on services that need longer (for example, a more database query).

Solving this could be done in multiple ways; we actually have a debugger built in (allowing pausing and stepping through automations), which we don't use.

From that perspective, we should be able to tell the user the automation is still running and why that is, dump information from it, and even offer a way to cancel the operation from the debugger.

From a final line of defense, we could consider a ridiculous timeout (but that still won't solve issues).

@allenporter
Copy link
Contributor Author

Thanks, thats helpful context. I'll proceed with helping users connect with integration owners to fix the root causes (i was planing to do that eventually anyway as I agree this isn't fixing the root cause, i was assuming some other stance)

In my experience there are two deadline approaches i've seen:
(1) top down deadline propagation based on the context of the caller (it knows best about the high level context about how long the caller is expecting to wait)
(2) deadlines based on the specific work item at the leaf (it knows best about the operation it is performing)

I think it sounds like we're saying we should go with the latter -- or in the case of script (1) would be "very very very very very long time". I suspect we may have other cases that do (1) like websocket or rest but i haven't looked closely. (Question i'm considering: Do we want a combination or remove all top down deadlines?)

We may want to update the integration quality scale to speak to our expectations around (2) and integrations setting reasonable timeouts as a detail about how they handle device unavailability.

Anyway, for the specific issue as you suggest i'll connect the end user reports with the correct integrations that are hanging (which used to just proceed silently) rather than considering this an automation wide issue to fix. Thanks.

@jjmerri
Copy link

jjmerri commented Aug 18, 2023

Are there follow up actions on this? The old behavior was a better user experience in my opinion than perpetually running the script. Now my scripts hang and I have to login to kill them just so I can try to run them again. Setting the mode to parallel is ano option for me but I dont think that is really a solution.

@allenporter
Copy link
Contributor Author

Are there follow up actions on this? The old behavior was a better user experience in my opinion than perpetually running the script. Now my scripts hang and I have to login to kill them just so I can try to run them again. Setting the mode to parallel is ano option for me but I dont think that is really a solution.

Yes there are follow up actions, what we said above. We need to, and will, fix the broken integrations.

@allenporter
Copy link
Contributor Author

Let's keep discussion on the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants