New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add system health component #20436

Merged
merged 14 commits into from Jan 30, 2019

Conversation

@balloob
Copy link
Member

balloob commented Jan 26, 2019

Description:

Was talking with Tinkerer, and we came to the conclusion that we should prioritize adding this component as it will help with helping (how meta).

Goal is to get a place in the UI to show the info on the machine. This will help people with diagnosing problems.

Some RFC about this implementation:

  • Right now info is just a callback function (not awaitable) to get some static info.
  • Idea is to add a diagnostics view that allows components to actually run some tests (like is network connected, can I reach the internet, is HA Cloud reachable etc). I am not sure if this is something that we should make part of the same info registration method or add a new one.

UI will look something like this:

  • System
    • OS: HassOS
    • Version: 2.6
  • Lovelace
    • Storage: YAML

Related issue (if applicable): fixes home-assistant/architecture#114

Pull request in home-assistant.io with documentation (if applicable): TODO home-assistant/home-assistant.io#<home-assistant.io PR number goes here>

Example entry for configuration.yaml (if applicable):

system_health:

Checklist:

  • The code change is tested and works locally.
  • Local tests pass with tox. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.

If the code does not interact with devices:

  • Tests have been added to verify that the new code works.
@arsaboo

This comment has been minimized.

Copy link
Contributor

arsaboo commented Jan 26, 2019

We should also include (custom) addons/components configured for diagnosis.

@DubhAd

This comment has been minimized.

Copy link
Contributor

DubhAd commented Jan 26, 2019

We should also include (custom) addons/components configured for diagnosis.

Yes please. Despite the warnings in the logs, lots of people overlook custom components (or heck, forget they've installed them).

@balloob

This comment has been minimized.

Copy link
Member Author

balloob commented Jan 26, 2019

Components will be added in a future PR

@pvizeli

This comment has been minimized.

Copy link
Member

pvizeli commented Jan 27, 2019

I think the actions need to be a new part. So you see all available checks and can trigger the test to see if there is a problem. I.e., Hue register a health check function. We can click on that and trigger the register function they check if the Hub is connected or other things and return one of 3 states: ok, warning, error with small text output. We could also check these things all 6 hours and 1 hour after the start.

So the health component has 2 parts. Static system information (they usually are very static) and a health check function for a domain.

@balloob

This comment has been minimized.

Copy link
Member Author

balloob commented Jan 28, 2019

@pvizeli so what about some info that we could run as an action, but also could retrieve statically? For example, we will know if we're connected to HA Cloud ?

One other thing I hope other components will add to this view is last interaction. When was last interaction with Hue etc. When was the last error etc.

balloob added some commits Jan 28, 2019

@balloob

This comment has been minimized.

Copy link
Member Author

balloob commented Jan 29, 2019

What am I doing… this needs to be WS commands instead of HTTP views🤦‍♂️

balloob added some commits Jan 29, 2019

@pvizeli

This comment has been minimized.

Copy link
Member

pvizeli commented Jan 29, 2019

The Problem is, that code works nicely if you have no issue. You call now every callback any time you try to receive data for Frontend.

On component, you need to check data on external API or hardware. If there is an issue, you run into the default timeout. Also, some system API like docker eat a lot of resources and can block other processes.

I would prefer that we run the health check every, i.e. hour. And the Frontend sees the cached data of the last health check. But with an option that you can trigger a new health check with knowledge that can take up to 30-60sec until you see the result. That allows us also to slow down the checks and not trigger all at once.

With this mechanics, we can later add things like creating a trigger on background checks or a history on which time the system had issues.

@balloob

This comment has been minimized.

Copy link
Member Author

balloob commented Jan 29, 2019

@pvizeli The info command is for static info, firmware version, last interaction, connected to cloud, lovelace storage mode. Future PR will add a diagnostics command that will diagnose things on command when user clicks the button.

@balloob

This comment has been minimized.

Copy link
Member Author

balloob commented Jan 29, 2019

Very confused, can't get the tests to pass on CI but can locally. Mock is not getting applied

balloob added some commits Jan 29, 2019

@pvizeli

This comment has been minimized.

Copy link
Member

pvizeli commented Jan 30, 2019

If I need to grab this data from a device or in case of hass.io from the supervisor, they run into the API timeout if there is a problem available. But you are right, for integration with a running connection like the cloud it works perfectly.

That end's up in: if you see the healthy data, your system works as it should otherwise you have an issue

@balloob

This comment has been minimized.

Copy link
Member Author

balloob commented Jan 30, 2019

Well, how can we do it otherwise? We give each component up to 5 seconds to get the data.

@balloob

This comment has been minimized.

Copy link
Member Author

balloob commented Jan 30, 2019

I want this to be part of the beta, so will merge it. We can discuss and change things later, as it's an internal implementation.

@balloob balloob merged commit cb07ea0 into dev Jan 30, 2019

6 checks passed

Hound No violations found. Woof!
WIP ready for review
Details
cla-bot Everyone involved has signed the CLA
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.09%) to 93.028%
Details

@delete-merged-branch delete-merged-branch bot deleted the system-health branch Jan 30, 2019

@Brianckramer

This comment has been minimized.

Copy link

Brianckramer commented Jan 31, 2019

https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=147781&start=50#p972790

For those running on a pi, this might be a good check to do and report as undervoltage can lead to SD problems/throttling

@balloob balloob referenced this pull request Feb 6, 2019

Merged

0.87.0 #20794

@vlad36N

This comment has been minimized.

Copy link

vlad36N commented Feb 7, 2019

Added system_health: in configuration.yaml
Home Assistant 0.87.0
After restart:
System Health error | System Health component is not loaded. Add 'system_health:' to configuration.yaml

In log:
Log Details (ERROR)

Thu Feb 07 2019 13:33:49 GMT-0500 (Eastern Standard Time)
Received invalid command: system_health/info

@fabaff

This comment has been minimized.

Copy link
Member

fabaff commented Feb 12, 2019

@vlad36N, please open an issue.

@home-assistant home-assistant locked as resolved and limited conversation to collaborators Feb 12, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.