Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template editor hangs Home Assistant #40621

Closed
to4ko opened this issue Sep 26, 2020 · 34 comments · Fixed by #40624, #40272, #40647 or #40667
Closed

Template editor hangs Home Assistant #40621

to4ko opened this issue Sep 26, 2020 · 34 comments · Fixed by #40624, #40272, #40647 or #40667

Comments

@to4ko
Copy link

to4ko commented Sep 26, 2020

The problem

testing templates in editor in 115.0-115.3 very tricky -making mistake in template you are getting hanging system, only container restart via portainer helps. below you can find log at the moment of testing. I think test environment should not lead to system reboot at any case

Environment

  • Home Assistant Core release with the issue: 115.3 (same on 115.0 - 115.2)
  • Last working Home Assistant Core release (if known): 144.4
  • Operating environment (OS/Container/Supervised/Core): Ubuntu Supervised (same on deb10 supervised)
  • Integration causing this issue: ??
  • Link to integration documentation on our website:

Problem-relevant configuration.yaml

evaluating this kind of templates leads to system hangs (even not able to save config files from IDE addon)


      - service: notify.telegram
        data_template:
          message: >
            {% set time_limit = as_timestamp(now())-4*60*60 %}
            {% set missing = states | selectattr('attributes.last_seen', 'defined') | selectattr('attributes.linkquality', 'defined')| selectattr('attributes.last_seen','<', time_limit*1000 ) | map(attribute='attributes.device.friendlyName') | unique | list  | join(', ') %}
            {% set missing_count = states | selectattr('attributes.last_seen', 'defined') | selectattr('attributes.linkquality', 'defined')| selectattr('attributes.last_seen','<', time_limit*1000 ) | map(attribute='attributes.device.friendlyName') | unique | list | count %}
            {% set z2m_uptime = ((states('sensor.zigbee2mqtt_bridge_uptime')|int) / 3600)|round(1) %}
            {% set z2m2_uptime = ((states('sensor.zigbee2mqtt2_bridge_uptime')|int) / 3600)|round(1) %}

            {% if missing_count >0 %}{{"\U0001f4f6"}} 4 часа (аптайм z2m: {{z2m_uptime}}ч, z2m2: {{z2m2_uptime}}ч) нет ответа от: {{ missing }}
            {%else%}{{"\U00002705"}}Все устройства доступны!{%endif%}

Traceback/Error logs

image

Additional information

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

The system is likely hanging because the template is listening for all state changed events when we can't successfully evaluate a template as optimistically waiting for the template evaluation to be successful. Normally this isn't an issue but it seems like there is something on your instance that is generating more state changed events than the template engine can keep up with which leads to it being overwhelmed.

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

I added functionality in #40624 to trap all the jinja2 errors which will prevent listening for all states on error. Regardless, the system should have been able to keep up with that so I suspect you have something else going on as well that might be brought to light with a py-spy recording.

@Mariusthvdb

This comment has been minimized.

@to4ko
Copy link
Author

to4ko commented Sep 26, 2020

chief, this templates were working just fine on all my instances up to 114.4. And only in 115 it cause the system to be non-responsive at all (cross tracking from other instances showing that it's offline). Only diff from "general" system that i'm having up to 2500 entities in my system.

this how it looks like from portainer - and it's not dropping down when i do delete the template and press the Reset to Demo ...

it might be hat my templates not the best ones but they were working just fine even on RPI3b+ and this is happening on Xeon system as well as on i3 or celeron...my RPI simple not able to restart container - only power cycle helps...

image

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

chief, this templates were working just fine on all my instances up to 114.4. And only in 115 it cause the system to be non-responsive at all (cross tracking from other instances showing that it's offline).

In 0.114.4 we were not capable of tracking all state changed events from a template so you would not get any listeners when using states | ....

Only diff from "general" system that i'm having up to 2500 entities in my system.

How many state changed events / second does the system fire?

@to4ko
Copy link
Author

to4ko commented Sep 26, 2020

chief, this templates were working just fine on all my instances up to 114.4. And only in 115 it cause the system to be non-responsive at all (cross tracking from other instances showing that it's offline).

In 0.114.4 we were not capable of tracking all state changed events from a template so you would not get any listeners when using states | ....

But it were giving correct answer all the time without any problems. and adding antity_id we were able to set "update interval"

Only diff from "general" system that i'm having up to 2500 entities in my system.

How many state changed events / second does the system fire?

according to my postgre stats - like 15-20 event per seconds.

image

additionally once system starts evaluation template shown at the top, connection to addons got lost in couple of minutes - i was installing py-spy in terminal addon and get connection closed message. no any reaction on menu\buttons\triggers whatever. only container restart.

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

I'm able to replicate the issue with:

{% for var in range(10000) -%}
  {% for var in range(10000) -%}
    {{ var }}
  {%- endfor %}.
{%- endfor %}

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

I can't replicate the issue with the template provided in the example using a system with 1200 states on the busiest system I have. The number of state changes events per second is much much lower than the reference system:

sqlite> select date(time_fired), count(*)/86400.0 from events group by date(time_fired);
2020-09-25|0.662002314814815
2020-09-26|0.821550925925926

@to4ko
Copy link
Author

to4ko commented Sep 26, 2020

I can't replicate the issue with the template provided in the example using a system with 1200 states on the busiest system I have. The number of state changes events per second is much much lower than the reference system:

sqlite> select date(time_fired), count(*)/86400.0 from events group by date(time_fired);
2020-09-25|0.662002314814815
2020-09-26|0.821550925925926

so what can be done?

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

I can't replicate the issue with the template provided in the example using a system with 1200 states on the busiest system I have. The number of state changes events per second is much much lower than the reference system:

sqlite> select date(time_fired), count(*)/86400.0 from events group by date(time_fired);
2020-09-25|0.662002314814815
2020-09-26|0.821550925925926

so what can be done?

At 16.18 state changes per second that is ~1.4 million state changed events fired per day.

That's an impressive amount of state changes.

That means the template needs to check 3.5 billion states per day (in this case its actually 7 billion since its iterated twice).

Listening and iterating all states is never going to be viable on a system with those numbers no matter how much we optimize templates.

The only solution I can come up with is to revert the change that added listeners for all states but that isn't going over very well as other users rely on this.

@bdraco bdraco added the help-wanted Wanna help? Jump in! label Sep 26, 2020
@to4ko
Copy link
Author

to4ko commented Sep 26, 2020

I can't replicate the issue with the template provided in the example using a system with 1200 states on the busiest system I have. The number of state changes events per second is much much lower than the reference system:

sqlite> select date(time_fired), count(*)/86400.0 from events group by date(time_fired);
2020-09-25|0.662002314814815
2020-09-26|0.821550925925926

so what can be done?

At 16.18 state changes per second that is ~1.4 million state changed events fired per day.

That's an impressive amount of state changes.

That means the template needs to check 3.5 billion states per day (in this case its actually 7 billion since its iterated twice).

Listening and iterating all states is never going to be viable on a system with those numbers no matter how much we optimize templates.

The only solution I can come up with is to revert the change that added listeners for all states but that isn't going over very well as other users rely on this.

maybe optionally add entity_id back - if it's specified use it (114 style), if not - than new 115 style

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

maybe optionally add entity_id back - if it's specified use it (114 style), if not - than new 115 style

The template tracking is used in bayesian sensors, template entities, automation triggers, universal media players, and the websocket api (the case you are reporting). This would handle the template entities case but it doesn't solve the issue you are reporting.

@Spirituss

This comment has been minimized.

@tdejneka

This comment has been minimized.

@Spirituss

This comment has been minimized.

@to4ko
Copy link
Author

to4ko commented Sep 26, 2020

maybe optionally add entity_id back - if it's specified use it (114 style), if not - than new 115 style

The template tracking is used in bayesian sensors, template entities, automation triggers, universal media players, and the websocket api (the case you are reporting). This would handle the template entities case but it doesn't solve the issue you are reporting.

it will. my system will not hang on template testing. as of now - i need to restart container each time when i have an error. plus - to be honest - do we really need this "feature"? it was working just fine in the past. and not it's enchanted meaning broken.

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

chief, this templates were working just fine on all my instances up to 114.4. And only in 115 it cause the system to be non-responsive at all (cross tracking from other instances showing that it's offline).

In 0.114.4 we were not capable of tracking all state changed events from a template so you would not get any listeners when using states | ....

Only diff from "general" system that i'm having up to 2500 entities in my system.

How many state changed events / second does the system fire?

I use similar templates and they were correctly working until update to 0.115. It sounds very strange that "your instance that is generating more state changed events than the template engine can keep up with which leads to it being overwhelmed" while all previous HA versions were really capable to work with such templates. It is definitely because of non-optimized template engine code in 0.115 which is still not ready for release.

I also propose to add two features to templates engine:

  1. it should have something like "template calculation timeout" inside and force stop any further calculation (with corresponding error/warning message in log) in case the engine is not capable to complete necessary updates online. It will significantly increase system stability in any case. The system behaviour when any entity can cause its hang up is not correct architectural solution.

jinja doesn't have a timeout capability. Since we have to render templates in async we don't have an easy way of aborting the execution without a signal which may not be safe to do.
pallets/jinja#161

  1. HA documentation tells that any standard integration should work with some common parameters such as 'scan_interval' (home-assistant.io/docs/configuration/platform_options). At the moment HA doesn't allow to use scan_interval in template sensor. For some 'heavy' sensors like above mentioned this option could also help to solve the problem. In case 'scan_interval' is not specified the template engine will update template by default rules.

If we implemented that we would end up with multiple template implementations in template entities. This issue isn't about template entities though its about the template editor hanging in developer tools.

@to4ko
Copy link
Author

to4ko commented Sep 26, 2020

If we implemented that we would end up with multiple template implementations in template entities. This issue isn't about template entities though its about the template editor hanging in developer tools.

but it will became about template templates in general once a user will create this kind of template. dev tool is just on the top of all that

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

I'm come back to this once #40272 is reviewed as it will give more flexibility on how we can handle this without breaking state counting.

@bdraco bdraco linked a pull request Sep 26, 2020 that will close this issue
21 tasks
@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

Assuming it solves all the cases, disabling the all states listener by default and implementing a flag allow_all_states_listener to turn it on seems a reasonable path forward after we can separate counting from iterating in #40272 as it means we don't end up with two implementations in template entities.

@tdejneka

This comment has been minimized.

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

disabling the all states listener by default and implementing a flag allow_all_states_listener to turn it on seems a reasonable path forward

Instead of creating a new option, called allow_all_states_listener, which serves a single purpose, why not simply restore entity_id and allow it to accept states as a valid value? It would allow for more flexibility than a new, one-trick pony.

Restoring entity_id would require the user to know about the flag in advance instead of trying to figure it out after they already have a performance issue. It also doesn't solve this issue.

@Spirituss

This comment has been minimized.

@Spirituss

This comment has been minimized.

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

Since allow_all_states_listener doesn't seem to be desired, I'm going to hold off moving forward until we can come up with a solution that accommodates the use cases without implementing two systems and solves the issue submitted here.

@tdejneka

This comment has been minimized.

@balloob
Copy link
Member

balloob commented Sep 26, 2020

Please do not spam this issue with discussion about entity_id being brought back. I've hid these comments. Next occurrence will result in a ban. This is the issue tracker, not for feature requests. The only thing you guys are achieving is that you are harassing someone who is actually trying to solve the problems. That's not something we accept in this community.

@bdraco
Copy link
Member

bdraco commented Sep 26, 2020

Within the scope of the dev tools we need some type of safety net since users need to be able to safely experiment that covers:

  1. Very expensive templates

Example: massive iteration

{% for var in range(10000) -%}
  {% for var in range(10000) -%}
    {{ var }}
  {%- endfor %}
{%- endfor %}
  1. Templates that re-fire too frequently

Example: iterate all states on busy systems (with the reference point being 1 million state change events per day or 3.5 billion iterates)

{{ states | list }}
  1. We end up with an all listener unexpectedly when we have a template error

Example: any invalid template.

@bdraco bdraco reopened this Sep 26, 2020
@Mariusthvdb
Copy link
Contributor

As posted here https://community.home-assistant.io/t/heads-up-upcoming-breaking-change-in-the-template-integration/223715/372?u=mariusthvdb


    {% set ns = namespace(domains=[]) %}

      {% for d in states|groupby('domain') %}

      {% set ns.domains = ns.domains + [d[0]] %}

      {% endfor %}

      {% set list = ns.domains|join('\n') %}

      {{list if list|count < 255 else

        list|replace('input','inp')|truncate(255,true)}}

115 is a bit too anxious to evaluate on state changes... no issues with these templates up to 114.

just to add to the thread here, for reference, thought it might be useful

and yes, only 'ha core restart' will do. The template testing for unavailable in that same thread, made even that impossible forcing to ha host reboot. Won't copy the thread any further

Why was this hidden? It was on the exact issue of this issue tracker my contribution was aimed. A template killing the instance in dev template tools

Please unhide?

@amelchio
Copy link
Contributor

For 2 the problem is the cumulative expense of execution where a rate limit probably makes sense.

Penny had some thoughts on this while working on the new template engine: home-assistant/architecture#206 – maybe it's better to continue the point 2 discussion there.

@bdraco
Copy link
Member

bdraco commented Sep 27, 2020

For 2 the problem is the cumulative expense of execution where a rate limit probably makes sense.

Penny had some thoughts on this while working on the new template engine: home-assistant/architecture#206 – maybe it's better to continue the point 2 discussion there.

Fantastic 👍 🥇 💯 🚀

This is exactly the help I was looking for 😄 😄 😄

@bdraco bdraco removed the help-wanted Wanna help? Jump in! label Sep 27, 2020
@bdraco bdraco linked a pull request Sep 27, 2020 that will close this issue
21 tasks
@Petro31
Copy link
Contributor

Petro31 commented Sep 28, 2020

@bdraco This may be related, I've noticed that listeners are set to all states when using states.domain.object_id with an object_id that is not present on the system (i.e. a typo). Is this intended or a bug?

@bdraco bdraco reopened this Sep 28, 2020
@bdraco
Copy link
Member

bdraco commented Sep 28, 2020

@bdraco This may be related, I've noticed that listeners are set to all states when using states.domain.object_id with an object_id that is not present on the system (i.e. a typo). Is this intended or a bug?

@Petro31
That is intended because we don't know if the entity is about to be added. If it really is a typo it doesn't cost much since its storing a dict key/value that which has near O(1) performance and since we will never get a state changed event for an entity that doesn't exist its not really an issue.

@Petro31
Copy link
Contributor

Petro31 commented Sep 28, 2020

@bdraco This may be related, I've noticed that listeners are set to all states when using states.domain.object_id with an object_id that is not present on the system (i.e. a typo). Is this intended or a bug?

@Petro31
That is intended because we don't know if the entity is about to be added. If it really is a typo it doesn't cost much since its storing a dict key/value that which has near O(1) performance and since we will never get a state changed event for an entity that doesn't exist its not really an issue.

Great, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment