Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistics on 2024.5.0 causing increased CPU usage and high temps #116650

Closed
myromeo opened this issue May 2, 2024 · 45 comments · Fixed by #116743
Closed

statistics on 2024.5.0 causing increased CPU usage and high temps #116650

myromeo opened this issue May 2, 2024 · 45 comments · Fixed by #116743

Comments

@myromeo
Copy link

myromeo commented May 2, 2024

The problem

Updated to 2024.5.0 and noted an increase in CPU usage (around 5% higher) resulting in CPU temps increasing by 30%+. Rolled back to 2024.4.4 and usage and temps immediately dropped.

What version of Home Assistant Core has the issue?

core-2024.5.0

What was the last working version of Home Assistant Core?

core-2024.4.4

What type of installation are you running?

Home Assistant OS

Integration causing the issue

No response

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Can’t see any errors and everything is running.

Additional information

X64, intel i5-4590T on Lenovo m93p tiny.

@bdraco
Copy link
Member

bdraco commented May 2, 2024

@nijel
Copy link
Contributor

nijel commented May 3, 2024

I observe the same symptoms, callgrind is here:

callgrind.out.1714720380542446.txt

If I read it correctly, it seems to be caused by statistics platform sensors in my case.

@home-assistant
Copy link

home-assistant bot commented May 3, 2024

Hey there @ThomDietrich, mind taking a look at this issue as it has been labeled with an integration (statistics) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of statistics can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign statistics Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


statistics documentation
statistics source
(message by IssueLinks)

@bdraco
Copy link
Member

bdraco commented May 3, 2024

If I read it correctly, it seems to be caused by statistics platform sensors in my case.

Spot on
72 million calls to total_seconds

stats
Screenshot 2024-05-03 at 2 26 47 AM

@bdraco
Copy link
Member

bdraco commented May 3, 2024

#110438 will probably fix it, but I never did get time to get back to iterate on that PR

@nijel
Copy link
Contributor

nijel commented May 3, 2024

Any clue why this appeared with the 2024.5? The statistics code doesn't seem to be changed recently.

@bdraco
Copy link
Member

bdraco commented May 3, 2024

Its been a problem for a while, and seems to come up every time the timings change a bit, see #110433

I'm not exactly sure what triggers the issue as I've never been able to fully replicate it to the degree you have in your profile.

@bdraco
Copy link
Member

bdraco commented May 3, 2024

_initialize_from_database could be refactored to only look at the last state to set the attributes instead of doing the calculations inside of _add_state_to_queue for every previous state and throwing it away.

That would probably help as well

@nijel
Copy link
Contributor

nijel commented May 3, 2024

In my case, it seems to be caused by 30 minute average_linear statistics for a frequently updated sensor. I should probably look for a better way to achieve the same.

PS: After changing a bit calculation formula for the originating sensor, I'm down to 12M calls of total_seconds and my system no longer runs fans on full power. Still, this is the most CPU intensive operation my Home Assistant does.

@ayavilevich
Copy link

I am having an issue with statistics since upgrade to 2024.5.0 too.

Here is the CPU graph:
image

It starts normal after a restart but then after a few hours gradually becomes CPU over loaded.

I ran the profiler. High CPU places are:

statistics.py:154(_sum)
~:0(<method 'as_integer_ratio' of 'float' objects>)
statistics.py:287(_exact_ratio)

image

I use statistics a lot but didn't have an issue before.
Going to rollback now.

@bdraco
Copy link
Member

bdraco commented May 3, 2024

I think the performance problems could be solved, by breaking out the code into a new object, and then breaking that out into a package, it could also be Cythonized to do all the calculations in native code.

Another alternative would be to rewrite the expensive operations in numpy

@myromeo
Copy link
Author

myromeo commented May 3, 2024

callgrind.out.1714751198628950.txt

As requested callgrind file. How do I make sense of this file as I have no idea what I am looking at!

@ayavilevich
Copy link

ayavilevich commented May 3, 2024

callgrind.out.1714751198628950.txt

As requested callgrind file. How do I make sense of this file as I have no idea what I am looking at!

See https://www.home-assistant.io/integrations/profiler

For example, you can use SnakeViz to render it and see what functions consume most of the CPU.

Correction, SnakeViz is for the .cprof. There are other software for the callgrind.

@ThomDietrich
Copy link
Contributor

Hey guys, hey @bdraco,
this sounds horrible. Any idea what caused that issue? My first guess would be that this might be related to the introduction of state_reported events, however the statistics component doesn't use that one yet. I'm thinking there might have been another change that lead to this.

@bdraco
Copy link
Member

bdraco commented May 3, 2024

@ayavilevich

callgrind.out.1714751198628950.txt

As requested callgrind file. How do I make sense of this file as I have no idea what I am looking at!

Screenshot 2024-05-03 at 3 39 21 PM

Looks like this case of many sqlite queries.

Can you turn on info (maybe even debug) logging for sqlalchemy as well?

@myromeo
Copy link
Author

myromeo commented May 3, 2024

how do I do that?

@bdraco
Copy link
Member

bdraco commented May 3, 2024

service: logger.set_level
data:
  sqlalchemy: info
Screenshot 2024-05-03 at 3 45 42 PM

@bdraco
Copy link
Member

bdraco commented May 3, 2024

Hey guys, hey @bdraco, this sounds horrible. Any idea what caused that issue? My first guess would be that this might be related to the introduction of state_reported events, however the statistics component doesn't use that one yet. I'm thinking there might have been another change that lead to this.

I'm not really sure whats going on at this point as it seems all the profiles have a lot of database access going on, but AFAICT _initialize_from_database is only ever called once per entity.

@bdraco
Copy link
Member

bdraco commented May 3, 2024

The only thing I can think is the database access is unrelated and because of the busy system otherwise as _initialize_from_database doesn't appear in the profile

@bdraco
Copy link
Member

bdraco commented May 3, 2024

Maybe there is some type of race in _next_to_purge_timestamp and its scheduling the next update too soon so its updating over and over again

self.hass, _scheduled_update, timestamp

@bdraco bdraco changed the title 2024.5.0 causing increased CPU usage and high temps statistics on 2024.5.0 causing increased CPU usage and high temps May 3, 2024
@bdraco
Copy link
Member

bdraco commented May 3, 2024

service: logger.set_level
data:
  sqlalchemy: info
  homeassistant.components.statistics: debug

It would be good to get debug logs for statistics as well so we can tell if its the scheduling update firing too often

@myromeo
Copy link
Author

myromeo commented May 3, 2024

@bdraco I've set that, what next?

@ayavilevich
Copy link

I am not sure if that is relevant, but since you mentioned database activity, I would like to add that I am using MariaDB as the database for the recorder.
Is that common to others who are having the issue?

@bdraco
Copy link
Member

bdraco commented May 3, 2024

@bdraco I've set that, what next?

Please post the log (remove anything sensitive)

@dannytsang
Copy link

home-assistant.log

@bdraco
Copy link
Member

bdraco commented May 3, 2024

Maybe I'm reading this wrong and the time zones aren't aligned, but it looks like it trying to schedule updates in the past which than execute right away

2024-05-03 23:37:20.850 DEBUG (MainThread) [homeassistant.components.statistics.sensor] sensor.home_electricity_power_standard_deviation_over_an_hour: scheduling update at 2024-05-03 23:36:44.809911+00:00

@bdraco
Copy link
Member

bdraco commented May 3, 2024

It looks like its scheduling an update to make sure old states are purged but it doesn't actually purge the state so the next update gets scheduled in the past because its expected it had purged the state, and than it tries to do it again and again

@bdraco
Copy link
Member

bdraco commented May 4, 2024

I've not had any luck replicating the problem.

Can you post your config for statistics?

@bdraco
Copy link
Member

bdraco commented May 4, 2024

#116743 should eliminate the chance of a race unless there is a bug in the time calculation.

Can be tested with

cd /config
curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743

@to4ko
Copy link

to4ko commented May 4, 2024

#116743 should eliminate the chance of a race unless there is a bug in the time calculation.

Can be tested with

cd /config
curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743

applied the fix above...its kinda better, but still way not even close to pre-2024.5.x versions

fix applied this morning (red line)
image

this is how i was before May - massive cpu heating started this Wednesday May 1.
image

@dannytsang
Copy link

I've not had any luck replicating the problem.

Can you post your config for statistics?

Here are all mine. I use packages so they are not all in a single file (if that makes any difference):

sensor:
  - platform: statistics
    name: "Home Electricity Power Average Over An Hour"
    unique_id: 97f8c377-acd2-4d92-a2f2-6bbf5ca0b110
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: mean
    max_age:
      hours: 1
  - platform: statistics
    name: "Home Electricity Power Average Over A Day"
    unique_id: 74b261eb-6feb-473d-87b4-8eca961be99d
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: mean
    max_age:
      hours: 24
  - platform: statistics
    name: "Home Electricity Power Average Over A Week"
    unique_id: 47f78be2-bea5-499c-926b-50b48f92f202
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: mean
    max_age:
      days: 7
  - platform: statistics
    name: "Home Electricity Power Average Over A Month"
    unique_id: c0b7e702-5468-4cb1-8221-8067cf44c3e1
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: mean
    max_age:
      days: 30
  - platform: statistics
    name: "Home Electricity Power Median Over An Hour"
    unique_id: b9256d7e-504c-43e2-ac7d-f77ed5b69aaa
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: median
    max_age:
      hours: 1
  - platform: statistics
    name: "Home Electricity Power Median Over A Day"
    unique_id: 82ff2fd7-2694-4171-b0ec-3cafc52205cf
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: median
    max_age:
      hours: 24
  - platform: statistics
    name: "Home Electricity Power Median Over A Week"
    unique_id: fe221135-0a10-414b-8954-137ea83aa73f
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: median
    max_age:
      days: 7
  - platform: statistics
    name: "Home Electricity Power Median Over A Month"
    unique_id: 5891e92f-7bbc-4c51-941e-aa005fba93f9
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: median
    max_age:
      days: 30
  - platform: statistics
    name: "Home Electricity Power Standard Deviation Over An Hour"
    unique_id: 3b115d2e-f765-4e47-b53d-03ff61ca6ada
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: standard_deviation
    max_age:
      hours: 1
  - platform: statistics
    name: "Home Electricity Power Standard Deviation Over A Day"
    unique_id: 3a0d219c-2dc3-4ef5-8964-23e739768499
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: standard_deviation
    max_age:
      hours: 24
  - platform: statistics
    name: "Home Electricity Power Standard Deviation Over A Week"
    unique_id: c5a650f5-7d46-4918-a484-485831c7a922
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: standard_deviation
    max_age:
      days: 7
  - platform: statistics
    name: "Home Electricity Power Standard Deviation Over A Month"
    unique_id: 15d61ca0-bf3f-4959-a526-bb4c5b9f851b
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: standard_deviation
    max_age:
      days: 30
  # Boiler Flow
  - platform: statistics
    name: "Flow Temperature Delta Last 24 Hours"
    unique_id: 96bc7959-a867-4d5b-b733-188bcc25c92c
    entity_id: sensor.boiler_delta_temperature
    state_characteristic: mean
    max_age:
      hours: 24
  - platform: statistics
    name: "Flow Temperature Highest Delta"
    unique_id: aaac8076-2bfa-4f8d-825f-2603051f7eb9
    entity_id: sensor.boiler_delta_temperature
    state_characteristic: value_max
    sampling_size: 43200
    max_age:
      days: 30
  - platform: statistics
    name: "Flow Temperature Lowest Delta"
    unique_id: e4febfc7-2d8b-443d-8aae-b9d5d86e6449
    entity_id: sensor.boiler_delta_temperature
    state_characteristic: value_min
    sampling_size: 43200
    max_age:
      days: 30
  - platform: statistics
    name: "Flow Temperature Delta Difference"
    unique_id: 99a1f9d4-b8bb-456d-840a-e3ad0c997215
    entity_id: sensor.boiler_delta_temperature
    state_characteristic: distance_absolute
    sampling_size: 43200
    max_age:
      days: 30

@to4ko
Copy link

to4ko commented May 4, 2024

@bdraco looks like your fix is working

image

more detailed last 48hrs - cpu temp got normalized around 60degC and CPU load normalized (little build up and drop at 12^36 to be invetigated deeper)
image

@bdraco
Copy link
Member

bdraco commented May 4, 2024

There's still a lot to be optimized in this integration, but to be honest, that's very hard to do without access to a system that's experiencing the issue that I could do live debugging on and rapidly iterate as otherwise I'm optimizing for theoretical use cases instead of practical ones

@ayavilevich
Copy link

@to4ko @dannytsang @myromeo Are you using the default sqlite database for recorder or do you have something custom like MariaDB or mysql?

@myromeo
Copy link
Author

myromeo commented May 4, 2024

@ayavilevich i use the default SQLite.

@dannytsang
Copy link

@ayavilevich MySQL

@ayavilevich
Copy link

I've not had any luck replicating the problem.

Can you post your config for statistics?

Attaching my "platform: statistics" entries. Is "platform: history_stats" relevant as well?
What about automatic statistics that are generated for the metrics? At "http://server/developer-tools/statistics"
Please note that it takes it about 12 hours to ramp up to 100% CPU utilization. As if there is a "CPU leak". So it may be difficult to see this right away.

sensor:
  - platform: statistics
    name: "Pumps Current Mean 10m"
    entity_id: sensor.pumps_current
    state_characteristic: mean
    max_age:
      minutes: 10
  - platform: statistics
    name: "Pumps Current Mean 1d"
    entity_id: sensor.pumps_current
    state_characteristic: mean
    max_age:
      hours: 24
  - platform: statistics
    name: "Clouds mean 8h"
    entity_id: sensor.openweathermap_cloud_coverage
    state_characteristic: mean
    max_age:
      hours: 8
  - platform: statistics
    name: "Wind mean"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      minutes: 5
  - platform: statistics
    name: "Wind mean 15m"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      minutes: 15
  - platform: statistics
    name: "Wind mean 10m"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      minutes: 10
  - platform: statistics
    name: "Wind mean 30m"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      minutes: 30
  - platform: statistics
    name: "Wind mean 30s"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      seconds: 30
  - platform: statistics
    name: "Wind median"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: median
    max_age:
      minutes: 5
  - platform: statistics
    name: "Wind min"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: value_min
    max_age:
      minutes: 5
  - platform: statistics
    name: "Wind max"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: value_max
    max_age:
      minutes: 5
  - platform: statistics
    name: "Clouds mean 8h"
    entity_id: sensor.openweathermap_cloud_coverage
    state_characteristic: mean
    max_age:
      hours: 8

@myromeo
Copy link
Author

myromeo commented May 4, 2024

Here are my statistics sensors too, fwiw.

`# Calculate 90 day average CGM

  • platform: statistics
    name: 'Glucose Avg 90d'
    unique_id: "glucose_90d_avg"
    entity_id: sensor.glucose_value
    state_characteristic: mean
    precision: 3
    max_age:
    days: 90

Electricity demand stats - 24 hour period

  • platform: statistics
    name: "Elec Average kWh"
    entity_id: sensor.smart_electricity_meter_whole_home_power
    state_characteristic: mean
    precision: 3
    max_age:
    hours: 24
  • platform: statistics
    name: "Elec Min kWh"
    entity_id: sensor.smart_electricity_meter_whole_home_power
    state_characteristic: value_min
    precision: 3
    max_age:
    hours: 24
  • platform: statistics
    name: "Elec Max kWh"
    entity_id: sensor.smart_electricity_meter_whole_home_power
    state_characteristic: value_max
    precision: 3
    max_age:
    hours: 24

House Temperature (Living Room) stats - 24 hour period

  • platform: statistics
    name: "livingroom_temp_ave"
    entity_id: sensor.livingroom_temp
    state_characteristic: mean
    precision: 3
    max_age:
    hours: 24
  • platform: statistics
    name: "livingroom_temp_min"
    entity_id: sensor.livingroom_temp
    state_characteristic: value_min
    precision: 3
    max_age:
    hours: 24
  • platform: statistics
    name: "livingroom_temp_max"
    entity_id: sensor.livingroom_temp
    state_characteristic: value_max
    precision: 3
    max_age:
    hours: 24
    `

@to4ko
Copy link

to4ko commented May 4, 2024

@to4ko @dannytsang @myromeo Are you using the default sqlite database for recorder or do you have something custom like MariaDB or mysql?

default SQLite

@dannytsang
Copy link

#116743 should eliminate the chance of a race unless there is a bug in the time calculation.

Can be tested with

cd /config
curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743

This has definitely improved my setup. Thanks @bdraco
image

@churchofnoise
Copy link

#116743 should eliminate the chance of a race unless there is a bug in the time calculation.

Can be tested with

cd /config
curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743

Seems to work here too, thank you!

@myromeo
Copy link
Author

myromeo commented May 5, 2024

Also looking good here, nice work.

@bdraco
Copy link
Member

bdraco commented May 5, 2024

Be sure to delete the custom version before updating to 2024.5.2

@to4ko
Copy link

to4ko commented May 6, 2024

Be sure to delete the custom version before updating to 2024.5.2

Should statistics folder be deleted totally or just statistics.py? Please advise for dumb-and-dumber:)

@dannytsang
Copy link

Folder.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants