MemoryError and crash of python and homeassistant docker container when using kef integration #36403

logan893 · 2020-06-03T08:00:09Z

The problem

The python3 process running homeassistant died.
The “homeassistant” container crashed.

Looking at the home-assistant log (/mnt/data/supervisor/homeassistant/home-assistant.log from HassOS) the final entry is at midnight local time (15:00 UTC) and this is reflected in the last-modified timestamp of the file.

2020-06-03 00:00:18 WARNING (zeroconf-Engine-1848242208) [zeroconf] Exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/zeroconf/__init__.py", line 1292, in handle_read
MemoryError

Even though the MemoryError is written here, and the process doesn’t fully die until an hour later, the high memory consumption and CPU usage spikes about an hour prior.

At around 13:55:11 UTC is where memory and CPU utilization begins to climb.

With some debug logging enabled, I also see that even “DEBUG” log entries from homeassistant.core cease. Final DEBUG entry is at 22:55:01 local time (13:55:01 UTC) just prior to . The debug entries are only regular sensor update information, “Bus:Handling <Event state_changed[L]> …”.

Over the following hour, for “python3 -m homeassistant --config /config” the CPU is pegged at 100% and memory utilization climbs from a typical mere 335 MB VIRT (200 MB RES) to 1169 MB VIRT (~700 MB RES) in just 5 minutes (13:59:52 UTC). CPU utilization hangs back a bit (possibly because the SWAP isn’t being hammered.)

It kicks up again at 14:05:27 UTC, with 100% CPU and gradually climbing memory usage. It plateaus again at 14:06:33 UTC with 1384 MB VIRT (740 MB RES).

This remains stable until 14:19:12 UTC, going up further to 1652 MB VIRT by 14:20:34 UTC.
Rinse and repeat at 14:37:15 UTC, climbing to 1987 MB VIRT by 14:39:02 UTC.
15:00:02 UTC we climb again, to 2029 MB VIRT by 15:00:20 UTC. This is around the time that MemoryError happens.
The python3 process lives for another hour at the same memory utilization and approximately 15% CPU utilization on average. Then it dies and goes away.

https://community.home-assistant.io/t/memory-exhausted-by-python3-process-process-container-crash/201011/2

Environment

Home Assistant Core release with the issue: I've had similar memory exhaustion issues with home assistant since first installing. 0.109, and several 0.110.x. Lack of information in logs makes it impossible to determine the cause. Most recently 0.110.4, with HassOS 3.13.
Last working Home Assistant Core release (if known): N/A
Operating environment (Home Assistant/Supervised/Docker/venv): Raspberry Pi 3, HassOS based installation from recommended 32-bit image
Integration causing this issue: N/A
Link to integration documentation on our website: N/A

Problem-relevant `configuration.yaml`

First happened with default configuration.yaml, only loaded my Google Home units (4x Home Mini, 1x Nest Mini, 1x Chromecast, 1x Chromecast Audio, 1x JBL speaker) and Philips Hue bridge with one light.
Since then I've added more, but with the lack of logs I cannot say if it's related to any specific configuration or the base Home Assistant core system.

Traceback/Error logs

DEBUG was active on homeassistant.core and it stops producing any logs about an hour prior to MemoryError being output into the home-assistant.log. See the detailed flow of events above.
The memory utilization of python starts to grow at the same time as the homeassistant.core DEBUG logging stops.

MemoryError

Additional information

I've tried to collect as many logs as possible but still cannot see what is triggering this runaway memory usage which results in a crash.

The text was updated successfully, but these errors were encountered:

basnijholt · 2020-06-03T12:17:53Z

There is also #35840, which is likely the same problem.

To summarize (mostly @logan893's findings) :

GitHub issues:

Possibly related topics:

@logan893 posted Memory exhausted by python3 process - process/container crash
@basnijholt posted Whole Home Assistant + OS periodically becomes irresponsive
@Guyanthalas posted HA dies after several days
@almostserious posted Python3 high CPU Usage

Less clear but still possibly related:

@jimford posted "Update Available" - No thank you very much!

bdraco · 2020-06-03T12:57:10Z

Zeroconf may not be the cause, it just may be the first thing that runs out of memory because its constantly processing network traffic.

bdraco · 2020-06-03T12:57:57Z

It would be helpful to get a dump with py-spy as soon as cpu usage starts to uptick https://github.com/benfred/py-spy

logan893 · 2020-06-03T14:36:42Z

Zeroconf may not be the cause, it just may be the first thing that runs out of memory because its constantly processing network traffic.

@bdraco Yes, I completely agree. Memory utilization spikes one hour prior to the MemoryError entry, and the zeroconf related exception may just be the proverbial straw that broke the 32-bit camel's back. Hard to tell from the logs since they were basically silent.

It would be helpful to get a dump with py-spy as soon as cpu usage starts to uptick https://github.com/benfred/py-spy

CPU utilization seems to be directly linked with active growing memory utilization. I would like to be able to see this myself, and I gave py-spy a go. Unfortunately I have not been successful in my attempts to run py-spy in the docker container of my RPi3 with HassOS.

If you know how to achieve this, please let me know.

https://community.home-assistant.io/t/py-spy-in-home-assistant-running-on-rpi-3-with-hassos-based-image/201612

bdraco · 2020-06-03T15:25:14Z

You'll probably have to compile from sources as there are no binaries for the RPi3

logan893 · 2020-06-03T15:36:42Z

You'll probably have to compile from sources as there are no binaries for the RPi3

py-spy? There are armv7 binaries. But it seems I cannot run it properly without adding additional capabilities to the docker container.

logan893 · 2020-06-03T15:37:36Z

Some more curious details.

I recently added some virtual sensors to see what can be captured during these memory exhaustion events. Turns out, not much.

My sensor for host OS memory usage?
Memory used %: final entry reads 38.3% at 22:53:51 -- 70 seconds before memory starts to climb
Memory free: final entry reads 597.1 MB at 22:54:53 -- 18 seconds before

Ping Google DNS?
Set to fire off 4 ping every 60 seconds.
Keeps recording data regularly up until 22:54:41 -- 30 seconds before

Ping my local router?
Set to fire off 5 ping every 30 seconds.
Keeps recording data regularly up until 22:55:01 -- 10 seconds before

During the hour of steady memory growth and no activity, the homeassistant.core (MainThread) debug messages have ceased. The only messages shown in the log still are from pychromecast.socket_client.(Thread-XX, multiple entries), and it only says "Heartbeat timeout, resetting connection" followed by "Connection reestablished!", for all my Google Cast devices in unison. These show up 4 times, with about 15-25 minute interval.

These are all the logs I got out of the system last time it happened.

bdraco · 2020-06-03T15:40:03Z

The arm7 binaries are linked again /lib/ld-linux-armhf.so.3

# ldd ./py-spy 
	/lib/ld-linux-armhf.so.3 (0xb6f24000)
	libc.so.6 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
Error loading shared library ld-linux-armhf.so.3: No such file or directory (needed by ./py-spy)
	libdl.so.2 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
	librt.so.1 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
	libpthread.so.0 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
	libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0xb6c34000)
	libm.so.6 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
Error relocating ./py-spy: __register_atfork: symbol not found
Error relocating ./py-spy: __snprintf_chk: symbol not found

logan893 · 2020-06-03T15:59:35Z

I see, thank you that is good to know. Then there are not one but two issues preventing me from running py-spy. Even if I were to compile it from source, the docker container still lacks the required capability.

bdraco · 2020-06-03T16:02:58Z

The default docker container support strace

apk add strace

bash-5.0# strace ls
execve("/bin/ls", ["ls"], 0xbe82abb0 /* 22 vars */) = 0
set_tls(0xb6f1f5d0)                     = 0
set_tid_address(0xb6f20184)             = 4256
mprotect(0x58c000, 8192, PROT_READ)     = 0
getuid32()                              = 0
...

bdraco · 2020-06-03T16:15:02Z

https://developers.home-assistant.io/docs/operating-system/debugging/
ssh on 22222
login
docker exec -it homeassistant /bin/bash
apk add cargo
cargo install py-spy

That should work in theory

logan893 · 2020-06-04T01:39:55Z

Thanks @bdraco I will give that a try right away.

logan893 · 2020-06-04T03:22:47Z

Unfortunately the py-spy compile fails with HassOS 3.13.

"dashmap" compile fails with error: use of unstable library feature 'map_get_key_value'

The problem is due to the slightly older version of "rust" available. Version 1.39 is installed alongside cargo from the Alpine 3.11 repo, and this function map_get_key_value is made stable from rust version 1.40.

To force install rust 1.43 apk from Alpine 3.12 would require upgrade of dependencies that would possibly break other things. I haven't found any other way around this.

Perhaps HassOS 4.x will use Alpine 3.12 where rust 1.43 is available, but I don't know how to verify that, nor do I feel comfortable upgrading at this time due to the issues raised for broken installs on Raspberry Pi 3 after attempts to upgrade.

basnijholt · 2020-06-04T08:05:10Z

I've tried to install py-spy via cargo on HassOS 4.8 which resulted in the same compilation error.

However, the binaries work for me.

logan893 · 2020-06-04T08:42:18Z

Is it on a Raspberry Pi 3, and which binaries are you using?

basnijholt · 2020-06-04T08:46:42Z

No, I am using a NUC and the py-spy-v0.3.3-x86_64-unknown-linux-musl.tar.gz file.

basnijholt · 2020-06-09T08:24:03Z

I've had py-spy recording while everything froze up.
This is the output, where I see that aiokef (used in the kef integration and my own creation 😅) mostly occupied the event-loop. I've fixed this in #36574.

Since I've made that change, I have not seen the problem anymore. I've also updated to core==0.111.0b4.

@logan893, do you use the kef integration by any chance?

logan893 · 2020-06-09T09:09:22Z

Yes, I am using the aiokef integration, that is very interesting finding!

What triggers the issue you fixed?
Would this fault potentially cause runaway memory utilisation and starvation of the homeassistant core process, as in my recent experience?

basnijholt · 2020-06-09T09:15:09Z

I am not sure why the memory error or high CPU usage happens.

Before aiokef did a very cheap check 2 times per second to see if the connection with the speaker should be closed. When it was actually closed, the check continued. Now it schedules a task to close the connection at a specific time and extends that time if a new operation is performed.

Again, I am not sure why this would actually be a problem, but since this resulted in many calls I decided to fix it (which seems to work).

logan893 · 2020-06-09T09:33:40Z

I see, thank you. Maybe it's at least part of the issue, if not necessarily the full story, then. It's unfortunate that I cannot run py-spy.

I'll continue to log with debug blasting on full.

Last time it happened was just after an upgrade, but I had only debug logs active on homeassistant core. Maybe I'll give 0.110.5 a shot will full debug and see what happens.

Do you see that the aiokef logs to debug during this disconnect loop?

basnijholt · 2020-06-09T11:10:56Z

I guess what could have happened is that each time the connection was slow/dropped momentarily, homeassistant created a new KefMediaPlayer(MediaPlayerEntity) instance. Maybe from the core team can tell if this is the case?

Each time that happens a new while True loop was started.

I get 70% CPU usage by running this code (which is kind of what happened in aiokef) in a Jupyter notebook:

import asyncio
import random
import time

import nest_asyncio

loop = asyncio.get_event_loop()
nest_asyncio.apply(loop)

async def f():
    while True:
        time.time()
        await asyncio.sleep(random.random())

# 10000 devices running
futs = [loop.create_task(f()) for _ in range(10_000)]

probot-home-assistant · 2020-06-09T13:41:47Z

kef documentation
kef source
_{^{(message by IssueLinks)}}

balloob · 2020-06-09T17:44:34Z

alright, so fixed in beta and we can close this issue?

balloob · 2020-06-09T17:45:17Z

btw @basnijholt not sure what you mean Home Assistant created a new KefMediaPlayer(MediaPlayerEntity) instance. We only call your platform once, after that it's all your code :)

basnijholt · 2020-06-09T17:49:01Z

Then I really cannot explain why my change fixed the issue.

Because then there is just one loop that does (my hypothesis was that KefMediaPlayer was restarted many times somehow):

    async def _disconnect_if_passive(self) -> None:
        """Disconnect socket after _KEEP_ALIVE seconds of not using it."""
        while True:
            with contextlib.suppress(Exception):
                time_is_up = time.time() - self._last_time_stamp > _KEEP_ALIVE
                if time_is_up:
                    await self._disconnect()
                await asyncio.sleep(0.5)

Therefore, I do not know whether the issue is actually fixed because I do not know for sure whether it's actually caused by aiokef (and I made many changes in my setup). Maybe @logan893 can confirm by turning off the KEF integration and seeing whether the problem is fixed for him too?

balloob · 2020-06-09T18:12:22Z

Why is there no break after calling disconnect? The job of that loop is done

(and why disconnect at all?)

basnijholt · 2020-06-09T18:29:34Z

That was the old implementation by the way. I just don’t understand how this could lead to the errors.

The rationale behind it code is that every time after communicating with the device, that the connection is broken off after two seconds. I do this because when there is a connection, one cannot use the KEF app for example (also reconnecting is cheap).

aukedejong · 2020-06-10T05:46:36Z

I subscribe to this issue, because last week I had the memory error also, for the first time. A few days after I updated Home Assistant. (Running 0.110.3 on Docker on Pi 4)

Just wanted to add that I don't use the aiokef integration. So maybe it's not related.

stale · 2020-09-08T16:26:07Z

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue now has been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

basnijholt · 2020-09-08T16:53:42Z

Oh, this issue should be closed because I fixed it months ago.

@aukedejong, if you are seeing a memory error without even using the kef integration, then you are seeing a different issue.

logan893 changed the title ~~MemoryError and crash of python homeassistant docker container~~ MemoryError and crash of python and homeassistant docker container Jun 3, 2020

basnijholt mentioned this issue Jun 3, 2020

Zeroconf causes high CPU Usage with an additional python3 process #35840

Closed

bdraco changed the title ~~MemoryError and crash of python and homeassistant docker container~~ MemoryError and crash of python and homeassistant docker container when using kef integration Jun 9, 2020

bdraco added the integration: kef label Jun 9, 2020

probot-home-assistant bot assigned basnijholt Jun 9, 2020

stale bot added the stale label Sep 8, 2020

stale bot removed the stale label Sep 8, 2020

bdraco closed this as completed Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError and crash of python and homeassistant docker container when using kef integration #36403

MemoryError and crash of python and homeassistant docker container when using kef integration #36403

logan893 commented Jun 3, 2020 •

edited

basnijholt commented Jun 3, 2020 •

edited

bdraco commented Jun 3, 2020

bdraco commented Jun 3, 2020

logan893 commented Jun 3, 2020

bdraco commented Jun 3, 2020

logan893 commented Jun 3, 2020

logan893 commented Jun 3, 2020

bdraco commented Jun 3, 2020

logan893 commented Jun 3, 2020

bdraco commented Jun 3, 2020

bdraco commented Jun 3, 2020

logan893 commented Jun 4, 2020

logan893 commented Jun 4, 2020

basnijholt commented Jun 4, 2020

logan893 commented Jun 4, 2020

basnijholt commented Jun 4, 2020

basnijholt commented Jun 9, 2020 •

edited

logan893 commented Jun 9, 2020

basnijholt commented Jun 9, 2020

logan893 commented Jun 9, 2020

basnijholt commented Jun 9, 2020

probot-home-assistant bot commented Jun 9, 2020

balloob commented Jun 9, 2020

balloob commented Jun 9, 2020

basnijholt commented Jun 9, 2020 •

edited

balloob commented Jun 9, 2020

basnijholt commented Jun 9, 2020 •

edited

aukedejong commented Jun 10, 2020

stale bot commented Sep 8, 2020

basnijholt commented Sep 8, 2020

MemoryError and crash of python and homeassistant docker container when using kef integration #36403

MemoryError and crash of python and homeassistant docker container when using kef integration #36403

Comments

logan893 commented Jun 3, 2020 • edited

The problem

Environment

Problem-relevant configuration.yaml

Traceback/Error logs

Additional information

basnijholt commented Jun 3, 2020 • edited

bdraco commented Jun 3, 2020

bdraco commented Jun 3, 2020

logan893 commented Jun 3, 2020

bdraco commented Jun 3, 2020

logan893 commented Jun 3, 2020

logan893 commented Jun 3, 2020

bdraco commented Jun 3, 2020

logan893 commented Jun 3, 2020

bdraco commented Jun 3, 2020

bdraco commented Jun 3, 2020

logan893 commented Jun 4, 2020

logan893 commented Jun 4, 2020

basnijholt commented Jun 4, 2020

logan893 commented Jun 4, 2020

basnijholt commented Jun 4, 2020

basnijholt commented Jun 9, 2020 • edited

logan893 commented Jun 9, 2020

basnijholt commented Jun 9, 2020

logan893 commented Jun 9, 2020

basnijholt commented Jun 9, 2020

probot-home-assistant bot commented Jun 9, 2020

balloob commented Jun 9, 2020

balloob commented Jun 9, 2020

basnijholt commented Jun 9, 2020 • edited

balloob commented Jun 9, 2020

basnijholt commented Jun 9, 2020 • edited

aukedejong commented Jun 10, 2020

stale bot commented Sep 8, 2020

basnijholt commented Sep 8, 2020

logan893 commented Jun 3, 2020 •

edited

Problem-relevant `configuration.yaml`

basnijholt commented Jun 3, 2020 •

edited

basnijholt commented Jun 9, 2020 •

edited

basnijholt commented Jun 9, 2020 •

edited

basnijholt commented Jun 9, 2020 •

edited