Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError and crash of python and homeassistant docker container when using kef integration #36403

Closed
logan893 opened this issue Jun 3, 2020 · 30 comments
Assignees

Comments

@logan893
Copy link
Contributor

logan893 commented Jun 3, 2020

The problem

The python3 process running homeassistant died.
The “homeassistant” container crashed.

Looking at the home-assistant log (/mnt/data/supervisor/homeassistant/home-assistant.log from HassOS) the final entry is at midnight local time (15:00 UTC) and this is reflected in the last-modified timestamp of the file.

2020-06-03 00:00:18 WARNING (zeroconf-Engine-1848242208) [zeroconf] Exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/zeroconf/__init__.py", line 1292, in handle_read
MemoryError

Even though the MemoryError is written here, and the process doesn’t fully die until an hour later, the high memory consumption and CPU usage spikes about an hour prior.

At around 13:55:11 UTC is where memory and CPU utilization begins to climb.

With some debug logging enabled, I also see that even “DEBUG” log entries from homeassistant.core cease. Final DEBUG entry is at 22:55:01 local time (13:55:01 UTC) just prior to . The debug entries are only regular sensor update information, “Bus:Handling <Event state_changed[L]> …”.

Over the following hour, for “python3 -m homeassistant --config /config” the CPU is pegged at 100% and memory utilization climbs from a typical mere 335 MB VIRT (200 MB RES) to 1169 MB VIRT (~700 MB RES) in just 5 minutes (13:59:52 UTC). CPU utilization hangs back a bit (possibly because the SWAP isn’t being hammered.)

It kicks up again at 14:05:27 UTC, with 100% CPU and gradually climbing memory usage. It plateaus again at 14:06:33 UTC with 1384 MB VIRT (740 MB RES).

This remains stable until 14:19:12 UTC, going up further to 1652 MB VIRT by 14:20:34 UTC.
Rinse and repeat at 14:37:15 UTC, climbing to 1987 MB VIRT by 14:39:02 UTC.
15:00:02 UTC we climb again, to 2029 MB VIRT by 15:00:20 UTC. This is around the time that MemoryError happens.
The python3 process lives for another hour at the same memory utilization and approximately 15% CPU utilization on average. Then it dies and goes away.

https://community.home-assistant.io/t/memory-exhausted-by-python3-process-process-container-crash/201011/2

Environment

  • Home Assistant Core release with the issue: I've had similar memory exhaustion issues with home assistant since first installing. 0.109, and several 0.110.x. Lack of information in logs makes it impossible to determine the cause. Most recently 0.110.4, with HassOS 3.13.
  • Last working Home Assistant Core release (if known): N/A
  • Operating environment (Home Assistant/Supervised/Docker/venv): Raspberry Pi 3, HassOS based installation from recommended 32-bit image
  • Integration causing this issue: N/A
  • Link to integration documentation on our website: N/A

Problem-relevant configuration.yaml

First happened with default configuration.yaml, only loaded my Google Home units (4x Home Mini, 1x Nest Mini, 1x Chromecast, 1x Chromecast Audio, 1x JBL speaker) and Philips Hue bridge with one light.
Since then I've added more, but with the lack of logs I cannot say if it's related to any specific configuration or the base Home Assistant core system.

Traceback/Error logs

DEBUG was active on homeassistant.core and it stops producing any logs about an hour prior to MemoryError being output into the home-assistant.log. See the detailed flow of events above.
The memory utilization of python starts to grow at the same time as the homeassistant.core DEBUG logging stops.

MemoryError

Additional information

I've tried to collect as many logs as possible but still cannot see what is triggering this runaway memory usage which results in a crash.

@logan893 logan893 changed the title MemoryError and crash of python homeassistant docker container MemoryError and crash of python and homeassistant docker container Jun 3, 2020
@basnijholt
Copy link
Contributor

basnijholt commented Jun 3, 2020

@bdraco
Copy link
Member

bdraco commented Jun 3, 2020

Zeroconf may not be the cause, it just may be the first thing that runs out of memory because its constantly processing network traffic.

@bdraco
Copy link
Member

bdraco commented Jun 3, 2020

It would be helpful to get a dump with py-spy as soon as cpu usage starts to uptick https://github.com/benfred/py-spy

@logan893
Copy link
Contributor Author

logan893 commented Jun 3, 2020

Zeroconf may not be the cause, it just may be the first thing that runs out of memory because its constantly processing network traffic.

@bdraco Yes, I completely agree. Memory utilization spikes one hour prior to the MemoryError entry, and the zeroconf related exception may just be the proverbial straw that broke the 32-bit camel's back. Hard to tell from the logs since they were basically silent.

It would be helpful to get a dump with py-spy as soon as cpu usage starts to uptick https://github.com/benfred/py-spy

CPU utilization seems to be directly linked with active growing memory utilization. I would like to be able to see this myself, and I gave py-spy a go. Unfortunately I have not been successful in my attempts to run py-spy in the docker container of my RPi3 with HassOS.

If you know how to achieve this, please let me know.

https://community.home-assistant.io/t/py-spy-in-home-assistant-running-on-rpi-3-with-hassos-based-image/201612

@bdraco
Copy link
Member

bdraco commented Jun 3, 2020

You'll probably have to compile from sources as there are no binaries for the RPi3

@logan893
Copy link
Contributor Author

logan893 commented Jun 3, 2020

You'll probably have to compile from sources as there are no binaries for the RPi3

py-spy? There are armv7 binaries. But it seems I cannot run it properly without adding additional capabilities to the docker container.

@logan893
Copy link
Contributor Author

logan893 commented Jun 3, 2020

Some more curious details.

I recently added some virtual sensors to see what can be captured during these memory exhaustion events. Turns out, not much.

My sensor for host OS memory usage?
Memory used %: final entry reads 38.3% at 22:53:51 -- 70 seconds before memory starts to climb
Memory free: final entry reads 597.1 MB at 22:54:53 -- 18 seconds before

Ping Google DNS?
Set to fire off 4 ping every 60 seconds.
Keeps recording data regularly up until 22:54:41 -- 30 seconds before

Ping my local router?
Set to fire off 5 ping every 30 seconds.
Keeps recording data regularly up until 22:55:01 -- 10 seconds before

During the hour of steady memory growth and no activity, the homeassistant.core (MainThread) debug messages have ceased. The only messages shown in the log still are from pychromecast.socket_client.(Thread-XX, multiple entries), and it only says "Heartbeat timeout, resetting connection" followed by "Connection reestablished!", for all my Google Cast devices in unison. These show up 4 times, with about 15-25 minute interval.

These are all the logs I got out of the system last time it happened.

@bdraco
Copy link
Member

bdraco commented Jun 3, 2020

The arm7 binaries are linked again /lib/ld-linux-armhf.so.3

# ldd ./py-spy 
	/lib/ld-linux-armhf.so.3 (0xb6f24000)
	libc.so.6 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
Error loading shared library ld-linux-armhf.so.3: No such file or directory (needed by ./py-spy)
	libdl.so.2 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
	librt.so.1 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
	libpthread.so.0 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
	libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0xb6c34000)
	libm.so.6 => /lib/ld-linux-armhf.so.3 (0xb6f24000)
Error relocating ./py-spy: __register_atfork: symbol not found
Error relocating ./py-spy: __snprintf_chk: symbol not found

@logan893
Copy link
Contributor Author

logan893 commented Jun 3, 2020

I see, thank you that is good to know. Then there are not one but two issues preventing me from running py-spy. Even if I were to compile it from source, the docker container still lacks the required capability.

@bdraco
Copy link
Member

bdraco commented Jun 3, 2020

The default docker container support strace

apk add strace

bash-5.0# strace ls
execve("/bin/ls", ["ls"], 0xbe82abb0 /* 22 vars */) = 0
set_tls(0xb6f1f5d0)                     = 0
set_tid_address(0xb6f20184)             = 4256
mprotect(0x58c000, 8192, PROT_READ)     = 0
getuid32()                              = 0
...

@bdraco
Copy link
Member

bdraco commented Jun 3, 2020

https://developers.home-assistant.io/docs/operating-system/debugging/
ssh on 22222
login
docker exec -it homeassistant /bin/bash
apk add cargo
cargo install py-spy

That should work in theory

@logan893
Copy link
Contributor Author

logan893 commented Jun 4, 2020

Thanks @bdraco I will give that a try right away.

@logan893
Copy link
Contributor Author

logan893 commented Jun 4, 2020

Unfortunately the py-spy compile fails with HassOS 3.13.

"dashmap" compile fails with error: use of unstable library feature 'map_get_key_value'

The problem is due to the slightly older version of "rust" available. Version 1.39 is installed alongside cargo from the Alpine 3.11 repo, and this function map_get_key_value is made stable from rust version 1.40.

To force install rust 1.43 apk from Alpine 3.12 would require upgrade of dependencies that would possibly break other things. I haven't found any other way around this.

Perhaps HassOS 4.x will use Alpine 3.12 where rust 1.43 is available, but I don't know how to verify that, nor do I feel comfortable upgrading at this time due to the issues raised for broken installs on Raspberry Pi 3 after attempts to upgrade.

@basnijholt
Copy link
Contributor

I've tried to install py-spy via cargo on HassOS 4.8 which resulted in the same compilation error.

However, the binaries work for me.

@logan893
Copy link
Contributor Author

logan893 commented Jun 4, 2020

Is it on a Raspberry Pi 3, and which binaries are you using?

@basnijholt
Copy link
Contributor

No, I am using a NUC and the py-spy-v0.3.3-x86_64-unknown-linux-musl.tar.gz file.

@basnijholt
Copy link
Contributor

basnijholt commented Jun 9, 2020

I've had py-spy recording while everything froze up.
This is the output, where I see that aiokef (used in the kef integration and my own creation 😅) mostly occupied the event-loop. I've fixed this in #36574.

Since I've made that change, I have not seen the problem anymore. I've also updated to core==0.111.0b4.

@logan893, do you use the kef integration by any chance?

@logan893
Copy link
Contributor Author

logan893 commented Jun 9, 2020

Yes, I am using the aiokef integration, that is very interesting finding!

What triggers the issue you fixed?
Would this fault potentially cause runaway memory utilisation and starvation of the homeassistant core process, as in my recent experience?

@basnijholt
Copy link
Contributor

I am not sure why the memory error or high CPU usage happens.

Before aiokef did a very cheap check 2 times per second to see if the connection with the speaker should be closed. When it was actually closed, the check continued. Now it schedules a task to close the connection at a specific time and extends that time if a new operation is performed.

Again, I am not sure why this would actually be a problem, but since this resulted in many calls I decided to fix it (which seems to work).

@logan893
Copy link
Contributor Author

logan893 commented Jun 9, 2020

I see, thank you. Maybe it's at least part of the issue, if not necessarily the full story, then. It's unfortunate that I cannot run py-spy.

I'll continue to log with debug blasting on full.

Last time it happened was just after an upgrade, but I had only debug logs active on homeassistant core. Maybe I'll give 0.110.5 a shot will full debug and see what happens.

Do you see that the aiokef logs to debug during this disconnect loop?

@basnijholt
Copy link
Contributor

I guess what could have happened is that each time the connection was slow/dropped momentarily, homeassistant created a new KefMediaPlayer(MediaPlayerEntity) instance. Maybe from the core team can tell if this is the case?

Each time that happens a new while True loop was started.

I get 70% CPU usage by running this code (which is kind of what happened in aiokef) in a Jupyter notebook:

import asyncio
import random
import time

import nest_asyncio

loop = asyncio.get_event_loop()
nest_asyncio.apply(loop)

async def f():
    while True:
        time.time()
        await asyncio.sleep(random.random())

# 10000 devices running
futs = [loop.create_task(f()) for _ in range(10_000)]

@bdraco bdraco changed the title MemoryError and crash of python and homeassistant docker container MemoryError and crash of python and homeassistant docker container when using kef integration Jun 9, 2020
@probot-home-assistant
Copy link

kef documentation
kef source
(message by IssueLinks)

@balloob
Copy link
Member

balloob commented Jun 9, 2020

alright, so fixed in beta and we can close this issue?

@balloob
Copy link
Member

balloob commented Jun 9, 2020

btw @basnijholt not sure what you mean Home Assistant created a new KefMediaPlayer(MediaPlayerEntity) instance. We only call your platform once, after that it's all your code :)

@basnijholt
Copy link
Contributor

basnijholt commented Jun 9, 2020

Then I really cannot explain why my change fixed the issue.

Because then there is just one loop that does (my hypothesis was that KefMediaPlayer was restarted many times somehow):

    async def _disconnect_if_passive(self) -> None:
        """Disconnect socket after _KEEP_ALIVE seconds of not using it."""
        while True:
            with contextlib.suppress(Exception):
                time_is_up = time.time() - self._last_time_stamp > _KEEP_ALIVE
                if time_is_up:
                    await self._disconnect()
                await asyncio.sleep(0.5)

Therefore, I do not know whether the issue is actually fixed because I do not know for sure whether it's actually caused by aiokef (and I made many changes in my setup). Maybe @logan893 can confirm by turning off the KEF integration and seeing whether the problem is fixed for him too?

@balloob
Copy link
Member

balloob commented Jun 9, 2020

Why is there no break after calling disconnect? The job of that loop is done

(and why disconnect at all?)

@basnijholt
Copy link
Contributor

basnijholt commented Jun 9, 2020

That was the old implementation by the way. I just don’t understand how this could lead to the errors.

The rationale behind it code is that every time after communicating with the device, that the connection is broken off after two seconds. I do this because when there is a connection, one cannot use the KEF app for example (also reconnecting is cheap).

@aukedejong
Copy link

I subscribe to this issue, because last week I had the memory error also, for the first time. A few days after I updated Home Assistant. (Running 0.110.3 on Docker on Pi 4)

Just wanted to add that I don't use the aiokef integration. So maybe it's not related.

@stale
Copy link

stale bot commented Sep 8, 2020

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue now has been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 8, 2020
@basnijholt
Copy link
Contributor

Oh, this issue should be closed because I fixed it months ago.

@aukedejong, if you are seeing a memory error without even using the kef integration, then you are seeing a different issue.

@stale stale bot removed the stale label Sep 8, 2020
@bdraco bdraco closed this as completed Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants