New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MemoryError and crash of python and homeassistant docker container when using kef integration #36403
Comments
There is also #35840, which is likely the same problem. To summarize (mostly @logan893's findings) : GitHub issues:
Possibly related topics:
Less clear but still possibly related: |
Zeroconf may not be the cause, it just may be the first thing that runs out of memory because its constantly processing network traffic. |
It would be helpful to get a dump with |
@bdraco Yes, I completely agree. Memory utilization spikes one hour prior to the MemoryError entry, and the zeroconf related exception may just be the proverbial straw that broke the 32-bit camel's back. Hard to tell from the logs since they were basically silent.
CPU utilization seems to be directly linked with active growing memory utilization. I would like to be able to see this myself, and I gave py-spy a go. Unfortunately I have not been successful in my attempts to run py-spy in the docker container of my RPi3 with HassOS. If you know how to achieve this, please let me know. |
You'll probably have to compile from sources as there are no binaries for the RPi3 |
py-spy? There are armv7 binaries. But it seems I cannot run it properly without adding additional capabilities to the docker container. |
Some more curious details. I recently added some virtual sensors to see what can be captured during these memory exhaustion events. Turns out, not much. My sensor for host OS memory usage? Ping Google DNS? Ping my local router? During the hour of steady memory growth and no activity, the homeassistant.core (MainThread) debug messages have ceased. The only messages shown in the log still are from pychromecast.socket_client.(Thread-XX, multiple entries), and it only says "Heartbeat timeout, resetting connection" followed by "Connection reestablished!", for all my Google Cast devices in unison. These show up 4 times, with about 15-25 minute interval. These are all the logs I got out of the system last time it happened. |
The arm7 binaries are linked again
|
I see, thank you that is good to know. Then there are not one but two issues preventing me from running py-spy. Even if I were to compile it from source, the docker container still lacks the required capability. |
The default docker container support strace
|
https://developers.home-assistant.io/docs/operating-system/debugging/ That should work in theory |
Thanks @bdraco I will give that a try right away. |
Unfortunately the py-spy compile fails with HassOS 3.13. "dashmap" compile fails with error: use of unstable library feature 'map_get_key_value' The problem is due to the slightly older version of "rust" available. Version 1.39 is installed alongside cargo from the Alpine 3.11 repo, and this function map_get_key_value is made stable from rust version 1.40. To force install rust 1.43 apk from Alpine 3.12 would require upgrade of dependencies that would possibly break other things. I haven't found any other way around this. Perhaps HassOS 4.x will use Alpine 3.12 where rust 1.43 is available, but I don't know how to verify that, nor do I feel comfortable upgrading at this time due to the issues raised for broken installs on Raspberry Pi 3 after attempts to upgrade. |
I've tried to install py-spy via cargo on HassOS 4.8 which resulted in the same compilation error. However, the binaries work for me. |
Is it on a Raspberry Pi 3, and which binaries are you using? |
No, I am using a NUC and the py-spy-v0.3.3-x86_64-unknown-linux-musl.tar.gz file. |
I've had py-spy recording while everything froze up. Since I've made that change, I have not seen the problem anymore. I've also updated to @logan893, do you use the |
Yes, I am using the aiokef integration, that is very interesting finding! What triggers the issue you fixed? |
I am not sure why the memory error or high CPU usage happens. Before Again, I am not sure why this would actually be a problem, but since this resulted in many calls I decided to fix it (which seems to work). |
I see, thank you. Maybe it's at least part of the issue, if not necessarily the full story, then. It's unfortunate that I cannot run py-spy. I'll continue to log with debug blasting on full. Last time it happened was just after an upgrade, but I had only debug logs active on homeassistant core. Maybe I'll give 0.110.5 a shot will full debug and see what happens. Do you see that the aiokef logs to debug during this disconnect loop? |
I guess what could have happened is that each time the connection was slow/dropped momentarily, Each time that happens a new I get 70% CPU usage by running this code (which is kind of what happened in import asyncio
import random
import time
import nest_asyncio
loop = asyncio.get_event_loop()
nest_asyncio.apply(loop)
async def f():
while True:
time.time()
await asyncio.sleep(random.random())
# 10000 devices running
futs = [loop.create_task(f()) for _ in range(10_000)] |
kef documentation |
alright, so fixed in beta and we can close this issue? |
btw @basnijholt not sure what you mean Home Assistant created a new |
Then I really cannot explain why my change fixed the issue. Because then there is just one loop that does (my hypothesis was that async def _disconnect_if_passive(self) -> None:
"""Disconnect socket after _KEEP_ALIVE seconds of not using it."""
while True:
with contextlib.suppress(Exception):
time_is_up = time.time() - self._last_time_stamp > _KEEP_ALIVE
if time_is_up:
await self._disconnect()
await asyncio.sleep(0.5) Therefore, I do not know whether the issue is actually fixed because I do not know for sure whether it's actually caused by |
Why is there no break after calling disconnect? The job of that loop is done (and why disconnect at all?) |
That was the old implementation by the way. I just don’t understand how this could lead to the errors. The rationale behind it code is that every time after communicating with the device, that the connection is broken off after two seconds. I do this because when there is a connection, one cannot use the KEF app for example (also reconnecting is cheap). |
I subscribe to this issue, because last week I had the memory error also, for the first time. A few days after I updated Home Assistant. (Running 0.110.3 on Docker on Pi 4) Just wanted to add that I don't use the aiokef integration. So maybe it's not related. |
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. |
Oh, this issue should be closed because I fixed it months ago. @aukedejong, if you are seeing a memory error without even using the |
The problem
The python3 process running homeassistant died.
The “homeassistant” container crashed.
Looking at the home-assistant log (/mnt/data/supervisor/homeassistant/home-assistant.log from HassOS) the final entry is at midnight local time (15:00 UTC) and this is reflected in the last-modified timestamp of the file.
Even though the MemoryError is written here, and the process doesn’t fully die until an hour later, the high memory consumption and CPU usage spikes about an hour prior.
At around 13:55:11 UTC is where memory and CPU utilization begins to climb.
With some debug logging enabled, I also see that even “DEBUG” log entries from homeassistant.core cease. Final DEBUG entry is at 22:55:01 local time (13:55:01 UTC) just prior to . The debug entries are only regular sensor update information, “Bus:Handling <Event state_changed[L]> …”.
Over the following hour, for “python3 -m homeassistant --config /config” the CPU is pegged at 100% and memory utilization climbs from a typical mere 335 MB VIRT (200 MB RES) to 1169 MB VIRT (~700 MB RES) in just 5 minutes (13:59:52 UTC). CPU utilization hangs back a bit (possibly because the SWAP isn’t being hammered.)
It kicks up again at 14:05:27 UTC, with 100% CPU and gradually climbing memory usage. It plateaus again at 14:06:33 UTC with 1384 MB VIRT (740 MB RES).
This remains stable until 14:19:12 UTC, going up further to 1652 MB VIRT by 14:20:34 UTC.
Rinse and repeat at 14:37:15 UTC, climbing to 1987 MB VIRT by 14:39:02 UTC.
15:00:02 UTC we climb again, to 2029 MB VIRT by 15:00:20 UTC. This is around the time that MemoryError happens.
The python3 process lives for another hour at the same memory utilization and approximately 15% CPU utilization on average. Then it dies and goes away.
https://community.home-assistant.io/t/memory-exhausted-by-python3-process-process-container-crash/201011/2
Environment
Problem-relevant
configuration.yaml
First happened with default configuration.yaml, only loaded my Google Home units (4x Home Mini, 1x Nest Mini, 1x Chromecast, 1x Chromecast Audio, 1x JBL speaker) and Philips Hue bridge with one light.
Since then I've added more, but with the lack of logs I cannot say if it's related to any specific configuration or the base Home Assistant core system.
Traceback/Error logs
DEBUG was active on homeassistant.core and it stops producing any logs about an hour prior to MemoryError being output into the home-assistant.log. See the detailed flow of events above.
The memory utilization of python starts to grow at the same time as the homeassistant.core DEBUG logging stops.
Additional information
I've tried to collect as many logs as possible but still cannot see what is triggering this runaway memory usage which results in a crash.
The text was updated successfully, but these errors were encountered: