Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homeassistant container memory leak - Out of Memory #93713

Closed
litinoveweedle opened this issue May 28, 2023 · 41 comments · Fixed by #96981
Closed

Homeassistant container memory leak - Out of Memory #93713

litinoveweedle opened this issue May 28, 2023 · 41 comments · Fixed by #96981

Comments

@litinoveweedle
Copy link

litinoveweedle commented May 28, 2023

Read before posting

#95386

The problem

HomeAssiastant container is leaking memory at about 1GB/day. Tried to disable not strictly necessary integrations, but could not find culprit. I am not exactly sure, when problem appeared first time, and if it is caused by new HA version or by new integration. I did disable OnVif integration, which was mentioned in separate OoM issue, but HA container kept growing anyway.

Profiler PROFILER.START_LOG_OBJECT_SOURCES log attached. I saw many messages like '"Failed to serialize <class 'list'>"'. but I am not sure, if it is real issue or not.

Could be related to #93556 or #92026

What version of Home Assistant Core has the issue?

core-2023.5.4

What was the last working version of Home Assistant Core?

core-2023-4.X?

What type of installation are you running?

Home Assistant Supervised

Integration causing the issue

unknown

Link to integration documentation on our website

No response

Diagnostics information

home-assistant.log.zip

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

@litinoveweedle
Copy link
Author

litinoveweedle commented May 28, 2023

After cca 18hrs my memory consumpion looks like:

CONTAINERS 20 sorted by memory_percent
Engine	Pod	Name				Status	Uptime		CPU%	MEM	/MAX	IOR/s	IOW/s	RX/s	TX/s	Command
docker	-	addon_a0d7b954_motioneye	running	yesterday	21.4	1.92G	7.61G	0b	0b	NaNb	NaNb	/init
docker	-	homeassistant			running	18 hours	27.1	1.74G	7.61G	0b	0b	NaNb	NaNb	/init
docker	-	addon_core_whisper		running	yesterday	0	908M	7.61G	0b	0b	0b	0b	/init
docker	-	addon_a0d7b954_nodered		running	yesterday	4.3	262M	7.61G	0b	0b	NaNb	NaNb	/init
docker	-	addon_4a36bbd1_signal		running	yesterday	0.1	213M	7.61G	0b	0b	248b	288b	/run.sh
docker	-	hassio_supervisor		running	yesterday	1.7	207M	7.61G	0b	0b	0b	0b	/init
docker	-	addon_a0d7b954_adguard		running	yesterday	0	147M	7.61G	0b	0b	NaNb	NaNb	/init
docker	-	addon_core_piper		running	yesterday	0	111M	7.61G	0b	0b	0b	0b	/init
docker	-	addon_a0d7b954_glances		running	yesterday	0.9	65.3M	7.61G	0b	0b	NaNb	NaNb	/run.sh
docker	-	addon_a0d7b954_logviewer	running	yesterday	0	32.1M	7.61G	0b	0b	0b	0b	/init
docker	-	addon_a0d7b954_sqlite-web	running	yesterday	0	31.8M	7.61G	0b	0b	0b	0b	/init
docker	-	hassio_dns			running	yesterday	0.2	23.7M	7.61G	0b	0b	0b	0b	/init
docker	-	addon_core_configurator		running	yesterday	0	22.5M	7.61G	0b	0b	0b	0b	/init
docker	-	hassio_audio			running	yesterday	0	21.9M	7.61G	0b	0b	0b	0b	/init
docker	-	addon_core_ssh			running	yesterday	0	20.7M	7.61G	0b	0b	0b	0b	/init
docker	-	hassio_observer			running	yesterday	0	17.6M	7.61G	0b	0b	0b	0b	/init
docker	-	addon_45c828c3_snapcast-server	running	yesterday	1.4	17.1M	7.61G	0b	0b	NaNb	NaNb	/init
docker	-	addon_a0d7b954_sonweb		running	yesterday	0	13.2M	7.61G	0b	0b	0b	0b	/init
docker	-	hassio_cli			running	yesterday	0	2.47M	7.61G	0b	0b	0b	0b	/init
docker	-	hassio_multicast		running	yesterday	0	1.05M	7.61G	0b	0b	NaNb	NaNb	/init

@litinoveweedle
Copy link
Author

After cca 30hrs memory consumpion looks like:

CONTAINERS 20 sorted by memory_percent
Engine	Pod	Name				Status	Uptime	CPU%	MEM	/MAX	IOR/s	IOW/s	RX/s	TX/s	Command
docker	-	homeassistant			running	yester	1.7	2.55G	7.61G	44Kb	0b	NaNb	NaNb	/init
docker	-	addon_a0d7b954_motioneye	running	2 days	12.2	1.44G	7.61G	0b	0b	NaNb	NaNb	/init
docker	-	addon_core_whisper		running	2 days	0	818M	7.61G	0b	0b	1Kb	0b	/init
docker	-	addon_a0d7b954_nodered		running	2 days	2.4	270M	7.61G	0b	0b	NaNb	NaNb	/init
docker	-	addon_4a36bbd1_signal		running	2 days	0.2	215M	7.61G	0b	0b	1Kb	240b	/run.sh
docker	-	hassio_supervisor		running	2 days	0.5	210M	7.61G	0b	0b	0b	0b	/init
docker	-	addon_a0d7b954_adguard		running	2 days	0	142M	7.61G	0b	0b	NaNb	NaNb	/init
docker	-	addon_core_piper		running	2 days	0	98.7M	7.61G	0b	0b	1Kb	0b	/init
docker	-	addon_a0d7b954_glances		running	2 days	1.2	51.1M	7.61G	0b	0b	NaNb	NaNb	/run.sh
docker	-	addon_a0d7b954_sqlite-web	running	2 days	0	30.3M	7.61G	0b	0b	1Kb	0b	/init
docker	-	addon_a0d7b954_logviewer	running	2 days	0	27.7M	7.61G	0b	0b	1Kb	0b	/init
docker	-	addon_core_configurator		running	2 days	0	23.2M	7.61G	0b	0b	1Kb	0b	/init
docker	-	hassio_dns			running	2 days	0.1	22.5M	7.61G	0b	0b	2Kb	808b	/init
docker	-	addon_core_ssh			running	2 days	0	19.7M	7.61G	0b	0b	1Kb	0b	/init
docker	-	hassio_audio			running	2 days	0	19.2M	7.61G	0b	0b	1Kb	0b	/init
docker	-	addon_45c828c3_snapcast-server	running	2 days	1.4	14.0M	7.61G	0b	0b	NaNb	NaNb	/init
docker	-	addon_a0d7b954_sonweb		running	2 days	0	13.0M	7.61G	0b	0b	1Kb	0b	/init
docker	-	hassio_observer			running	2 days	0	12.4M	7.61G	0b	0b	1Kb	0b	/init
docker	-	hassio_cli			running	2 days	0	1.96M	7.61G	0b	0b	1Kb	0b	/init
docker	-	hassio_multicast		running	2 days	0	1.04M	7.61G	0b	0b	NaNb	NaNb	/init

@bdraco
Copy link
Member

bdraco commented May 31, 2023

I'd start with start_log_objects as START_LOG_OBJECT_SOURCES is a lot more useful when we already know what type of object we are looking for

@litinoveweedle
Copy link
Author

OK, thank you. I did as you suggested, the log is attached. Thank you for any help.
home-assistant.zip

@bdraco
Copy link
Member

bdraco commented Jun 2, 2023

Something is creating a massive amount of tasks
2023-05-31 13:41:50.794 CRITICAL (SyncWorker_4) [homeassistant.components.profiler] Memory Growth: [('coroutine', 202609, 339), ('frame', 202592, 339), ('builtin_function_or_method', 110601, 169), ('ReferenceType', 80809, 104), ('method', 56223, 90), ('Task', 50646, 84), ('Future', 50673, 84), ('FutureIter', 50627, 84), ('Context', 103933, 15)]

@bdraco
Copy link
Member

bdraco commented Jun 2, 2023

Thats good thats its asyncio since it means a callgrind might show it.

Can you generate one with the profiler.start service?

@litinoveweedle
Copy link
Author

Thank you for help, please find it attached - for 180s, if needed I can produce longer.
callgrind.zip

@bdraco
Copy link
Member

bdraco commented Jun 2, 2023

Its not immediately obvious where the task leak is coming from which means its likely a "slow" leak. They are always easier to find when they run the system out of memory quickly as you can see the abnormal growth quickly.

The next step would be to disable custom components below and see if the problem goes away.

  • cz_energy_spot_prices
  • smartir
  • floureon
  • solax_modbus
  • programmable_thermostat
  • signalmessenger
  • variable_history

If that turns out to be the cause we would than need to dig through the code of the one causing it to look for the leak

@litinoveweedle
Copy link
Author

litinoveweedle commented Jun 2, 2023

Its not immediately obvious where the task leak is coming from which means its likely a "slow" leak.

It is not that slow leak... I am running HA on server with 16GB RAM and HA was able to grow from ~300MB to 4GB in ~ two days. Should I keep running profiler for longer?

Does you reply means, that there was nothing obvious in callgrind? I am doing what you wrote anyway (test and try integrations), but I need at leas half of the integrations for base function of my house hold, so I can't simply switch them off. Therefore it would be really nice, if there is something in the logs at least to narrow my search. Thank you.

@bdraco
Copy link
Member

bdraco commented Jun 2, 2023

It likely it is in the profile but I can’t find it because it’s so small.

It’s slow in the sense that in the time of the profile it only grows a little. When I say fast leak I mean something that will crash in a few minutes. These are easy to see where the source is because they show up in the profile by sorting

You could try running the profile for 30m to see if it’s large enough to correlate the source.

keep in mind the profile has all the calls so it’s looking for a needle in a haystack. Anything we can do to make the leak faster or bigger will make it easier to find

@litinoveweedle
Copy link
Author

I did run profiler for 1800s but strangely the size of the files is similar. Anyway I hope it would be helpful. Thank you.

callgrind.zip

@bdraco
Copy link
Member

bdraco commented Jun 3, 2023

create_task

Here is what is calling create_task

@bdraco
Copy link
Member

bdraco commented Jun 3, 2023

what_is_messagemanager

I don't know what messagemanager is but its creating tasks (not saying its leaking), but I can't find it in the HA code so it must be a custom component or lib

@litinoveweedle
Copy link
Author

Hello,
thank you very much for support. It is strange as this brings nothing also>

root@smart:/usr/share/hassio/homeassistant/custom_components# grep -Ri messagemanager *
root@smart:/usr/share/hassio/homeassistant/custom_components# grep -Ri message_manager *

So messagemanager method seems not to be in the in the custom integrations either. I did even grep in the Homeassistant and Hassio root directories, but nothing was found. Are you sure about the name? (i.e. messagemanager)? I did not found it in the diagrams you attached either.

BTW: is there any HOWTO one can follow to get similar output from profiler? It would be helpful for user to try to identify faulty code themself.

@litinoveweedle
Copy link
Author

@bdraco Can you please check "messagemanager" method/class name you mentioned? Maybe I am doing it wrong, but I was not able to find it in the custom integration either.

@sknsean
Copy link
Contributor

sknsean commented Jun 7, 2023

Just to chime in here, I'm not using any of these integrations:

* cz_energy_spot_prices

* smartir

* floureon

* solax_modbus

* programmable_thermostat

* signalmessenger

* variable_history

I'm also experiencing a memory leak, that results in the home-assistant container gets killed (by the OOM) every 5-6 days. Quite annoying :)

One of the pictures is showing when the container got killed, and the other shows the free memory keeps getting lower.

IMG_1265
IMG_1262

@litinoveweedle
Copy link
Author

litinoveweedle commented Jun 13, 2023

@bdraco I am really sorry, but I have politely asked you several times during past weeks simple question regarding your original findings - i.e. where did you exactly found "messagemanager" in the trace and why I can't grep anything in HA source code including custom integrations with such name. I need such information to continue with my troubleshooting.

Do I assume correctly from your silence, that I should not expect an answer at all?

image

@bdraco
Copy link
Member

bdraco commented Jun 13, 2023

Day job has me traveling quite a bit. I’m about 150 GitHub notifications behind

@backcountrymountains
Copy link

So, I just put in an issue with HAOS because I'm also experiencing OOM reboots. My memory allocation graph is even steeper and my machine reboots more often. #2595.

Is this more likely to be an HA than an HAOS issue? I'm going to try to downgrade to HAOS 9.5 and see if I still have leaks.

@VinceRMP
Copy link

I see the same on my instance. It seemed to start after updating HAOS or 2023.6 beta (coming from a dev build in the early stage of 2023.5, so it could be also related to a 2023.5 specific version).
What I have noticed is that my memory increase is caused by my tablet which shows the dashboard 24/7. When I disable the tablet the memory stops increasing, but doesn’t decrease. When I open the display on the tablet the memory starts to increase again.
Is this something you use aswell?

@litinoveweedle
Copy link
Author

I did observed memory leak on 2023.5. There was question why my graph of memory usage is not as steep as for others. Well I do run HA on server and I have currently 64GB of RAM, so it takes longer to consume it. (I extended my memory as intermediate workaround as my original 16GB RAM was exhausted in just 2-3 days.)

@bdraco
Copy link
Member

bdraco commented Jun 15, 2023

If you have a memory leak please post a start log objects (see above) and a callgrind. Me too comments are not helpful as they don’t add any more information that gets us closer to tracking down a leak.

@VinceRMP
Copy link

callgrind_start_log_objects.zip

See the attached file. If you need more or longer logs let me know.

@bdraco
Copy link
Member

bdraco commented Jun 15, 2023

@VinceRMP Can you run the start log objects for 30 minutes? The way it works is it keep looking at whats in memory and sees what grows over time. Without a longer run it won't show what is leaking as it doesn't have enough time to figure out what is normal memory churn and what is not going away.

@VinceRMP
Copy link

At the moment I’m not able to do that.
I will do it tomorrow and post the full log.

@backcountrymountains
Copy link

So, I've noticed a correlation with a CPU spike and the incrementally increasing memory use:
image

Is there a way to monitor what causes the CPU spike to maybe get insight into what's using and not releasing memory?

@VinceRMP
Copy link

I can reproduce the start of the memory leak and stop it aswell after some testing.
I'm not sure if this is the same issue as described here and I don't want to hijack this issue.
So if I need to create a seperate one please let me know.

image
At the first drop in memory I did a restart of Home Assistant, but didn't refresh the webapp running on my tablet.
This causes the memory to increase as you can see. Around 08:00 I restarted the webapp on the tablet.
The memory stays the same, so no memory leak visible.
Around 09:00 I did another restart of Home Assistant without restarting the webapp on the tablet and there is the memory leak again.

This is also reported in the logs (I somehow didn't notice this before):
[139928754006096] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":22632,"type":"result","success":true,"result":null} [139928813456528] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":20399,"type":"result","success":true,"result":null} [139928532840656] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":20538,"type":"result","success":true,"result":null} [139928688859152] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":70314,"type":"result","success":true,"result":null} [139928776142096] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":30737,"type":"result","success":true,"result":null}

This is the log which now has run for around 30 minutes:
home-assistant.log

@bdraco
Copy link
Member

bdraco commented Jun 16, 2023

@VinceRMP that definitely looks like a different problem as there are no obvious task leaks in the log. Please continue in a new issue.

@VinceRMP
Copy link

Thanks for looking into it! I will create an issue in the frontend section.

@bdraco
Copy link
Member

bdraco commented Jun 16, 2023

I think it’s still should go to core as we need to figure out why it isn’t releasing memory when the client disconnects

@litinoveweedle
Copy link
Author

@bdraco how about original issue I logged. Did you got time to find out how I should identify "messagemanager" you mentioned?

@bdraco
Copy link
Member

bdraco commented Jun 16, 2023

BTW: is there any HOWTO one can follow to get similar output from profiler? It would be helpful for user to try to identify faulty code themself.

https://www.home-assistant.io/integrations/profiler/

Additionally, the profiler will generate a callgrind.out file that can be viewed with:

KCachegrind or QCachegrind
Gprof2dot

@bdraco
Copy link
Member

bdraco commented Jun 16, 2023

@bdraco how about original issue I logged. Did you got time to find out how I should identify "messagemanager" you mentioned?

I wasn't able to find it. I don't think its going to help anyways

@bdraco
Copy link
Member

bdraco commented Jun 16, 2023

@litinoveweedle I think you might have some luck figuring out the source if you do dump_log_objects on Task

@bdraco
Copy link
Member

bdraco commented Jun 16, 2023

service: profiler.dump_log_objects
data:
  type: Task

You'll get something like

2023-06-16 09:17:13.870 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1215' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.870 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1216' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.870 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1222' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.871 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1223' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.871 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1229' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:512> wait_for=<Task pending name='Task-1230' coro=<RequestHandler._handle_request() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:433> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.871 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1230' coro=<RequestHandler._handle_request() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:433> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[Task.task_wakeup()]>
2023-06-16 09:17:13.872 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1231' coro=<WebSocketHandler._writer() running at /Users/bdraco/home-assistant/homeassistant/components/websocket_api/http.py:109> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.872 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1208' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>

Keep in mind you have 50000 in your dump above so it may run out of memory trying to list them all so you might want to do the dump before too many memory has leaked but not too soon that there aren't enough to identify.

@litinoveweedle
Copy link
Author

@litinoveweedle I think you might have some luck figuring out the source if you do dump_log_objects on Task

Sorry I don't understand. (On Task part). I will be back home on Sunday, I will try to see how this could be done.

@litinoveweedle
Copy link
Author

Ok, thank you for explanation, I will try to do that after HA restart

@NicoIIT
Copy link

NicoIIT commented Jun 16, 2023

Not sure it can help or even if it is relevant, but investigating myself an unrelated memory leak I found out a potential interesting library for your case: aiocoap which defines a messagemanager and functions called fill_or_recognize_remote as it appears in the callgrind graphs.

This pyhton library is used by another library: pytradfri, itself used by the IKEA Tradfri HA Component: https://github.com/home-assistant/core/blob/dev/homeassistant/components/tradfri/manifest.json

@litinoveweedle if you use this component it may worse deactivating it to check if it could be the culprit.

@bdraco
Copy link
Member

bdraco commented Jul 21, 2023

Pretty sure I just stumbled on the source of this leak #96981

@litinoveweedle
Copy link
Author

Thank you for your effort and not forgetting this issue. I did upgraded to 2023.7.3, which contains your fix yesterday and I can confirm that so far there is no memory leak anymore!

As additional info, three days ago I discovered, that my official IKEA integration has new 'unknown' and 'unconfigured' device. I was able to remove this device via integration configuration GUI and it slowed down a leak significantly (but not completely). As to the origin of the this mysterious device - I remember I had some issue with IKEA battery powered blinds (Fyrtur), as after recharging their batteries some were not detected and I had to pair them to the IKEA gateway again. This was about in the same time when I discovered the leak. I was going to report this finding, but you were faster with the fix. ;-)

Thank you again.

@bdraco
Copy link
Member

bdraco commented Jul 23, 2023

Thanks for confirming

@github-actions github-actions bot locked and limited conversation to collaborators Aug 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants