Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive Memory Leak when running on Intel NUC #29786

Closed
JokerVP opened this issue Dec 9, 2019 · 39 comments · Fixed by #30155
Closed

Massive Memory Leak when running on Intel NUC #29786

JokerVP opened this issue Dec 9, 2019 · 39 comments · Fixed by #30155

Comments

@JokerVP
Copy link

JokerVP commented Dec 9, 2019

Home Assistant release with the issue:
0.102.0

Last working Home Assistant release (if known):
0.101.3

Operating environment (Hass.io/Docker/Windows/etc.):
Hassi.io in Docker on an Intel NUC with Ubuntu 18.04.3 LTS
Intel(R) Core(TM) i3-8109U CPU @ 3.00GHz
8GB RAM

Integration:
All the custom components have been removed. The issue still exists without custom components

Description of problem:
After a restart of Home-Assistant, everything works fine for about an hour. The RAM consumption of the Home-Assistant docker image is constant. After that, the consumption increases exponentially in steps. After another one or two hours, all the memory of the system is consumed up. Since the Linux tries to do memory swapping, the HDD LED is always on, CPU load is at 100%, and the system gets completely unresponsive. After about an hour, the Home-Assistant restarts and the system is stable again. The whole issue repeats.

I was able to take the following picture of the RAM consumption of the Home-Assistant container by using Portainer:
Memory 0 102 3 (no custom)

While the system is doing the memory swapping, I was able to connect to the host system over SSH with a lot of patience (it is extremely slow). The "top" command showed the following CPU and memory usage:

top - 09:51:56 up 14:55,  2 users,  load average: 116.85, 135.05, 88.71
Tasks: 378 total,   2 running, 325 sleeping,   0 stopped,   0 zombie
%Cpu(s): 24.7 us, 31.0 sy,  0.0 ni,  0.0 id, 43.8 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem :  8035168 total,   113200 free,  7783184 used,   138784 buff/cache
KiB Swap:  2097148 total,        0 free,  2097148 used.    25556 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9516 root      20   0 1332188   8892      0 S 100.6  0.1  27:35.67 grafana-server
   59 root      20   0       0      0      0 R  99.7  0.0 274:51.01 kswapd0
 1715 root      20   0 2386572  18528      0 S   8.0  0.2  23:34.93 dockerd
16466 root      20   0 8252480 6.866g      0 S   2.7 89.6  16:36.29 python3
 6437 root      20   0  385560   3264    156 D   1.2  0.0   0:23.55 php7.2
 3425 root      20   0  118424   8924      0 S   0.9  0.1   9:56.59 python3
26847 root      20   0  449932   5320   2176 D   0.9  0.1   8:38.62 Xorg
 6465 root      20   0  109104    768      0 S   0.6  0.0   1:54.53 containerd-shim

The PID 16466 belongs to the main Home-Assistant container.

Additional information:
I have tried it with several versions. The issue is introduced with version 0.102.0. Updating to 0.102.3 does not work. The problem still exists in the 0.103.0b0.

I am not a software engineer. Therefore, any help on further narrowing the issue down is highly appreciated.

@Martinvdm
Copy link

I think you should add some logging from the home assistant log file to narrow down the issue.

@Adminiuga
Copy link
Contributor

What integrations you enabled and configured? Start disabling those one by one to narrow the scope.

@JokerVP
Copy link
Author

JokerVP commented Dec 10, 2019

Thank you for the inputs. I cannot see anything in the log file. Nevertheless, please find an example here:

home-assistant (0.102.3 test 2).log

My Home-Assistant configuration is pretty big. I am using a lot of integrations. Since it always takes more than an hour until the error appears, it is quite a time-consuming task. I have already removed all the custom components and a couple of them which seem to be obvious when looking at the change list in version 0.102.0. Since the error still appeared, I removed all the configurations. After running it for more than two hours, the error did not appear.

I am now using basically a binary search approach for narrowing down the integration that causes the issue. I have deleted the lower part of my configuration. Since the error still appears, I know now, that it is in the upper part. That is the current state. I will try to do several more iterations until I maybe know which integration is causing the issue. It might take days...

If anyone has a better idea of how you can see which integration is consuming the memory, any help is very much appreciated!

@nickrout
Copy link

@nickrout
Copy link

mem_top looks useful - pretty simple to use https://pypi.org/project/mem_top/

@Adminiuga
Copy link
Contributor

Do you have icmp ping device tracker enabled? I think I saw someone complaining about me leak with it.

@JokerVP
Copy link
Author

JokerVP commented Dec 14, 2019

In order to make it short: the issue seems to be resolved with the new 0.103.0 version even though my previous tests with version 0.103.0b0 were not successful.

In the last couple of days, I did a lot of tests (they take several hours each). My previous approach narrowing it down with a binary search algorithm did not work out. I removed the lower part of my integrations. The memory leak was still present. Therefore, I removed the upper half of my integrations as a verification. Since also this test failed, I assume there must be some further dependencies. Since I knew that without integrations in the config file, the leak did not appear, I started adding only a few components at a time. Before I did the test with the latest Home-Assistant version, I had it basically down to 3 integrations. With these 3 integrations enabled (cloud, netatmo, and ifttt), the leak already appeared.

Since it seems to work now, I will stop my investigations. However, If it is helpful for the developers, I could try to nail it down to a single integration. I do not know whether that is worth it since there seem to be some dependencies. Please let me know. Otherwise, I will close the issue in a few days.

Thanks again for everybody for helping!

@nickrout: I did not see any differences in the mem_top at the start and after the leak had started. I uploaded you the console output I got after running the system for more than an hour:
mem_top leaked 1GB.txt

@Adminiuga: I do not use icmp ping. I am using only nmap_tracker and google_maps as device trackers. Nevertheless, I also removed these integrations but I did not see any changes.

@JokerVP
Copy link
Author

JokerVP commented Dec 14, 2019

Unfortunately, I seem that the issue is actually not resolved in 0.103.0. The whole system was running stable for more than two days. Today, I did the first restart after the update. About two hours later, the system crashed again since it was running out of memory.
I have now rolled back to version 0.101.3. I will do further tests in the next days.

@silfa718
Copy link

HA was crashing all day today and couldn't figure out why until I saw this:
Screenshot_20191215-230038.

I had installed .103 Friday morning and everything was fine until this morning. Probably a reboot late Friday triggered it to start.

I rolled back to .102.3.

@JokerVP
Copy link
Author

JokerVP commented Dec 16, 2019

@silfa718, very interesting. For me 0.102.3 does not work, I have to roll back to 0.101.3 in order to resolve the issue. Can you share more information about your system? Are you using an x86 system? Are you on Hassio in Docker?
Probably the most interesting point is that you wrote that it started after a reboot.

@silfa718
Copy link

silfa718 commented Dec 16, 2019

@JokerVP No luck, I woke up this morning and it's happening again. But I don't believe it was doing for me before .103.
My system sounds exactly like yours x86_64, Hassio, Docker, Ubuntu 18, i3 with 8 gigs Ram.

I know I did so an update to Ubuntu before I updated to .103, which I didn't do in two months. Originally I thought that might have be my issue.

Just rolled back to .102.2 which is what I updated to .103 from. I'll report back how it goes.

Still happening on .102.2, I will give .101.3 a try.

@silfa718
Copy link

12 hours on .101.3
Screenshot_20191216-223037

@JokerVP
Copy link
Author

JokerVP commented Dec 17, 2019

@silfa718 Can you confirm that the issue does not appear anymore on version 0.101.3? This would be completely in line with what I am experiencing.

Yesterday, I did again a test in which I removed all the components in the configuration file. I wrote a couple of days ago that with this empty configuration the memory leak does not appear. Unfortunately, after retesting it, I still experienced the leak. Therefore, I did yesterday tests with removing also all integrations which are configured in the UI. I did not see the leak after two hours. I did this test twice. Currently, I am running a test in which I disable one UI integration after another one. Maybe you can help me there. Do you use any of these integrations?

  • Google Cast
  • MQTT broker
  • ESP Home
  • deCONZ
  • HEOS

If so, It would make sense to start removing the common one first.

@silfa718
Copy link

silfa718 commented Dec 17, 2019

@JokerVP It doesn't crash on .101.3. Strange though, I didn't experience a crash on .102.2 before I updated to .103, but reverting back to .102.2 crashes.

From your list I also use:
Google Cast
MQTT broker
ESP Home
deConz

I also have these integrations:
HACS
IQVIA
Ad Guard
Zones
IFTTT
Lifx
Webhook
Axis
Mobile App
Sonos
Wemo
Google Hangouts
Unifi

@JokerVP
Copy link
Author

JokerVP commented Dec 17, 2019

Finally, I think I am narrowing the issue down. After disabling the Google Cast integration did not work, I disabled deConz. Today, I did a total of 3 tests over several hours. It all seems to prove that disabling deConz is actually resolving the memory leak issue. I run the system with all other integrations enabled. It runs for hours without any issue.

@silfa718: It would be very helpful if you could also disable deConz on your system and check whether this resolves the issue.

I am currently running deConz in version 3.8. In the meantime, there is already version 3.9 available. I will do tomorrow tests with the latest version.

I also checked the changes between Home-Assistant version 0.101.3 and 0.102.0. There were actually some pull requests form Kane610.

@silfa718
Copy link

silfa718 commented Dec 17, 2019

@JokerVP installed .103, removed deConz integration, set the deConz addon to not load, rebooted.
I'll report back in two hours.

Edit: Two hours and no crash and no memory spike.

Edit 2: How did you get back to deConz 3.8?

@JokerVP
Copy link
Author

JokerVP commented Dec 18, 2019

I did the test again with the latest version of the deConz add-on 3.9 with the same results. The memory still leaks with the add-on version 3.9 under Home-Assistant V0.103.0.

@silfa718: Thank you for confirming that disabling the deConz integration also helps on your site. I do not know how to roll back to 3.8, but I do not think that makes any sense right now. I do not see any differences between 3.8 and 3.9.

I think we are now pretty sure that the changes in the deConz integration that have been introduced in Home-Assistant V0.102.0 are causing these issues. @Kane610: can have a look into the issue? Please let me know if I can help you doing further investigations on the issue. Thank you in advance.

@Kane610
Copy link
Member

Kane610 commented Dec 18, 2019

I don't know what to look for, the changes for 102 are small in themselves. Im running everything on similar setups as you except for not running hassio and I don't experience any issue like what you describe.

What happens if you only enable deconz?

@silfa718
Copy link

Yesterday I noticed that supervisor 193 was available when enabling beta and gave that a try on .103, but the same crash happened just like 192.

@Kane610 I am running the same hardware and hassio like @JokerVP

@JokerVP
Copy link
Author

JokerVP commented Dec 18, 2019

@Kane610: Since I never did a test in which I disabled everything, I just did it this afternoon. I removed everything from the config file that is not required to get Home-Assistant running:
configuration.txt
I removed all the integrations, except for deConz in the UI. After about one hour of running the system, the memory started to increase:
Only deConz
I took this graph from Portainer which I run outside of Hassio on the same Docker environment. The whole thing is now perfectly reproducible. The memory always starts to increase after about one hour, independent of how much memory is used by other integrations. The memory increases always in steps. The time between the steps is almost precisely 5 minutes. Every step is about double the size of the previous one. This means it increases exponentially. If I remove the deConz integration while the memory has been increased already, the memory will stay at the same level until I reboot the system.

@Kane610
Copy link
Member

Kane610 commented Dec 18, 2019

Can you enable full debug logs for HASS to see if there is something to identify

@agners
Copy link
Member

agners commented Dec 18, 2019

@Kane610 not an expert, but looking at the changes using

git log -p 0.101.3...0.102.0 -- ./homeassistant/components/deconz/*.py

It seems that the battery stuff is the biggest change. Without having deeper knowledge it what is going on, it seems that creating the trackers in homeassistant/components/deconz/sensor.py uses a different if conditions where trackers get created and removed.. Could it be that this is continuously creating new DeconzSensorStateTrackers?

FWIW, I use deconz too and haven't seen the issue, so it probably depends on what devices are floating in the Zigbee network...

@JokerVP
Copy link
Author

JokerVP commented Dec 18, 2019

I have set the logger in the configuration.yaml to "info". Here again the log from a test in which I did a couple of days ago. It was a test in which not all other integrations were disabled. However, I did not see anything in the log.

home-assistant.0.102.3.test.2.log

The only battery-operated devices in my ZigBee network are a couple of Philips Hue Motion Sensors and Philips Hue Dimmer Switches.

@Kane610
Copy link
Member

Kane610 commented Dec 18, 2019

Set up hass with what is minimal to reproduce enable debug for everything and look through the logs for suspicious behaviour.

The Statetracker will only create a new tracker when a signal for a new, I see a small improvement could be made to make sure to not create duplicates if it signals a new device is created. If this is the root cause of your issue it would show in the logs

@JokerVP
Copy link
Author

JokerVP commented Dec 19, 2019

@Kane610: Thank you for clarifying how to use the logging part. I did now some tests with setting the logging to "debug". Please find here a log file which was copied after the memory leak manifested:

home-assistant.log

I noticed is that there is a direct correlation between the memory steps and the log file. The memory increases in steps every five minutes. The log file repeats itself every five minutes. It seems that I get new values from the battery devices every five minutes.

I tried to capture a single memory usage step. The shortest sampling rate that was still working was 3s:

Detailed Step3

The memory starts increasing at 14:25:03. Looking at the attached log file this seem to happen when the following debug message is printed:

2019-12-19 14:25:02 DEBUG (MainThread) [pydeconz.websocket] Websocket data: {"config":{"battery":0,"group":"29306","on":true,"reachable":true},"e":"changed","id":"2","r":"sensors","t":"event","uniqueid":"00:17:88:01:10:39:29:b9-02-fc00"}

2019-12-19 14:25:04 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event timer_out_of_sync[L]: seconds=3.756566993019078>

I checked it for several memory-increas-steps. It always correlates with this debug message. Does this information help? Please let me know if I can do further investigations that could help you.

@Kane610
Copy link
Member

Kane610 commented Dec 19, 2019

It does indeed look like the battery value 0 would create multiple sensors trackers, I will evaluate to see what happens

@silfa718
Copy link

Tried again with the 103.3 and 4.1 addon update....crashed again.

@ottelo9
Copy link

ottelo9 commented Dec 22, 2019

I've the same problem.
https://community.home-assistant.io/t/0-103-watchdog-found-a-problem-with-home-assistant-docker-periodically-memory-leak/157006

My system hangs every 100min. I have lot of zigbee devices connected to deconz.

EDIT:
I moved back to Deconz 3.8 (via restore snapshot).

EDIT2:
Same problem with Deconz 3.8

@agners
Copy link
Member

agners commented Dec 22, 2019

@Kane610

I still think it must be this block of code:

            if sensor.battery:
                new_battery = DeconzBattery(sensor, gateway)
                if new_battery.unique_id not in batteries:
                    batteries.add(new_battery.unique_id)
                    entities.append(new_battery)
                    battery_handler.remove_tracker(sensor)
            else:
                battery_handler.create_tracker(sensor)

Shouldn't that be indented, e.g.?

...
            if sensor.battery:
                new_battery = DeconzBattery(sensor, gateway)
                if new_battery.unique_id not in batteries:
                    batteries.add(new_battery.unique_id)
                    entities.append(new_battery)
                    battery_handler.remove_tracker(sensor)
                else:
                    battery_handler.create_tracker(sensor)

@agners
Copy link
Member

agners commented Dec 22, 2019

However, what I don't quite understand is why the memory leak is exponential, that looks like there is a list being copied which shouldn't be.

@Kane610
Copy link
Member

Kane610 commented Dec 22, 2019

@agners no, the tracker is used to keep a check of devices not currently reporting battery state, which is common on a newly restarted system. The problem is that a battery value of '0' would evaluate if sensor.battery: as false. Which if a sensor reports battery state of 0 would for each tracker create a new tracker exponentially increasing the amounts of sensors. Lets see if it improves your situation with next release

@ottelo9
Copy link

ottelo9 commented Dec 23, 2019

Can anyone test Release 0.103.4?

@JokerVP
Copy link
Author

JokerVP commented Dec 23, 2019

I am currently testing it with 0.103.4. It runs now for about 4.5 hours without an issue.

@Kane610: Thank you for all the effort. I was not expecting to get the fix so early. Great Christmas gift!

@Kane610
Copy link
Member

Kane610 commented Dec 23, 2019

Awesome! Lets hope its completely fixed then. (I would have fixed it sooner but was busy with other stuff)

@silfa718
Copy link

silfa718 commented Dec 23, 2019

I am currently testing it with 0.103.4. It runs now for about 4.5 hours without an issue.

@Kane610: Thank you for all the effort. I was not expecting to get the fix so early. Great Christmas gift!
@JokerVP
What version of the add-on are you running

@JokerVP
Copy link
Author

JokerVP commented Dec 23, 2019

I am still running version 3.9. I will probably update tomorrow.

@ottelo9
Copy link

ottelo9 commented Dec 23, 2019

I'm now testing 0.103.4 with Deconz 4.1.
@Kane610 Thanks for your help!!!!!!

EDIT
Until now, no problems!

@silfa718
Copy link

24 hours on .103.4 and deconz 3.9 and no issues.

@Pirol62
Copy link

Pirol62 commented Jan 22, 2020

Sorry, but after passing 50% Swapuse I think, the problem is still relevant.
I'm running on 103.4 and deConz 5.0.
Swap has increased during the last 7 days from 40 to 51%
I will open a new issue, because having such a large swap when memory use ist at 35% in imo not ok anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants