New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Massive Memory Leak when running on Intel NUC #29786
Comments
I think you should add some logging from the home assistant log file to narrow down the issue. |
What integrations you enabled and configured? Start disabling those one by one to narrow the scope. |
Thank you for the inputs. I cannot see anything in the log file. Nevertheless, please find an example here: home-assistant (0.102.3 test 2).log My Home-Assistant configuration is pretty big. I am using a lot of integrations. Since it always takes more than an hour until the error appears, it is quite a time-consuming task. I have already removed all the custom components and a couple of them which seem to be obvious when looking at the change list in version 0.102.0. Since the error still appeared, I removed all the configurations. After running it for more than two hours, the error did not appear. I am now using basically a binary search approach for narrowing down the integration that causes the issue. I have deleted the lower part of my configuration. Since the error still appears, I know now, that it is in the upper part. That is the current state. I will try to do several more iterations until I maybe know which integration is causing the issue. It might take days... If anyone has a better idea of how you can see which integration is consuming the memory, any help is very much appreciated! |
There is some stuff here https://medium.com/zendesk-engineering/hunting-for-memory-leaks-in-python-applications-6824d0518774 |
mem_top looks useful - pretty simple to use https://pypi.org/project/mem_top/ |
Do you have icmp ping device tracker enabled? I think I saw someone complaining about me leak with it. |
In order to make it short: the issue seems to be resolved with the new 0.103.0 version even though my previous tests with version 0.103.0b0 were not successful. In the last couple of days, I did a lot of tests (they take several hours each). My previous approach narrowing it down with a binary search algorithm did not work out. I removed the lower part of my integrations. The memory leak was still present. Therefore, I removed the upper half of my integrations as a verification. Since also this test failed, I assume there must be some further dependencies. Since I knew that without integrations in the config file, the leak did not appear, I started adding only a few components at a time. Before I did the test with the latest Home-Assistant version, I had it basically down to 3 integrations. With these 3 integrations enabled (cloud, netatmo, and ifttt), the leak already appeared. Since it seems to work now, I will stop my investigations. However, If it is helpful for the developers, I could try to nail it down to a single integration. I do not know whether that is worth it since there seem to be some dependencies. Please let me know. Otherwise, I will close the issue in a few days. Thanks again for everybody for helping! @nickrout: I did not see any differences in the mem_top at the start and after the leak had started. I uploaded you the console output I got after running the system for more than an hour: @Adminiuga: I do not use icmp ping. I am using only nmap_tracker and google_maps as device trackers. Nevertheless, I also removed these integrations but I did not see any changes. |
Unfortunately, I seem that the issue is actually not resolved in 0.103.0. The whole system was running stable for more than two days. Today, I did the first restart after the update. About two hours later, the system crashed again since it was running out of memory. |
@silfa718, very interesting. For me 0.102.3 does not work, I have to roll back to 0.101.3 in order to resolve the issue. Can you share more information about your system? Are you using an x86 system? Are you on Hassio in Docker? |
@JokerVP No luck, I woke up this morning and it's happening again. But I don't believe it was doing for me before .103. I know I did so an update to Ubuntu before I updated to .103, which I didn't do in two months. Originally I thought that might have be my issue. Just rolled back to .102.2 which is what I updated to .103 from. I'll report back how it goes. Still happening on .102.2, I will give .101.3 a try. |
@silfa718 Can you confirm that the issue does not appear anymore on version 0.101.3? This would be completely in line with what I am experiencing. Yesterday, I did again a test in which I removed all the components in the configuration file. I wrote a couple of days ago that with this empty configuration the memory leak does not appear. Unfortunately, after retesting it, I still experienced the leak. Therefore, I did yesterday tests with removing also all integrations which are configured in the UI. I did not see the leak after two hours. I did this test twice. Currently, I am running a test in which I disable one UI integration after another one. Maybe you can help me there. Do you use any of these integrations?
If so, It would make sense to start removing the common one first. |
@JokerVP It doesn't crash on .101.3. Strange though, I didn't experience a crash on .102.2 before I updated to .103, but reverting back to .102.2 crashes. From your list I also use: I also have these integrations: |
Finally, I think I am narrowing the issue down. After disabling the Google Cast integration did not work, I disabled deConz. Today, I did a total of 3 tests over several hours. It all seems to prove that disabling deConz is actually resolving the memory leak issue. I run the system with all other integrations enabled. It runs for hours without any issue. @silfa718: It would be very helpful if you could also disable deConz on your system and check whether this resolves the issue. I am currently running deConz in version 3.8. In the meantime, there is already version 3.9 available. I will do tomorrow tests with the latest version. I also checked the changes between Home-Assistant version 0.101.3 and 0.102.0. There were actually some pull requests form Kane610. |
@JokerVP installed .103, removed deConz integration, set the deConz addon to not load, rebooted. Edit: Two hours and no crash and no memory spike. Edit 2: How did you get back to deConz 3.8? |
I did the test again with the latest version of the deConz add-on 3.9 with the same results. The memory still leaks with the add-on version 3.9 under Home-Assistant V0.103.0. @silfa718: Thank you for confirming that disabling the deConz integration also helps on your site. I do not know how to roll back to 3.8, but I do not think that makes any sense right now. I do not see any differences between 3.8 and 3.9. I think we are now pretty sure that the changes in the deConz integration that have been introduced in Home-Assistant V0.102.0 are causing these issues. @Kane610: can have a look into the issue? Please let me know if I can help you doing further investigations on the issue. Thank you in advance. |
I don't know what to look for, the changes for 102 are small in themselves. Im running everything on similar setups as you except for not running hassio and I don't experience any issue like what you describe. What happens if you only enable deconz? |
@Kane610: Since I never did a test in which I disabled everything, I just did it this afternoon. I removed everything from the config file that is not required to get Home-Assistant running: |
Can you enable full debug logs for HASS to see if there is something to identify |
@Kane610 not an expert, but looking at the changes using
It seems that the battery stuff is the biggest change. Without having deeper knowledge it what is going on, it seems that creating the trackers in FWIW, I use deconz too and haven't seen the issue, so it probably depends on what devices are floating in the Zigbee network... |
I have set the logger in the configuration.yaml to "info". Here again the log from a test in which I did a couple of days ago. It was a test in which not all other integrations were disabled. However, I did not see anything in the log. home-assistant.0.102.3.test.2.log The only battery-operated devices in my ZigBee network are a couple of Philips Hue Motion Sensors and Philips Hue Dimmer Switches. |
Set up hass with what is minimal to reproduce enable debug for everything and look through the logs for suspicious behaviour. The Statetracker will only create a new tracker when a signal for a new, I see a small improvement could be made to make sure to not create duplicates if it signals a new device is created. If this is the root cause of your issue it would show in the logs |
@Kane610: Thank you for clarifying how to use the logging part. I did now some tests with setting the logging to "debug". Please find here a log file which was copied after the memory leak manifested: I noticed is that there is a direct correlation between the memory steps and the log file. The memory increases in steps every five minutes. The log file repeats itself every five minutes. It seems that I get new values from the battery devices every five minutes. I tried to capture a single memory usage step. The shortest sampling rate that was still working was 3s: The memory starts increasing at 14:25:03. Looking at the attached log file this seem to happen when the following debug message is printed:
I checked it for several memory-increas-steps. It always correlates with this debug message. Does this information help? Please let me know if I can do further investigations that could help you. |
It does indeed look like the battery value 0 would create multiple sensors trackers, I will evaluate to see what happens |
Tried again with the 103.3 and 4.1 addon update....crashed again. |
I've the same problem. My system hangs every 100min. I have lot of zigbee devices connected to deconz. EDIT: EDIT2: |
I still think it must be this block of code:
Shouldn't that be indented, e.g.?
|
However, what I don't quite understand is why the memory leak is exponential, that looks like there is a list being copied which shouldn't be. |
@agners no, the tracker is used to keep a check of devices not currently reporting battery state, which is common on a newly restarted system. The problem is that a battery value of '0' would evaluate if sensor.battery: as false. Which if a sensor reports battery state of 0 would for each tracker create a new tracker exponentially increasing the amounts of sensors. Lets see if it improves your situation with next release |
Can anyone test Release 0.103.4? |
I am currently testing it with 0.103.4. It runs now for about 4.5 hours without an issue. @Kane610: Thank you for all the effort. I was not expecting to get the fix so early. Great Christmas gift! |
Awesome! Lets hope its completely fixed then. (I would have fixed it sooner but was busy with other stuff) |
I am still running version 3.9. I will probably update tomorrow. |
I'm now testing 0.103.4 with Deconz 4.1. EDIT |
24 hours on .103.4 and deconz 3.9 and no issues. |
Sorry, but after passing 50% Swapuse I think, the problem is still relevant. |
Home Assistant release with the issue:
0.102.0
Last working Home Assistant release (if known):
0.101.3
Operating environment (Hass.io/Docker/Windows/etc.):
Hassi.io in Docker on an Intel NUC with Ubuntu 18.04.3 LTS
Intel(R) Core(TM) i3-8109U CPU @ 3.00GHz
8GB RAM
Integration:
All the custom components have been removed. The issue still exists without custom components
Description of problem:
After a restart of Home-Assistant, everything works fine for about an hour. The RAM consumption of the Home-Assistant docker image is constant. After that, the consumption increases exponentially in steps. After another one or two hours, all the memory of the system is consumed up. Since the Linux tries to do memory swapping, the HDD LED is always on, CPU load is at 100%, and the system gets completely unresponsive. After about an hour, the Home-Assistant restarts and the system is stable again. The whole issue repeats.
I was able to take the following picture of the RAM consumption of the Home-Assistant container by using Portainer:
While the system is doing the memory swapping, I was able to connect to the host system over SSH with a lot of patience (it is extremely slow). The "top" command showed the following CPU and memory usage:
The PID 16466 belongs to the main Home-Assistant container.
Additional information:
I have tried it with several versions. The issue is introduced with version 0.102.0. Updating to 0.102.3 does not work. The problem still exists in the 0.103.0b0.
I am not a software engineer. Therefore, any help on further narrowing the issue down is highly appreciated.
The text was updated successfully, but these errors were encountered: