Skip to content
This repository has been archived by the owner on May 17, 2021. It is now read-only.

OutOfMemory issues #1961

Closed
petrklus opened this issue Jan 11, 2015 · 64 comments
Closed

OutOfMemory issues #1961

petrklus opened this issue Jan 11, 2015 · 64 comments
Labels
Milestone

Comments

@petrklus
Copy link
Contributor

Hello,

I am experiencing occasional lockups, which are caused by system running out of Heap space. The symptoms are that OH starts using 100% of CPU (single thread), then event processing slows down until it does not continue anymore.

Please see my logs - the "fun" starts in the openhab.log around 3am
logs.zip | uploaded via ZenHub

Explanation for the logs:

  • the zwave.log contains zwave binding information separately
  • the console.log is a copy&paste of the osgi console terminal, those are things that are probably not in the logs (my logback settings does not show any of the openhab/events onto STDOUT, so my assumption is that what actually makes it there is not in the logs)
@petrklus
Copy link
Contributor Author

I would like to add that I was seeing the same issue before upgrade (running 1.5.0)

@digitaldan
Copy link
Contributor

Hi Petr, can you provide a heap dump by executing the following command:

jmap -dump:format=b,file=heap.bin pid

Where "pid" is the process ID of the Java process? I don't know how long it
takes to lock up, but maybe let it run for 24 hours then take a heap dump
(and post here), and then again when it does lock up. The jmap exe should
be located in the JDK bin folder if its not on your path.

On Sun, Jan 11, 2015 at 5:15 AM, Petr Klus notifications@github.com wrote:

I would like to add that I was seeing the same issue before upgrade
(running 1.5.0)


Reply to this email directly or view it on GitHub
#1961 (comment).

@digitaldan
Copy link
Contributor

One last thing, if you're worried the heap dump may contain sensitive
data you could email it to me, dan at digitaldan dot com

On Sun, Jan 11, 2015 at 8:22 AM, dan cunningham dan@digitaldan.com wrote:

Hi Petr, can you provide a heap dump by executing the following command:

jmap -dump:format=b,file=heap.bin pid

Where "pid" is the process ID of the Java process? I don't know how long
it takes to lock up, but maybe let it run for 24 hours then take a heap
dump (and post here), and then again when it does lock up. The jmap exe
should be located in the JDK bin folder if its not on your path.

On Sun, Jan 11, 2015 at 5:15 AM, Petr Klus notifications@github.com
wrote:

I would like to add that I was seeing the same issue before upgrade
(running 1.5.0)


Reply to this email directly or view it on GitHub
#1961 (comment).

@teichsta
Copy link
Member

Hi,
could you provide some more details like OS, JDK, installed Add-Ons?
Thanks, Thomas E.-E.

@teichsta teichsta added the bug label Jan 11, 2015
@teichsta teichsta added this to the 1.6.2 milestone Jan 11, 2015
@petrklus
Copy link
Contributor Author

Here are the contents of my addon directory:
image

Will run the dump as soon as it happens again - I would like to clarify, it does not seem to be happening every time I restart - only if it does, it manifests itself within 24 hours.

@digitaldan
Copy link
Contributor

@petrklus , as @teichsta mentioned above can you post your OS and JDK?

@petrklus
Copy link
Contributor Author

Sorry for the delay!

OS: Ubuntu 14.04 LTS
Java info:
java version "1.7.0_60"
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) Client VM (build 24.60-b09, mixed mode)
CPU: ARMv7 Processor rev 0 (v7l)

On Mon, Jan 12, 2015 at 11:53 PM, Dan notifications@github.com wrote:

@petrklus https://github.com/petrklus , as @teichsta
https://github.com/teichsta mentioned above can you post your OS and
JDK?


Reply to this email directly or view it on GitHub
#1961 (comment).

@zacofany
Copy link

Yes, I'm experiencing this too with OH 1.6.1. I tried to reinstall and reconfigure OH and java (tried both 1.7 and 1.8 version) many times on my Banana PI (tried Bananian, Raspbian and Lubuntu) and I was always getting out of Heap Space after few hours. Last weekend I discovered that in my case this is strictly dependent on using Habdroid. Since I stopped to use habdroid on my (3 different) devices it looks like OH remains very stable. It looks like that it doesn't matter on which hardware Habdroid is running. Once it is connected to OH it makes that openhab starts to generate some errors and at the very end we get out of Heap Space. Also I see that Habdroid version available in Google Play is pretty old. Now the only OH control I use is direct connection via http.
If I can help you somehow to fix this issue please let me know what I should provide to you.

@petrklus
Copy link
Contributor Author

@zacofany True, I do experience the issue more often when using Habdroid - will fire a few instances to see what happens! Which version of Java are you using?

@teichsta How does my OS/JDK look to you, is there anything that is known to cause problems?

@teichsta
Copy link
Member

It looks like that it doesn't matter on which hardware Habdroid is running.

is it really HABDroid or does this happen when using GreenT or iOS as well? If yes, than it could be related to the Atmosphere issue #765 which also came along with an OOM.

How does my OS/JDK look to you, is there anything that is known to cause problems?

looks good, unfortunately not. How much RAM does your System provide?

@zacofany
Copy link

Hi. Here are some details regarding my configuration:
mainboard - Banana PI
system - Bananian 15.01 r01

root@galactica ~ # inxi -Fx
System: Host: galactica Kernel: 3.4.104+ armv7l (32 bit gcc: 4.8.2) Console: tty 0
Distro: Debian GNU/Linux 7
Machine: No /sys/class/dmi; using dmidecode: dmidecode is not installed.
CPU: Dual core ARMv7 Processor rev 4 (v7l) (-MCP-) (ARM)bmips: 4008
clock speeds: max: 1008 MHz 1: 1008 MHz 2: 1008 MHz
Graphics: Card: Failed to Detect Video Card!
Display Server: N/A driver: N/A
tty size: 91x49 Advanced Data: N/A for root out of X
Audio: Card sunxi-CODEC driver: sunxi-CODEC Sound: ALSA v: 1.0.25
Network: Card: Failed to Detect Network Card!
Drives: HDD Total Size: 1000.2GB (21.5% used)
ID-1: /dev/sda model: ST1000LM014 size: 1000.2GB
Partition: ID-1: / size: 15G used: 3.1G (23%) fs: ext4 dev: /dev/root
RAID: No RAID devices: /proc/mdstat, md_mod kernel module present
Sensors: None detected - is lm-sensors installed and configured?
Info: Processes: 73 Uptime: 4 days Memory: 196.6/970.5MB
Init: SysVinit runlevel: 2 Gcc sys: N/A
Client: Shell (zsh 4.3.17) inxi: 2.2.16

root@galactica ~ # java -version
java version "1.8.0_06"
Java(TM) SE Runtime Environment (build 1.8.0_06-b23)
Java HotSpot(TM) Client VM (build 25.6-b23, mixed mode)

addons installed:
root@galactica ~ # ls /usr/share/openhab/addons
org.openhab.binding.exec-1.6.1.jar org.openhab.binding.weather-1.6.1.jar
org.openhab.binding.http-1.6.1.jar org.openhab.binding.wol-1.6.1.jar
org.openhab.binding.modbus-1.6.1.jar org.openhab.io.dropbox-1.6.1.jar
org.openhab.binding.ntp-1.6.1.jar org.openhab.io.transport.serial-1.6.1.jar
org.openhab.binding.onkyo-1.6.1.jar org.openhab.persistence.rrd4j-1.6.1.jar
org.openhab.binding.serial-1.6.1.jar README
org.openhab.binding.tcp-1.6.1.jar

I will try to provide some logs soon.

@zacofany
Copy link

OK here is some more info. Openhab.log attached.
@teichsta I dont't know if this is the same issue as Atmosphere issue #765 .
Also out of heap space errors starts only when I start Habdroids. It never happened when I used just http UI.
In terms of greent I believe that it has also nothing to this issue but I have the other issue which you can see in my log from:
2015-01-14 20:52:03.136 [ERROR] [o.i.r.i.f.SendPageUpdateFilter] - null (greent started)
to
2015-01-14 20:52:29.382 [ERROR] [o.i.r.i.f.SendPageUpdateFilter] - null (greent stopped)

I started Habdroids on my 3 devices and the first ERR message is on:
2015-01-14 20:53:31.343 [ERROR] [.r.i.filter.PollingDelayFilter] - null

openhab_outofheap.log | uploaded via ZenHub

top.screen | uploaded via ZenHub

@digitaldan
Copy link
Contributor

@zacofany, was this your whole log file? Did your system startup at 2015-01-14 20:23:59.795 and then throw the OOM error an hour later at 2015-01-14 21:31:49.625 ? Thats a very aggressive memory issue. Can you try getting a heap dump as mentioned in a previous comment?

@zacofany
Copy link

@digitaldan Yes, unfortunately that's the whole log file. I will try to provide you new log and heap dump as soon as possible. However this would be the first time for me to do heap dump. At what point should I launch command
jmap -F -dump:format=b,file=heap.bin pid ? How long does it take?

@petrklus
Copy link
Contributor Author

@digitaldan Happened again! here are my logs + heap dump:

http://cnktmp.s3.amazonaws.com/petr/20150116_logs_with_heap.zip

The trick was indeed to let HABdroid open on a few of my devices and that seems to speed up the manifestation of the problem.

@zacofany
Copy link

Here is new log together with heap dump.

openhab.log | uploaded via ZenHub

heap.zip | uploaded via ZenHub

@digitaldan
Copy link
Contributor

Thanks all, I have enough to go on, there is definitely a leak in the broadcaster class, I'm going back through the binding to see what I can do. If I need more logging I will let you know.

@zacofany
Copy link

Thanks for the info. If you need anything more please let us know. I can even provide you remote access to my system if it is necessary.

@petrklus
Copy link
Contributor Author

Which binding is the leak in? Just curious..

On Mon, Jan 19, 2015 at 3:22 PM, Dan notifications@github.com wrote:

Thanks all, I have enough to go on, there is definitely a leak in the
broadcaster class, I'm going back through the binding to see what I can do.
If I need more logging I will let you know.


Reply to this email directly or view it on GitHub
#1961 (comment).

@digitaldan
Copy link
Contributor

This would be the io.rest addon

On Mon, Jan 19, 2015 at 6:46 AM, Petr Klus notifications@github.com wrote:

Which binding is the leak in? Just curious..

On Mon, Jan 19, 2015 at 3:22 PM, Dan notifications@github.com wrote:

Thanks all, I have enough to go on, there is definitely a leak in the
broadcaster class, I'm going back through the binding to see what I can
do.
If I need more logging I will let you know.


Reply to this email directly or view it on GitHub
#1961 (comment).


Reply to this email directly or view it on GitHub
#1961 (comment).

@petrklus
Copy link
Contributor Author

@digitaldan Is it somehow specific to a mode of communication? (WebSocket, etc.) The reason I am asking is that I am having almost constantly open the standard web interface (not GreenT) on multiple machines.. and it seems to be just fine!

Keep us updated please on the progress, would like to have my tablets running again! Thank you very much for looking into this..

@digitaldan
Copy link
Contributor

What version of the android client is everyone on? Could you all please post a version number?

@petrklus
Copy link
Contributor Author

HABDroid version: 1.5.1.1 (from Play store)

On Mon, Jan 19, 2015 at 6:15 PM, Dan notifications@github.com wrote:

What version of the android client is everyone on? Could you all please
post a version number?


Reply to this email directly or view it on GitHub
#1961 (comment).

@digitaldan
Copy link
Contributor

@zacofany and @petrklus could you start OH using the debug script (start_debug.sh)? I only need an hour or so worth of those logs while you have android clients up. I'm hoping they show a abnormal amount of http traffic.

@zacofany
Copy link

@digitaldan I used cloudbees repository to install openhab. I need to download OH package to run start_debug.sh. Or is there any other solution for running openhab service in debug mode?

@petrklus
Copy link
Contributor Author

@digitaldan on it, just started with DEBUG. I think the other factor in my case are sensors that are pushing approx. 1 update / second via REST interface.

Do you need it to freeze again or is a snapshot after couple of hours fine?

@teichsta
Copy link
Member

I used cloudbees repository to install openhab. I need to download OH package to run start_debug.sh.

you'll find the start_debug.sh script within the distribution-runtime archive.

Best, Thomas E.-E.

@zacofany
Copy link

@petrklus
Copy link
Contributor Author

@digitaldan I meant that the other devices are being pulled by OH, using the HTTP binding mainly, with some subsequent conversion in javascript.

Would it be helpful to send you a copy of all my items, configurations and transports?

@petrklus
Copy link
Contributor Author

@digitaldan Is there any more information I could provide? Would it be helpful to get a heap dump of OH running for a few days without the android clients? Is there anything further I could help with?

@digitaldan
Copy link
Contributor

Thanks, but I have enough to go on. I have setup a simple demo environment
with rules that update items every second, so far I have had the OOM issue
happen once, but not every time, strange. I can see the problem, for some
reason the BroadcasterCache starts to allow lots of items and pagebeans to
backup for active sessions (uuid). I had thought this was fixed with a
small filter I put in place to ensure the cache only contained one item at
most, the current version of the sitemaps (or piece of it), but obviously
thats not the case. So the short answer it I'm still looking ;-)

On Thu, Jan 29, 2015 at 1:38 AM, Petr Klus notifications@github.com wrote:

@digitaldan https://github.com/digitaldan Is there any more information
I could provide? Would it be helpful to get a heap dump of OH running for a
few days without the android clients? Is there anything further I could
help with?


Reply to this email directly or view it on GitHub
#1961 (comment).

@petrklus
Copy link
Contributor Author

Thanks for the update! Let me know if you find anything that could help..
also, you've mentioned that it could be two separate memory leaks - is this
still the case? (just curious)

On Thu, Jan 29, 2015 at 3:31 PM, Dan notifications@github.com wrote:

Thanks, but I have enough to go on. I have setup a simple demo environment
with rules that update items every second, so far I have had the OOM issue
happen once, but not every time, strange. I can see the problem, for some
reason the BroadcasterCache starts to allow lots of items and pagebeans to
backup for active sessions (uuid). I had thought this was fixed with a
small filter I put in place to ensure the cache only contained one item at
most, the current version of the sitemaps (or piece of it), but obviously
thats not the case. So the short answer it I'm still looking ;-)

On Thu, Jan 29, 2015 at 1:38 AM, Petr Klus notifications@github.com
wrote:

@digitaldan https://github.com/digitaldan Is there any more
information
I could provide? Would it be helpful to get a heap dump of OH running
for a
few days without the android clients? Is there anything further I could
help with?


Reply to this email directly or view it on GitHub
#1961 (comment).


Reply to this email directly or view it on GitHub
#1961 (comment).

@teichsta teichsta removed this from the 1.6.2 milestone Jan 29, 2015
@petrklus
Copy link
Contributor Author

petrklus commented Feb 5, 2015

@digitaldan Hello! Just checking in again.. without Android clients running, my system is still fine after more than a week. I occasionally launch my phone app to switch a light on/off but I do not keep it open.

Unfortunately, one of my use cases is having an android tablet mounted as a control for headless Volumio music player.. which is not possible at the moment. Is there any other way I could help out with?

@digitaldan
Copy link
Contributor

I have been on the road all week and have not found time to focus on it.
The good news is that I can reproduce it all the time , I have an idea for
a fix, I'm hoping to play with it this weekend when I'm back.

Sent from my mobile.

On Feb 5, 2015, at 3:31 AM, Petr Klus notifications@github.com wrote:

@digitaldan https://github.com/digitaldan Hello! Just checking in again..
without Android clients running, my system is still fine after more than a
week. I occasionally launch my phone app to switch a light on/off but I do
not keep it open.

Unfortunately, one of my use cases is having an android tablet mounted as a
control for headless Volumio music player.. which is not possible at the
moment. Is there any other way I could help out with?


Reply to this email directly or view it on GitHub
#1961 (comment).

@teichsta
Copy link
Member

teichsta commented Feb 5, 2015

Thanks @digitaldan appreciate your help!

@digitaldan
Copy link
Contributor

I see whats happening here, there was a change made to the android client, openhab/openhab-android@44de4f4 . The comment says "Changed polling cycle to alternate long polling/regular requests. Improves working with current buggy Atmosphere in openHAB" . This is effectively causing a new atmosphere session to be created every other request. Every session starts to build up a cache of changed objects which only gets cleaned up after our default timeout (i think), on many systems I don't think this is fast enough and it runs out of memory. Let me open a ticket up with @belovictor and see if we can work something out.

@petrklus
Copy link
Contributor Author

petrklus commented Feb 8, 2015

that makes sense! However, just a philosophical point here - is this that
leaves OH susceptible to an easy DOS attack? Or does the functionality need
to work like this?

Also, having no idea how Atmosphere works but is there any way to define
some kind of a FIFO fixed maximum size cache?

Thank you VERY much for looking into this!

On Sun, Feb 8, 2015 at 5:41 PM, Dan notifications@github.com wrote:

I see whats happening here, there was a change made to the android client,
openhab/openhab-android@44de4f4
openhab/openhab-android@44de4f4
. The comment says "Changed polling cycle to alternate long polling/regular
requests. Improves working with current buggy Atmosphere in openHAB" . This
is effectively causing a new atmosphere session to be created every other
request. Every session starts to build up a cache of changed objects which
only gets cleaned up after our default timeout (i think), on many systems I
don't think this is fast enough and it runs out of memory. Let me open a
ticket up with @belovictor https://github.com/belovictor and see if we
can work something out.


Reply to this email directly or view it on GitHub
#1961 (comment).

@digitaldan
Copy link
Contributor

Both are good points, to the first one, if you are using username/password
authentication then no, I don't think this is an issue (if you don't use
auth to the outside world, you have bigger issues then this ;-) To your
second point, yes, but its not working as it should ;-). Atmosphere is a
complicated framework, and our implementation is complicated on top of
that. Its has not been easy to make things work how we want it to.

On Sun, Feb 8, 2015 at 8:52 AM, Petr Klus notifications@github.com wrote:

that makes sense! However, just a philosophical point here - is this that
leaves OH susceptible to an easy DOS attack? Or does the functionality need
to work like this?

Also, having no idea how Atmosphere works but is there any way to define
some kind of a FIFO fixed maximum size cache?

Thank you VERY much for looking into this!

On Sun, Feb 8, 2015 at 5:41 PM, Dan notifications@github.com wrote:

I see whats happening here, there was a change made to the android
client,
openhab/openhab-android@44de4f4
<
openhab/openhab-android@44de4f4

. The comment says "Changed polling cycle to alternate long
polling/regular
requests. Improves working with current buggy Atmosphere in openHAB" .
This
is effectively causing a new atmosphere session to be created every other
request. Every session starts to build up a cache of changed objects
which
only gets cleaned up after our default timeout (i think), on many
systems I
don't think this is fast enough and it runs out of memory. Let me open a
ticket up with @belovictor https://github.com/belovictor and see if we
can work something out.


Reply to this email directly or view it on GitHub
#1961 (comment).


Reply to this email directly or view it on GitHub
#1961 (comment).

@teichsta teichsta changed the title OutOfMemory issues (1.6.1) OutOfMemory issues Feb 18, 2015
@sja
Copy link
Contributor

sja commented Feb 20, 2015

Hi!
New fresh air from my side to this issue, after I investigated in #765:
By default, Atmosphere 2.0.x has two unlimited (!) threadpools for message processing. In their speed optimization wiki page on github the author suggests to limit this. I made a branch for this. Maybe that will help a bit.

As @petrklus mentioned, he has no problems on updating items from his sensors, which is a POST request. But that means many update events on the event bus, which are getting buffered in the UUIDBroadcasterCache for every connection. That problem will increase, if a sitemap contains a lot of those items because HABdroid will subscribe with long-polling to all changes of this sitemap page. If HABdroid won't use the UUID correctly as a workaround for #765, the Broadcaster will explode. :-) Does that make sense, @digitaldan ?

@digitaldan Indeed, we have a very complicated setup on top of atmosphere! With many try/catches and logging stuff on every request. I hope, my PR will remove a bit of this ovehead.

@digitaldan
Copy link
Contributor

@sja your timing is once again great, I am literally sitting here in front of a few terminals watching threads grow out of control and beating my head against the table ;-) Since I have my chatty test system up, I will go ahead and try this out!

@digitaldan
Copy link
Contributor

Still testing this, there still seems to be an ever growing amount of threads being produced, I'm on the road for a few days but I will continue to play with this. thanks again @sja !

@sja
Copy link
Contributor

sja commented Feb 23, 2015

@digitaldan Just to be sure: You've seen my commit to limit the thread pool? This is not part of the PR for Atmosphere 2.2.

@digitaldan
Copy link
Contributor

Yes, I combined the two for testing.

Sent from my mobile.

On Feb 23, 2015, at 3:29 AM, Sebastian Janzen notifications@github.com
wrote:

@digitaldan https://github.com/digitaldan Just to be sure: You've seen my
commit
https://github.com/innoq/openhab/commit/ec1a47b02233b809a3c6d22809a7c64d141dfe26
to limit the thread pool? This is not part of the PR for Atmosphere 2.2.


Reply to this email directly or view it on GitHub
#1961 (comment).

@digitaldan
Copy link
Contributor

So I think I have found the issue(s), we have a class that delays the delivery of messages to long polling clients by a few hundred milliseconds in order to group changes into a single message. This was creating a new thread pool every time it was called, under light traffic this was not much of an issue, but on a chatty system it would hit some thread contention/locking point and the thread creation would grow out of control very quickly.

The second fix is in the Android client, i'll ping @belovictor and see if he he can get that merged.

The third fix was to manage the Atmosphere cache to specifically hold only a single PageBean for sitemaps requests.

@sja I left out your commit to limit the thread pool on this PR, i was trying to keep my changes related to solving the specific issue listed above. I think it would be great for you to do a PR on that separately as it can only help our situation.

@sja
Copy link
Contributor

sja commented Mar 1, 2015

Awesome, Dan! That sounds very reasonable.
I'll do a PR tomorrow.

Von meinem Telefon gesendet

Am 01.03.2015 um 13:45 schrieb Dan notifications@github.com:

So I think I have found the issue(s), we have a class that delays the delivery of messages to long polling clients by a few hundred milliseconds in order to group changes into a single message. This was creating a new thread pool every time it was called, under light traffic this was not much of an issue, but on a chatty system it would hit some thread contention/locking point and the thread creation would grow out of control very quickly.

The second fix is in the Android client, i'll ping @belovictor and see if he he can get that merged.

The third fix was to manage the Atmosphere cache to specifically hold only a single PageBean for sitemaps requests.

@sja I left out your commit to limit the thread pool on this PR, i was trying to keep my changes related to solving the specific issue listed above. I think it would be great for you to do a PR on that separately as it can only help our situation.


Reply to this email directly or view it on GitHub.

@petrklus
Copy link
Contributor Author

Is this still an issue with the latest build or was this resolved with #2213?

@petrklus
Copy link
Contributor Author

... any updates?

@sja
Copy link
Contributor

sja commented Jun 30, 2015

No updates from my side. Is there still an issue with this?

@digitaldan
Copy link
Contributor

I have not heard any feedback one way or another.

@zacofany
Copy link

Hi, I will check if this issue still exist in my case and confirm till the end of the week. Cheers.

@zacofany
Copy link

zacofany commented Jul 7, 2015

Hi. I've checked it with the current build in my environment and it seems that it works flawlessly right now. Thanks!

@teichsta teichsta added this to the 1.7.0 milestone Jul 7, 2015
@teichsta
Copy link
Member

teichsta commented Jul 7, 2015

it seems that #2213 solved this issue …

@teichsta teichsta closed this as completed Jul 7, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants