Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much cpu used when Hazelcast is idle V 3.6.2 #7943

Closed
NiranjanBS opened this issue Apr 12, 2016 · 54 comments

Comments

Projects
None yet
8 participants
@NiranjanBS
Copy link

commented Apr 12, 2016

As requested by @noctarius creating new ticket.
Am using 3.6.2 new release still taking high CUP usage.
Verified in java 1.7.0_75 and 1.8.0_77.
Tomcat 7.0.68 (java 1.7),Tomcat 8.0.30 (java 1.8).
OS Windows 7.
Use case - We are using apache shiro as security framework and hazelcast is beeing used to store user session in hazelcast map. We are observing high cpu when no user has logged in for around 3 hours.

hazelcast.txt

untitled1
untitled

@noctarius

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2016

What type of serialization do you use? Are there any complex objects involved? I guess you utilize Hazelcast Session Replication or just Shiro?

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 12, 2016

I am using java serialization, its not complex object its a simple object having few fields about user details.
Ya we have hazelcast server running as web application and our application is interacting with hazelcast through hazelcast client.

@noctarius

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2016

Any chance to share some testcode that reproduces the issue? If you can't share it privately you could also send it via mail.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 12, 2016

Ok I will send through mail.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 12, 2016

Hi @noctarius I have sent mail to you please verify and let me know if you need any more help.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 15, 2016

@noctarius
I am still facing same issue.
untitled

It has become a block for us, any config changes I need to do??
if you suggest any stable version also fine for me.

@jerrinot jerrinot added this to the 3.7 milestone Apr 15, 2016

@pveentjer

This comment has been minimized.

Copy link
Member

commented Apr 15, 2016

@noctarius please switch to 3.7 and add the following:

-Dhazelcast.performance.metric.level=info
-Dhazelcast.performance.monitor.enabled=true
-Dhazelcast.performance.monitor.invocation.sample.period.seconds=30
-Dhazelcast.performance.monitor.pending.invocations.period.seconds=30
-Dhazelcast.performance.monitor.slowoperations.period.seconds=30

This should give detailed information of what is happening; especially it should give some insights in operation executions.

@NiranjanBS I'm not that familiar with the tool you are using. It doesn't need to mean that Hazelcast has high cpu usage; it can also be that the whole system isn't doing much, and the it seems that threads that do a little bit of stuff, are consuming most of the cpu time; but this is all very relative.

Imagine a room of super lazy programmers, that sit 1 minute a week behind there keyboard and the rest of the time they are playing footbal. If one of these programmers sits behind his keyboard 2 minutes, he seems to be overloaded compared to his colleagues, but effectively he is only busy a fraction of the time.

If you look at one of the graphs, you see that the 'acceptor thread' is doing a lot of work just like the regular io threads. But it is very unlikely that the acceptor thread is doing work all the time; so my guess is that the system is quite idle. But perhaps I'm wrong. Can you check with e.g. htop and see if your cores are really busy?

4d409130-00e1-11e6-8248-55c51acfedb3

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 15, 2016

@pveentjer @noctarius
As I have mentioned in my first comment I am testing in Windows 7 OS and all the screen shots I had uploaded are from JConsole and jvisualvm which are provided by java itself.

Today I tested by running only cache server there was no client connected and nothing else was running but still after 2 hours cpu is taking 100%. And once it goes to 100% it will never come down.

100percentcpu

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2016

@NiranjanBS thanks for the update, the CPU usage comes from Tomcat process so it looks like there's something wrong on the Tomcat/Shiro/Hazelcast-client side (if I remember correctly you mentioned you run Hazelcast in client-server mode, is this accurate?). Can you please provide more details on your use case so we can setup a lab to reproduce the issue?

  • Which version of Shiro do you use?
  • Do you use Hazelcast as cache for Shiro? or just for Tomcat web sessions?
  • Could you please provide the relevant fragments of your Tomcat/Shiro configuration?
@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 15, 2016

@vbekiaris

We have one dedicated web application in which we initialize the cache and load the userdetails and shiro session string, we use hazelcast clients to interact with hazelcast server.

In the above scenario we had just started the cache application without starting any other clients, basically there was no client interaction, only on startup data was loaded to the map through mapstore.

Shiro Version: 1.2.3
We use hazelcast map to store the shiro session
I had shared the code in mail with @noctarius if you need more details let me know I can share.
I am attaching the configuration files here.

hazelcast-client.xml.txt
hazelcast.xml.txt
shiro.txt

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2016

@NiranjanBS it would be great if you could also email me the code.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 18, 2016

@vbekiaris
Can you please share your email.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 18, 2016

@vbekiaris @noctarius @pveentjer

I tested by creating simple cache server with out connecting any client, still I am facing same issue.
I just creating a USER_DETAILS map with default configuration for hazelcast instance and just loaded 100 user objects. After 2 hours again CPU is burning to 100%.

We were supposed to go live next week, but hazelcast has become big block for us.

Versions are as follows.
Hazelcast - 3.6.2.
Java - 1.8.0_77.
Tomcat 8.0.30.
OS Windows 7.

I have attacked the complete source code here.

cacheserver.zip

@pveentjer

This comment has been minimized.

Copy link
Member

commented Apr 18, 2016

@NiranjanBS would it be possible to switch to 3.7-SNAPSHOT and add the following parameters:

-Dhazelcast.performance.metric.level=info
-Dhazelcast.performance.monitor.enabled=true
-Dhazelcast.performance.monitor.invocation.sample.period.seconds=30
-Dhazelcast.performance.monitor.pending.invocations.period.seconds=30
-Dhazelcast.performance.monitor.slowoperations.period.seconds=30

And rerun your test? This should give us a lot more detailed information what is happening. What confuses me is why only the io threads are doing something. The other thing which confuses me is why the acceptor thread is doing so much work according to your screenshots. Once the connections are established; it should be dormant. Dumb question; do you make new HZ client instance all the time?

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 18, 2016

@pveentjer We have already added the mentioned parameters (3.6.2) while running the test. PFA jvm option for your reference.
But i am not able to see more information in the tomcat & application log files. please let me know where to check.

jvm_details

@pveentjer

This comment has been minimized.

Copy link
Member

commented Apr 18, 2016

These properties only work with 3.7. I have improved the performance logging system in 3.7 to expose a lot more info.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 18, 2016

@pveentjer
Ok I will test in 3.7 too.

do you make new HZ client instance all the time?

  • No, I am creating only one client instance for entire application, my client manager is a singleton.
    And The code I have shared above has only server there is no client interaction at all.
@pveentjer

This comment has been minimized.

Copy link
Member

commented Apr 18, 2016

@NiranjanBS no; creating new clients would be disastrous for your performance :)

I'm looking forward to the log file. What I find strange is that according to the following graph:
https://cloud.githubusercontent.com/assets/9413835/14461697/4d409130-00e1-11e6-8248-55c51acfedb3.png

is the following:
1: your client is doing stuff since the ClientInSelector/ClientOutSelector are the io threads of the client and appearently they are spending quite some time on the cpu. In 3.7 they will have a more sensible name btw
2: why is the acceptor thread doing so much work. Because normally it should not; first bet would be that connections are continuously made.

Anyhow... the performance log files will provide some insights in what is happening. Especially on the networking level.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 18, 2016

@pveentjer

I have started testing on 3.7 by setting -Dhazelcast property to tomcat.
where can I find log files which you are expecting, because I am not seeing any extra log in tomcat as well as application log.

@pveentjer

This comment has been minimized.

Copy link
Member

commented Apr 18, 2016

you should find these new files in the working directory of your application. See 'user.dir' from System properties. Files should be called performance-xxx.log and performance-client-xxx.log

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 18, 2016

Ok It is generating performance-xxx.log files. Give me some time, once CPU hits 100% I will share them here.

@pveentjer

This comment has been minimized.

Copy link
Member

commented Apr 18, 2016

keep it running for at least 10 minutes after you get the problems.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 18, 2016

ya sure.

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2016

I have been running the sample code on Windows 7, Tomcat 8.0_30, JDK 1.8.0_77 with Hazelcast 3.6.2 for more than an hour and I don't see any CPU usage on tomcat process. Perhaps there is something else in your environment causing the additional CPU usage? Let's wait for the 3.7 performance logs and see if we find anything interesting there.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 18, 2016

@pveentjer @vbekiaris @noctarius

With 3.7-SNAPSHOT also I am facing same issue. Please find the performance-xxx.log file and screenshot attached below.

Even heap memory is increasing continually you can see in screen shot.

cpu_usage

performance-127.0.0.1#5701-1460978886777-000.zip

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 19, 2016

Ya sure I will check.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 19, 2016

@vbekiaris
My test is still going on, usually after 2 hours CPU use to hit 100% from (0% - 2%), but now it is hitting 25% from (0% - 2%). I will test for long duration and get back.
Thanks for you support.

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2016

Great, I think that if you fire up a visualvm/mission control and check cpu usage, the acceptor thread must be the one causing the excess usage while in/out threads should now be at the bottom end. Can you please confirm? Also, once you conclude your test, can you share the performance log?
Thanks

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 20, 2016

Hi @vbekiaris , yes you are right now acceptor thread is the one which is taking more CPU.
But you can see in screenshot memory is growing rapidly. I have attached the performance-xxx.log files and screenshot. I am not able to upload zip file here.

cpuusage

@NiranjanBS NiranjanBS closed this Apr 20, 2016

@NiranjanBS NiranjanBS reopened this Apr 20, 2016

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 21, 2016

@vbekiaris @pveentjer @noctarius
I have not stoped the test, its still running I am still facing issue with Memory and CUP. Any updates please.
untitled

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented Apr 21, 2016

Hi @NiranjanBS , we are working on the final fix for this one, which will address excess object allocation (this was introduced as a side-effect of the temporary workaround you are now testing) and also introduce the workaround in acceptor thread as well.

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2016

Hi @NiranjanBS , would you be so kind as to run your test again with the attached Hazelcast binary? hazelcast-3.7-SNAPSHOT.jar.zip. Don't forget to add (or just keep if you already have it) -Dhazelcast.io.selectorWorkaround=true to your JVM startup options, this is required for the workaround to kick in.
This is built from https://github.com/vbekiaris/hazelcast/tree/master-nio-workaround and should fix Acceptor thread CPU usage and excess object allocation.
It would be great if you could also attach your performance logs at the end of the test run.

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 22, 2016

ya sure we will check.

@bwzhang2011

This comment has been minimized.

Copy link

commented Apr 27, 2016

@NiranjanBS, any update with such issue ?

@NiranjanBS

This comment has been minimized.

Copy link
Author

commented Apr 27, 2016

Hi, I am running test from 4-5 days, there is no high CPU usage, but still am facing Memory issue as you can see in attachment.

@vbekiaris We have done workaround for both server and client selector threads or only for server, because I have seen client selector threads also using high CPU.

https://cloud.githubusercontent.com/assets/9413835/14551483/68765360-02f1-11e6-9ae2-cbf598824036.png

27apr2016

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented Apr 27, 2016

Hi @NiranjanBS , a couple of questions:

  • Can you compare the heap memory usage you see in your application now against running without the workaround enabled (remove -Dhazelcast.io.selectorWorkaround=true from your JVM options), in order to see how the workaround affects memory usage vs no workaround?
    Comparing current to the previous workaround version, the sawtooth is now sparser (implying less allocation) and seems to reach maximum heap size less frequently than once per day, while the previous version reached max heap size twice per day.
  • Can you also share your performance.log files?

The workaround is currently only implemented on the server side, so if you see client side IO threads with high CPU usage, we'll need to apply the workaround there as well.

@bwzhang2011

This comment has been minimized.

Copy link

commented Apr 28, 2016

Dear all, as there's a period of time before hazelcast3.7 released, @vbekiaris, @pveentjer, does next stable version 3.6.3 solve such problem ?

@bwzhang2011

This comment has been minimized.

Copy link

commented May 5, 2016

any update with such issue ?

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented May 11, 2016

@NiranjanBS, @bwzhang2011 the fix is now merged in master and a backport will be available in 3.x branch shortly. In order to enable the workaround, you need to supply system property -Dhazelcast.io.selectorMode=selectwithfix on JVM startup (note it's different from the property used in intermediate versions posted above).

@NiranjanBS thanks for your help in locating and fixing this. It would be great if you could use 3.7-SNAPSHOT to run a test with following system properties:

-Dhazelcast.io.selectorMode=selectwithfix
-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.invocation.sample.period.seconds=30
-Dhazelcast.diagnostics.pending.invocations.period.seconds=30
-Dhazelcast.diagnostics.slowoperations.period.seconds=30

and attach here your performance.log after the test. I guess a 3.7-SNAPSHOT build with the fix included should be available tomorrow on Hazelcast maven snapshot repository.

@SomannaDC

This comment has been minimized.

Copy link

commented Oct 6, 2016

@vbekiaris @pveentjer @noctarius @bwzhang2011 @NiranjanBS -- Is the issue resolved, we are facing similar issue with 3.6.2, is there a later version of Hazelcast with the fix for this issue?

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2016

Hi @SomannaDC , yes, it is fixed in 3.6.3 (by #8154) and 3.7 (by #7998).

If you believe you are hit by the same JVM/network stack bug, then you should start hazelcast adding the system property hazelcast.io.selectorMode=selectwithfix to your JVM startup.

If after upgrading & testing with this system property enabled you still see high cpu usage while idle, then please create a new issue for further investigation.

@linkedz

This comment has been minimized.

Copy link

commented Apr 30, 2018

Hi

I've got the same problem, when very simple project (from the comment of related issue) after 2 hrs (+/- few minutes) got 100% CPU usage
#7943 (comment)

Env: Windows 10, jdk build 1.8.0_171-b11, tried also with jdk v71, v161
also different versions of hazelcast: 3.6.2, 3.8.8, 3.9.4, 3.10.
Problem always is reproducing with 100% chance after 2 hrs of idle.

In debug there is next info, if compare data before and after CPU overhead:
sun.nio.ch.WindowsSelectorImpl.SubSelector#poll0
before - readFds array is fulled with 0
after - readFds[0] == 1, value of readFds[1] == sun.nio.ch.WindowsSelectorImpl#wakeupSourceFd
also method poll0 is executing instantly, without any timeout and thread is in endless loop without block or wait.

Any help on the problem? Or any advise what to debug else or try to change?

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2018

you should start hazelcast adding the system property hazelcast.io.selectorMode=selectwithfix to your JVM startup.

@linkedz did you try this advice in the comment above? This property enables the selector issue workaround.

@linkedz

This comment has been minimized.

Copy link

commented Apr 30, 2018

Just now tried and it helped to fix the problem, thanks!

@vbekiaris

This comment has been minimized.

Copy link
Contributor

commented May 2, 2018

nice, thanks for the heads up @linkedz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.