Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Map Statistics #5905

Closed
wildnez opened this issue Aug 10, 2015 · 13 comments

Comments

Projects
None yet
5 participants
@wildnez
Copy link
Contributor

commented Aug 10, 2015

Management Center is unstable in showing cluster view. I started a 3 members cluster (each member on different machine), connected with Management Center running on another machine and started loading cluster with data (only imap.set calls).

Aim was to load 30 million entries, amounting to ~22GB in total.

15 million into loading process and ManCenter shakes - 2 out of 3 nodes disappear for a few seconds, a red warning sign flashes at the top of the screen that says - " Instance Connection Warning: xxxxxxxxx". Please see attached screenshot MC12.

Clicking on Map and the entire table of statistics is blank, I stayed on this screen for half a minute but nothing changed. See attached screenshot MC13.

After few more minutes, all three nodes came back on Home page, see MC14.
But clicking on Map (CreditCardCache) shows only 2 nodes, see MC15.

Few seconds later, another member disappears from the Map screen and red warning starts flashing again, see MC16. After a few more seconds, everything disappears, see MC17.

And Management Centre stops showing anything from this point onwards, even on Home, see MC18.

Do note that all this while, all the members were very much connected with each other, no disconnections, highly stable network (everything normal in logs). The machine that runs Management Centre was also in the same network as the member machines. Important thing is that Management Centre could never recover from this situation again.
Members were running with 10GB heap and only 1 Major GC seen on node 1 during the complete load of 30 million.

I have also attached some of the Health Monitor logs from all the members when the members were keep going out and coming into ManCentre.

Reproduction is very easy - just start a 3 nodes cluster and load 30 million entries of 1KB.

@wildnez

This comment has been minimized.

Copy link
Contributor Author

commented Aug 11, 2015

Two more issues with Management Centre:

Only started 6 server nodes with partition count set to 4099.

  1. Connect Management Centre and "Cluster Health" shows "Backup Synchronization in Progress". Please note that until this point, servers were idle, no client was connected but still this message appears, with a WARNING symbol.
  2. Almost all of 6 server nodes are running with more than 18% heap occupancy - not sure if this is a Management Centre issue or issue related to servers.
@emrahkocaman

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2015

Hi @wildnez,

For the first part, it was a known issue (reported before #4895) but never tested with this much entry count.

This is not about management center, this is about how we collect map statistics. Currently, some map stats (like cost, last access time) is calculated by traversing map entries. So calculation time exceeds the time interval reserved for management center state sending thread when entry count is too high.

@bilalyasar has done some tests and saw that we support around 10 million entries before the fix. (emrahkocaman@c6d3195).
Now we support around 27 million entries, after this point calculation time exceeds the time interval again and causes management center connectivity problems.

Map stats calculation mechanism needs to be changed to fix this issue completely and I think this requires a PRD and can't be fixed within the scope of this issue. So for now, we can say that ~27 million entries is the upper limit.

For the second part, @bilalyasar is trying to reproduce your case, will update here once we succeed.

@pveentjer

This comment has been minimized.

Copy link
Member

commented Aug 11, 2015

Traversing that much content is potentially causing a performance hit on the system since one or more cores will be busy. Don't know which thread this is done one, but if it is done on a partition-thread, this thread can be blocked for a long time and this can cause throughput/latency issues.

@emrahkocaman

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2015

@pveentjer This thread is a part of ManagementCenterService, it's not related to any partition-thread but for sure it has the potential to cause some cpu load.

@pveentjer

This comment has been minimized.

Copy link
Member

commented Aug 11, 2015

Perhaps we should consider a design where the complexity depends on the number of items on a map. And making something that works up to 30M items is limiting HZ.

@bilalyasar

This comment has been minimized.

Copy link
Collaborator

commented Aug 12, 2015

@wildnez ,for the case Only started 6 server nodes with partition count set to 4099.

I reproduced your case and discussed with @emrahkocaman.
Cluster Health is updated every 1 minute. ( because cost of the that query is not cheap)
So management center asks when it starts first time, but due to partition count is 4099 it shows Backup Synchronization in Progress.. But after 1 minute, management center again asks and this time it shows 0 waiting migration(s)... Can you try with waiting 1 minute?

@wildnez

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2015

Hi guys,

Thanks for your feedback.

@emrahkocaman I'll create a PRD for this issue as having any limitation is not good.

@bilalyasar ok, I'll try your suggestion. Would you have any explanation for issue# 2 of high heap usage?

@bilalyasar

This comment has been minimized.

Copy link
Collaborator

commented Aug 12, 2015

@wildnez , if you mean heap usage percentage it shows to me 1 percent. Can you observe again after 1 min? And i didn't put any data as you said. Did you put any data? Just 6 node and partition count is 4099

@wildnez

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2015

@bilalyasar this can actually be seen live at http://10.212.1.113:8080/mancenter-3.5/main.jsp. No data, only started servers.

@wildnez

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2015

Actually in a currently running test, I saw Management Centre shaking at 11 million count itself. See attached screenshot - only 5 nodes show up out of 6.

mc-unstable-early

@bilalyasar

This comment has been minimized.

Copy link
Collaborator

commented Aug 13, 2015

When we try Xmx140G and Xms140G then in management center shows 20G used heap. After disabling management center, we still see 20G used heap, so seems this issue not related to man center at least.

Also i tried with another application that an empty java application. When i set Xmx140Gand Xms140G, it shows 7G used heap.

@mesutcelik mesutcelik changed the title Unstable Management Centre in 3.5 Faster Map Statistics Aug 18, 2015

@mesutcelik mesutcelik modified the milestones: Backlog, 3.5.2 Aug 18, 2015

emrahkocaman added a commit to emrahkocaman/hazelcast that referenced this issue Aug 18, 2015

emrahkocaman added a commit to emrahkocaman/hazelcast that referenced this issue Aug 20, 2015

@pveentjer

This comment has been minimized.

Copy link
Member

commented May 7, 2016

@ahmetmircik afaik this issue is fixed. Can you verify and close this ticket?

@bilalyasar

This comment has been minimized.

Copy link
Collaborator

commented Jun 29, 2016

this is already fixed, so closing this one.

@bilalyasar bilalyasar closed this Jun 29, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.