modify counting active workers #769

PennyAndWang · 2019-05-14T08:20:22Z

Hi, this commit is for the issue: #641 . In my situation ，maybe it is not real coordinator HA, just a trick for restarting coodinator, but I think the code is more appropriate for my situation。Thanks ！

cla-bot · 2019-05-14T08:20:24Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: wangpei6.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email email@example.com
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

sopel39 · 2019-05-14T08:44:29Z

presto-main/src/main/java/io/prestosql/server/ClusterStatsResource.java

@@ -57,9 +58,10 @@ public ClusterStats getClusterStats()
        long blockedQueries = 0;
        long queuedQueries = 0;

-        long activeNodes = nodeManager.getNodes(NodeState.ACTIVE).size();


what's the rationale?

I need to restart the coordinator without the user's awareness, so I will start the second coordinator in the same cluster to switch traffic, and the second coordinator will be closed after the restart. When the second coordinator started, I found that the number of active workers was incorrect, so I submit this commit for this reason. Thanks!

Such scenario: "two coordinators registered to same discovery server" is not supported by Presto ATM, so workers and coordinator behavior is undefined. Definitely currently running queries will fail when coordinator is killed.

Correct way to do coordinator turnover is to have some kind of load balancer or name resolution for coordinator that would be routing to the currently active coordinator.

The picture shows the details of our restart ,i need restart the coordinator without user's awareness,that is ,the running queries can not be affected. So i figure out the picture solution .Maybe there is better solution that i don't know, i really appreciate it if you can tell me.Thanks !
By the way ,i have do this more than 20 times, no any effects。

dain

Thanks for the PR. The existing code is definitely wrong. I suggested some more improvements.

presto-main/src/main/java/io/prestosql/server/ClusterStatsResource.java

PennyAndWang · 2019-06-03T07:37:16Z

My PR builds failed, but through the log, it seems that the errors have nothing to do with my PR. Sorry, I don't know what to do. I need help .

dain · 2019-06-11T23:16:03Z

presto-main/src/main/java/io/prestosql/server/ClusterStatsResource.java

-        long activeNodes = nodeManager.getNodes(NodeState.ACTIVE).size();
-        if (!isIncludeCoordinator) {
-            activeNodes -= 1;
+        long activeNodes = nodeManager.getNodes(NodeState.ACTIVE).stream()


If we change this to:

long activeWorkers = nodeManager.getNodes(NodeState.ACTIVE).stream() .filter(node -> isIncludeCoordinator || !node.isCoordinator()) .count();

It will ignore coordinators in this count unless, isIncludeCoordinator is set, so if you have multiple coordinator and are including them, you get the right count.

PennyAndWang · 2019-06-12T10:47:29Z

@dain, Hi, dain, thanks very much for your reply.
I have been thinking about this issue for a long time.

Your proposed code:

long activeWorkers = nodeManager.getNodes(NodeState.ACTIVE).stream()
                .filter(node -> isIncludeCoordinator || !node.isCoordinator())
                .count();

I think the codes have some problems, for example:
there are two workers in a cluster, a coordinator1 (node-scheduler. include-coordinator = true),
and a coordinator2 (node-scheduler. include-coordinator = false) at the same time,

according to your code,
coordinator 1 obtains 4 workers and coordinator 2 obtains 2 workers .

There's also something wrong with my submitted code，which is:

long activeWorkers = nodeManager.getNodes(NodeState.ACTIVE).stream()
                .filter(node -> !node.isCoordinator())
                .count();
                
if (isIncludeCoordinator){
    activeWorkers += 1;
}

In the same cluster, according to my code,
coordinator1 gets 3 workers and coordinator2 gets 2 workers .

It seems that neither of these two solutions is perfect.

In addition, the value of the attribute `node-scheduler.include-coordinator'in each coordinator cannot be obtained in the discovery service.

So, I think there are two choices at present.

One is to modify the discovery service code and know the value of the attribute `node-scheduler.include-coordinator'in each coordinator, so that we can get the exact number of workers.

The other is to assume that each coordinator can only see itself and other workers, and can not see other coordinators. The code I submitted fits this situation.

The above is my opinion. Please be free to review.
Thanks again!

dain · 2019-06-12T18:12:29Z

Summary

I'm going to try to summarize. We have two possible solutions that both start by counting the workers (non-coordinators), and then modify the count depending on the "include coordinators" configuration. The two different modifications are:

Add one if the local coordinator doing the count is configured to include coordinators.
Add all coordinators if the local coordinator doing the count is configured to include coordinators.

Both give the correct answer if you only have a single coordinator. The first one gives the wrong answer when multiple coordinators have "include coordinators" enabled. The second one gives the wrong answer anytime the coordinators have different configurations.

Does the above summary look correct?

Thoughts

My thoughts on this are based on following observations:

Multiple coordinators is not a supported configuration for Presto today, and there are a known problems with running multiple coordinators (e.g., memory connector gives wrong results).
Most people run a single shared config for workers, because if you don't things can break.

Based on 1, either of the solutions is good enough, but based on 2, I would expect most people to use the same configuration for all coordinators, which gives me a slight preference for the "add all coordinators" proposal.

The long term answer is we'll need to adjust the announcements, when adding support for multiple coordinators (#391)

dain · 2019-06-12T18:13:43Z

@PennyAndWang, BTW we will need an individual CLA (https://github.com/prestosql/cla) from you to accept this contribution.

PennyAndWang · 2019-06-14T03:05:35Z

@dain ,I agree with your summary and your thoughts . BTW, �I missed the reply from Martin about my presto-cla file, really sorry about this. �I have re-send my presto-cla file to Martin yesterday and I am waiting his check.

martint · 2019-06-14T03:14:50Z

@cla-bot check

cla-bot · 2019-06-14T03:14:52Z

The cla-bot has been summoned, and re-checked this pull request!

dain

Looks good. Can you squash this to one commit and rebase on master?

… and add ClusterStats.activeCoordinators property

PennyAndWang · 2019-06-26T02:19:36Z

Looks good. Can you squash this to one commit and rebase on master?

@dain ，done！Build errored again ,errors have nothing to do with my PR. I don't know what to do ,sorry .

dain · 2019-06-26T19:38:10Z

The error is unrelated.

Merged. Thanks!

PennyAndWang force-pushed the update_counting_active_workers branch from 0ecf8a1 to b565e46 Compare May 14, 2019 08:27

sopel39 reviewed May 14, 2019

View reviewed changes

dain requested changes May 15, 2019

View reviewed changes

presto-main/src/main/java/io/prestosql/server/ClusterStatsResource.java Outdated Show resolved Hide resolved

PennyAndWang force-pushed the update_counting_active_workers branch 2 times, most recently from b16a1e6 to 004fc14 Compare May 29, 2019 12:27

trinodb deleted a comment from cla-bot bot Jun 1, 2019

dain self-requested a review June 11, 2019 22:19

dain requested changes Jun 11, 2019

View reviewed changes

cla-bot bot added the cla-signed label Jun 14, 2019

dain approved these changes Jun 25, 2019

View reviewed changes

count workers by assuming the same configuration for all coordinators…

e0d1f56

… and add ClusterStats.activeCoordinators property

PennyAndWang force-pushed the update_counting_active_workers branch from b640980 to e0d1f56 Compare June 25, 2019 02:41

dain approved these changes Jun 26, 2019

View reviewed changes

dain merged commit 07ac8db into trinodb:master Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modify counting active workers #769

modify counting active workers #769

PennyAndWang commented May 14, 2019

cla-bot bot commented May 14, 2019

sopel39 May 14, 2019

PennyAndWang May 14, 2019

sopel39 May 14, 2019

PennyAndWang May 15, 2019 •

edited

dain left a comment

PennyAndWang commented Jun 3, 2019

dain Jun 11, 2019

PennyAndWang commented Jun 12, 2019 •

edited

dain commented Jun 12, 2019

dain commented Jun 12, 2019

PennyAndWang commented Jun 14, 2019

martint commented Jun 14, 2019

cla-bot bot commented Jun 14, 2019

dain left a comment

PennyAndWang commented Jun 26, 2019

dain commented Jun 26, 2019

modify counting active workers #769

modify counting active workers #769

Conversation

PennyAndWang commented May 14, 2019

cla-bot bot commented May 14, 2019

sopel39 May 14, 2019

Choose a reason for hiding this comment

PennyAndWang May 14, 2019

Choose a reason for hiding this comment

sopel39 May 14, 2019

Choose a reason for hiding this comment

PennyAndWang May 15, 2019 • edited

Choose a reason for hiding this comment

dain left a comment

Choose a reason for hiding this comment

PennyAndWang commented Jun 3, 2019

dain Jun 11, 2019

Choose a reason for hiding this comment

PennyAndWang commented Jun 12, 2019 • edited

dain commented Jun 12, 2019

Summary

Thoughts

dain commented Jun 12, 2019

PennyAndWang commented Jun 14, 2019

martint commented Jun 14, 2019

cla-bot bot commented Jun 14, 2019

dain left a comment

Choose a reason for hiding this comment

PennyAndWang commented Jun 26, 2019

dain commented Jun 26, 2019

PennyAndWang May 15, 2019 •

edited

PennyAndWang commented Jun 12, 2019 •

edited