Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify counting active workers #769

Open
wants to merge 4 commits into
base: master
from

Conversation

4 participants
@PennyAndWang
Copy link

commented May 14, 2019

Hi, this commit is for the issue: #641 . In my situation ,maybe it is not real coordinator HA, just a trick for restarting coodinator, but I think the code is more appropriate for my situation。Thanks !

@cla-bot

This comment has been minimized.

Copy link

commented May 14, 2019

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: wangpei6.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@PennyAndWang PennyAndWang force-pushed the PennyAndWang:update_counting_active_workers branch from 0ecf8a1 to b565e46 May 14, 2019

@@ -57,9 +58,10 @@ public ClusterStats getClusterStats()
long blockedQueries = 0;
long queuedQueries = 0;

long activeNodes = nodeManager.getNodes(NodeState.ACTIVE).size();

This comment has been minimized.

Copy link
@sopel39

sopel39 May 14, 2019

Member

what's the rationale?

This comment has been minimized.

Copy link
@PennyAndWang

PennyAndWang May 14, 2019

Author

I need to restart the coordinator without the user's awareness, so I will start the second coordinator in the same cluster to switch traffic, and the second coordinator will be closed after the restart. When the second coordinator started, I found that the number of active workers was incorrect, so I submit this commit for this reason. Thanks!

This comment has been minimized.

Copy link
@sopel39

sopel39 May 14, 2019

Member

Such scenario: "two coordinators registered to same discovery server" is not supported by Presto ATM, so workers and coordinator behavior is undefined. Definitely currently running queries will fail when coordinator is killed.

Correct way to do coordinator turnover is to have some kind of load balancer or name resolution for coordinator that would be routing to the currently active coordinator.

This comment has been minimized.

Copy link
@PennyAndWang

PennyAndWang May 15, 2019

Author

image
The picture shows the details of our restart ,i need restart the coordinator without user's awareness,that is ,the running queries can not be affected. So i figure out the picture solution .Maybe there is better solution that i don't know, i really appreciate it if you can tell me.Thanks !
By the way ,i have do this more than 20 times, no any effects。

@dain
Copy link
Member

left a comment

Thanks for the PR. The existing code is definitely wrong. I suggested some more improvements.

@PennyAndWang PennyAndWang force-pushed the PennyAndWang:update_counting_active_workers branch from b25f9e3 to b16a1e6 May 29, 2019

@PennyAndWang PennyAndWang force-pushed the PennyAndWang:update_counting_active_workers branch from b16a1e6 to 004fc14 May 29, 2019

PennyAndWang added some commits May 29, 2019

@prestosql prestosql deleted a comment from cla-bot bot Jun 1, 2019

@prestosql prestosql deleted a comment from cla-bot bot Jun 1, 2019

@prestosql prestosql deleted a comment from cla-bot bot Jun 1, 2019

@prestosql prestosql deleted a comment from cla-bot bot Jun 1, 2019

@prestosql prestosql deleted a comment from cla-bot bot Jun 1, 2019

@prestosql prestosql deleted a comment from cla-bot bot Jun 1, 2019

@prestosql prestosql deleted a comment from cla-bot bot Jun 1, 2019

@prestosql prestosql deleted a comment from cla-bot bot Jun 1, 2019

@PennyAndWang

This comment has been minimized.

Copy link
Author

commented Jun 3, 2019

My PR builds failed, but through the log, it seems that the errors have nothing to do with my PR. Sorry, I don't know what to do. I need help .

@dain dain self-requested a review Jun 11, 2019

long activeNodes = nodeManager.getNodes(NodeState.ACTIVE).size();
if (!isIncludeCoordinator) {
activeNodes -= 1;
long activeNodes = nodeManager.getNodes(NodeState.ACTIVE).stream()

This comment has been minimized.

Copy link
@dain

dain Jun 11, 2019

Member

If we change this to:

long activeWorkers = nodeManager.getNodes(NodeState.ACTIVE).stream()
                .filter(node -> isIncludeCoordinator || !node.isCoordinator())
                .count();

It will ignore coordinators in this count unless, isIncludeCoordinator is set, so if you have multiple coordinator and are including them, you get the right count.

@PennyAndWang

This comment has been minimized.

Copy link
Author

commented Jun 12, 2019

@dain, Hi, dain, thanks very much for your reply.
I have been thinking about this issue for a long time.

Your proposed code:

long activeWorkers = nodeManager.getNodes(NodeState.ACTIVE).stream()
                .filter(node -> isIncludeCoordinator || !node.isCoordinator())
                .count();

I think the codes have some problems, for example:
there are two workers in a cluster, a coordinator1 (node-scheduler. include-coordinator = true),
and a coordinator2 (node-scheduler. include-coordinator = false) at the same time,

according to your code,
coordinator 1 obtains 4 workers and coordinator 2 obtains 2 workers .

There's also something wrong with my submitted code,which is:

long activeWorkers = nodeManager.getNodes(NodeState.ACTIVE).stream()
                .filter(node -> !node.isCoordinator())
                .count();
                
if (isIncludeCoordinator){
    activeWorkers += 1;
}
                

In the same cluster, according to my code,
coordinator1 gets 3 workers and coordinator2 gets 2 workers .

It seems that neither of these two solutions is perfect.

In addition, the value of the attribute `node-scheduler.include-coordinator'in each coordinator cannot be obtained in the discovery service.

So, I think there are two choices at present.

One is to modify the discovery service code and know the value of the attribute `node-scheduler.include-coordinator'in each coordinator, so that we can get the exact number of workers.

The other is to assume that each coordinator can only see itself and other workers, and can not see other coordinators. The code I submitted fits this situation.

The above is my opinion. Please be free to review.
Thanks again!

@dain

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

Summary

I'm going to try to summarize. We have two possible solutions that both start by counting the workers (non-coordinators), and then modify the count depending on the "include coordinators" configuration. The two different modifications are:

  1. Add one if the local coordinator doing the count is configured to include coordinators.
  2. Add all coordinators if the local coordinator doing the count is configured to include coordinators.

Both give the correct answer if you only have a single coordinator. The first one gives the wrong answer when multiple coordinators have "include coordinators" enabled. The second one gives the wrong answer anytime the coordinators have different configurations.

Does the above summary look correct?

Thoughts

My thoughts on this are based on following observations:

  1. Multiple coordinators is not a supported configuration for Presto today, and there are a known problems with running multiple coordinators (e.g., memory connector gives wrong results).
  2. Most people run a single shared config for workers, because if you don't things can break.

Based on 1, either of the solutions is good enough, but based on 2, I would expect most people to use the same configuration for all coordinators, which gives me a slight preference for the "add all coordinators" proposal.

The long term answer is we'll need to adjust the announcements, when adding support for multiple coordinators (#391)

@dain

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

@PennyAndWang, BTW we will need an individual CLA (https://github.com/prestosql/cla) from you to accept this contribution.

@PennyAndWang

This comment has been minimized.

Copy link
Author

commented Jun 14, 2019

@dain ,I agree with your summary and your thoughts . BTW, �I missed the reply from Martin about my presto-cla file, really sorry about this. �I have re-send my presto-cla file to Martin yesterday and I am waiting his check.

@martint

This comment has been minimized.

Copy link
Member

commented Jun 14, 2019

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Jun 14, 2019

@cla-bot

This comment has been minimized.

Copy link

commented Jun 14, 2019

The cla-bot has been summoned, and re-checked this pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.