Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify counting active workers #769

Merged
merged 1 commit into from Jun 26, 2019

Conversation

PennyAndWang
Copy link
Contributor

Hi, this commit is for the issue: #641 . In my situation ,maybe it is not real coordinator HA, just a trick for restarting coodinator, but I think the code is more appropriate for my situation。Thanks !

@cla-bot
Copy link

cla-bot bot commented May 14, 2019

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: wangpei6.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@PennyAndWang PennyAndWang force-pushed the update_counting_active_workers branch from 0ecf8a1 to b565e46 Compare May 14, 2019 08:27
@@ -57,9 +58,10 @@ public ClusterStats getClusterStats()
long blockedQueries = 0;
long queuedQueries = 0;

long activeNodes = nodeManager.getNodes(NodeState.ACTIVE).size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the rationale?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to restart the coordinator without the user's awareness, so I will start the second coordinator in the same cluster to switch traffic, and the second coordinator will be closed after the restart. When the second coordinator started, I found that the number of active workers was incorrect, so I submit this commit for this reason. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such scenario: "two coordinators registered to same discovery server" is not supported by Presto ATM, so workers and coordinator behavior is undefined. Definitely currently running queries will fail when coordinator is killed.

Correct way to do coordinator turnover is to have some kind of load balancer or name resolution for coordinator that would be routing to the currently active coordinator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
The picture shows the details of our restart ,i need restart the coordinator without user's awareness,that is ,the running queries can not be affected. So i figure out the picture solution .Maybe there is better solution that i don't know, i really appreciate it if you can tell me.Thanks !
By the way ,i have do this more than 20 times, no any effects。

Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The existing code is definitely wrong. I suggested some more improvements.

@PennyAndWang PennyAndWang force-pushed the update_counting_active_workers branch 2 times, most recently from b16a1e6 to 004fc14 Compare May 29, 2019 12:27
@trinodb trinodb deleted a comment from cla-bot bot Jun 1, 2019
@trinodb trinodb deleted a comment from cla-bot bot Jun 1, 2019
@trinodb trinodb deleted a comment from cla-bot bot Jun 1, 2019
@trinodb trinodb deleted a comment from cla-bot bot Jun 1, 2019
@trinodb trinodb deleted a comment from cla-bot bot Jun 1, 2019
@trinodb trinodb deleted a comment from cla-bot bot Jun 1, 2019
@trinodb trinodb deleted a comment from cla-bot bot Jun 1, 2019
@trinodb trinodb deleted a comment from cla-bot bot Jun 1, 2019
@PennyAndWang
Copy link
Contributor Author

My PR builds failed, but through the log, it seems that the errors have nothing to do with my PR. Sorry, I don't know what to do. I need help .

@dain dain self-requested a review June 11, 2019 22:19
long activeNodes = nodeManager.getNodes(NodeState.ACTIVE).size();
if (!isIncludeCoordinator) {
activeNodes -= 1;
long activeNodes = nodeManager.getNodes(NodeState.ACTIVE).stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this to:

long activeWorkers = nodeManager.getNodes(NodeState.ACTIVE).stream()
                .filter(node -> isIncludeCoordinator || !node.isCoordinator())
                .count();

It will ignore coordinators in this count unless, isIncludeCoordinator is set, so if you have multiple coordinator and are including them, you get the right count.

@PennyAndWang
Copy link
Contributor Author

PennyAndWang commented Jun 12, 2019

@dain, Hi, dain, thanks very much for your reply.
I have been thinking about this issue for a long time.

Your proposed code:

long activeWorkers = nodeManager.getNodes(NodeState.ACTIVE).stream()
                .filter(node -> isIncludeCoordinator || !node.isCoordinator())
                .count();

I think the codes have some problems, for example:
there are two workers in a cluster, a coordinator1 (node-scheduler. include-coordinator = true),
and a coordinator2 (node-scheduler. include-coordinator = false) at the same time,

according to your code,
coordinator 1 obtains 4 workers and coordinator 2 obtains 2 workers .

There's also something wrong with my submitted code,which is:

long activeWorkers = nodeManager.getNodes(NodeState.ACTIVE).stream()
                .filter(node -> !node.isCoordinator())
                .count();
                
if (isIncludeCoordinator){
    activeWorkers += 1;
}
                

In the same cluster, according to my code,
coordinator1 gets 3 workers and coordinator2 gets 2 workers .

It seems that neither of these two solutions is perfect.

In addition, the value of the attribute `node-scheduler.include-coordinator'in each coordinator cannot be obtained in the discovery service.

So, I think there are two choices at present.

One is to modify the discovery service code and know the value of the attribute `node-scheduler.include-coordinator'in each coordinator, so that we can get the exact number of workers.

The other is to assume that each coordinator can only see itself and other workers, and can not see other coordinators. The code I submitted fits this situation.

The above is my opinion. Please be free to review.
Thanks again!

@dain
Copy link
Member

dain commented Jun 12, 2019

Summary

I'm going to try to summarize. We have two possible solutions that both start by counting the workers (non-coordinators), and then modify the count depending on the "include coordinators" configuration. The two different modifications are:

  1. Add one if the local coordinator doing the count is configured to include coordinators.
  2. Add all coordinators if the local coordinator doing the count is configured to include coordinators.

Both give the correct answer if you only have a single coordinator. The first one gives the wrong answer when multiple coordinators have "include coordinators" enabled. The second one gives the wrong answer anytime the coordinators have different configurations.

Does the above summary look correct?

Thoughts

My thoughts on this are based on following observations:

  1. Multiple coordinators is not a supported configuration for Presto today, and there are a known problems with running multiple coordinators (e.g., memory connector gives wrong results).
  2. Most people run a single shared config for workers, because if you don't things can break.

Based on 1, either of the solutions is good enough, but based on 2, I would expect most people to use the same configuration for all coordinators, which gives me a slight preference for the "add all coordinators" proposal.

The long term answer is we'll need to adjust the announcements, when adding support for multiple coordinators (#391)

@dain
Copy link
Member

dain commented Jun 12, 2019

@PennyAndWang, BTW we will need an individual CLA (https://github.com/prestosql/cla) from you to accept this contribution.

@PennyAndWang
Copy link
Contributor Author

@dain ,I agree with your summary and your thoughts . BTW, �I missed the reply from Martin about my presto-cla file, really sorry about this. �I have re-send my presto-cla file to Martin yesterday and I am waiting his check.

@martint
Copy link
Member

martint commented Jun 14, 2019

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Jun 14, 2019
@cla-bot
Copy link

cla-bot bot commented Jun 14, 2019

The cla-bot has been summoned, and re-checked this pull request!

Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can you squash this to one commit and rebase on master?

… and add ClusterStats.activeCoordinators property
@PennyAndWang PennyAndWang force-pushed the update_counting_active_workers branch from b640980 to e0d1f56 Compare June 25, 2019 02:41
@PennyAndWang
Copy link
Contributor Author

Looks good. Can you squash this to one commit and rebase on master?

@dain ,done!Build errored again ,errors have nothing to do with my PR. I don't know what to do ,sorry .

@dain dain merged commit 07ac8db into trinodb:master Jun 26, 2019
@dain
Copy link
Member

dain commented Jun 26, 2019

The error is unrelated.

Merged. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

4 participants