Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new config to randomly pick the xDS server host #972

Merged
merged 4 commits into from
Jan 30, 2024

Conversation

PapaCharlie
Copy link
Member

@PapaCharlie PapaCharlie commented Jan 23, 2024

Instead of using the default "pick_first" policy, which always picks the first IP address returned by DNS, "round_robin" will pick a random starting point in the list of IPs. This will fix the very lopsided load seen on the xDS servers.

Here is the current traffic distribution with using pick_first:
image

Instead of using the default "pick_first" policy, which always picks the first
IP address returned by DNS, "round_robin" will pick a random starting point in
the list of IPs. This will fix the very lopsided load seen on the xDS servers.
Copy link
Contributor

@shivamgupta1 shivamgupta1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please share local testing results.

Copy link

@logstashbugreporter logstashbugreporter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add your ingraph link showing how skewed this problem is right now? ie., why it deserve a fix right now instead of waiting for a more generic solution?
please add background information on the root cause of the issue, i.e., why we need client side round-robin given that DNS return a random shuffle each time?

@shivamgupta1
Copy link
Contributor

@PapaCharlie
Copy link
Member Author

@logstashbugreporter updated the PR description

Copy link
Contributor

@bohhyang bohhyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.
Only that I couldn't think of a good way to test locally, since the client can't get info about the sub-channel picked in the grpc/xds channel.

@shivamgupta1
Copy link
Contributor

shivamgupta1 commented Jan 27, 2024

Only that I couldn't think of a good way to test locally, since the client can't get info about the sub-channel picked in the grpc/xds channel.

Should be pretty simple to test locally. The observer host a client connects to is in the lsof output for the client process. getaddrinfo results can be hardcoded in the /etc/hosts/ file: https://jameshfisher.com/2018/02/03/what-does-getaddrinfo-do/

@shivamgupta1
Copy link
Contributor

@PapaCharlie suggest to test locally and update results.

@PapaCharlie
Copy link
Member Author

PapaCharlie commented Jan 27, 2024 via email

@PapaCharlie
Copy link
Member Author

Tested this locally by adding the following to my /etc/hosts:

IP1 main.indis-registry-observer.ei-ltx1.atd.disco.linkedin.com
IP2 main.indis-registry-observer.ei-ltx1.atd.disco.linkedin.com
IP3 main.indis-registry-observer.ei-ltx1.atd.disco.linkedin.com

Without the round_robin config, it will always pick IP1 when starting up. I tested this by repeatedly redeploying the application and checking the logs on the observer. After applying the change, it picks a random IP every time it starts!

Copy link
Contributor

@shivamgupta1 shivamgupta1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing it locally too.

Copy link

@whutwhu whutwhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link

@logstashbugreporter logstashbugreporter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,
just to make sure I understand this, without setting round-robin as default LB policy, NettyChannelBuilder will use the first one of the sorted list of addresses?

@PapaCharlie
Copy link
Member Author

PapaCharlie commented Jan 30, 2024

@logstashbugreporter yeah exactly. The sorting is based on IPv6 proximity, so it picks whichever observer it thinks is closest. The problem is that some observers seem to be "closer" than most, which is why we see this very skewed distribution. The round_robin policy ignores proximity and picks a random IP from the set of returned IPs

@PapaCharlie PapaCharlie merged commit 46d142b into master Jan 30, 2024
2 checks passed
@PapaCharlie PapaCharlie deleted the pc/ipv6 branch January 30, 2024 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants