Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random instance selection #136

Merged
merged 18 commits into from
Aug 22, 2022
Merged

Conversation

drdrsh
Copy link
Collaborator

@drdrsh drdrsh commented Aug 21, 2022

Round-robin as an instance selection mechanism can cause load imbalance between replicas when some instances are excluded from the pool (due to banning or query routing decisions)

If a system has 4 replicas, clients will go through these replicas in order (albeit with different starting point), this means, replica 3 will always be attempted after replica 2.

This will become problematic when we ban a replica (or otherwise exclude it from the pool) because this will lead all clients that were using the excluded replica to try the next one which will resolve to the same instance for ALL clients. If we ban replicas 2 and 3 then all traffic from these replicas will end up on replica 4. This effect should be transient but in my test, it seems to be persistent.

Randomizing all the way seems to yield better results, the traffic goes back to being balanced more quickly.

This PR changes instance selection to be random.

@drdrsh drdrsh changed the title Mostafa random failover Random instance selection Aug 21, 2022
shard: usize, // shard number
role: Option<Role>, // primary or replica
process_id: i32, // client id
mut round_robin: usize, // round robin offset
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After randomization, this factor won't be needed. It is only used by the validate method to go over all instances.

To simplify the logic for this method, I refactored validate to checkout connection directly from the underlying pool using shard_index and instance_index directly.

src/main.rs Outdated
@@ -340,3 +340,376 @@ fn format_duration(duration: &chrono::Duration) -> String {

format!("{}d {}:{}:{}", days, hours, minutes, seconds)
}

#[cfg(test)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an integration test but I didn't want to do a lot of refactoring to make integration tests work.
The reason the integration test requires refactoring is that we generate configs on-the-fly and as such we need access to Config structs. We generate these configs on-the-fly using string manipulation but for complex integration tests we will end up with a lot of mess.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not write it in ruby?

@drdrsh drdrsh marked this pull request as ready for review August 21, 2022 14:14
@@ -43,7 +43,7 @@ jobs:
command: "cargo build"
- run:
name: "Tests"
command: "cargo test && bash .circleci/run_tests.sh && .circleci/generate_coverage.sh"
command: "cargo build && bash .circleci/run_tests.sh && cargo test && .circleci/generate_coverage.sh"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rust integration test expects databases to be set up so we run it after the bash script runs.

Cargo.toml Outdated
@@ -33,3 +31,7 @@ tokio-rustls = "0.23"
rustls-pemfile = "1"
hyper = { version = "0.14", features = ["full"] }
phf = { version = "0.10", features = ["macros"] }

[dev-dependencies]
postgres = "0.19.3"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for making queries against our proxies in the integration test.

@drdrsh drdrsh requested a review from levkk August 21, 2022 16:01
src/pool.rs Outdated
// Don't attempt to connect to banned servers.
if self.is_banned(address, shard, role) {
continue;
}

allowed_attempts -= 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the replicas are down, won't this cause the client to be stuck in an infinite loop in this function? I believe that's why I put allowed_attempts -= 1 line above the is_banned check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I think you are right.

@@ -63,7 +63,7 @@ pub struct Address {
pub shard: usize,
pub database: String,
pub role: Role,
pub replica_number: usize,
pub instance_index: usize,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address could point to a primary so using the term replica is misleading

@drdrsh drdrsh requested a review from levkk August 21, 2022 23:04
@levkk levkk merged commit 5f5b5e2 into postgresml:main Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants