-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random instance selection #136
Conversation
shard: usize, // shard number | ||
role: Option<Role>, // primary or replica | ||
process_id: i32, // client id | ||
mut round_robin: usize, // round robin offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After randomization, this factor won't be needed. It is only used by the validate
method to go over all instances.
To simplify the logic for this method, I refactored validate
to checkout connection directly from the underlying pool using shard_index and instance_index directly.
src/main.rs
Outdated
@@ -340,3 +340,376 @@ fn format_duration(duration: &chrono::Duration) -> String { | |||
|
|||
format!("{}d {}:{}:{}", days, hours, minutes, seconds) | |||
} | |||
|
|||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an integration test but I didn't want to do a lot of refactoring to make integration tests work.
The reason the integration test requires refactoring is that we generate configs on-the-fly and as such we need access to Config
structs. We generate these configs on-the-fly using string manipulation but for complex integration tests we will end up with a lot of mess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not write it in ruby?
.circleci/config.yml
Outdated
@@ -43,7 +43,7 @@ jobs: | |||
command: "cargo build" | |||
- run: | |||
name: "Tests" | |||
command: "cargo test && bash .circleci/run_tests.sh && .circleci/generate_coverage.sh" | |||
command: "cargo build && bash .circleci/run_tests.sh && cargo test && .circleci/generate_coverage.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rust integration test expects databases to be set up so we run it after the bash script runs.
Cargo.toml
Outdated
@@ -33,3 +31,7 @@ tokio-rustls = "0.23" | |||
rustls-pemfile = "1" | |||
hyper = { version = "0.14", features = ["full"] } | |||
phf = { version = "0.10", features = ["macros"] } | |||
|
|||
[dev-dependencies] | |||
postgres = "0.19.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed for making queries against our proxies in the integration test.
src/pool.rs
Outdated
// Don't attempt to connect to banned servers. | ||
if self.is_banned(address, shard, role) { | ||
continue; | ||
} | ||
|
||
allowed_attempts -= 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the replicas are down, won't this cause the client to be stuck in an infinite loop in this function? I believe that's why I put allowed_attempts -= 1
line above the is_banned
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I think you are right.
@@ -63,7 +63,7 @@ pub struct Address { | |||
pub shard: usize, | |||
pub database: String, | |||
pub role: Role, | |||
pub replica_number: usize, | |||
pub instance_index: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address
could point to a primary so using the term replica
is misleading
Round-robin as an instance selection mechanism can cause load imbalance between replicas when some instances are excluded from the pool (due to banning or query routing decisions)
If a system has 4 replicas, clients will go through these replicas in order (albeit with different starting point), this means, replica 3 will always be attempted after replica 2.
This will become problematic when we ban a replica (or otherwise exclude it from the pool) because this will lead all clients that were using the excluded replica to try the next one which will resolve to the same instance for ALL clients. If we ban replicas 2 and 3 then all traffic from these replicas will end up on replica 4. This effect should be transient but in my test, it seems to be persistent.
Randomizing all the way seems to yield better results, the traffic goes back to being balanced more quickly.
This PR changes instance selection to be random.