Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

firehose connection loadbalance #4083

Merged
merged 1 commit into from
Oct 21, 2022
Merged

firehose connection loadbalance #4083

merged 1 commit into from
Oct 21, 2022

Conversation

mangas
Copy link
Contributor

@mangas mangas commented Oct 20, 2022

Fixes #3879

@leoyvens leoyvens marked this pull request as ready for review October 20, 2022 13:12
volumes:
- ./data/postgres:/var/lib/postgresql/data
# volumes:
# - ./data/postgres:/var/lib/postgresql/data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change to docker-compose?

@@ -75,8 +73,7 @@ impl FirehoseEndpoint {
// Timeout on each request, so the timeout to estabilish each 'Blocks' stream.
.timeout(Duration::from_secs(120));

// Load balancing on a same endpoint is useful because it creates a connection pool.
let channel = Channel::balance_list(iter::repeat(endpoint).take(conn_pool_size as usize));
let channel = Channel::balance_list(vec![endpoint].into_iter());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this let channel = endpoint.connect_lazy(); as it used to be.

@@ -267,10 +264,12 @@ impl FirehoseEndpoints {
self.0.len()
}

// selects the FirehoseEndpoint with the lest amount of references, which will help with spliting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// selects the FirehoseEndpoint with the lest amount of references, which will help with spliting
// selects the FirehoseEndpoint with the least amount of references, which will help with splitting

self.0.iter().choose(&mut rng)
self.0
.iter()
.min_by(|x, y| Arc::strong_count(x).cmp(&Arc::strong_count(y)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_by_key is nicer.

@@ -267,10 +264,12 @@ impl FirehoseEndpoints {
self.0.len()
}

// selects the FirehoseEndpoint with the lest amount of references, which will help with spliting
// the load naively across the entire list.
pub fn random(&self) -> Option<&Arc<FirehoseEndpoint>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One aspect that concerns me is a silent hang if we ever hit the 100 stream limit per connection limit. Perhaps a simple way to make the error explicit would be to add a

const SUBGRAPHS_PER_CONN: usize = 100;

And return an error if the ref count reaches this number. Or even better auto-scale based on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will address this on a separate PR as this could benefit from some metrics in order to detect a stall

.min_by_key(|x| Arc::strong_count(x))
.ok_or(anyhow!("no available firehose endpoints"))?;
if Arc::strong_count(endpoint) > SUBGRAPHS_PER_CONN {
return Err(anyhow!("all connections saturated with {} connections, increase the firehose conn_pool_size", SUBGRAPHS_PER_CONN));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work with the actionable error message!

@leoyvens
Copy link
Collaborator

Oh and lets bump the default to 20.

@mangas mangas requested a review from leoyvens October 21, 2022 11:50
@mangas mangas merged commit 8559a1e into master Oct 21, 2022
@mangas mangas deleted the 3879-firehose-conn-lb branch October 21, 2022 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Firehose connections hang when more than 100 subgraphs are deployed
2 participants