Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serverset: Lazily connect to a limited number of server #8165

Merged

Conversation

@illicitonion
Copy link
Contributor

commented Aug 12, 2019

This allows for a connection limit, and doesn't make any connections
until they're needed.

This may slightly slow down the first few requests, but means that we
won't proactively connect to a large number of servers.

We also now disconnect from the server when we see it fail (rather than
just stopping to use it).

This makes the order of server connections slightly less guaranteed;
before we would strictly round-robin, whereas now we may skip some
servers or use them twice in a row when we connect/disconnect a new
server.

@illicitonion

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

Depends on #8164

@pierrechevalier83
Copy link
Contributor

left a comment

Left a few comments. Overall, the change is good. It's bringing useful functionality of reducing the amounts of active connections. Some of my comments are actually due to the structure of the pre-existing code which makes the changeset harder to read. Feel free to ignore some of the comments if you feel it isn't really the responsibility of this changeset to address them. In other words, be as much or as little of a boyscout as you see fit 😉

src/rust/engine/serverset/src/lib.rs Show resolved Hide resolved
src/rust/engine/serverset/src/lib.rs Show resolved Hide resolved
}

fn is_connected(&self, server_index: usize) -> bool {
for connection in &self.connections {

This comment has been minimized.

Copy link
@pierrechevalier83

pierrechevalier83 Aug 12, 2019

Contributor

Nitpick: this is a "raw for loop" reimplementation of any.
I would find:

    self
      .connections
      .iter()
      .any(|connection| connection.server_index == server_index)

more elegant.

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 12, 2019

Author Contributor

Done

@@ -220,65 +279,83 @@ impl<T: Clone + Send + Sync + 'static> Serverset<T> {
pub fn next(&self) -> BoxFuture<(T, HealthReportToken), String> {
let now = Instant::now();
let mut inner = self.inner.lock();
let server_count = inner.connections.len();

let existing_connections = inner.connections.len();

This comment has been minimized.

Copy link
@pierrechevalier83

pierrechevalier83 Aug 12, 2019

Contributor

Style:
I would probably move these two lines into their own well named function (in Inner) to reduce the cognitive overhead of parsing this function.
For instance, if inner.has_available_servers_to_connect_to() { ... } (better name suggestions welcome 😄 )

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 12, 2019

Author Contributor

Done

inner.next = inner.next.wrapping_add(1);
let server = &inner.connections[i];
for _ in 0..inner.servers.len() {
let i = inner.next_server % inner.servers.len();

This comment has been minimized.

Copy link
@pierrechevalier83

pierrechevalier83 Aug 12, 2019

Contributor

This would read better if the wrapping iterator logic was implemented on a substruct of inner:
inner.servers.iter().for_each(|server| ... )

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 12, 2019

Author Contributor

As above

// A healthy server! Use it!
return futures::future::ok((server.server.clone(), HealthReportToken { index: i }))
.to_boxed();
} // else cannot happen, or we would already have connected and returned.
}
// Unwrap is safe because if we hadn't populated earliest_future, we would already have returned.

This comment has been minimized.

Copy link
@pierrechevalier83

pierrechevalier83 Aug 12, 2019

Contributor

I get this is not new in this changeset, so don't feel like you have to act on it, but it feels like instead of having comments like this one where the ownace is on the reader to verify all of this logic, avoiding early returns and having else branches to match the if branches would make this an actual no-brainer.

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 12, 2019

Author Contributor

Rephrased this in a more iteratory way, but it still has an unwrap because the ordering of the checks we're doing makes it hard to encode the required information cleanly into the type system.

We want to say:

  1. Maybe connect (iterate to find a connectable)
  2. Else maybe reuse
  3. Else iterate to find what will next be connectable

If we were happy to unconditionally do the next finding in the maybe-connect stage, even if we may end up reusing a connection, we could phrase this as:

  1. Iterate over all servers, finding the next one we should try to connect to
  2. If we can make a connection, and the server is healthy, connect to it
  3. If nothing is connectable, and we have an open connection, re-use it
  4. Else reschedule for when something will be connectable

But because we don't want to do the iteration and time calculations in step 1 if step 2 is going to fail and we're going to use step 3, we can't just write this as one cute expression.

@illicitonion illicitonion force-pushed the twitter:dwagnerhall/serverset-connections/wip2 branch from a3701da to d226b75 Aug 12, 2019

@pierrechevalier83
Copy link
Contributor

left a comment

Thanks for the refactoring. Looks good to me 😃

}
// Unwrap is safe because if we hadn't populated earliest_future, we would already have returned.
let instant = earliest_future.unwrap();
// Unwrap is safe because _some_ server must have an unhealthy_info or we would've already

This comment has been minimized.

Copy link
@pierrechevalier83

pierrechevalier83 Aug 12, 2019

Contributor

Nice refactoring 😄 Reads much better 😄

@illicitonion illicitonion requested a review from stuhood Aug 12, 2019

@stuhood
Copy link
Member

left a comment

Thanks, looks good. Just nits.

It might be good to describe how the independent round-robining of the servers and connections works in a docstring somewhere.


struct Inner<T: Clone> {
// Order is immutable through the lifetime of a Serverset, but contents of the Backend may change.

This comment has been minimized.

Copy link
@stuhood

stuhood Aug 12, 2019

Member

One thing that eases my mind is that this vec is always the same length, afaict. Would be good to mention that here, and indicate that that is why it is ok to hold indexes into this vec.

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 14, 2019

Author Contributor

Done

pub(crate) next: usize,
pub(crate) next_server: usize,

// Points into the servers list; may change over time.

This comment has been minimized.

Copy link
@stuhood

stuhood Aug 12, 2019

Member

And vice-versa: the length of this vec might change, but we don't hold indexes into it (just a modded value).

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 14, 2019

Author Contributor

Done

if inner.can_make_more_connections() {
// Find a healthy server without a connection, connect to it.
if let Some(ret) = inner.connect() {
inner.next_connection = inner.next_connection.wrapping_add(1);

This comment has been minimized.

Copy link
@stuhood

stuhood Aug 12, 2019

Member

It's not clear why we increment next_connection here... the connection will be added at the end of the list, but we don't know where the index is pointing. If this is necessary would add a comment, otherwise, maybe remove to keep things clear?

This comment has been minimized.

Copy link
@illicitonion

illicitonion Aug 14, 2019

Author Contributor

Done

@stuhood

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

Hm. There appears to be a reproducible issue with test remoting here, which causes the v2-remote-unit tests to fail. I've cleared cache on it, and retried a few times. The shard ends up "terminated", but much too quickly (in about 2 minutes):

/home/travis/.travis/functions: line 524:  5266 Terminated              travis_jigger "${!}" "${timeout}" "${cmd[@]}"

I'll increase debug output a bit, and see whether I can remove the wait.

@stuhood

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

The exposed error is:

Throw(Failed to execute process: Failed after 6 retries; last failure: Error attempting to upload digest Digest(Fingerprint<20e24a64653fc16455f152ef768932622d89ea6c0ef18a8e15c6f0ae7d25a30d>, 146741653): RpcFinished(Some(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded"), status_proto_bytes: None })))

Since we're not seeing it on master, I do expect that it is related. Will take a look today.

EDIT: I investigated this a bit, and AFAICT, the "deadline" we have set is 60 seconds, which makes hitting it 5 times fairly unlikely. But also, this particular failure is not reproducible. The truncation appears to be randomish, so I'm now suspecting either a panic or some other terminal error (sigsegv?) that is preventing us from getting a useful exit status due to the ci.py wrapper.
EDIT2: BUT, I was able to repro this error locally once with:

./build-support/bin/ci.py --bootstrap
./build-support/bin/ci.py --python-tests-v2 --remote-execution-enabled

EDIT3: But only able to repro it once.

@stuhood

This comment has been minimized.

Copy link
Member

commented Aug 16, 2019

So after the above investigation, I'm back to thinking that the process is terminating abnormally, because I was able to get a "successful" (except for tests failing due to #8067) run of the unit test shard locally using the ci.py commands above.

@Eric-Arellano

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

The exposed error is:

Throw(Failed to execute process: Failed after 6 retries; last failure: Error attempting to upload digest Digest(Fingerprint<20e24a64653fc16455f152ef768932622d89ea6c0ef18a8e15c6f0ae7d25a30d>, 146741653): RpcFinished(Some(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded"), status_proto_bytes: None })))

This is the exact error I have when remoting tests on my Ubuntu VM! The tests pass if everything is already cached, like util:strutil, but fail for anything that requires an upload, with this exact error. Specifically, this is the command I've been trying to run:

./pants --pants-config-files=pants.remote.ini --no-v1 --v2 --remote-oauth-bearer-token-path=<(gcloud auth application-default print-access-token | perl -p -e 'chomp if eof') --process-execution-speculation-strategy=none --remote-ca-certs-path=/snap/google-cloud-sdk/91/lib/third_party/grpc/_cython/_credentials/roots.pem  test tests/python/pants_test/backend/jvm/tasks/jvm_compile:jvm_compile

I'd feel much less crazy if this ends up being related. Solving this will be a huge help to unblocking me from remoting integration tests as well.

@stuhood

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

This is the exact error I have when remoting tests on my Ubuntu VM! The tests pass if everything is already cached, like util:strutil, but fail for anything that requires an upload, with this exact error.

@Eric-Arellano: Hmmm. While I repro'd this once locally, I also had a (mostly) successful result locally. So I suspect that that portion of the error might just be spurious.

The truncation aspect of #8165 (comment) is the thing I need to investigate tomorrow.

illicitonion added 3 commits Aug 9, 2019
Serverset: Lazily connect to a limited number of servers
This allows for a connetion limit, and doesn't make any connections
until they're needed.

This may slightly slow down the first few requests, but means that we
won't proactively connect to a large number of servers.

We also now disconnect from the server when we see it fail (rather than
just stopping to use it).

This makes the order of server connections slightly less guaranteed;
before we would strictly round-robin, whereas now we may skip some
servers or use them twice in a row when we connect/disconnect a new
server.

@stuhood stuhood force-pushed the twitter:dwagnerhall/serverset-connections/wip2 branch 2 times, most recently from e846f05 to 38e012c Aug 20, 2019

@stuhood

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

Ok, progress. It looks like the process running the tests is segfaulting.

EDIT: Dangit: didn't get enough sleep. Signal 11 is sigsegv, not sighup.

@stuhood stuhood force-pushed the twitter:dwagnerhall/serverset-connections/wip2 branch from 160ad91 to bbd0c39 Aug 21, 2019

@stuhood

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

bbd0c39 ended up being the reason for the segfault. Will figure that one out later.

@stuhood stuhood merged commit 70cb3c3 into pantsbuild:master Aug 21, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@stuhood stuhood deleted the twitter:dwagnerhall/serverset-connections/wip2 branch Aug 21, 2019

@stuhood stuhood added this to the 1.19.x milestone Aug 21, 2019

stuhood added a commit that referenced this pull request Aug 22, 2019
Serverset: Lazily connect to a limited number of servers (#8165)
This allows for a connection limit, and doesn't make any connections
until they're needed.

This may slightly slow down the first few requests, but means that we
won't proactively connect to a large number of servers.

We also now disconnect from the server when we see it fail (rather than
just stopping to use it).

This makes the order of server connections slightly less guaranteed;
before we would strictly round-robin, whereas now we may skip some
servers or use them twice in a row when we connect/disconnect a new
server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.