New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rust/engine] add timeout for grpc #5632

Merged
merged 15 commits into from May 23, 2018

Conversation

Projects
None yet
4 participants
@baroquebobcat
Copy link
Contributor

baroquebobcat commented Mar 26, 2018

#5504

This is an initial movement towards adding a timeout. It still needs more work.

This patch: On responses from the server, instead of always continuing, we'll now timeout if the incomplete responses take more than the timeout. Currently in the patch, the timeout is 4 seconds, which is quite unrealistic, but it makes it easier to test, given that the timeout isn't injected yet.

Before this can land, it needs to inject the timeout duration into the client rather than having a hardcoded value. It also might make sense to investigate other future framework mechanisms for handling timing out. This patch's construction assumes that the server will be quick in its responses about incomplete operations.

Problem

(explain the context of the problem and why you're making this change. include
references to all relevant github issues.
)

Solution

(describe the modifications you've made.)

Result

(describe how your changes affect the end-user behavior of the system. this section is
optional, and should generally be summarized in the title of the pull request.
)

[rust/engine] add timeout for grpc
#5504

This is an initial movement towards adding a timeout. It still needs more work.

This patch: On responses from the server, instead of always continuing, we'll now timeout if the incomplete responses take more than the timeout. Currently in the patch, the timeout is 4 seconds, which is quite unrealistic, but it makes it easier to test, given that the timeout isn't injected yet.

Before this can land, it needs to inject the timeout duration into the client rather than having a hardcoded value. It also might make sense to investigate other future framework mechanisms for handling timing out. This patch's construction assumes that the server will be quick in its responses about incomplete operations.
@illicitonion

This comment has been minimized.

Copy link
Contributor

illicitonion commented Mar 26, 2018

I think I'd be pretty much happy with this as-is with two modifications, both to ExecuteProcessRequest:

  1. Add a timeout: std::time::Duration. Also make run_command_locally return a Future, and honour the timeout.
  2. Add a description: String field to ExecuteProcessRequest which isn't included in Eq, Hash &c, and is just used in error logging.
@baroquebobcat

This comment has been minimized.

Copy link
Contributor

baroquebobcat commented Apr 25, 2018

I've got some work towards that, but I think I'm going to wait for #5714 to land, then follow the patterns that review uses and use the library that uses.

@stuhood stuhood force-pushed the pantsbuild:master branch from b6bb42d to 9e2fdb5 May 11, 2018

@baroquebobcat baroquebobcat changed the title [wip][rust/engine] add timeout for grpc [rust/engine] add timeout for grpc May 18, 2018

@baroquebobcat baroquebobcat requested review from illicitonion , stuhood and ity and removed request for illicitonion May 18, 2018

@baroquebobcat

This comment has been minimized.

Copy link
Contributor

baroquebobcat commented May 18, 2018

I'm not 100% happy with it, but I think I'm at the point where I need further review to make progress.

@illicitonion
Copy link
Contributor

illicitonion left a comment

Looks great! A few clean-ups and this should be good to go!

@@ -30,7 +31,7 @@ pub mod remote;
///
/// A process to be executed.
///
#[derive(Clone, Debug, Eq, Hash, PartialEq)]
#[derive(Clone, Debug)]

This comment has been minimized.

@illicitonion

illicitonion May 18, 2018

Contributor

A slightly cleaner / more reliable way of achieving this (which doesn't require people to update the manual trait implementations if new fields are added) is to define a wrapper type which no-ops these commands, and to continue to auto-derive the traits.

Tuple-structs are zero-overhead in rust (they're compiled out), so this is free:

struct IgnoreForHashEq<T>(T);

impl <T> PartialEq for IgnoreForHashEq<T> {
    fn eq(&self, _: &IgnoreForHashEq<T>) -> bool {
        true
    }   
}

impl <T> Eq for IgnoreForHashEq<T> {}

impl <T> Hash for IgnoreForHashEq<T> {
    fn hash<H: Hasher>(&self, _state: &mut H) {}
}

impl <T: Debug> Debug for IgnoreForHashEq<T> {
    fn fmt(&self, f: &mut Formatter) -> fmt::Result {
        write!(f, "{:?}", self.0)
    }
}

impl <T: Display> Display for IgnoreForHashEq<T> {
    fn fmt(&self, f: &mut Formatter) -> fmt::Result {
        write!(f, "{}", self.0)
    }
}

impl <T: Clone> Clone for IgnoreForHashEq<T> {
    fn clone(&self) -> IgnoreForHashEq<T> {
        IgnoreForHashEq(self.0.clone())
    }
}

impl <T: Copy> Copy for IgnoreForHashEq<T> {}

You can then just have description be a IgnoreForHashEq<String> rather than a String, and have everything Just Work.

(We may also want to implement Deref so that you don't need to call description.0 when you want a description, too)

I'd be in favour of putting this into a separate crate in src/rust/engine and using it here. I can believe we may even want to publish that crate for other people to re-use.

h/t to @passy for the suggestion :)

This comment has been minimized.

@passy

This comment has been minimized.

@baroquebobcat

baroquebobcat May 18, 2018

Contributor

Oh, that's fun. I'd considered looking at a library that does this sort of thing, but it seemed like implementing eq et al would be easier. This approach is much cleaner. Where should the type live? It seems like something we might want to reuse.

This comment has been minimized.

@illicitonion

illicitonion May 18, 2018

Contributor

Maybe a crate in src/rust/engine/ignore_derive or similar?

This comment has been minimized.

@baroquebobcat

baroquebobcat May 18, 2018

Contributor

I'll give that a go

This comment has been minimized.

@stuhood

stuhood May 18, 2018

Member

Ehh, this feels like overkill to do in this PR. Can we defer it?

I'm also not sure about nuking equality of fields, such that later when you do something like res1.description == res2.description it succeeds incorrectly and you never notice it.

This comment has been minimized.

@baroquebobcat

baroquebobcat May 18, 2018

Contributor

hm. How about I drop the separate impl of Eq et al for now, and we can revisit if it is an issue?

@@ -103,45 +103,54 @@ impl CommandRunner {
let execution_client2 = execution_client.clone();
let operations_client = self.operations_client.clone();

let req_timeout = req.timeout;
let req_description = req.description.clone();

This comment has been minimized.

@illicitonion

illicitonion May 18, 2018

Contributor

I believe that if you do this assignment after the call to make_execute_request, you won't need the clone (because nothing uses req after that, so you can just take the value from it)

.and_then(move |(execute_request, operation)| {
let start_time = SystemTime::now();
let timeout: Duration = req_timeout;

This comment has been minimized.

@illicitonion

illicitonion May 18, 2018

Contributor

This variable doesn't look necessary?

This comment has been minimized.

@baroquebobcat

baroquebobcat May 18, 2018

Contributor

I think it's vestigial from when we had it hardcoded. I'll xx it.

.and_then(move |(execute_request, operation)| {
let start_time = SystemTime::now();
let timeout: Duration = req_timeout;
let req_description = req_description.clone();

This comment has been minimized.

@illicitonion

illicitonion May 18, 2018

Contributor

This line should be deletable - req_description gets moved into the closure below, and then gets cloned per loop iteration already.

This comment has been minimized.

@baroquebobcat

baroquebobcat May 18, 2018

Contributor

Hm. rustc was unhappy with me removing it before, but I'll try again.

This comment has been minimized.

@baroquebobcat

baroquebobcat May 18, 2018

Contributor

works! Thanks. I need to better model in my head how moves interact with various things.

future::err(format!(
"Exceeded time out of {:?} with {:?} for operation {}, {}",
timeout, elapsed, operation_name, req_description
)).to_boxed() as BoxFuture<_, _>

This comment has been minimized.

@illicitonion

illicitonion May 18, 2018

Contributor

I don't think the as BoxFuture<_, _> is needed any more, since #5754 (and throughout)

This comment has been minimized.

@baroquebobcat

baroquebobcat May 18, 2018

Contributor

nice

// take the grpc result and cancel the op if too much time has passed.
match grpc_result {
Ok(operation) => {
match start_time.elapsed() {

This comment has been minimized.

@illicitonion

illicitonion May 18, 2018

Contributor

I would possibly phrase this match as a try_future!:

let elapsed = try_future!(start_time.elapsed().map_err(|err| format!("Something weird happened with time: {:?}", err)));
if elapsed > timeout {
  ...
}

the match right now means that there's a lot more scope to consider for a very simple thing

This comment has been minimized.

@baroquebobcat

baroquebobcat May 18, 2018

Contributor

yeah, a fluent monadic representation of the execution here might be easier to follow. Thanks.

fn make_incomplete_operation(operation_name: &str) -> bazel_protos::operations::Operation {
fn make_incomplete_operation(
operation_name: &str,
) -> (bazel_protos::operations::Operation, Option<Duration>) {

This comment has been minimized.

@illicitonion

illicitonion May 18, 2018

Contributor

The Option<Duration> on this function (and the other hard-coded-to-None returning functions) probably deserves a comment explaining that it's so that callers can just the output of these calls compatibly in a Vec with calls to make_delayed_incomplete_operation

This comment has been minimized.

@baroquebobcat

baroquebobcat May 18, 2018

Contributor

Makes sense to me.

@illicitonion
Copy link
Contributor

illicitonion left a comment

Looks great, thanks!

It's probably worth noting in a comment somewhere that this is the timeout which is applied to the whole operation, so includes any remote queueing and remote set-up time which may be spent, rather than just being the time the process is allowed to execute (which I can believe we'll also support separately at some point :))


class ExecuteProcessRequest(datatype([
('argv', tuple),
('env', tuple),
('input_files', DirectoryDigest),
('output_files', tuple)
('output_files', tuple),
('timeout', Exactly(float, int)),

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

s/timeout/timeout_seconds/

req_timeout, elapsed, operation_name, req_description
)).to_boxed()
} else {
// maybe the delay here should be the min of remaining time and the backoff period

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

That sounds like a good plan :)

@@ -496,11 +497,20 @@ impl ExecuteProcess {
.map(|path: String| PathBuf::from(path))
.collect();

let timeout_str = externs::project_str(&value, "timeout");

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

It's slightly sad that this goes through a float -> string -> float round-trip, but I'm pretty sure I don't care enough to do the whole round-tripping floats across the ffi boundary piece :)

@@ -127,6 +127,8 @@ fn main() {
env,
input_files,
output_files: BTreeSet::new(),
timeout: Duration::new(15 * 60, 0),

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

Could be nice to make this a configurable flag, but happy to defer that too :)

@stuhood
Copy link
Member

stuhood left a comment

Looks good... I only have nits which should be resolved in followups.

)

@classmethod
def create_with_empty_snapshot(cls, argv, env, output_files):
return cls.create_from_snapshot(argv, env, EMPTY_SNAPSHOT, output_files)
def create_with_empty_snapshot(cls, argv, env, output_files, timeout_seconds=_default_timeout_seconds, description='process'):

This comment has been minimized.

@stuhood

stuhood May 23, 2018

Member

I believe that this could be merged with the previous method by giving the snapshot a default argument of snapshot=DEFAULT_SNAPSHOT.

('output_files', tuple),
# NB: timeout_seconds covers the whole remote operation including queuing and setup.
('timeout_seconds', Exactly(float, int)),
('description', SubclassesOf(*six.string_types)),

This comment has been minimized.

@stuhood

stuhood May 23, 2018

Member

Nit: Encouraging folks to set names for processes, and discouraging them from setting specific timeouts seems like a good idea. So swapping the order of these arguments would be good.

Also, argv[0] would probably make a better default description than "process".

@baroquebobcat baroquebobcat merged commit f5a46cb into pantsbuild:master May 23, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

stuhood added a commit to twitter/pants that referenced this pull request May 24, 2018

[rust/engine] add timeout \ description for grpc process requests (pa…
…ntsbuild#5632)

Fixes pantsbuild#5504

On responses from the server, instead of always continuing, we'll now timeout if the incomplete responses take more than the timeout. The default timeout in the patch is 15 minutes.

It does this by introducing a timeout to ExecuteProcessRequest that counts from when the grpc request goes out and fails with an error result if the timeout is exceeded while checking the status of the response.

It also adds a description field and includes it in error messages so that errors can have a human readable description of the request.

It does not add a timeout implementation to local executions, pantsbuild#5833 is the issue for that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment