Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remote workunits for Zipkin trace #7897

Merged

Conversation

@cattibrie
Copy link
Contributor

commented Jun 19, 2019

Problem

As the purpose of the remoting is to make pants runs faster (for compile) it is very important to understand its performance. One of the nice ways is Zipkin tracing.

Solution

The result of the remote execution contains timings of different stages of the remote process that are used to create workunits to Zipkin trace.

The last part of this PR is to add a test(unitest and/or integration test).

@cattibrie cattibrie force-pushed the cattibrie:etyurina/zipkin_add_remote_spans branch 3 times, most recently from c44a0ce to 5ef5b75 Jun 19, 2019

@illicitonion illicitonion requested review from illicitonion, patliu85, stuhood and blorente and removed request for illicitonion and patliu85 Jun 19, 2019

@illicitonion
Copy link
Contributor

left a comment

Looks good :) Some minor things, and we'll need some tests that you're working on :)

@@ -211,7 +212,7 @@ impl super::CommandRunner for CommandRunner {
///
/// Runs a command on this machine in the passed working directory.
///
fn run(&self, req: ExecuteProcessRequest) -> BoxFuture<FallibleExecuteProcessResult, String> {
fn run(&self, req: ExecuteProcessRequest, _workunit_store: Arc<WorkUnitStore>) -> BoxFuture<FallibleExecuteProcessResult, String> {

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jun 19, 2019

Contributor

Maybe add a TODO here to populate workunits for local process executions. We could reasonably have workunits for each of:
materialize input files
execution
ingest output files

(and I have a not-so-secret plan that we'll start also populating workunits for each store action (at least the remote ones) for things like "download file from remote" :))

fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>;
fn add_workunit(&self, workunit: WorkUnit);
}

pub struct SafeWorkUnitStore {

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jun 19, 2019

Contributor

What's safe about this, to influence it's name?

If Context itself isn't going to implement the WorkUnitStore trait, then we could get rid of the trait, and just have a struct named WorkUnitStore that we use everywhere?

This comment has been minimized.

Copy link
@cattibrie

cattibrie Jun 27, 2019

Author Contributor

We decided to call it safe because it is a Mutex of a value(Vec). So we wanted to say that it is safe to work with it from several threads.
What could be a better name in this case?

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

Just WorkUnitStore - Rust guarantees that you'll always be able to work with things properly across threads (and specifically the Send and Sync marker traits are used to denote these kinds of safety) :)

fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>;
fn add_workunit(&self, workunit: WorkUnit);
}

pub struct SafeWorkUnitStore {
pub workunits: Mutex<Vec<WorkUnit>>,

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jun 19, 2019

Contributor

I don't think this needs to be pub?

fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>;
fn add_workunit(&self, workunit: WorkUnit);
}

pub struct SafeWorkUnitStore {
pub workunits: Mutex<Vec<WorkUnit>>,

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jun 19, 2019

Contributor

It may be worth having this field be an Arc<Mutex<Vec<WorkUnit>>> so that all of the callers don't need to wrap it in an Arc. Or it may not :)

@@ -736,6 +753,30 @@ impl CommandRunner {
}
}

fn maybe_add_workunit(result_cached: &bool, name: &str, start_time: &Timespec, end_time: &Timespec, parent_id: Option<String>, workunit_store: &Arc<WorkUnitStore>) {

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jun 19, 2019

Contributor

It's kind of weird to take a &bool - clippy will probably suggest that you just take a bool.

The reason for this is that a bool is just one small amount of memory to copy, and is probably actually cheaper to just copy than using references, and any time you use a reference you make the compiler's job slightly harder.

}

fn timespec_as_float_secs(timespec: &Timespec) -> f64 {
// Returning value is formed by representing duration as a hole number of seconds (u64) plus

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jun 19, 2019

Contributor

Probably remove this comment - it's just describing what the code says and not adding much value on its own :)

This comment has been minimized.

Copy link
@blorente

blorente Jun 20, 2019

Contributor

I'd probably keep the last line though, because we are losing precision and I can imagine myself forgetting about that very easily, and not getting that from the code.

@blorente
Copy link
Contributor

left a comment

Awesome work!

}

fn timespec_as_float_secs(timespec: &Timespec) -> f64 {
// Returning value is formed by representing duration as a hole number of seconds (u64) plus

This comment has been minimized.

Copy link
@blorente

blorente Jun 20, 2019

Contributor

I'd probably keep the last line though, because we are losing precision and I can imagine myself forgetting about that very easily, and not getting that from the code.

start_timestamp,
end_timestamp,
span_id,
parent_id: None,

This comment has been minimized.

Copy link
@blorente

blorente Jun 20, 2019

Contributor

Maybe add a TODO (ideally with an issue number) to have this not be None soon (excited!)

This comment has been minimized.

Copy link
@illicitonion

@cattibrie cattibrie force-pushed the cattibrie:etyurina/zipkin_add_remote_spans branch from 5ef5b75 to 29097d4 Jun 27, 2019

@cattibrie cattibrie changed the title WIP: Add remote workunits for Zipkin trace Add remote workunits for Zipkin trace Jun 27, 2019

@cattibrie cattibrie force-pushed the cattibrie:etyurina/zipkin_add_remote_spans branch from 29097d4 to 1cb27b3 Jun 28, 2019

@illicitonion
Copy link
Contributor

left a comment

Looks great! Apart from one question about how we're asserting equality, all really minor comments :)

}

fn timespec_as_float_secs(timespec: &Timespec) -> f64 {
// Reverting time from duration to f64 decrease precision.

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

Probably worth linking to rust-lang/rust#54361 saying that there's an unstable standard library feature we'd like to use, but we're copying their implementation until it's stabilised.

@@ -2489,6 +2529,70 @@ mod tests {
)
}

#[test]
fn check_that_remote_workunits_are_in_workunit_store() {

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

We tend not to use check_that or similar prefixes in our test names, because that what all tests do, so it's typically redundant. Maybe call this stores_workunits or remote_workunits_are_stored?

StderrType::Raw(testdata_empty.string()),
0,
)
.op

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

This .op.unwrap().unwrap() points to a slightly wrong abstraction here. make_successful_operation_with_metadata returns a MockOperation, which maybe wraps an Operation. It looks like you want a function that just makes an Operation.

I think this should be pretty easy to do; just make make_successful_operation_with_maybe_metadata return an Operation, and do the MockOperation wrapping in make_successful_operation instead.

let mut runtime = tokio::runtime::Runtime::new().unwrap();

let workunit_store_2 = workunit_store.clone();
runtime.block_on(futures::future::ok(()).and_then(move |()| command_runner.extract_execute_response(

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

Slightly more clear than the futures::future::ok(()).and_then(move |()| could be: futures::future::lazy(move ||

let workunits_arc = workunit_store.get_workunits();
let workunits = workunits_arc.lock();
let scheduling_workunit = &workunits[0];
assert_workunit_params(scheduling_workunit, "scheduling", 0.0, 1.0, None);

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

Does the order of WorkUnits in the store actually matter? If not, we may not want to assert on it...

And in general, it's nice to be able to assert equality on two WorkUnits rather than have to go through field by field... We may want to (or it may be confusing and overkill...) do something like:

let got_workunits: HashSet<SpanIdIgnoringWorkUnits> = workunit_store.get_workunits().lock().iter().cloned().map(WorkUnitIgnoringSpanId::from_workunit).collect()

let want_workunits = hashset! {
  WorkUnitIgnoringSpanId { ... },
  WorkUnitIgnoringSpanId { ... },
  WorkUnitIgnoringSpanId { ... },
  WorkUnitIgnoringSpanId { ... },
};

assert_eq!(want_workunits, got_workunits)
let mut metadata = ExecutedActionMetadata::new();
metadata.set_queued_timestamp(timestamp_only_secs(0));
metadata.set_worker_start_timestamp(timestamp_only_secs(1));
metadata.set_worker_completed_timestamp(timestamp_only_secs(8));

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

Let's move this line to the end, so that the order of code reflects the order of timestamps

NodeKey::Snapshot(n) => n.run(context).map(NodeResult::from).to_boxed(),
NodeKey::Task(n) => n.run(context).map(NodeResult::from).to_boxed(),
}
futures::future::ok(()).and_then(|()| {

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

Again, maybe futures::future::lazy(|| (but not big deal either way)

self.workunits.lock().push(workunit);
}

pub fn len(&self) -> usize {

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

len seems like a kind of weird operation to have on a WorkUnitStore, and it looks like it's only used in a test. Maybe in the test instead do store.get_workunits().lock().len()?

// only if '--reporting-zipkin-trace-v2' is set
if !result_cached {
let workunit = WorkUnit {
name: String::from(name),

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 1, 2019

Contributor

We probably want to use slightly more verbose names, something like "remote execution worker input fetching" and similar

@@ -126,7 +128,7 @@ impl super::CommandRunner for CommandRunner {
///
/// TODO: Request jdk_home be created if set.
///
fn run(&self, req: ExecuteProcessRequest) -> BoxFuture<FallibleExecuteProcessResult, String> {
fn run(&self, req: ExecuteProcessRequest, workunit_store: WorkUnitStore) -> BoxFuture<FallibleExecuteProcessResult, String> {

This comment has been minimized.

Copy link
@stuhood

stuhood Jul 2, 2019

Member

@illicitonion : Should this be using thread/task locals like logging does? AFAIK, most tracing implementations do.

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 2, 2019

Contributor

I'm personally a big fan of explicit over implicit, but it could if we wanted to...

This comment has been minimized.

Copy link
@stuhood

stuhood Jul 2, 2019

Member

Well, tracing is a lot like logging. Doing the equivalent thing there and propagating our logger explicitly throughout our callstack to all the places we might want to log would probably be too much.

Maybe this is not similar to a logger, and so that doesn't make sense here... unknown.

This comment has been minimized.

Copy link
@cattibrie

cattibrie Jul 5, 2019

Author Contributor

As I use task locals in not many places in this PR I would prefer to leave it as it is.
But this is a useful suggestion and it would be great to reconsider it when adding right parent_id to all other v2 Nodes.

@cattibrie cattibrie force-pushed the cattibrie:etyurina/zipkin_add_remote_spans branch from 1cb27b3 to edb8c80 Jul 5, 2019

@illicitonion
Copy link
Contributor

left a comment

Looks great! A couple of tiny comments, otherwise about ready to merge :)

Can you look at the first two shards on travis? There are a couple of trivial clippy things, and a reformatting needed. All the other shards look like unrelated flakes

format!("{:16.x}", random_u64)
}

pub fn got_workunits(workunit_store: WorkUnitStore) -> HashSet<WorkUnit> {

This comment has been minimized.

Copy link
@illicitonion

illicitonion Jul 8, 2019

Contributor

Let's rename this to something like: workunits_without_span_id or workunits_with_constant_span_id

Also, this only needs a reference to a WorkUnitStore, so let's take a &WorkUnitStore instead of a WorkUnitStore.

There's also a cute little construction where you could if you wanted (I don't think it's necessarily better here, but it's pretty cute and good to know about), write:

workunit_store.get_workunits().lock().iter()
  .map(|workunit| WorkUnit { span_id: String::from("ignore"), **workunit })
  .collect()

which means "Make me a new WorkUnit with all the fields set the same as workunit except span_id"

This comment has been minimized.

Copy link
@cattibrie

cattibrie Jul 9, 2019

Author Contributor

When I changed the code to

 workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..workunit }).collect()

I got the next error:

error[E0308]: mismatched types
  --> workunit_store/src/lib.rs:76:109
   |
76 |   workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..workunit }).collect()
   |                                                                                                             ^^^^^^^^ expected struct `WorkUnit`, found &WorkUnit
   |

I cannot also do:

 workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..*workunit }).collect()

Because cannot move out of borrowed content

This comment has been minimized.

Copy link
@cattibrie

cattibrie Jul 9, 2019

Author Contributor

WorkUnit { span_id: String::from("ignore"), ..workunit.clone() } works
So anyway need to clone the workunit

start_timestamp,
end_timestamp,
span_id,
parent_id: None,

This comment has been minimized.

Copy link
@illicitonion

@cattibrie cattibrie force-pushed the cattibrie:etyurina/zipkin_add_remote_spans branch from edb8c80 to b54225f Jul 9, 2019

Add remote workunits for Zipkin trace
### Problem
As the purpose of the remoting is to make pants runs faster (for compile) it is very important to understand its performance. One of the nice ways is Zipkin tracing.

### Solution
The result of the remote execution contains timings of different stages of the remote process that are used to create workunits to Zipkin trace.

The last part of this PR is to add a test(unitest and/or integration test).

@illicitonion illicitonion force-pushed the cattibrie:etyurina/zipkin_add_remote_spans branch from b54225f to ac1f978 Jul 10, 2019

@illicitonion

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2019

I just pushed a rebase because I introduced a bunch of conflicts in two other PRs; will merge when green! :)

@illicitonion illicitonion merged commit 05188ed into pantsbuild:master Jul 10, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.