Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing pt. 1 #1925

Merged
merged 72 commits into from Feb 12, 2020
Merged

Tracing pt. 1 #1925

merged 72 commits into from Feb 12, 2020

Conversation

@maackle
Copy link
Member

maackle commented Nov 28, 2019

PR summary

Note to self: see tracing-1-michael for local crate paths for local dev on this

Adding basic tracing to core, using newly added updates to holochain-tracing.

  • Sprinkles #[autotrace] macro in many places to build up full thread-local stack traces
  • Liberally uses SpanWrap and EncodedSpanWrap to send Spans with data across thread and process boundaries, respectively.
    • action channel now uses SpanWrap
    • WireMessage lib3h communication uses EncodedSpanWrap
  • Adds a Tracer and a long-running Reporting thread to both holochain and cli crates, which reports spans to a locally running Jaeger instance.
    • The conductor reporting thread dynamically creates a Reporter for each instance, and stores them all in a big RwLock<HashMap<,>>, similar to the signal receivers

testing/benchmarking notes

How to run tests locally:

Before running a test, run a Jaeger server (put this in a script)

docker run -d --rm --name jaeger \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14267:14267 \
  -p 14268:14268 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.16
  • Make sure the test you want includes a "tracing" config in tryorama. See app_spec/test/config.js for an example
  • Use the --tracing-name flag for sim2h_server, e.g. hc-sim2h-server --tracing-name sim2h
  • Run the test, and go to http://localhost:16686 to see the Jaeger UI and search for traces

Final exam / TryCP integration notes:

Run all-in-one on final exam:

docker run -d --rm --name jaeger \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14267:14267 \
  -p 14268:14268 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.16

Run jaeger-agent on each trycp machine

HOST=final-exam.holo.host

docker run \
  --rm \
  -p5775:5775/udp \
  -p6831:6831/udp \
  -p6832:6832/udp \
  -p5778:5778/tcp \
  jaegertracing/jaeger-agent:1.16 \
  --reporter.tchannel.host-port "$HOST:14267"

followups

  • Configurable sampling strategy
  • More fine-tuned traces -- current ones are noisy and incomplete, in particular the discontinuity in queued_pending_validations, which is documented with TODOs in the code
  • More use of tags
  • Send traces over the network!

changelog

  • if this is a code change that effects some consumer (e.g. zome developers) of holochain core, then it has been added to our between-release changelog with the format
- summary of change [PR#1234](https://github.com/holochain/holochain-rust/pull/1234)

documentation

@jamesray1

This comment has been minimized.

Copy link
Contributor

jamesray1 commented Nov 29, 2019

Good to see this! Once complete, this may possibly help to close or actually close #395?

@maackle

This comment has been minimized.

Copy link
Member Author

maackle commented Nov 29, 2019

@jamesray1 maybe so, but that issue was more like a note-to-self than a unit of work, so I closed it anyway


// TODO: TRACING: this is where we would include a Span, so that we can resume
// the trace when the workflow gets popped (see instance.rs), but we can't do that
// until we stop cloning the State, because Spans are not Cloneable.

This comment has been minimized.

Copy link
@lucksus

lucksus Dec 2, 2019

Member

I'm not sure if we ever will stop cloning the state. As a matter of fact, we recently added more State cloning (instead of tossing around lock guards which lead to deadlocks) and have made that cloning cheaper since the State is comprised almost only of Arcs and im collections..

Could spans live in Arcs inside the state?

// TODO: I changed a clone() to a take() because I couldn't see how this would work otherwise. Was it right??
if let Some(action_wrapper) = unprocessed_action.take().or_else(|| rx_action.recv_timeout(Duration::from_secs(1)).ok()) {
Comment on lines 205 to 206

This comment has been minimized.

Copy link
@lucksus

lucksus Dec 2, 2019

Member

Hm, I guess that is right.

Copy link
Member

lucksus left a comment

Looks good! I would like to see how this affects performance though..

@maackle

This comment has been minimized.

Copy link
Member Author

maackle commented Dec 2, 2019

Currently waiting on the ability to publish a new release of holochain_tracing (and a brand new one for holochain_tracing_macros) to crates.io, so that holochain-rust and lib3h crates can be properly updated to use the new tracing versions

if let Some(PendingValidationWithTimeout { pending: front, .. }) =
new_store.queued_holding_workflows.front()
{
if front == pending {
let _ = new_store.queued_holding_workflows.pop_front();
} else {
// The first item in the queue could be a delayed one which will result
// in the holding thread seeing another item as the next one.
// The holding thread will still try to pop that next item, so we need
// this else case where we just remove an item from some position inside the queue:
new_store
.queued_holding_workflows
.retain(|PendingValidationWithTimeout { pending: item, .. }| item != pending);
}
} else {
if let None = new_store.remove_holding_workflow(pending) {
error!("Got Action::PopNextHoldingWorkflow on an empty holding queue!");
}
Comment on lines -189 to 195

This comment has been minimized.

Copy link
@maackle

maackle Dec 3, 2019

Author Member

I replaced this big chunk of code with a simpler method on DhtStore. It seemed unnecessary to have two separate cases for whether the item is at the front of the queue or not, either way we just want to remove it. Can someone make sure I did this replacement right though? @lucksus?

This comment has been minimized.

Copy link
@lucksus

lucksus Dec 4, 2019

Member

Looks good!

maackle added 3 commits Jan 30, 2020
Copy link
Contributor

freesig left a comment

There's is a lot to review here. I can't see any issues but we probably want at least 3 eyes on it as it touches so much

crates/common/src/lib.rs Outdated Show resolved Hide resolved
@maackle maackle marked this pull request as ready for review Feb 1, 2020
@maackle

This comment has been minimized.

Copy link
Member Author

maackle commented Feb 3, 2020

I just realized one shortcoming of this branch: Reporters for each Instance are only created when instantiating from Config, but not when adding Instances from the admin interface. This means e.g. Holoscape wouldn't be able to do tracing until you restart your conductor (though Holoscape doesn't have a way to specify that tracing should be added to the config yet anyway).

crates/sim2h/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Member

lucksus left a comment

Just wanna make sure we don't forget to remove the .only before merging.

app_spec/test/files/test.js Outdated Show resolved Hide resolved
@@ -557,6 +586,8 @@ impl Conductor {
/// Starts all instances
pub fn start_all_instances(&mut self) -> Result<(), HolochainInstanceError> {
notify("Start all instances".to_string());
// TODO: how to stop?

This comment has been minimized.

Copy link
@lucksus

lucksus Feb 4, 2020

Member

Killswitch channel?

// To do this, we'd need to store a Span in each item of the DhtStore::queued_holding_workflows.
// However, Span is not Clone, and the entire DhtStore needs to be Cloned.
// So, the span has to be cut short here until we stop cloning the state.

This comment has been minimized.

Copy link
@lucksus

lucksus Feb 4, 2020

Member

Again, what about putting Arc wrapped spans into the state? We are doing this with the state slices already to not clone the whole state every time - just the root reference.

This comment has been minimized.

Copy link
@maackle

maackle Feb 4, 2020

Author Member

@lucksus The problem is that we rely on the Span being dropped to mark the end of the span and to actually send it for reporting. If we allow it inside an Arc, we lose control over when it goes out of scope if we don't keep track of every clone. If all clones don't drop at the same time, we get an inaccurate span. We could solve this with explicit coordination, but that's just more complexity. Unless you see a simple way that I'm not seeing?

This comment has been minimized.

Copy link
@lucksus

lucksus Feb 12, 2020

Member

I see. Well, that explicit coordination could just be an Arc<Option<..>>, right?
(Or Arc<Mutex<Option<..>>>)

};
Ok(Box::new(Sim1hWorker::new(h, backend_config)?) as Box<dyn NetWorker>)
P2pBackendKind::SIM1H => Box::new(move |_h| {
unimplemented!()

This comment has been minimized.

Copy link
@lucksus

lucksus Feb 4, 2020

Member

I know we removed the sim1h test suite. Are we (did we?) officially and fully deprecate sim1h?

This comment has been minimized.

Copy link
@maackle

maackle Feb 4, 2020

Author Member

Let's discuss that today, and what that should look like

This comment has been minimized.

Copy link
@maackle

maackle Feb 4, 2020

Author Member

Oops, forgot to bring it up today. Tomorrow!

maackle and others added 3 commits Feb 4, 2020
Co-Authored-By: Nicolas Luck <nicolas@lucksus.eu>
@@ -57,6 +58,8 @@ pub fn reduce_init(state: &mut NetworkState, root_state: &State, action_wrapper:
p2p_config,
Some(Address::from(network_settings.agent_id.clone())),
Some(root_state.conductor_api.clone()),
// TODO: This probably shouldn't be none but I can't figure out how to get the tracer
None,

This comment has been minimized.

Copy link
@freesig

freesig Feb 11, 2020

Contributor

I couldn't figure out how to get a tracer here @maackle It's needed to pass to the Sim2hWorker

This comment has been minimized.

Copy link
@maackle

maackle Feb 11, 2020

Author Member

Well I thought we could get it in through the ActionWrapper. But then we would have to make Tracer Serialize 🙈

@zippy zippy requested review from lucksus, neonphog and freesig Feb 11, 2020
freesig added 2 commits Feb 11, 2020
@freesig freesig mentioned this pull request Feb 12, 2020
2 of 5 tasks complete
@freesig freesig merged commit a5c5cae into develop Feb 12, 2020
6 checks passed
6 checks passed
ci/circleci: app-spec-tests-sim2h Your tests passed on CircleCI!
Details
ci/circleci: build Your tests passed on CircleCI!
Details
ci/circleci: cli-tests Your tests passed on CircleCI!
Details
ci/circleci: fmt Your tests passed on CircleCI!
Details
ci/circleci: stress-tests-sim2h Your tests passed on CircleCI!
Details
ci/circleci: wasm-conductor-tests Your tests passed on CircleCI!
Details
@neonphog neonphog deleted the tracing-1 branch Mar 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.