manual: update runtime tracing chapter for custom events #12335

TheLortex · 2023-06-28T08:49:03Z

This PR extends the runtime tracing section to account for custom events introduced in #11474

It should be included in the 5.1 release

manual/src/cmds/runtime-tracing.etex

gasche · 2023-06-28T09:13:08Z

manual/src/cmds/runtime-tracing.etex

+the same set of registered events. The event name is used by the consumer to
+identify producer's events, with a fallback mechanism for unknown events of
+built-in types. Unknown events for user-defined types are currently ignored,
+even if the type is defined both in the producer and consumer side.


You explain "runtime probes" and "custom probes", but in fact most of the explanation of "custom probes" is about what "custom events" are. My feeling is that "events" and "probes" are two different concepts that could be explained separately: events have a semantic and a payload, and probes, actually I am not sure what they are (the Runtime_events module documentation does not contain the word "probe"), but they seem to be related to both emission and reception of events -- or at least both are discussed here.

You're right. Basically I see probes as "the piece of code that emit events". There are probes in the runtime that emit runtime events. Using the custom events library, users can set up additional probes in their code base that emit custom events.

I'll come up with better wording.

Note (not directly related to the PR): to me the word "probe" suggests a tool to observe some information, so it is more associated to a "receiver" than to a "producer" in my mind. The usage of the word "probe" in a similar context that I know of are the probes in the Linux kernel ( https://docs.kernel.org/trace/kprobes.html ), and their meaning there is fairly different from "the piece of code that emit events" (the probe is more like the part that registers a listener for a given event, the "probed instructrion" is in charge of emitting the event).

The other precedent for the use of the term 'probe' is the dtrace, which supports both dynamic and static probe points. https://illumos.org/books/dtrace/chp-usdt.html (certificate expired, but content still valid)

I see yes, my context for the definition of "probe" is:

https://github.com/janestreet/ocaml-probes: "Some examples of what a probe handler can do: write data into a ring buffer or similar for transmission to a another tool that can analyze or visualize it.", a 'probe handler' being "If a probe is enabled, whenever the execution reaches it, the corresponding handler is evaluated."

the current usage of probe in the manual: "The full set of events emitted by probes [...]" "The events transport part of the system ingests events emitted by the probes
and makes them available to consumers."

Both making me think of them as event producers.

(cc @sadiqj if you have opinions on how "probes" should be defined, and how they relate to the documentation the Runtime_events module itself.)

If you're into electronics, a probe is part of the measurement device (oscilloscope, etc). The part of the printed circuit board where you can attach a probe is called a test point, or point de piquage in French. What about "tap" as a source of events? Or just "event source" for additional clarity.

I don't think I have a strong opinion either way on terminology. I used to refer to the CAML_EV macros as probes because there wasn't an explicit way of actually attaching to those, if runtime_events was enabled all probes produced events that went in to the ringbuffer - you didn't really attach to individual ones directly.

In any case I think that the discussion in this paragraph mixes too many things: (1) what is the information carried by those custom events, (2) how to emit them, (3) how to listen for them. I think that it would be natural to discuss at least (1) strictly before (2), (3), which could be discussed either together or separately.

(I think it would also be nice to understand what terminology we want to use, but this is orthogonal. I do find it weird that currently the terminology in the manual chapter is distinct from the terminology in the .mli documentation.)

manual/src/cmds/runtime-tracing.etex

gasche · 2023-06-28T09:20:30Z

manual/src/cmds/runtime-tracing.etex

+    match Runtime_events.User.tag event, value with
+    | CustomSpan, Runtime_events.Type.Begin -> Printf.printf "> count begin\n"
+    | CustomSpan, End -> Printf.printf "< count end\n"
+    | _ -> ()


If I understand correctly, there is no enforced relation between the name of the even (CustomSpan, CustomInt) and the payload it carries: the producer could emit both span and int events for CustomSpan, and the customer could receive them. The fact that nothing prevents us to have CustomSpan as a tag for non-span events may or may not be an API defect depending on how we think of it.

Questions:

Would it be interesting to enforce that an event declared with a span payload is only ever emitted with a span payload? This could be done for example by having separate extensible variants for int-events and span-events, or by making the extensible tag a GADT indexed over its intended payload.

Or, on the countrary, is it intentional that a given event tag can be used with several payload types? In this case, should we consider rewriting this example to have a single Custom event tag, and use it for both span and int payloads?

That's a good question. Tags are currently used to provide additional semantics to a type, so in that sense they should be an extensible GADT to make sure they are properly used.

The question I'm wondering about is that custom types can also be used to refine semantics. For example if I write a web server and want to log response latencies, should I use the built-in int type so that it can be consumed by generic consumers, or use a custom int type used specifically for latencies. I see that there is a hierarchy of semantics: my library web request latencies < web request latencies < latencies < number. Using types and tags we're basically exposing two semantics for the int value: a broad one (type) and a precise one (tag).
As a consequence users might not have a use for tags if they come up with sufficiently precise custom types, and in that case it is a burden to extend the GADT with a tag for each custom type.

(I don't really understand the discussion on custom types, but I will try to have another go on my next read of the proposed chapter.)

@TheLortex’s point is not very clear to me either, but if I’m not mistaken this a discussion of the (already merged) API. It doesn’t mean that it isn’t important, but it seems to me that it shouldn’t be blocking for the present PR.

Well I think we should have respected the guideline of not merging significant new changes without updating the manual at the same time, so that we would be blocking both the API and the documentation PR. People would be more motivated to have a good API and a good documentation that way. Now instead we are in a situation where we want to merge documentation that we don't really understand for a feature that already shipped with a worse API.

My apologies for taking so long to come back to this PR.

The type of an event is what data it contains. We have three default ones, unit, span begin/end and int which correspond to an instantaneous event, the beginning of an event, the end of an event and finally the measurement of something at a point in time. There's also custom types, for which you can provide an encoder/decoder. The intention behind having the default ones is that generic consumers (e.g an opentelemetry or logging library) can do something sensible with the ones they know about.

A good example of a custom type might be a Histogram type. You don't want to register a new type for each Histogram you track in your program, you only do it once and then use that type to create a User.t with a name and tag. An example here might be 200_response_times or something similar. You could imagine having a Histogram for each response code timing.

One complexity is that runtime events is designed so you can consume in- or out-of-process. This is why there's a name and a tag. The name is carried with the event onwards but the tag is not. Having the two separate is a bit of a pain, there may be something that could be done there.

The design was essentially that the most common use cases were simple (units, spans and ints) to get most of the benefits but custom types could give you a lot more freedom but at the cost of a bit more complicated usage.

Eio's use of custom events for tracing is a good example of how the API is currently used: https://github.com/ocaml-multicore/eio/pull/635/files

People would be more motivated to have a good API and a good documentation that way.

There's a danger that if we insist on getting things perfect the first time, it won't bring forth good contributions. It's a balancing act and with runtime events I recall part of the argument of moving it to otherlibs was the expectation that it wasn't a mature design.

There's a danger that if we insist on getting things perfect the first time, it won't bring forth good contributions. It's a balancing act and with runtime events I recall part of the argument of moving it to otherlibs was the expectation that it wasn't a mature design.

In my experience, the act of writing documentation often reveals improvements to the API -- it is a way to test a design, orthogonal to its use to write code. (I would say that usage examples are good at highlighting weak spots and missing features, while documentation is good at suggesting opportunities to simplify concepts or improve consistency.) I didn't mean to suggest that the code should be perfect in an absolute sense, but that it should have benefited from this first round of improvement before it reaches users of an upstream release.

TheLortex · 2023-07-05T12:04:19Z

Thank you for your reviews. I have updated the PR address requested changes.

gasche · 2023-07-07T08:36:35Z

manual/src/cmds/runtime-tracing.etex

+
+An extensible variant \texttt{tag} provides a way to attach additional data to
+events. The tag is not serialized, so it is only available for \textit{known}
+events.


I don't understand what the last sentence means.

When the consumer receives an event that it doesn't know about, only the name and payload content can be retrieved, not the tag.

I also struggled for a while to understand this paragraph. I was filled with confusion with questions like: what information can a tag carry? What purpose do they serve if they are not serialized? If they are not serialized, how can the consumer retrieve them? Etc.

The light appeared when I read the .mli:

(** The type for a user event tag. Tags are used to discriminate between user events of the same type *)

So, in summary, I wonder if an alternative formulation of this paragraph would help eliminate the incomprehensions that I’ve had:

The function \texttt{Runtime_events.User.register}, in addition to the event name and type, also takes an extensible variant \texttt{tag}. Due to the way serialization works, subscribing to user events is per-type (i.e., you will receive all registered events of a given type) and one must use tags to discriminate between events of the same type. To establish the link between the name \texttt{name} and the tag \texttt{tag}, one must call \texttt{Runtime_events.User.register name tag typ}—in the producer, but also in the consumer, if the consumer needs to be able to discriminate on the tag.

or something like that. (I hope this is correct.)

In the ring buffer events are just an index in to a name string table and a bunch of bytes. The Runtime_event.Type is how we turn those bytes back in to some data we can do something with.

A tag is essentially a way of you associating an OCaml type with the name, so you don't have to use the string name everywhere. If the producer and consumer both know about a custom event custom type, they can register the same tag. That's not necessary though, if you register a callback for a custom type but don't register tags, you should still get the events - they'll just identified by a name and not have a tag. For the common use case of taking these events and recording them in to a log file or database somewhere, that's sufficient.

Edit: wading through the source again, this is not true. If a User Custom event isn't registered then we drop them even if there's an active callback for that type. That's because ultimately all we know is the name and data - we don't have a way of identifying the type of the event (beyond the built in ones). Even if we did have an identifier for each custom types, how would we even refer to it?

gasche · 2023-07-07T08:37:46Z

manual/src/cmds/runtime-tracing.etex

+\item event is registered: payload and tag are available.
+\item event is not registered and has a built-in event type (unit, int, span):
+    only the payload is available.
+\item event is not registered and has a custom event type: event is dropped.


Why is there a difference in behavior depending on the type of the payload? Wouldn't it make sense to drop all unregistered events?

Why is there a difference in behavior depending on the type of the payload?

Due to how the event transport protocol is designed, the consumer can decode event payloads for built-in types only by using the event type ID field in the header. It is not currently possible for custom types, as the consumer has first to identify the event in order to obtain the decoder function.

Wouldn't it make sense to drop all unregistered events?

There is value in supporting unregistered events of built-in types. runtime_events_tools's olly trace command was extended to integrate both GC events and unregistered events of the span type in a single trace (tarides/runtime_events_tools@5865899)

To me the other way around makes more sense: lift the limitation of unregistered events of custom types being dropped.

There is value in supporting unregistered events of built-in types. [We added support in olly trace.]

Can you spell this out for me, what is it useful for in practice?

Bonus question: Is it possible for an event to have several pieces of information as the payload, some custom and some built-in?

what is it useful for in practice?

For example, I was trying to understand performance issues in a multicore eio program. Using olly and runtime events I figured out that on major GC, the domains had a hard time synchronizing (one domain was waiting up to 200ms) but I couldn't figure out why. By adding custom events of the span types in eio I was able to see that time was spent in a system call that was not marked as a blocking section.

olly trace doesn't need to know about eio for that purpose. In fact, any library using the span event could appear in the trace. To me, obtaining events from many sources in a single trace is very useful.

Bonus question: Is it possible for an event to have several pieces of information as the payload, some custom and some built-in?

Currently, that's not possible. One would have to create multiple events for that purpose.

(and if I remember correctly, this is the offending C binding: ocaml-multicore/ocaml-uring#86. Very hard to spot this bug without this tracing.)

Thanks for the nice example! This is along the lines of: some "specialized listeners" are only looking for specific events they know about, but some other "generic listeners" are trying to improve observability by reporting on all events, possibly in a degraded way.

Two thoughts on this:

It could be nice to suggest this situation in the manual (something like "Some generic listeners are used to visualize the trace of OCaml programs, and they require no modification / extra registration logic to show your events [of standard payload].")

This example suggests that maybe dropping events with custom payload, instead of a best-effort approach of showing the event but nothing about the payload, is not the best approach. At least, in the current state of things, readers of the manual could be encouraged to systematically pair their custom-payload events with approximations of those with standard types that can be displayed, so that they can observe something with generic trace tools.

To me, the paragraph that’s been added below the one discussed here is a good implementation of your suggestion 1.

manual/src/cmds/runtime-tracing.etex

OlivierNicole

I have reviewed the proposed additions. I agree with most of @gasche’s comments. Nonetheless, it seems to me that the PR is a few minor tweaks away from addressing all of the reviewers’ concerns.

manual/src/cmds/runtime-tracing.etex

OlivierNicole · 2023-11-15T17:07:09Z

manual/src/cmds/runtime-tracing.etex

+Unknown events for user-defined types are currently ignored, even if the type is
+defined both in the producer and consumer side.


Does “unknown” mean “unregistered” here? If so, I think it would be clearer to stick to the same word.

OlivierNicole · 2023-11-15T17:37:16Z

manual/src/cmds/runtime-tracing.etex

+
+An extensible variant \texttt{tag} provides a way to attach additional data to
+events. The tag is not serialized, so it is only available for \textit{known}
+events.


I also struggled for a while to understand this paragraph. I was filled with confusion with questions like: what information can a tag carry? What purpose do they serve if they are not serialized? If they are not serialized, how can the consumer retrieve them? Etc.

The light appeared when I read the .mli:

(** The type for a user event tag. Tags are used to discriminate between user events of the same type *)

So, in summary, I wonder if an alternative formulation of this paragraph would help eliminate the incomprehensions that I’ve had:

The function \texttt{Runtime_events.User.register}, in addition to the event name and type, also takes an extensible variant \texttt{tag}. Due to the way serialization works, subscribing to user events is per-type (i.e., you will receive all registered events of a given type) and one must use tags to discriminate between events of the same type. To establish the link between the name \texttt{name} and the tag \texttt{tag}, one must call \texttt{Runtime_events.User.register name tag typ}—in the producer, but also in the consumer, if the consumer needs to be able to discriminate on the tag.

or something like that. (I hope this is correct.)

manual/src/cmds/runtime-tracing.etex

OlivierNicole · 2023-11-17T17:20:08Z

manual/src/cmds/runtime-tracing.etex

+\item event is registered: payload and tag are available.
+\item event is not registered and has a built-in event type (unit, int, span):
+    only the payload is available.
+\item event is not registered and has a custom event type: event is dropped.


To me, the paragraph that’s been added below the one discussed here is a good implementation of your suggestion 1.

OlivierNicole · 2023-11-17T17:28:04Z

manual/src/cmds/runtime-tracing.etex

+    match Runtime_events.User.tag event, value with
+    | CustomSpan, Runtime_events.Type.Begin -> Printf.printf "> count begin\n"
+    | CustomSpan, End -> Printf.printf "< count end\n"
+    | _ -> ()


@TheLortex’s point is not very clear to me either, but if I’m not mistaken this a discussion of the (already merged) API. It doesn’t mean that it isn’t important, but it seems to me that it shouldn’t be blocking for the present PR.

Co-authored-by: Olivier Nicole <olivier.github.com@chnik.fr>

sadiqj · 2023-11-19T17:29:38Z

So I think one of the sources of confusion with the API at the moment is that the manual mixes the types of Runtime Events and the tags/names. Is it worth us separating these out, starting with just unit, span and int to simplify things and then pushing the custom types to later on in a more advanced usage section? (This mirrors how we'd hope people would use this, the majority of users should use the built-in types because that means generic consumers work)

gasche · 2023-12-10T09:18:05Z

I am unhappy with the state of this PR. Documentation should be written at the same time as the feature, not as an afterthought, otherwise there are few incentives to do a good job in a timely manner and we end up with undocumented software.

I would like to get to a state where there is documentation of runtime events including custom events that people (other than the implementor of the feature) can understand -- hopefully with a couple improvements to the implementation itself if they are suggested by the documentation process.

If we cannot get to this point in a reasonable timeframe, I will propose to remove custom events from 5.2, to ensure that there are proper incentives in place to have a documented feature.

talex5 · 2023-12-10T17:50:57Z

I would like to get to a state where there is documentation of runtime events including custom events that people (other than the implementor of the feature) can understand [...] I will propose to remove custom events

I wasn't aware of this documentation, but was able to figure things out easily enough by reading https://v2.ocaml.org/api/Runtime_events.html and the olly source code. It would be very inconvenient for Eio if custom events were removed now.

My understanding is that an event written to the ring is just a time-stamp, a name (a string) and a payload (bytes). The name is registered first so that it can be referred to using an index into a table to save space. It's not completely clear to me that the Runtime_events module needs to convert between OCaml types and bytes itself, since the user of the library can do it themselves easily enough. That would also allow generic tools to record all information in the ring, without having to understand it.

But it would make sense to me to merge this documentation about the current API first and then decide whether to change it in a separate PR.

talex5 · 2023-12-11T09:35:10Z

One thing that it would be good to clarify in the docs is thread-safety. For example, does the decoder get a copy of the ring buffer or the buffer itself? If it is the buffer itself, what happens if a new item overwrites it while the decoder is running? Likewise, what happens if two sys-threads write to the buffer at the same time (or a GC event is produced by the encoder running)?

OlivierNicole · 2023-12-11T12:51:39Z

So I think one of the sources of confusion with the API at the moment is that the manual mixes the types of Runtime Events and the tags/names. Is it worth us separating these out, starting with just unit, span and int to simplify things and then pushing the custom types to later on in a more advanced usage section? (This mirrors how we'd hope people would use this, the majority of users should use the built-in types because that means generic consumers work)

@sadiqj Personally, yes, I think it would have helped me understand better the first time. Along with that, I think that a brief description of what is actually serialized, as sketched by you and @talex5, would help understand why events need to be registered in this way.

OlivierNicole · 2023-12-15T16:55:44Z

In private discussions with both @gasche and @sadiqj it was agreed that I would take over the remaining work, notably the need for clarification advocated by @gasche. I don’t know how that works concretely, however: I don’t think I can push commits onto this PR.

manual: update runtime tracing chapter for custom events (ex #12335)

manual: update runtime tracing chapter for custom events (ex #12335) (cherry picked from commit 2433ab3)

gasche · 2023-12-22T23:00:07Z

This was subsumed by #12840 which was just merged. Thanks a lot everyone!

manual: update runtime tracing chapter for custom events (ex #12335) (cherry picked from commit 2433ab3)

TheLortex changed the title ~~manual: add entry for custom events~~ manual: update runtime tracing chapter for custom events Jun 28, 2023

gasche reviewed Jun 28, 2023

View reviewed changes

TheLortex requested a review from gasche July 7, 2023 08:19

gasche reviewed Jul 7, 2023

View reviewed changes

kayceesrk added the documentation label Jul 12, 2023

TheLortex added 3 commits July 13, 2023 13:52

manual entry for custom events

f3e39b9

adapt to review

a0a83ab

specialized / generic listeners

718a021

TheLortex force-pushed the custom-events-manual branch from 9c85323 to 718a021 Compare July 13, 2023 13:19

OlivierNicole reviewed Nov 17, 2023

View reviewed changes

sadiqj and others added 2 commits November 19, 2023 16:30

Update manual/src/cmds/runtime-tracing.etex

d584c7d

Co-authored-by: Olivier Nicole <olivier.github.com@chnik.fr>

Update manual/src/cmds/runtime-tracing.etex

6285540

Co-authored-by: Olivier Nicole <olivier.github.com@chnik.fr>

OlivierNicole mentioned this pull request Dec 18, 2023

manual: update runtime tracing chapter for custom events (ex #12335) #12840

Merged

dra27 added a commit that referenced this pull request Dec 22, 2023

Merge pull request #12840 from OlivierNicole/custom-events-manual

2433ab3

manual: update runtime tracing chapter for custom events (ex #12335)

dra27 added a commit that referenced this pull request Dec 22, 2023

Merge pull request #12840 from OlivierNicole/custom-events-manual

877bd14

manual: update runtime tracing chapter for custom events (ex #12335) (cherry picked from commit 2433ab3)

gasche closed this Dec 22, 2023

dra27 added a commit that referenced this pull request Dec 23, 2023

Merge pull request #12840 from OlivierNicole/custom-events-manual

b2e7c4b

manual: update runtime tracing chapter for custom events (ex #12335) (cherry picked from commit 2433ab3)

		Unknown events for user-defined types are currently ignored, even if the type is
		defined both in the producer and consumer side.

manual: update runtime tracing chapter for custom events #12335

manual: update runtime tracing chapter for custom events #12335

Conversation

TheLortex commented Jun 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheLortex Jun 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gasche Jun 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheLortex commented Jul 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadiqj Nov 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gasche Jul 8, 2023 • edited

Choose a reason for hiding this comment

OlivierNicole Nov 17, 2023 • edited

Choose a reason for hiding this comment

OlivierNicole left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OlivierNicole Nov 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadiqj commented Nov 19, 2023

gasche commented Dec 10, 2023

talex5 commented Dec 10, 2023

talex5 commented Dec 11, 2023

OlivierNicole commented Dec 11, 2023

OlivierNicole commented Dec 15, 2023

gasche commented Dec 22, 2023

TheLortex Jun 28, 2023 •

edited

gasche Jun 29, 2023 •

edited

sadiqj Nov 19, 2023 •

edited

gasche Jul 8, 2023 •

edited

OlivierNicole Nov 17, 2023 •

edited

OlivierNicole Nov 17, 2023 •

edited