Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manual: update runtime tracing chapter for custom events #12335

Closed
wants to merge 5 commits into from

Conversation

TheLortex
Copy link
Contributor

This PR extends the runtime tracing section to account for custom events introduced in #11474

It should be included in the 5.1 release

@TheLortex TheLortex changed the title manual: add entry for custom events manual: update runtime tracing chapter for custom events Jun 28, 2023
manual/src/cmds/runtime-tracing.etex Outdated Show resolved Hide resolved
the same set of registered events. The event name is used by the consumer to
identify producer's events, with a fallback mechanism for unknown events of
built-in types. Unknown events for user-defined types are currently ignored,
even if the type is defined both in the producer and consumer side.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You explain "runtime probes" and "custom probes", but in fact most of the explanation of "custom probes" is about what "custom events" are. My feeling is that "events" and "probes" are two different concepts that could be explained separately: events have a semantic and a payload, and probes, actually I am not sure what they are (the Runtime_events module documentation does not contain the word "probe"), but they seem to be related to both emission and reception of events -- or at least both are discussed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Basically I see probes as "the piece of code that emit events". There are probes in the runtime that emit runtime events. Using the custom events library, users can set up additional probes in their code base that emit custom events.

I'll come up with better wording.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note (not directly related to the PR): to me the word "probe" suggests a tool to observe some information, so it is more associated to a "receiver" than to a "producer" in my mind. The usage of the word "probe" in a similar context that I know of are the probes in the Linux kernel ( https://docs.kernel.org/trace/kprobes.html ), and their meaning there is fairly different from "the piece of code that emit events" (the probe is more like the part that registers a listener for a given event, the "probed instructrion" is in charge of emitting the event).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other precedent for the use of the term 'probe' is the dtrace, which supports both dynamic and static probe points. https://illumos.org/books/dtrace/chp-usdt.html (certificate expired, but content still valid)

Copy link
Contributor Author

@TheLortex TheLortex Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see yes, my context for the definition of "probe" is:

  • https://github.com/janestreet/ocaml-probes: "Some examples of what a probe handler can do: write data into a ring buffer or similar for transmission to a another tool that can analyze or visualize it.", a 'probe handler' being "If a probe is enabled, whenever the execution reaches it, the corresponding handler is evaluated."
  • the current usage of probe in the manual: "The full set of events emitted by probes [...]" "The events transport part of the system ingests events emitted by the probes
    and makes them available to consumers."

Both making me think of them as event producers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cc @sadiqj if you have opinions on how "probes" should be defined, and how they relate to the documentation the Runtime_events module itself.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're into electronics, a probe is part of the measurement device (oscilloscope, etc). The part of the printed circuit board where you can attach a probe is called a test point, or point de piquage in French. What about "tap" as a source of events? Or just "event source" for additional clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have a strong opinion either way on terminology. I used to refer to the CAML_EV macros as probes because there wasn't an explicit way of actually attaching to those, if runtime_events was enabled all probes produced events that went in to the ringbuffer - you didn't really attach to individual ones directly.

Copy link
Member

@gasche gasche Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case I think that the discussion in this paragraph mixes too many things: (1) what is the information carried by those custom events, (2) how to emit them, (3) how to listen for them. I think that it would be natural to discuss at least (1) strictly before (2), (3), which could be discussed either together or separately.

(I think it would also be nice to understand what terminology we want to use, but this is orthogonal. I do find it weird that currently the terminology in the manual chapter is distinct from the terminology in the .mli documentation.)

manual/src/cmds/runtime-tracing.etex Outdated Show resolved Hide resolved
manual/src/cmds/runtime-tracing.etex Outdated Show resolved Hide resolved
manual/src/cmds/runtime-tracing.etex Show resolved Hide resolved
match Runtime_events.User.tag event, value with
| CustomSpan, Runtime_events.Type.Begin -> Printf.printf "> count begin\n"
| CustomSpan, End -> Printf.printf "< count end\n"
| _ -> ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, there is no enforced relation between the name of the even (CustomSpan, CustomInt) and the payload it carries: the producer could emit both span and int events for CustomSpan, and the customer could receive them. The fact that nothing prevents us to have CustomSpan as a tag for non-span events may or may not be an API defect depending on how we think of it.

Questions:

  • Would it be interesting to enforce that an event declared with a span payload is only ever emitted with a span payload? This could be done for example by having separate extensible variants for int-events and span-events, or by making the extensible tag a GADT indexed over its intended payload.
  • Or, on the countrary, is it intentional that a given event tag can be used with several payload types? In this case, should we consider rewriting this example to have a single Custom event tag, and use it for both span and int payloads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. Tags are currently used to provide additional semantics to a type, so in that sense they should be an extensible GADT to make sure they are properly used.

The question I'm wondering about is that custom types can also be used to refine semantics. For example if I write a web server and want to log response latencies, should I use the built-in int type so that it can be consumed by generic consumers, or use a custom int type used specifically for latencies. I see that there is a hierarchy of semantics: my library web request latencies < web request latencies < latencies < number. Using types and tags we're basically exposing two semantics for the int value: a broad one (type) and a precise one (tag).
As a consequence users might not have a use for tags if they come up with sufficiently precise custom types, and in that case it is a burden to extend the GADT with a tag for each custom type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't really understand the discussion on custom types, but I will try to have another go on my next read of the proposed chapter.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheLortex’s point is not very clear to me either, but if I’m not mistaken this a discussion of the (already merged) API. It doesn’t mean that it isn’t important, but it seems to me that it shouldn’t be blocking for the present PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I think we should have respected the guideline of not merging significant new changes without updating the manual at the same time, so that we would be blocking both the API and the documentation PR. People would be more motivated to have a good API and a good documentation that way. Now instead we are in a situation where we want to merge documentation that we don't really understand for a feature that already shipped with a worse API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies for taking so long to come back to this PR.

The type of an event is what data it contains. We have three default ones, unit, span begin/end and int which correspond to an instantaneous event, the beginning of an event, the end of an event and finally the measurement of something at a point in time. There's also custom types, for which you can provide an encoder/decoder. The intention behind having the default ones is that generic consumers (e.g an opentelemetry or logging library) can do something sensible with the ones they know about.

A good example of a custom type might be a Histogram type. You don't want to register a new type for each Histogram you track in your program, you only do it once and then use that type to create a User.t with a name and tag. An example here might be 200_response_times or something similar. You could imagine having a Histogram for each response code timing.

One complexity is that runtime events is designed so you can consume in- or out-of-process. This is why there's a name and a tag. The name is carried with the event onwards but the tag is not. Having the two separate is a bit of a pain, there may be something that could be done there.

The design was essentially that the most common use cases were simple (units, spans and ints) to get most of the benefits but custom types could give you a lot more freedom but at the cost of a bit more complicated usage.

Eio's use of custom events for tracing is a good example of how the API is currently used: https://github.com/ocaml-multicore/eio/pull/635/files

People would be more motivated to have a good API and a good documentation that way.

There's a danger that if we insist on getting things perfect the first time, it won't bring forth good contributions. It's a balancing act and with runtime events I recall part of the argument of moving it to otherlibs was the expectation that it wasn't a mature design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a danger that if we insist on getting things perfect the first time, it won't bring forth good contributions. It's a balancing act and with runtime events I recall part of the argument of moving it to otherlibs was the expectation that it wasn't a mature design.

In my experience, the act of writing documentation often reveals improvements to the API -- it is a way to test a design, orthogonal to its use to write code. (I would say that usage examples are good at highlighting weak spots and missing features, while documentation is good at suggesting opportunities to simplify concepts or improve consistency.) I didn't mean to suggest that the code should be perfect in an absolute sense, but that it should have benefited from this first round of improvement before it reaches users of an upstream release.

@TheLortex
Copy link
Contributor Author

Thank you for your reviews. I have updated the PR address requested changes.

@TheLortex TheLortex requested a review from gasche July 7, 2023 08:19

An extensible variant \texttt{tag} provides a way to attach additional data to
events. The tag is not serialized, so it is only available for \textit{known}
events.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the last sentence means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the consumer receives an event that it doesn't know about, only the name and payload content can be retrieved, not the tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also struggled for a while to understand this paragraph. I was filled with confusion with questions like: what information can a tag carry? What purpose do they serve if they are not serialized? If they are not serialized, how can the consumer retrieve them? Etc.

The light appeared when I read the .mli:

  (** The type for a user event tag. Tags are used to discriminate between
      user events of the same type *)

So, in summary, I wonder if an alternative formulation of this paragraph would help eliminate the incomprehensions that I’ve had:

The function \texttt{Runtime_events.User.register}, in addition to the event
name and type, also takes an extensible variant \texttt{tag}. Due to the way
serialization works, subscribing to user events is per-type (i.e.,  you will
receive all registered events of a given type) and one must use tags to
discriminate between events of the same type. To establish the link between
the name \texttt{name} and the tag \texttt{tag}, one must call
\texttt{Runtime_events.User.register name tag typ}—in the producer, but
also in the consumer, if the consumer needs to be able to discriminate on
the tag.

or something like that. (I hope this is correct.)

Copy link
Contributor

@sadiqj sadiqj Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ring buffer events are just an index in to a name string table and a bunch of bytes. The Runtime_event.Type is how we turn those bytes back in to some data we can do something with.

A tag is essentially a way of you associating an OCaml type with the name, so you don't have to use the string name everywhere. If the producer and consumer both know about a custom event custom type, they can register the same tag. That's not necessary though, if you register a callback for a custom type but don't register tags, you should still get the events - they'll just identified by a name and not have a tag. For the common use case of taking these events and recording them in to a log file or database somewhere, that's sufficient.

Edit: wading through the source again, this is not true. If a User Custom event isn't registered then we drop them even if there's an active callback for that type. That's because ultimately all we know is the name and data - we don't have a way of identifying the type of the event (beyond the built in ones). Even if we did have an identifier for each custom types, how would we even refer to it?

\item event is registered: payload and tag are available.
\item event is not registered and has a built-in event type (unit, int, span):
only the payload is available.
\item event is not registered and has a custom event type: event is dropped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a difference in behavior depending on the type of the payload? Wouldn't it make sense to drop all unregistered events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a difference in behavior depending on the type of the payload?

Due to how the event transport protocol is designed, the consumer can decode event payloads for built-in types only by using the event type ID field in the header. It is not currently possible for custom types, as the consumer has first to identify the event in order to obtain the decoder function.

Wouldn't it make sense to drop all unregistered events?

There is value in supporting unregistered events of built-in types. runtime_events_tools's olly trace command was extended to integrate both GC events and unregistered events of the span type in a single trace (tarides/runtime_events_tools@5865899)

To me the other way around makes more sense: lift the limitation of unregistered events of custom types being dropped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is value in supporting unregistered events of built-in types. [We added support in olly trace.]

Can you spell this out for me, what is it useful for in practice?

Bonus question: Is it possible for an event to have several pieces of information as the payload, some custom and some built-in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is it useful for in practice?

For example, I was trying to understand performance issues in a multicore eio program. Using olly and runtime events I figured out that on major GC, the domains had a hard time synchronizing (one domain was waiting up to 200ms) but I couldn't figure out why. By adding custom events of the span types in eio I was able to see that time was spent in a system call that was not marked as a blocking section.

screenshot_2023-02-15_at_15-20-23_eeww trace__1_mb__-_perfetto_ui

olly trace doesn't need to know about eio for that purpose. In fact, any library using the span event could appear in the trace. To me, obtaining events from many sources in a single trace is very useful.

Bonus question: Is it possible for an event to have several pieces of information as the payload, some custom and some built-in?

Currently, that's not possible. One would have to create multiple events for that purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and if I remember correctly, this is the offending C binding: ocaml-multicore/ocaml-uring#86. Very hard to spot this bug without this tracing.)

Copy link
Member

@gasche gasche Jul 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice example! This is along the lines of: some "specialized listeners" are only looking for specific events they know about, but some other "generic listeners" are trying to improve observability by reporting on all events, possibly in a degraded way.

Two thoughts on this:

  1. It could be nice to suggest this situation in the manual (something like "Some generic listeners are used to visualize the trace of OCaml programs, and they require no modification / extra registration logic to show your events [of standard payload].")
  2. This example suggests that maybe dropping events with custom payload, instead of a best-effort approach of showing the event but nothing about the payload, is not the best approach. At least, in the current state of things, readers of the manual could be encouraged to systematically pair their custom-payload events with approximations of those with standard types that can be displayed, so that they can observe something with generic trace tools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the paragraph that’s been added below the one discussed here is a good implementation of your suggestion 1.

manual/src/cmds/runtime-tracing.etex Outdated Show resolved Hide resolved
Copy link
Contributor

@OlivierNicole OlivierNicole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the proposed additions. I agree with most of @gasche’s comments. Nonetheless, it seems to me that the PR is a few minor tweaks away from addressing all of the reviewers’ concerns.

manual/src/cmds/runtime-tracing.etex Outdated Show resolved Hide resolved
Comment on lines +85 to +86
Unknown events for user-defined types are currently ignored, even if the type is
defined both in the producer and consumer side.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does “unknown” mean “unregistered” here? If so, I think it would be clearer to stick to the same word.


An extensible variant \texttt{tag} provides a way to attach additional data to
events. The tag is not serialized, so it is only available for \textit{known}
events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also struggled for a while to understand this paragraph. I was filled with confusion with questions like: what information can a tag carry? What purpose do they serve if they are not serialized? If they are not serialized, how can the consumer retrieve them? Etc.

The light appeared when I read the .mli:

  (** The type for a user event tag. Tags are used to discriminate between
      user events of the same type *)

So, in summary, I wonder if an alternative formulation of this paragraph would help eliminate the incomprehensions that I’ve had:

The function \texttt{Runtime_events.User.register}, in addition to the event
name and type, also takes an extensible variant \texttt{tag}. Due to the way
serialization works, subscribing to user events is per-type (i.e.,  you will
receive all registered events of a given type) and one must use tags to
discriminate between events of the same type. To establish the link between
the name \texttt{name} and the tag \texttt{tag}, one must call
\texttt{Runtime_events.User.register name tag typ}—in the producer, but
also in the consumer, if the consumer needs to be able to discriminate on
the tag.

or something like that. (I hope this is correct.)

manual/src/cmds/runtime-tracing.etex Outdated Show resolved Hide resolved
\item event is registered: payload and tag are available.
\item event is not registered and has a built-in event type (unit, int, span):
only the payload is available.
\item event is not registered and has a custom event type: event is dropped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the paragraph that’s been added below the one discussed here is a good implementation of your suggestion 1.

match Runtime_events.User.tag event, value with
| CustomSpan, Runtime_events.Type.Begin -> Printf.printf "> count begin\n"
| CustomSpan, End -> Printf.printf "< count end\n"
| _ -> ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheLortex’s point is not very clear to me either, but if I’m not mistaken this a discussion of the (already merged) API. It doesn’t mean that it isn’t important, but it seems to me that it shouldn’t be blocking for the present PR.

sadiqj and others added 2 commits November 19, 2023 16:30
Co-authored-by: Olivier Nicole <olivier.github.com@chnik.fr>
Co-authored-by: Olivier Nicole <olivier.github.com@chnik.fr>
@sadiqj
Copy link
Contributor

sadiqj commented Nov 19, 2023

So I think one of the sources of confusion with the API at the moment is that the manual mixes the types of Runtime Events and the tags/names. Is it worth us separating these out, starting with just unit, span and int to simplify things and then pushing the custom types to later on in a more advanced usage section? (This mirrors how we'd hope people would use this, the majority of users should use the built-in types because that means generic consumers work)

@gasche
Copy link
Member

gasche commented Dec 10, 2023

I am unhappy with the state of this PR. Documentation should be written at the same time as the feature, not as an afterthought, otherwise there are few incentives to do a good job in a timely manner and we end up with undocumented software.

I would like to get to a state where there is documentation of runtime events including custom events that people (other than the implementor of the feature) can understand -- hopefully with a couple improvements to the implementation itself if they are suggested by the documentation process.

If we cannot get to this point in a reasonable timeframe, I will propose to remove custom events from 5.2, to ensure that there are proper incentives in place to have a documented feature.

@talex5
Copy link
Contributor

talex5 commented Dec 10, 2023

I would like to get to a state where there is documentation of runtime events including custom events that people (other than the implementor of the feature) can understand [...] I will propose to remove custom events

I wasn't aware of this documentation, but was able to figure things out easily enough by reading https://v2.ocaml.org/api/Runtime_events.html and the olly source code. It would be very inconvenient for Eio if custom events were removed now.

My understanding is that an event written to the ring is just a time-stamp, a name (a string) and a payload (bytes). The name is registered first so that it can be referred to using an index into a table to save space. It's not completely clear to me that the Runtime_events module needs to convert between OCaml types and bytes itself, since the user of the library can do it themselves easily enough. That would also allow generic tools to record all information in the ring, without having to understand it.

But it would make sense to me to merge this documentation about the current API first and then decide whether to change it in a separate PR.

@talex5
Copy link
Contributor

talex5 commented Dec 11, 2023

One thing that it would be good to clarify in the docs is thread-safety. For example, does the decoder get a copy of the ring buffer or the buffer itself? If it is the buffer itself, what happens if a new item overwrites it while the decoder is running? Likewise, what happens if two sys-threads write to the buffer at the same time (or a GC event is produced by the encoder running)?

@OlivierNicole
Copy link
Contributor

So I think one of the sources of confusion with the API at the moment is that the manual mixes the types of Runtime Events and the tags/names. Is it worth us separating these out, starting with just unit, span and int to simplify things and then pushing the custom types to later on in a more advanced usage section? (This mirrors how we'd hope people would use this, the majority of users should use the built-in types because that means generic consumers work)

@sadiqj Personally, yes, I think it would have helped me understand better the first time. Along with that, I think that a brief description of what is actually serialized, as sketched by you and @talex5, would help understand why events need to be registered in this way.

@OlivierNicole
Copy link
Contributor

In private discussions with both @gasche and @sadiqj it was agreed that I would take over the remaining work, notably the need for clarification advocated by @gasche. I don’t know how that works concretely, however: I don’t think I can push commits onto this PR.

dra27 added a commit that referenced this pull request Dec 22, 2023
 manual: update runtime tracing chapter for custom events (ex #12335)
dra27 added a commit that referenced this pull request Dec 22, 2023
 manual: update runtime tracing chapter for custom events (ex #12335)

(cherry picked from commit 2433ab3)
@gasche
Copy link
Member

gasche commented Dec 22, 2023

This was subsumed by #12840 which was just merged. Thanks a lot everyone!

@gasche gasche closed this Dec 22, 2023
dra27 added a commit that referenced this pull request Dec 23, 2023
 manual: update runtime tracing chapter for custom events (ex #12335)

(cherry picked from commit 2433ab3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants