Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Eventlog tracing system #9082

Merged
merged 2 commits into from
Apr 30, 2020
Merged

[RFC] Eventlog tracing system #9082

merged 2 commits into from
Apr 30, 2020

Conversation

abbysmal
Copy link
Contributor

@abbysmal abbysmal commented Oct 30, 2019

[RFC] Eventlog tracing system

OCaml currently ships with an instrumented version of the runtime which collects counters and timings for GC events and outputs logs in a custom text format.
Eventlog is a proposed replacement framework, which preserves the existing metrics and adds support for GC trace events in a low-overhead implementation based on a standardized binary trace format (CTF).

Based on an initial design used in OCaml Multicore (inspired by GHC's ThreadScope), this proposal includes an implementation providing:

  • The eventlog tracing facility, sitting in the regular OCaml runtime (instead of residing in an alternative runtime to link like the instrumented runtime does.)
  • A first trace metadata definition including a set of metrics and counters preserving the feature level of the instrumented runtime.

We provide as well a script to convert such traces to the catapult event format.
A catapult trace take the form of a json file (see the format description),
that can then be read using Google Chrome's own trace viewer, chrome://tracing.

Screenshot from chrome://tracing

A zoomed-in view on a particular major collection event

Duration stats on major collections during the programs lifetime

Overview of the trace viewer window on the zarith sandmark benchmark

Overview of the trace viewer window on the js_of_ocaml sandmark benchmark

The statistics gathered, coupled with the available tooling (chrome://tracing),
enable core OCaml and application developers to profile the runtime activity: said profile can then be correlated with an application profile to get a full overview of an application's lifetime.
An overview of the garbage collector's timeline also proved very useful to the
Multicore OCaml developers by enabling ways to visualize contention points, relationship between GC phases, and collection event timings.

The implementation is further discussed in the Implementation section.

The Tooling section provides samples and screenshots of the tooling available with our implementation.

We then discuss the future of our developments on this project in the last section.

Implementation

Our eventlog port to trunk remains in spirit the same as the one found in OCaml Multicore.
This PR provide our initial implementation of the eventlog framework.

Overview

The event tracing system adds functions to feed an event buffer that is flushed
to disk once full, and at exit.

In our reference implementation, event tracing functions usually take an event id
defined in caml/eventlog.h and counters where due.
An entry is added into the buffer, and if the buffer is full, a flush is triggered.

Events are flushed in a single CTF stream.

The initial target was to collect enough information to be able to match insight
provided by the instrumented runtime.
As such, basic tooling was provided in the spirit of the helpers found
with the source distribution, to be used with the instrumented runtime. (ocaml-instr-graph and ocaml-instr-report)
Minor/major collections distribution over the
lifetime of the program, metrics on allocations, and timing on the various
phases within GC cycles can be easily visualized with the evailable tooling.

Format

The main difference will be that instead of emitting traces in the aforementioned
Catapult format, the Common Trace Format is used in our current prototype.

We decided to work with CTF for the following reasons:

  • Performance: as a binary format, better performances could be easily achieved
    on the serializing front.
  • Streaming: the CTF format is comprised of possibly several streams holding the event payloads.
    This approach maps well with OCaml Multicore (where each domain could stream its own set of events).
    The notion of stream is also unrelated to the transport mechanism chosen by the implementor.
    As such, it is possible to have a tracer communicate over the network with very low overhead.
    (as a MirageOS program would do, for example.)
  • Ecosystem: the CTF ecosystem comes with various visualization tools and a
    C reference implementation bundled with a simple to use Python library.
  • Prepare the ground for a streamlined tracing experience within the OCaml ecosystem:
    CTF is a straightforward and generic format. Several users reported using it
    successfully within their own projects.
    We could attempt at providing a unified tracing user experience for the OCaml system.

Our initial implementation aims to stay simple and as such the runtime will output those traces in files.
However we could evolve the tracing infrastructure in the future and provide application-level control over trace distribution or a pub-sub system available from the runtime itself.

OCaml CTF metadata

In CTF, a metadata stream is required in order to allow a CTF codec to parse
the various binary streams composing a trace.
This metadata stream is itself written using CTF's own TSDL description format.

An annotated version can be found at this address: https://github.com/Engil/gc-tracing/blob/master/metadata

Some questions are still open in our implementation:

A complete CTF trace should contain the metadata stream as well.
In its current form (and implementation), its definition is not fully static: the byte_order
metadata field depends on which platform the runtime has been executed on.
It has not been decided yet which approach should be taken to distribute the metadata file.
Proposed solutions are to have the runtime generate metadata files, or distribute them as a part of the
compiler installation.

Control plane

Our implementation provides only simple controls over eventlog's behavior.

Any program built with the OCaml compiler on this branch can be traced by executing the said program with OCAML_EVENTLOG_ENABLED=1.

This will output the trace in a caml-eventlog-pid file in the running directory. The user can also specify an alternative name for the file using the OCAML_EVENTLOG_FILENAME env variable.

Two primitives are exposed by the runtime to pause and resume tracing: Gc.eventlog_pause and Gc.eventlog_resume.

Performance measurements

Sandmark was employed to measure the performance impact of eventlog on the runtime. The report can be found here.

The associated Jupyter notebook can be found as well for further analysis in this repository. A Docker image can be built for simpler setup.

Early numbers showed a mean slowdown of 1% (comparing an active traced program and the current trunk) and a mean slowdown of 0.01% (comparing an non actively traced program with the current trunk).

Tooling

The instrumented runtime ships with two scripts allowing to display and graph various datapoints from the GC's lifetime.

We provide a set of scripts that can be used as a base to run data analysis on our implementation's traces.

ctf_to_catapult.py acts as a converter from our trace format to Chrome's eventlog format. Once converted, such traces can then be loaded inside Chrome's chrome://tracing.

We provide as well an example Jupyter notebook(rendered PDF here).
This notebook showcases basic interactions with our implementation's trace format using the babeltrace library.

We made a few example accessible in the gc-tracing repository. json files can be loaded into Chrome and are extraced from the related traces found in the various ctf subdirectories.

The scripts are prototypes written by reusing the existing ecosystem for CTF traces analysis, mainly using the reference implementation for CTF, babeltrace.
An OCaml binding to the babeltrace library is planned and would enable trace processing and analysis from OCaml programs.

Eventlog future timeline

Deprecating the instrumented runtime

We aim to merge the tracing runtime with a feature level at least matching the instrumented runtime's.
We believe the tracing runtime can have a sufficiently low overhead so that it
can sit in the regular runtime without overly impacting performances in programs
that do not use the feature.

Removing it would as well speed up OCaml's compilation, by not having to build an extra runtime.
A more straightforward approach to statistics gathering sitting in the main runtime would also simplify maintenance:
While working on porting the eventlog framework to trunk we found a long standing
inconsistency within the instrumented runtime's output. (see GPR #9004)

The inclusion of the eventlog framework adds as well a bit of duplicate noise in the codebase, as it trace similar codepaths as the instrumented runtime.

As such we will submit a second PR to deprecate the instrumented runtime from the OCaml distribution.

Future: user-based events, user-controlled streaming mechanisms

We plan to extend the eventlog framework further by adding support for user defined events and extra mechanism for traces extraction.

User defined events would allow developers to implement their own set of events from the application level. Such events could then be easily correlated with the runtime's own set of events.

We also had discussions about opening new extraction channels for eventlog's streams. This would prove useful for production systems or Mirage application (extracting traces via network or a serial port... etc).

The timeline for these features is undefined as discussions about their respective designs are still ongoing.

@gasche
Copy link
Member

gasche commented Oct 30, 2019

This looks very nice, thanks! (cc @damiendoligez, @jhjourdan)

I think it could be nice to push the removal of the instrumented runtime instructions (I don't really care about cleaning up the build system part etc., the commit could remove just the runtime changes) as an extra commit. This makes it easy, looking at the diff, to get an idea of the maintainability impact of this change if we consider going the "full route" that your propose. (But it remains possible to review what is in the current PR, without the CAML_INSTR removal, by doing a diff without that new commit.)

runtime/unix.c Outdated
(int64_t)t.tv_sec * (int64_t)1000000000 +
(int64_t)t.tv_usec * (int64_t)1000;
#else
# error "No timesource available"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows, I think you can use the functions QueryPerformanceCounter + QueryPerformanceFrequency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please don't fallback on gettimeofday it can go back in time which can only mislead. On macOS use mach_absolute_time, clock_gettime is not supported. Here's code for this that supports all major platforms, IIRC the windows path has been tested at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, I will as well test the Windows codepath.
@kayceesrk this function was imported from OCaml multicore, should I backport it later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reused the code for OSX and Windows, and it seems to work fine after testing it on both platforms.
One thing I'm still unsure about is how to handle failure and how to handle other platforms:
For now it returns 0 on "unsupported" and failure (initialization of state in win32 and MacOS).
I'm thinking that the right way to do it would be to have every tracing code behind macros and be explicit we request a proper monotonic clock for it to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now it returns 0 on "unsupported"

For that one you could at least emit a compilation warning.

I'm thinking that the right way to do it would be to have every tracing code behind macros and be explicit we request a proper monotonic clock for it to work.

Not sure I understand what you mean by that and how it solves the problem. It seems to me that returning 0 on failure is ok. No program correctness depends on that and it should make it quite clear when perusing the resulting data set you get that something went wrong.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instrumentation code looks good.

(In term of potential performance regression, I would look at the minor-gc instrumentation. It looks like everything are in cold path, but I haven't reviewed this part carefully.)

I think that minor improvements could be made to the API. In particular:

  • It would be nice if tracing code (in "user" modules) never had to check for caml_eventlog_enabled directly (this requires adding API support for the idea of counting things during instrumentation, to support caml_instr_alloc_jump; unfortunately the natural name "counter" is used for something else).

  • There should be an easy way to tell which functions are no-ops when tracing is not enabled (or is paused), and which are not.

For example, one option would be to guarantee that all caml_ev_* functions are no-ops, and name caml_eventlog_* the few that are not: caml_eventlog_{init,resume,pause,disable}. Remark: Using "init" rather than "setup" makes more sense when called from caml_gc_init.

I also included some questions on the choice of integer sizes for information, which look a bit small to me. How easy will it be to extend the wire format in the future if we want to support tracking more kind of runtime events?

alloc array compare eventlog extern floats gc_ctrl hash intern \
interp ints io lexing md5 meta memprof obj parsing signals str sys \
callback weak finalise stacks dynlink backtrace_byt backtrace \
spacetime_byt afl bigarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe we should have one name per line to avoid this kind of noisy diffs?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gasche. Or at least when adding things, just cut the overlong lines but do not reformat the block.

runtime/major_gc.c Outdated Show resolved Hide resolved
@@ -200,6 +203,8 @@ static header_t *nf_allocate (mlsize_t wo_sz)
}
prev = cur;
cur = Next_small (prev);
if (caml_eventlog_enabled)
++ caml_instr_alloc_jump;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there could be another macro that avoids those naked checks to caml_eventlog_enabled in traced code?

runtime/eventlog.c Outdated Show resolved Hide resolved
runtime/eventlog.c Outdated Show resolved Hide resolved
ocaml_eventlog_filename, eventlog_startup_pid);
} else {
filename = malloc(64);
sprintf(filename, "caml-eventlog-%d", eventlog_startup_pid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above on sprintf

runtime/freelist.c Outdated Show resolved Hide resolved
runtime/caml/eventlog.h Outdated Show resolved Hide resolved
uint8_t phase; /* for GC events */
uint8_t counter_kind; /* misc counter name */
uint8_t alloc_bucket; /* for alloc counters */
uint32_t count; /* for misc counters */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the size restrictions here are future-proof. Wouldn't it be safer to use int16 for phase, counter_kind and alloc_bucket? For "misc counters", can't we have above-32bit values so wouldn't int64 be safer? (Eg. memory usage counters for large programs.) You already use 128bits of header metadata, so it's not clear that int8 vs. int16 is so important there.

Copy link
Member

@gasche gasche Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phase and counter_kind here refer not just to toplevel notions of phase/kind (mark, sweep, clean, etc.), but also the name finer subdivisions; there is in fact a hierarchy of moments with a tree structure. It seems fairly conceivable that we would end up with more than 2^8 "counter kinds" if we extend to support more runtime tracing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think counters should be pushed to 64bits indeed but I think the packet header could be shortened further and made more efficient. (so I think the payload size topic is not quite irrelevant)
For instance the pid field can be removed and put into the stream header itself, since for now we do not plan to have multiple process/domains emitting to a same stream. (rather, I believe it make more sense to have a separate stream per process/domain).

As for the phase and counter kind, I think you are right.
I'm wondering if this is the right categorization model however, surely an ever-growing enumeration is not adviceable? Maybe things could be sorted in a better way and split in various enumerations.
I added to the repository a few days ago the TSDL declaration, to make discussions on the schema a bit simpler. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a high-level perspective a natural categorization I think would be hierarchical, with "intervals" as nodes in the tree and "events" as child-less leaves. But It's not clear to me whether the event format supports this, and given that the killer feature is the availability of nice visualization tools we don't want to go for something non-standard. Flattening all interval and event names into basically a single enum, as you did for GC instrumentation, may be a way to go to fit the common format -- and then we can indeed expect an explosion of enumeration size.

static FILE* output;
static uintnat eventlog_startup_timestamp = 0;
static uint32_t eventlog_startup_pid = 0;
static uintnat eventlog_paused = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the domain state structure? Eventually, I would like each domain to be able to start and stop logging independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea,
Is it too much state to be added there? I don't think we'll require much more of these assuming we stay on a one stream per domain logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done in d4f25e7

runtime/eventlog.c Outdated Show resolved Hide resolved
runtime/eventlog.c Outdated Show resolved Hide resolved
struct event ev = eb->events[i];
ev.header.pid = eventlog_startup_pid;

fwrite(&ev.header, sizeof(struct ctf_event_header), 1, out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwrite returns the number of items actually written, so you need to check the return value of all of these invocations to ensure that it is equal to the value you intended to write and that it is also not zero (EOF).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went on to check fwrite failures and I'm not entirely satisfied with the solution.
I think retry mechanisms on partial writes would add much complexity (in the current form of the implementation) and in the event of a failure eventlog should be disabled.
I think the current code structure doesn't help with finer grained error handling, not relying on fwrite buffering would make life easier (vs handling our own buffer and write logic).
At least logging a message when eventlog is forcibly disabled should make it acceptable?


if (!caml_eventlog_enabled) return;
if (eventlog_paused) return;
if (!evbuf) setup_evbuf();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where you assume that setup_evbuf always succeeds in initialising evbuf to non-NULL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is now handled, failure to allocate event buffer results in a caml_fatal_error call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid calling caml_fatal_error (even if only theoretically) wherever it is possible to raise an exception instead. You are allocating at the first occurring event, where you do not know whether you can raise. I see two solutions to avoid this:

  • you could call setup_evbuf at the moment eventlog is enabled. For now it is in caml_gc_init so it is still necessary to call caml_fatal_error, but I find it more sensible to fail there than during execution. But if later you wanted to allow enabling eventlog at runtime, you could raise then (out of the C stub).
  • if you really want to delay its allocation, you could delay the call to setup_evbuf to the first time eventlog is resumed, whenever it starts paused. Then it is safe to raise at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think delaying allocation is an interesting property to keep and it make state a bit too fuzzy to track.
I will go with the first idea, whether it is enabled or paused on startup, allocating if eventlog was explicitly enabled seems fair enough.

@sadiqj
Copy link
Contributor

sadiqj commented Oct 31, 2019

With regards to the Control Plane, right now the environment variable OCAML_EVENTLOG_ENABLED enables or disables the tracing and there are functions to pause and resume the tracing.

I think it's worth considering a production use-case where we'd want tracing available if we needed to diagnose a potential issue with a running application but don't want it started initially.

Might it make sense to have three states for OCAML_EVENTLOG_ENABLED: off, enabled and enabled but paused?

The last one would require Gc.eventlog_resume to start.

Copy link
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only checked for asynchronous exceptions. Overall, the scoped instruments are a nice use of the new exception-safety guarantees in the runtime, but a few cases remain to handle.

runtime/gc_ctrl.c Outdated Show resolved Hide resolved
runtime/gc_ctrl.c Outdated Show resolved Hide resolved
runtime/gc_ctrl.c Outdated Show resolved Hide resolved
runtime/gc_ctrl.c Outdated Show resolved Hide resolved
runtime/minor_gc.c Outdated Show resolved Hide resolved
@abbysmal
Copy link
Contributor Author

abbysmal commented Nov 4, 2019

Related to @sadiqj comment about the control plane I think something needs to be addressed regarding custom path for traces (see https://github.com/ocaml/ocaml/pull/9082/files#diff-7e12f3a136d56d50b3f4e2beafb30368R75):

This feature was added as a part of our tests using sandmark, to simplify trace extraction from a large run.
I kept it as a part of the PR but I am uneasy with the idea of having (potentially, if we go the main runtime route, no separate variant) OCaml programs being able to overwrite any particular file it has access to with traces content. This sounds like a security issue to me. (edit: but I actually changed it to suffix the filename so I guess it is not as harmful.)
Is there any particular way we should go about it?
My guess is either we disable the feature, or as a starter we only open new files, and maybe some form of path sanitization?
It's unclear to me how useful custom filename for the output is, given that custom extract mechanisms are a part of the eventlog roadmap in the long run.
Another approach would be to make eventlog disabled unless the program request it: this way there would be no accidental tempering of existing OCaml programs.
Maybe I'm overthinking this detail, the security model is a bit unclear to me but avoiding the addition unwanted "features" in compiled programs seems sensible enough.

Any suggestions?

{
char_os *filename = NULL;
char_os *eventlog_filename;
value tmp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be registered as root of the GC -- which probably requires using some CAMLfoo macro on the function declaration as well.

fwrite(&header, sizeof(struct ctf_stream_header), 1, output);
fflush(output);
} else {
caml_eventlog_enabled = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the API would be slightly nicer if this function returned a boolean (0 when it failed to setup the event log, 1 if the output is correctly initialized), and the caller was in charge of disabling event-logging in the failure case.

runtime/startup_aux.c Outdated Show resolved Hide resolved
runtime/caml/eventlog.h Outdated Show resolved Hide resolved
void caml_ev_alloc(uintnat size);
void caml_ev_alloc_flush(void);
void caml_ev_flush(void);
void caml_ev_disable(void);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header should document the locking requirements for using this API. Since I don't see any explicit locking in the implementation, my guess is that this API must only be called while holding the runtime lock (that is, do not call caml_ev_foo from within a blocking section).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a mention of the locking requirements here: 0f07748

runtime/freelist.c Outdated Show resolved Hide resolved
@@ -416,6 +417,8 @@ static void mark_slice (intnat work)
INSTR (slice_fields += end - start;)
INSTR (if (size > end)
CAML_INSTR_INT ("major/mark/slice/remain", size - end);)
if (size > end)
caml_ev_counter (EV_C_MAJOR_MARK_SLICE_REMAIN, size - end);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this code. (I realise it's adapted from the CAML_INSTR_INT logic above, but I don't understand that either).

This event is only posted in the size > end case. However, the code just below asserts that this case never occurs (either end < size or end == size).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds to me the logic is ok, I'm maybe missing something?
size > end and end < size sounds equivalent in these locations appart. I think the counter could be moved behind the first case to make it more palatable.

runtime/major_gc.c Outdated Show resolved Hide resolved
@stedolan
Copy link
Contributor

Generally, this looks great. CTF looks much nicer than either catapult or threadscope.

@gasche
Copy link
Member

gasche commented Nov 13, 2019

We discussed this at our OCaml developper meeting today. There is support for the idea of replacing the existing GC implementation by the current proposal (so: you shold implement my suggestion of getting rid of Damien's code), but we feel that the "runtime code" (the supporting library) needs a thorough review by someone familiar with the C runtime style. (@stedolan may or may not volunteer himself, otherwise we need to find someone else.)

There as a discussion on whether the runtime overhead is really reasonable, or whether we do want the IFDEFs again. More analysis is needed there.

Note: there was no consensus on the idea of using this approach for arbitrary / userland OCaml program tracing, just on the GC-instrumentation part. We will need to come back to that issue later, separately.

@abbysmal
Copy link
Contributor Author

@gasche Thank you very much for the follow up.
I pushed the instrumented runtime removal, as well as many tweaks and adding some missing data points that were collected in the instrumented runtime.
I also changed a few of the naked checks you mentioned through the use of a macro. (defined in eventlog.h)
I have a few more issues to address that were highlighted by the previous reviews, I hope to get it to a decent shape in the coming days, sorry for the commit "spam".
(also the instrumented runtime removal isn't complete yet, it's missing the right logic to handle clock_gettime in the configure step, I'll also take care of this soon)

@abbysmal
Copy link
Contributor Author

* It would be nice if tracing code (in "user" modules) never had to check for `caml_eventlog_enabled` directly (this requires adding API support for the idea of counting things during instrumentation, to support `caml_instr_alloc_jump`; unfortunately the natural name "counter" is used for something else).

I attempted a broader approach which somewhat hides the eventlog state from the user through a CAML_EVENTLOG macro. I think one good point is that it allows reusing the current ev_count API without adding extra logic from within, while exposing a mean to perform side computations (only when eventlog is on, as is done in major_gc for the slice_pointers and slice_fields counters) without expecting the user to have full knowledge of the internals of the tracing system

I think the alternative approach to expose a better counter API would require rework of the evbuf logic and add a segmentation of the contained events. To enable this, "updating" counters should be stored as values that may be updated by a caller until it reaches its final value, this way the book keeping is done by eventlog. The alloc_jump instrumentation code for instance would require an hypothetical API allowing to: one -> update a counter value (could be inside a lookup table/array, or hidden as a static variable), two -> "flush" it to the event queue. (and reset the counter.)
This way the book keeping can be done internally, to the cost of having an extra layer of buffering for some counters and maybe introducing a confusing API, like the distinction between these updating counters, and the regular, one off counters.

Do you think this is reasonable @gasche ? The evbuf logic needs rework (write failure scenarios are hard to handle right now), so this part may move in the future.

@abbysmal
Copy link
Contributor Author

Hi,

Thank you for your feedbacks, I tried to address the issues that were brought up to me during review, let me know if there's more improvements to be done.
I have to points I still need to address: clock_gettime/autoconf tweaks and the distribution of the metadata file.
I believe these point should be addressed once a decision has been made on where the eventlog framework should sit in the OCaml distribution.

I will update tomorrow the performance report that was attached with this PR.
This may act as a baseline to investigate more the performance profile of these changes.
I will work from this profile to investigate potential regressions, and would be glad to hear suggestions and particular areas of concern within the runtime I should especially look for.

@abbysmal
Copy link
Contributor Author

abbysmal commented Jan 7, 2020

Hello,

Sorry for the time elapsed since the last update on this PR.

After meeting with the team at Cambium in early December (with @Octachron @damiendoligez @shindere and @fpottier ) and in light of some numbers gathered regarding the performance impact of this PR in some real-world programs (alt-ergo, js_of_ocaml), it was suggested we may incubate in this feature in a separated runtime.
Performance improvements could be made further down the path, and the performance profile of the eventlog facilities is likely to change as well in the future with more Multicore OCaml features landing. If added, application level tracing may also introducing changes impacting heavily the performance profile of eventlog.
As such we found reasonable incubating the eventlog code into its own runtime, in a similar fashion to the instrumented runtime.

I landed a few commits recently isolating the code in its own runtime, I took the freedom of suffixing the runtime with e (debug being d and instrumented being i). The i could have been reused but I thought different features means different user-facing API, but if anyone feels differently about it please tell me.
I also added the installation of the CTF metadata file in the lib/ocaml directory, with some sed to adjust endianness so that it matches the current platform.
I am unsure if my approach is the most preferable one so let me know if anything feels too off with this.

Only missing change for the eventlog runtime to be complete would be adjusting the configure step in search of clock_gettime (as the instrumented runtime was doing), since the macos and windows platforms use a different set of functions, I will need to revisit it.
Likely the simplest approach would be to leave this outside of autoconf checks and have the runtime default to a dummy caml_time_counter function returning 0, associated with a preprocessor warning. Eventlog would build on each platforms nicely, but would generate slightly unusable traces (counters to zero would not introduce bad behaviors however).

@shindere
Copy link
Contributor

shindere commented Feb 17, 2020 via email

tools/Makefile Outdated
# Eventlog metadata file

eventlog_metadata: $(EVENTLOG_METADATA)
cat eventlog_metadata.in | sed "s/@endianness/$(EVENTLOG_ENDIANNESS)/g" > eventlog_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary use of cat. You can use sed "s/@endianness/$(EVENTLOG_ENDIANNESS)/g" eventlog_metadata.in > eventlog_metadata instead?

@abbysmal
Copy link
Contributor Author

abbysmal commented Apr 3, 2020

Hello,

I spent a bit of time cleaning up the PR and rebasing it on trunk.

I also exchanged with @shindere and @dra27 to ready the eventlog runtime. (thank you very much!)
Time was spent integrating the runtime properly in the build steps, as well as adding a simple test case in the testsuite to verify basic operations of the eventlog runtime.
We have one outstanding issue running the test under Windows we will attempt fixing next week, otherwise I think the PR is ready for more reviews. (the CIs seems to only have issues on the aforementioned problem, and the Changes check)

I think @gasche wanted specifically another serious pass at reviewing the runtime code, although last time it wasn't yet a separated runtime.
As this new runtime is meant to replace the instrumented runtime currently in trunk (as discussed in December), maybe @damiendoligez wants to take a look at it and see if any piece is missing and not being removed?

Some more details should be discussed (mostly configure time decision making, and naming), but I think this can wait after the runtime part was validated and considered sane enough.

@shindere
Copy link
Contributor

shindere commented Apr 6, 2020 via email

@kayceesrk
Copy link
Contributor

I would prefer keeping the name instrumented_runtime as the name is more meaningful than eventlog_runtime; instrumented tells me that I should expect some overhead due to instrumentation. eventlog doesn't offer any such clue. Also, I suspect that there are not many users of instrumented runtime, so anyone caught unaware may only be pleasantly surprised with the additional information that eventlog offers.

@shindere
Copy link
Contributor

shindere commented Apr 6, 2020 via email

@abbysmal
Copy link
Contributor Author

abbysmal commented Apr 6, 2020

I think it is fine to keep the instrumented runtime name then since it is the most popular opinion out there.
While my thinking was that the generated format was different, it was better to change the name accordingly, the feature itself was perhaps not widely spread enough to warrant a new word in the build terminology. :)
I will revert the name to "instrumented".

@shindere
Copy link
Contributor

shindere commented Apr 6, 2020 via email

@abbysmal abbysmal force-pushed the eventlog_ctf branch 2 times, most recently from 4d379a7 to 7eef707 Compare April 7, 2020 06:55
@abbysmal
Copy link
Contributor Author

abbysmal commented Apr 7, 2020

Hello,

Great, thanks! Don't hesitate to leave a note here when you're done please so that we know that we can start reviewing.

As discussed yesterday, I reverted the name of the runtime to "instrumented". Changes were squashed for proper history.
I pushed as a well a first draft of the Changes file.

Have a great day.

@dra27
Copy link
Member

dra27 commented Apr 30, 2020

I'm working on this at the moment - sorry, I thought this had been through precheck.

The filename isn't known, so I'm not sure how check output actions could be used?

@shindere
Copy link
Contributor

Well, it feels to me one reason why the hook is complicated is that
it can not fully predict what the name of the eventlog file will be,
because the pid of the program gets added to the filename specified in
the OCAML_EVENTLOG_FILE variable.

I'm sorry I didn't catch it rarlier, but to me there is something
wrong with the design here.

More precisely, the user has no control on the .PID.eventlog part and that
feels wrong. I'd prefer to be in control of the whole filename than rather
on its prefix.

In addition, given that this suffix is enforced, I find the OCAML_EVENTLOG_FILE
name misleading because it precisely suggests that you have the control over
the whole filename.

We can wait to see what other developers think, but I really think
something should be clarified / improved here.

If you woul'nt add the suffix, you would not even need snprintf_os.
It feels it would simplify both the code in the runtime and the code in
the test and give the user more control.

@shindere
Copy link
Contributor

shindere commented Apr 30, 2020 via email

@dra27
Copy link
Member

dra27 commented Apr 30, 2020

Test hopefully fixed in 42b8c20 and 4a54886

@abbysmal
Copy link
Contributor Author

Thank you @dra27 !

See my previous comment. I think it's bad that we can't know the name of the file. I believe this should be modified to be under user's full ocntrol.

While I agree that it is indeed not optimal to not know the file, and may lead to a few hacks to be able to properly test it, I don't think we should hand full control over to the user.

For now the instrumented runtime indeed only outputs one tracefile. However the Common Trace Format was chosen partly because it allows to define traces from multiple streams (and multiple files or data points, possibly). When Multicore domains are available, they could each output their own tracefile, which can then be demuxed by the CTF codec.
I think this versatility is an interesting feature point, and I don't see an easy and non-hacky way to allow user full control over the names of domains-generated tracefiles.

I think the current behavior is also a pretty common one among software generating logfiles, embedding timestamps, hashes or pids in filenames, and avoids overwriting by inadvertence, or fights over a trace filename.

@shindere
Copy link
Contributor

shindere commented Apr 30, 2020 via email

@dra27
Copy link
Member

dra27 commented Apr 30, 2020

@shindere - indeed, I'm not disputing that discussion can and should continue, just fixing trunk/4.11 CI in the soonest possible time.

@dra27
Copy link
Member

dra27 commented Apr 30, 2020

The behaviour of the log file feels perfectly natural to me (and believe that I didn't spend several hours fighting with Cygwin to get the test to be able to detect without considering whether changing the filename might be easier!).

I agree that perhaps the name of the environment variable could be altered?

@abbysmal
Copy link
Contributor Author

Indeed the name I picked was perhaps not the wisest option and was misleading, I'm sorry about this.
Maybe it could be replaced by OCAML_EVENTLOG_PREFIX ?

@shindere
Copy link
Contributor

shindere commented Apr 30, 2020 via email

@dra27
Copy link
Member

dra27 commented Apr 30, 2020

The problem is that to me sounds more like the [ocaml] bit of messages, for example! I guess it could be OCAML_EVENTLOG_FILE_PREFIX, but for a feature like this I'm not sure that it matters - a user isn't going to set this randomly, it'll be because they've read how to use the instrumented runtime and set things up for it (i.e. it's not a "natural" part of the system to come and explore without first finding out how it works)

@shindere
Copy link
Contributor

shindere commented Apr 30, 2020 via email

@shindere
Copy link
Contributor

shindere commented Apr 30, 2020 via email

@dra27
Copy link
Member

dra27 commented Apr 30, 2020

_ocamltest directories are not always cleared on Windows, that's why I added the rm of previous files. The test you propose is not simpler it is different. At the moment, the test as portably as possible, given the near impossibility of determining the PID of a native Windows program you executed from Windows from a Cygwin shell script, ensures that exactly one event log file is produced by the instrumented runtime where the filename includes something which looks like a PID, which is what the previous version (which could rely on $pid being correct) did.

Given that it now works, I think further alterations to the test should be about improving what it tests, not polishing how it tests...

@shindere
Copy link
Contributor

shindere commented Apr 30, 2020 via email

@dra27
Copy link
Member

dra27 commented Apr 30, 2020

Normally they're deleted yes - but not, for example, when you're working on the test and have a terminal in the directory, or other (log) files open. There was certainly an instance I saw of the previous files still being there, so given that the test is checking exactly for the presence of a new file, it defensively ensures that they are not there in the first place.

@dra27
Copy link
Member

dra27 commented Apr 30, 2020

(amusingly, it is an example of why ocamltest should probably use a randomly-named directory and update a symbolic link so that _ocamltest is the last executed test, but that's a tweak I haven't got around to doing... you can't ever guarantee deleting a directory on Windows)

@shindere
Copy link
Contributor

shindere commented Apr 30, 2020 via email

@gasche
Copy link
Member

gasche commented May 18, 2020

Reading back the original post of this PR (to review the #9541 documentation PR, on which more eyes are welcome), I notice that it suggested a "mean overhead of 1%". However later during the review people estimated that the overhead would be too high, and decided to move the instrumentation to a separate runtime.

Does this mean that the overhead is more than 1%? I would like to understand this better:

  • What is our current estimate of mean overhead of running with the instrumented runtime, when instrumentation is active and when it is paused?
  • If the original prediction was inaccurate, do we know what caused the inaccuracy? Was it a problem in data analysis, or were the benchmarks not predictive enough, or is it a problem with computing the mean overhead rather that some other metric where orthogonal microbenchmarks do not skew the results?

@abbysmal
Copy link
Contributor Author

@gasche

Thank you very much for your review.
To answer your questions, indeed the overhead can be higher than 1%.
However the overhead varies greatly depending on the allocation profile of the program.
Using sandmark, trunk was compared to an eventlog branch with tracing off and on.

To note, a few trace points were added to the runtime during this PR, some in a particularly hot loop within the major gc code.
Profiling some problematic programs shows indeed some regression in the execution time for a few gc phases, and this correlate also the the addition of trace points during the development process. (but I didn't do a proper bisect and just eye-balled it on a local branch to assess when the problematic counters were added, to try to rebuild the timeline.)

The methodology applied to check for regressions during further development (after opening this PR) was to check against well known "worst offenders". However, with the addition of further counters (imported from what was traced by the instrumented runtime), the performance profile of eventlog changed, and with it, the results.
I reported these numbers during my meeting with the Cambium team in December.

As for the actual numbers, I have a few profiles extracted from various sandmark runs.
Numbers that are reliably high (thus excluding overly short running, less gc intense programs, and programs that were hand checked and thought to be victim of profiling "noise") can be found in a few test cases including real world OCaml program. (js_of_ocaml for instance exhibit a 5.7% slowdown with tracing off compared to its execution on the vanilla runtime.)
It is however hard to give an accurate estimate of the potential numbers for the average user, as it does depend on how the runtime executes.
I think an accurate representation of the problem is that the overhead may manifest itself as high as 5.7% (in the js_of_ocaml benchmark in sandmark, compiling frama-c) with tracing off.

As it is further developed (with additions to what is traced), more work will be done on understanding and improving its performance profile, which was the main motivation behind pushing it as a separate runtime.

I am sorry to not have made any more visible updates on the performance issues encountered along the way outside of the December meeting.
Should I update the claims on the first post?

@gasche
Copy link
Member

gasche commented Jun 2, 2020

@Engil proposed documentation as promised as a separate PR, #9541. For now I have been the only one doing the work of reading the proposal and giving feedback. It would be nice if other people who pushed for the implementation part of the feature (@dra27 ?) could also help reviewing the documentation.

@shindere
Copy link
Contributor

shindere commented Jun 2, 2020 via email

@gasche
Copy link
Member

gasche commented Jun 2, 2020

Just to clarify: it is ready to review now.

@shindere
Copy link
Contributor

shindere commented Jun 2, 2020 via email

@abbysmal
Copy link
Contributor Author

abbysmal commented Jun 3, 2020

Gabriel Scherer (2020/06/02 08:38 -0700):
Just to clarify: it is ready to review now.
Well my opinion is that it would be better to review it once the code that goes with the documentaion has been included, too.

I just pushed the changes onto the documentation branch.

@shindere
Copy link
Contributor

shindere commented Jun 4, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.