Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch to use frame pointer to profile code with Linux perf tools #5721

Closed
vicuna opened this Issue Aug 6, 2012 · 20 comments

Comments

Projects
None yet
2 participants
@vicuna
Copy link
Collaborator

vicuna commented Aug 6, 2012

Original bug ID: 5721
Reporter: @lefessan
Assigned to: @lefessan
Status: closed (set by @xavierleroy on 2015-12-11T18:23:49Z)
Resolution: fixed
Priority: normal
Severity: feature
Version: 4.00.0
Fixed in version: 4.01.0+dev
Category: back end (clambda to assembly)
Related to: #5723
Monitored by: @bobzhang meurer @rixed @diml @ygrek Camarade_Tux @jberdine @hcarty smimram @yakobowski

Bug description

Here is a patch to allow profiling amd64 OCaml native code with modern Linux performance tools ("perf record/report"), based on kernel support of processor counters.

Since the kernel is unable to use DWARF information stored in the program, stack unwinding is currently only based on C-style frames with frame pointers (%rbp saved on entry, and restored on leave), and this is not going to change in the near future (Linus becomes angry as soon as "dwarf" is pronounced somewhere).

This patch modifies the amd64 code generator to implement C-style frames, with the frame pointer in %rbp. This behavior becomes the default, unless a new option '-fomit-frame-pointer' is provided.

The implementation is based on decreasing by one the number of available integer registers, and moving %rbp two positions later (after %r12 and %13), as the last available integer register, available only if -fomit-frame-pointer is provided. %rbp is also saved on entry, and restored on leave from every function, and replaced by %r13 as an argument register.

On the "standard" compcert short benchmark (rm -f lib/Integers.vo; time make lib/Integers.vo), the performance penalty is less than 5%.

File attachments

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 6, 2012

Comment author: @ygrek

and this is not going to change in the near future

There have been recent attempts at dwarf unwinding kernel-recorded stack in userspace, e.g. http://thread.gmane.org/gmane.linux.kernel/1331332 (previous attempt at https://groups.google.com/forum/#!topic/linux.kernel/MDZE766vq7E)

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 6, 2012

Comment author: @lefessan

Yes, the "dwarf unwind in user space" patch has been around for a few months, but it is only able to unwind the portion of the stack saved by the kernel in the perf event. It is expensive (you need to copy all the data instead of just the return addresses, for every event) and you only extract the top of the stack, which might not be enough in some cases. Anyway, it might take a long time before having this available in a stable linux kernel in a standard distribution.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 6, 2012

Comment author: @xavierleroy

I'm very much interested in good performance monitoring tools for Linux. Any pointer to some documentation I could read to learn about this new "perf" tool? And why it needs the kernel to walk the call stack?

However: if Linus gets angry when "dwarf" is mentioned, I get angry when "frame pointer" is mentioned. The %ebp/%rbp business in the x86 code generation conventions is a terrible waste, not just of time, but even more of stack space. (It can double the stack usage of some recursive functions, for heaven's sake!)

So, there is no way that frame pointers can be maintained by default. I'm open to discussions, however, on whether to maintain it if some options are given (-g ? some new option?).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 6, 2012

Comment author: @lefessan

There is a wiki with some information on how to use the tools:

https://perf.wiki.kernel.org/index.php/Main_Page

but most of the documentation is in the tools themselves ("perf help report" à la git), or in the code (in the tools/perf/ subdirectory of recent linux kernel sources).

For the space taken by %rbp on the stack, it is true that it can double the frame size, but at the same time, having a program doing a stack overflow because of that looks more like a bug from the program than from the code generator.

Anyway, use of frame pointers could be triggered as in C by a -fno-omit-frame-pointer option instead of being the default, and of course by default with -g (but use of -g should not be mandatory, because it also might disable other optimizations). Maybe having two distinguished modes for profiling, like "-p-hard" and "-p-soft" for hardware and software profiling modes...

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 6, 2012

Comment author: @lefessan

On a side note: it would be useful to be able to use an environment variable, like OCAML_PROFILING=1, to trigger a mode for generating code suitable for profiling, without modifying the build system files. And another option for debugging code...

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 6, 2012

Comment author: @alainfrisch

Will it be possible to link modules compile with and without "no-omit-frame-pointer"? (My understanding is that this is purely local to the function and does not change the calling convention, right?)

OCAML_PROFILING=1

What about a more general "OCAMLFLAGS"? The compilers would simply parse this variable and synthesize extra arguments at the beginning of the command-line.
(And it would then be nice if all switches came with a negated form.)

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 6, 2012

Comment author: @lefessan

Will it be possible to link modules compile with and without "no-omit-frame-pointer"? (My understanding is that this is purely local to the function and does not change the calling convention, right?)

Yes, but since my patch modifies the calling convention, all the files have to be compiled with the patch applied, you can then link together files compiled with and without the option.

OCAMLFLAGS

Does such an option really make sense ? How should an option in OCAMLFLAGS be interpreted by different tools like ocamlc, ocamlopt, ocamldep, etc. ? It looks better for me to have specific environment variables for specific purposes (ocamldep knows that it can disregard OCAML_PROFILING, but it doesn't know what to do with a '-g' present in OCAMLFLAGS).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 7, 2012

Comment author: @alainfrisch

How should an option in OCAMLFLAGS be interpreted by different tools like ocamlc, ocamlopt, ocamldep, etc. ?

Indeed, we should probably have OCAMLCFLAGS, OCAMLOPTFLAGS, OCAMLDEPFLAGS, etc.

If we create an environment variable per command-line option, we will get a lot of them, and the user needs to know all their names. Off the top of my head, I can imagine situations where it could be useful to support the following options:

-ffast-math
-absname
-annot
-bin-annot
-cc
-compact
-g
-I
-inline
-no-app-funct
-noassert
-nodynlink
-nostdlib
-pp
-ppx
-principal
-S
-strict-sequence
-thread
-unsafe
-w
-warn-error

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 8, 2012

Comment author: @lefessan

The discussion is diverging. I created another bug report here of the environment variable feature:

#5723

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 13, 2012

Comment author: @lefessan

I uploaded a new version (some offsets in caml_raise_exn were wrong, so backtraces were not correctly printed)

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 13, 2012

Comment author: meurer

I'm currently working on adding iOS support to the ARM backend, and the iOS ABI uses r7 as frame pointer. Right now I'm just using r7 as regular caller-save register, making debugging OCaml code on iPhone/iPad impossible. But if there is some interest to support frame pointers (and debugging/performance monitoring techniques using frame pointers), we may want to add some generic support stuff and add only minimal backend specific glue for amd64/i386 (Linux) and arm (iOS).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Sep 23, 2012

Comment author: @yakobowski

Profiling programs with perf is far superior to using gprof, both in terms of precision and ease of use. I would be very glad to see this patch integrated in the trunk.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 15, 2012

Comment author: @damiendoligez

I agree with Xavier that this should not be the compiler's default. Also, I intensely dislike "-fno-omit-frame-pointers" as a command-line switch. Since this is a profiling option, it shoud be "-p".

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Nov 19, 2012

Comment author: @lefessan

Actually, since the calling conventions are changed by that patch, it should probably not be a compilation option, but a ./configure option, so that you don't mix libraries linked with and without profiling.

I would propose "./configure -asm-keep-frame-pointer". The option would only work under amd64, for now.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Dec 17, 2012

Comment author: @diml

I observed a bug with this path. I attached a file which reproduces it. The problem is that the compiler chooses to store the upper-bound of the for-loop into the r10 register but doesn't save it on the stack, and this register maybe be destroyed by functions called inside the loop, in this case by caml_modify (not itself, but by another function it may end up calling: realloc).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Jan 14, 2013

Comment author: @diml

Fixed: register numbers were not updated in Proc.destroyed_at_c_call. I attached a patch.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Apr 15, 2013

Comment author: @diml

What is the status of this feature? We have been using this patch for development everyday for a few month now at Jane Street and it would be great if it could be integrated upstream. Having it as a configure option is fine.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Apr 15, 2013

Comment author: @lefessan

There was an agreement in the core team, that it could be integrated with the switch in the configure to activate it. So, it is just a matter of finding the time to merge it in trunk.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Jun 11, 2013

Comment author: @lefessan

Integrated in trunk for version 4.01

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 13, 2013

Comment author: nicolas_boulay

Does you need 2 compiler to have the profiling option and an other one to do binary release ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.