Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcount: Allow full-dynamic tracing to instrument unsupported functions on x86_64 (w/ capstone) #870

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AnsBal
Copy link
Contributor

@AnsBal AnsBal commented Sep 13, 2019

Some functions can't be instrumented by full-dynamic tracing, for safety reasons. Functions that jump to a prologue that has been modified may cause an undefined behavior. To be able to instrument these functions, one can embed an illegal instruction (for example "int3") in the head of every instruction that has been moved. This way, if a thread branch to the function prologue, it will step on the illegal instruction and a handler will be called by the kernel to redirect the thread to the original instruction. Afterwards, the thread will jump back to the function and resume its execution normally.

Instrumenting unsupported functions process is similar to full-dynamic tracing process excepts some differences:

  1. Create a constraint based on the position of the "int3" in the offset of the call instruction.
  2. Store the instructions located at the prolog of the function, if patchable.
  3. Find and allocate a free address that respect the constraint created previously.
  4. Store the position of "int3" and the address of the original instruction.
  5. instrument the 'relative address call instruction' to 'prolog of the function' to call the trampoline.

The execution flow is similar to the full-dynamic tracing except when a thread branch to a prologue and step on an "int3":

  1. The trap handler is called.
  2. The address from where the trap handler was raised is computed.
  3. If the trap handler was raised from an address related to our tracepoint, the thread is redirected to the original instruction. Else, the original handler (set by the user) is called and the execution is
    resumed.

Some Drawbacks and limitations:

  1. "int3" is slower than a jump/call since it needs to go to back to the kernel and dispatch it to call the trap handler.
  2. With embedded "int3" in the offset it reduces the range of finding a free address. The worst case is having four "one byte instructions" in the probe location. In this case the offset of the call/jmp should be 0xCCCCCCCC and if the absolute address (probe location + 0xCCCCCCCC) is already mapped, the instrumentation will fail. A solution to this problem is to combine different illegal instructions instead of using only "int3".
  3. Since the trap handler should be set by uftrace for this technique to work, it may interfere with gdb that relies on "int3" to insert the breakpoints. It means that gdb can't be used to debug uftrace while tracing a program with this technique.

…s on x86_64 (w/ capstone)

Some functions can't be instrumented by full-dynamic tracing, for safety
reasons. Functions that jump to a prologue that has been modified may
cause an undefined behavior. To be able to instrument these functions, one
can embed an illegal instruction (for example "int3") in the head of every
instruction that has been moved. This way, if a thread branch to the function
prologue, it will step on the illegal instruction and a handler will be called
by the kernel to redirect the thread to the original instruction. Afterwards,
the thread will jump back to the function and resume its execution normally.

Instrumenting unsupported functions process is similar to full-dynamic
tracing process excepts some differences:
1. Create a constraint based on the position of the "int3" in
the offset of the call instruction.
2. Store the instructions located at the prolog of the function,
if patchable.
3. Find and allocate a free address that respect the constraint
created previously.
4. Store the position of "int3" and the address of the original
instruction.
5. instrument the 'relative address call instruction' to 'prolog of the
function' to call the trampoline.

The execution flow is similar to the full-dynamic tracing except when
a thread branch to a prologue and step on an "int3":

1. The trap handler is called.
2. The address from where the trap handler was raised is computed.
3. If the trap handler was raised from an address related to our tracepoint,
the thread is redirected to the original instruction.
Else, the original handler (set by the user) is called and the execution is
resumed.

Signed-off-by: Anas Balboul <anasbalbo@gmail.com>
Full dynamic tracing is only enabled when libcapstone is available. No need
to do the test when the lib is missing.

Signed-off-by: Anas Balboul <anasbalbo@gmail.com>
The added function branch twice to its prologue. It could be used to test
dynamic_full tracing of unsupported functions.

Signed-off-by: Anas Balboul <anasbalbo@gmail.com>
@honggyukim
Copy link
Collaborator

Hi @AnsBal, thanks very much for your work. Since this PR contains many changes, it may take some time to be reviewed by @namhyung.

We cannot work for this project full time, so please understand the delay. Thanks a lot!

@AnsBal
Copy link
Contributor Author

AnsBal commented Sep 16, 2019

Hello @honggyukim. Thank you for you reply. I understand that it may take some time.
I’m looking forward to the review by @namhyung .

@namhyung
Copy link
Owner

@AnsBal an interesting approach. So IIUC it changes offset of the call instruction to have 0xcc (INT3) in the position of first byte for each (original) instructions to catch jumps to prologue, right? I was thinking that changing call-sites instead of the call in the prologue. That'd be much easier to handle memory regions for the trampoline.

@AnsBal
Copy link
Contributor Author

AnsBal commented Sep 16, 2019

HI @namhyung, Thank you for the reply !

@AnsBal an interesting approach. So IIUC it changes offset of the call instruction to have 0xcc (INT3) in the position of first byte for each (original) instructions to catch jumps to prologue, right?

Yes, that's right.

I was thinking that changing call-sites instead of the call in the prologue. That'd be much easier to handle memory regions for the trampoline.

I thought about it too. But the thing with patching the call-site is that you can't patch all of them. Some are too small (insn size) to be patched by an alternative call-site that can reach the original instruction. Besides that, sometimes, it's difficult to find the destination of a call-site (in the case of indirect branches). This technique covers all this case and "in theory" has more success rate.

Another alternative could be using "int3" at the function entry just like kprobes/uprobes/gdb do. But it adds the overhead of dispatching the trap handler every time the function is called.

@namhyung
Copy link
Owner

Yeah, I agree that patching call-sites cannot catch indirect jumps. But it'd be possible to handle direct jumps only. I don't understand what you said about the size. I think we can patch the first byte to INT3 and skip the rest.

The downside I see in your approach is that it will spread trampolines for each function based on the instruction pattern. While some of them might be shared, this will increase the number of mmaps and it can reject real use of mmap in the target process later. That's why I tried to find trampoline location in the same text mapping.

Another alternative could be using "int3" at the function entry just like kprobes/uprobes/gdb do. But it adds the overhead of dispatching the trap handler every time the function is called.

Yes, this is the safe and slow approach. Maybe we can use it only for indirect jump cases (assuming it's rare).

@AnsBal
Copy link
Contributor Author

AnsBal commented Sep 25, 2019

Yeah, I agree that patching call-sites cannot catch indirect jumps. But it'd be possible to handle direct jumps only. I don't understand what you said about the size.

What I tried to say is that some call-sites are too small to be patched by an instruction that can reach the original one. If we assume that the size of function will, most of the time, be large enough for the compiler or the programmer to use a short jump (or other short branch instructions), then we won't be able to patch them because of their size.

The downside I see in your approach is that it will spread trampolines for each function based on the instruction pattern. While some of them might be shared, this will increase the number of mmaps and it can reject real use of mmap in the target process later. That's why I tried to find trampoline location in the same text mapping.

Indeed, it's consuming the mapping address space of the target process. I'm wondering if we are consuming too much, since we are using this it only for a small part of the functions that we failed to patch.

Edit:
In nginx, 19 out of 1195 function patch failed because of a branch to the prologue.
In uftrace, 42 out of 686 function patch failed because of a branch to the prologue.

In the worst case we need one page for each function. 42 * 4096 = 172kb for uftrace and 77kb for uftrace. Knowing that the size of the user-space virtual memory is 128 TB, the worst case is not that bad.

Yes, this is the safe and slow approach. Maybe we can use it only for indirect jump cases (assuming it's rare).

It could be used to patch indirect jumps cases as well as the case where we can't patch a call-site, because the optimization worth trying to patch the call-site.

@namhyung
Copy link
Owner

If we assume that the size of function will, most of the time, be large enough for the compiler or the programmer to use a short jump (or other short branch instructions), then we won't be able to patch them because of their size.

Oh, I thought adding INT3 there.

In nginx, 19 out of 1195 function patch failed because of a branch to the prologue.
In uftrace, 42 out of 686 function patch failed because of a branch to the prologue.

In the worst case we need one page for each function. 42 * 4096 = 172kb for uftrace and 77kb for uftrace. Knowing that the size of the user-space virtual memory is 128 TB, the worst case is not that bad.

Thanks for the numbers. It's good to see that there're not many. Did you try to patch all functions in the libraries as well? In general, we cannot predict how much it is for each binary and for possible compiler changes. Also there's a limit of number of mappings (/proc/sys/vm/max_map_count) and the default is 65535.

@AnsBal
Copy link
Contributor Author

AnsBal commented Sep 28, 2019

If we assume that the size of function will, most of the time, be large enough for the compiler or the programmer to use a short jump (or other short branch instructions), then we won't be able to patch them because of their size.

Oh, I thought adding INT3 there.

It may be an option but I think that 'int3' in the call-site may downgrade significantly the performance if it's a loop instead of a single branch to the prologue.

Thanks for the numbers. It's good to see that there're not many. Did you try to patch all functions in the libraries as well?

Only function in static object has been patched. How can I patch dynamic objects with dynamic tracing ?

In general, we cannot predict how much it is for each binary and for possible compiler changes. Also there's a limit of number of mappings (/proc/sys/vm/max_map_count) and the default is 65535.

Oh, the default limit is too small. It's true that patching the entry with an 'int3' directly doesn't need to map a memory area for each patched function, but its still downgrade the performance. I think of it as a time space trade-off.

@namhyung
Copy link
Owner

It may be an option but I think that 'int3' in the call-site may downgrade significantly the performance if it's a loop instead of a single branch to the prologue.

I don't follow. I think the effect is same since your approach also needs to hit INT3 anyway. I'm saying we can use it only for jumps to the prologue.

Only function in static object has been patched. How can I patch dynamic objects with dynamic tracing ?

Are you talking about the DSOs? The commit 1147486 added library support. As it uses prefix matching, you may add -P .@lib to enable it for every library (not tested though).

@honggyukim honggyukim added the dorsal from dorsal group, https://www.dorsal.polymtl.ca/en label Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dorsal from dorsal group, https://www.dorsal.polymtl.ca/en dynamic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants