mcount: Allow full-dynamic tracing to instrument unsupported functions on x86_64 (w/ capstone) #870

AnsBal · 2019-09-13T19:40:04Z

Some functions can't be instrumented by full-dynamic tracing, for safety reasons. Functions that jump to a prologue that has been modified may cause an undefined behavior. To be able to instrument these functions, one can embed an illegal instruction (for example "int3") in the head of every instruction that has been moved. This way, if a thread branch to the function prologue, it will step on the illegal instruction and a handler will be called by the kernel to redirect the thread to the original instruction. Afterwards, the thread will jump back to the function and resume its execution normally.

Instrumenting unsupported functions process is similar to full-dynamic tracing process excepts some differences:

Create a constraint based on the position of the "int3" in the offset of the call instruction.
Store the instructions located at the prolog of the function, if patchable.
Find and allocate a free address that respect the constraint created previously.
Store the position of "int3" and the address of the original instruction.
instrument the 'relative address call instruction' to 'prolog of the function' to call the trampoline.

The execution flow is similar to the full-dynamic tracing except when a thread branch to a prologue and step on an "int3":

The trap handler is called.
The address from where the trap handler was raised is computed.
If the trap handler was raised from an address related to our tracepoint, the thread is redirected to the original instruction. Else, the original handler (set by the user) is called and the execution is
resumed.

Some Drawbacks and limitations:

"int3" is slower than a jump/call since it needs to go to back to the kernel and dispatch it to call the trap handler.
With embedded "int3" in the offset it reduces the range of finding a free address. The worst case is having four "one byte instructions" in the probe location. In this case the offset of the call/jmp should be 0xCCCCCCCC and if the absolute address (probe location + 0xCCCCCCCC) is already mapped, the instrumentation will fail. A solution to this problem is to combine different illegal instructions instead of using only "int3".
Since the trap handler should be set by uftrace for this technique to work, it may interfere with gdb that relies on "int3" to insert the breakpoints. It means that gdb can't be used to debug uftrace while tracing a program with this technique.

…s on x86_64 (w/ capstone) Some functions can't be instrumented by full-dynamic tracing, for safety reasons. Functions that jump to a prologue that has been modified may cause an undefined behavior. To be able to instrument these functions, one can embed an illegal instruction (for example "int3") in the head of every instruction that has been moved. This way, if a thread branch to the function prologue, it will step on the illegal instruction and a handler will be called by the kernel to redirect the thread to the original instruction. Afterwards, the thread will jump back to the function and resume its execution normally. Instrumenting unsupported functions process is similar to full-dynamic tracing process excepts some differences: 1. Create a constraint based on the position of the "int3" in the offset of the call instruction. 2. Store the instructions located at the prolog of the function, if patchable. 3. Find and allocate a free address that respect the constraint created previously. 4. Store the position of "int3" and the address of the original instruction. 5. instrument the 'relative address call instruction' to 'prolog of the function' to call the trampoline. The execution flow is similar to the full-dynamic tracing except when a thread branch to a prologue and step on an "int3": 1. The trap handler is called. 2. The address from where the trap handler was raised is computed. 3. If the trap handler was raised from an address related to our tracepoint, the thread is redirected to the original instruction. Else, the original handler (set by the user) is called and the execution is resumed. Signed-off-by: Anas Balboul <anasbalbo@gmail.com>

Full dynamic tracing is only enabled when libcapstone is available. No need to do the test when the lib is missing. Signed-off-by: Anas Balboul <anasbalbo@gmail.com>

The added function branch twice to its prologue. It could be used to test dynamic_full tracing of unsupported functions. Signed-off-by: Anas Balboul <anasbalbo@gmail.com>

honggyukim · 2019-09-16T00:41:59Z

Hi @AnsBal, thanks very much for your work. Since this PR contains many changes, it may take some time to be reviewed by @namhyung.

We cannot work for this project full time, so please understand the delay. Thanks a lot!

AnsBal · 2019-09-16T15:40:07Z

Hello @honggyukim. Thank you for you reply. I understand that it may take some time.
I’m looking forward to the review by @namhyung .

namhyung · 2019-09-16T16:04:38Z

@AnsBal an interesting approach. So IIUC it changes offset of the call instruction to have 0xcc (INT3) in the position of first byte for each (original) instructions to catch jumps to prologue, right? I was thinking that changing call-sites instead of the call in the prologue. That'd be much easier to handle memory regions for the trampoline.

AnsBal · 2019-09-16T17:53:50Z

HI @namhyung, Thank you for the reply !

@AnsBal an interesting approach. So IIUC it changes offset of the call instruction to have 0xcc (INT3) in the position of first byte for each (original) instructions to catch jumps to prologue, right?

Yes, that's right.

I was thinking that changing call-sites instead of the call in the prologue. That'd be much easier to handle memory regions for the trampoline.

I thought about it too. But the thing with patching the call-site is that you can't patch all of them. Some are too small (insn size) to be patched by an alternative call-site that can reach the original instruction. Besides that, sometimes, it's difficult to find the destination of a call-site (in the case of indirect branches). This technique covers all this case and "in theory" has more success rate.

Another alternative could be using "int3" at the function entry just like kprobes/uprobes/gdb do. But it adds the overhead of dispatching the trap handler every time the function is called.

namhyung · 2019-09-19T15:22:46Z

Yeah, I agree that patching call-sites cannot catch indirect jumps. But it'd be possible to handle direct jumps only. I don't understand what you said about the size. I think we can patch the first byte to INT3 and skip the rest.

The downside I see in your approach is that it will spread trampolines for each function based on the instruction pattern. While some of them might be shared, this will increase the number of mmaps and it can reject real use of mmap in the target process later. That's why I tried to find trampoline location in the same text mapping.

Another alternative could be using "int3" at the function entry just like kprobes/uprobes/gdb do. But it adds the overhead of dispatching the trap handler every time the function is called.

Yes, this is the safe and slow approach. Maybe we can use it only for indirect jump cases (assuming it's rare).

AnsBal · 2019-09-25T02:04:18Z

Yeah, I agree that patching call-sites cannot catch indirect jumps. But it'd be possible to handle direct jumps only. I don't understand what you said about the size.

What I tried to say is that some call-sites are too small to be patched by an instruction that can reach the original one. If we assume that the size of function will, most of the time, be large enough for the compiler or the programmer to use a short jump (or other short branch instructions), then we won't be able to patch them because of their size.

The downside I see in your approach is that it will spread trampolines for each function based on the instruction pattern. While some of them might be shared, this will increase the number of mmaps and it can reject real use of mmap in the target process later. That's why I tried to find trampoline location in the same text mapping.

Indeed, it's consuming the mapping address space of the target process. I'm wondering if we are consuming too much, since we are using this it only for a small part of the functions that we failed to patch.

Edit:
In nginx, 19 out of 1195 function patch failed because of a branch to the prologue.
In uftrace, 42 out of 686 function patch failed because of a branch to the prologue.

In the worst case we need one page for each function. 42 * 4096 = 172kb for uftrace and 77kb for uftrace. Knowing that the size of the user-space virtual memory is 128 TB, the worst case is not that bad.

Yes, this is the safe and slow approach. Maybe we can use it only for indirect jump cases (assuming it's rare).

It could be used to patch indirect jumps cases as well as the case where we can't patch a call-site, because the optimization worth trying to patch the call-site.

namhyung · 2019-09-25T12:10:01Z

If we assume that the size of function will, most of the time, be large enough for the compiler or the programmer to use a short jump (or other short branch instructions), then we won't be able to patch them because of their size.

Oh, I thought adding INT3 there.

In nginx, 19 out of 1195 function patch failed because of a branch to the prologue.
In uftrace, 42 out of 686 function patch failed because of a branch to the prologue.

In the worst case we need one page for each function. 42 * 4096 = 172kb for uftrace and 77kb for uftrace. Knowing that the size of the user-space virtual memory is 128 TB, the worst case is not that bad.

Thanks for the numbers. It's good to see that there're not many. Did you try to patch all functions in the libraries as well? In general, we cannot predict how much it is for each binary and for possible compiler changes. Also there's a limit of number of mappings (/proc/sys/vm/max_map_count) and the default is 65535.

AnsBal · 2019-09-28T21:18:31Z

If we assume that the size of function will, most of the time, be large enough for the compiler or the programmer to use a short jump (or other short branch instructions), then we won't be able to patch them because of their size.

Oh, I thought adding INT3 there.

It may be an option but I think that 'int3' in the call-site may downgrade significantly the performance if it's a loop instead of a single branch to the prologue.

Thanks for the numbers. It's good to see that there're not many. Did you try to patch all functions in the libraries as well?

Only function in static object has been patched. How can I patch dynamic objects with dynamic tracing ?

In general, we cannot predict how much it is for each binary and for possible compiler changes. Also there's a limit of number of mappings (/proc/sys/vm/max_map_count) and the default is 65535.

Oh, the default limit is too small. It's true that patching the entry with an 'int3' directly doesn't need to map a memory area for each patched function, but its still downgrade the performance. I think of it as a time space trade-off.

namhyung · 2019-09-29T23:51:36Z

It may be an option but I think that 'int3' in the call-site may downgrade significantly the performance if it's a loop instead of a single branch to the prologue.

I don't follow. I think the effect is same since your approach also needs to hit INT3 anyway. I'm saying we can use it only for jumps to the prologue.

Only function in static object has been patched. How can I patch dynamic objects with dynamic tracing ?

Are you talking about the DSOs? The commit 1147486 added library support. As it uses prefix matching, you may add -P .@lib to enable it for every library (not tested though).

AnsBal added 3 commits September 12, 2019 14:13

test: Skipping dynamic_full test when capstone is missing.

b7a7057

Full dynamic tracing is only enabled when libcapstone is available. No need to do the test when the lib is missing. Signed-off-by: Anas Balboul <anasbalbo@gmail.com>

test: Add a function that jumps to it prologue in the dynamic_full test.

06dde18

The added function branch twice to its prologue. It could be used to test dynamic_full tracing of unsupported functions. Signed-off-by: Anas Balboul <anasbalbo@gmail.com>

honggyukim added the dynamic label Oct 3, 2019

honggyukim added the dorsal from dorsal group, https://www.dorsal.polymtl.ca/en label Sep 1, 2021

clementguidi mentioned this pull request May 10, 2023

dynamic: x86_64: Runtime dynamic instrumentation #1698

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcount: Allow full-dynamic tracing to instrument unsupported functions on x86_64 (w/ capstone) #870

mcount: Allow full-dynamic tracing to instrument unsupported functions on x86_64 (w/ capstone) #870

AnsBal commented Sep 13, 2019

honggyukim commented Sep 16, 2019

AnsBal commented Sep 16, 2019

namhyung commented Sep 16, 2019

AnsBal commented Sep 16, 2019 •

edited

namhyung commented Sep 19, 2019

AnsBal commented Sep 25, 2019 •

edited

namhyung commented Sep 25, 2019

AnsBal commented Sep 28, 2019

namhyung commented Sep 29, 2019

mcount: Allow full-dynamic tracing to instrument unsupported functions on x86_64 (w/ capstone) #870

Are you sure you want to change the base?

mcount: Allow full-dynamic tracing to instrument unsupported functions on x86_64 (w/ capstone) #870

Conversation

AnsBal commented Sep 13, 2019

honggyukim commented Sep 16, 2019

AnsBal commented Sep 16, 2019

namhyung commented Sep 16, 2019

AnsBal commented Sep 16, 2019 • edited

namhyung commented Sep 19, 2019

AnsBal commented Sep 25, 2019 • edited

namhyung commented Sep 25, 2019

AnsBal commented Sep 28, 2019

namhyung commented Sep 29, 2019

AnsBal commented Sep 16, 2019 •

edited

AnsBal commented Sep 25, 2019 •

edited