Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Ruby 3.2.0 structs #61

Closed
dalehamel opened this issue Feb 21, 2023 · 16 comments
Closed

[feat] Ruby 3.2.0 structs #61

dalehamel opened this issue Feb 21, 2023 · 16 comments

Comments

@dalehamel
Copy link

dalehamel commented Feb 21, 2023

Howdy @javierhonduco ! Have been kicking the tires on this and would love to try out some of the features in production.

Unfortunately, as we are perennially on the newest ruby, it is difficult to keep up with the struct offsets / version compatibility concerns.

I notice that it looks like this project pulls the struct info from rbspy.

Perhaps if you could add some documentation / a general explanation on your process for how to bump the struct offsets, I or someone on my team could take a crack at bumping the ruby structs? It would be very instructive to give it a go either way.

@dalehamel
Copy link
Author

d859c58 looks like a good exemplar, I'll see if i can come up with something by muddling my way through

@dalehamel
Copy link
Author

@manuelfelipe and I both took a crack at this and came up with identical patches based on copy/pasta and not really knowing what we are doing, here is his 4056905 but mine looked like it as well.

The error that we get is:

./rbperf record --pid 1049635 --verbose-libbpf-logging --verbose-bpf-logging cpu
...
...
... 
libbpf: map 'rbperf_b.rodata': created successfully, fd=13
libbpf: map 'rbperf_b.data': created successfully, fd=14
libbpf: sec 'perf_event': found 1 CO-RE relocations
libbpf: CO-RE relocating [71] struct task_struct: found target candidate [356] struct task_struct in [vmlinux]
libbpf: prog 'on_event': relo #0: <byte_off> [71] struct task_struct.start_time (0:111 @ offset 2880)
libbpf: prog 'on_event': relo #0: matching candidate #0 <byte_off> [356] struct task_struct.start_time (0:86 @ offset 1600)
libbpf: prog 'on_event': relo #0: patched insn #37 (ALU/ALU64) imm 2880 -> 1600
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: System error, errno: 7', src/rbperf.rs:211:71
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The line in question appears to be https://github.com/javierhonduco/rbperf/blob/main/src/rbperf.rs#L210 so clearly we have done something wrong

@dalehamel
Copy link
Author

dalehamel commented Feb 21, 2023

I think that https://github.com/javierhonduco/rbperf/blob/main/src/rbperf.rs#L134 was somehow failing, because i reverted afc7ee7 and deleted the older ruby versions from the list:

        // Set the Ruby versions config
        let ruby_version_configs_raw = vec![
            ruby_3_1_2, ruby_3_1_3, ruby_3_2_0, ruby_3_2_1,
        ];

and it allowed me to get past this error. Errno 7 i think is E2BIG so maybe there is somehow an overflow?

Now i have a new one, probably due to how we poorly copied the ruby structs as it says the error should be impossible (lol):

./rbperf record -d 10  --pid 368959 syscall exit_getpid
pid: 368959
statically linked
ruby main thread address: 0x55d860f45bd0
process base address: 0x55d860915000
ruby version: "3.2.0"

thread 'main' panicked at 'Frame id is zero, this should never happen', src/rbperf.rs:418:29
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@manuelfelipe
Copy link
Contributor

I think that https://github.com/javierhonduco/rbperf/blob/main/src/rbperf.rs#L134 was somehow failing, because i reverted afc7ee7 and deleted the older ruby versions from the list:

Found what was the issue with the first listed panic. manuelfelipe@48f05ec addressed it in my fork. Looks like there is a static limit set for the numbers entries allowed in the bpf map used for ruby version matching.

Now, I'm at the sage stage you are @dalehamel and have the same panic after the record session is done.

@javierhonduco
Copy link
Owner

javierhonduco commented Feb 22, 2023

Hi!!

The documentation is really lacking – sorry about this. Thanks for bringing this up, will prioritise the documentation on adding support for a new Ruby version.

Found what was the issue with the first listed panic. manuelfelipe@48f05ec addressed it in my fork. Looks like there is a static limit set for the numbers entries allowed in the bpf map used for ruby version matching.

That fix is totally correct!

@javierhonduco
Copy link
Owner

Seems that in Ruby 3.2.0 the internal structures that back Strings have changed, see Variable Width Allocation: Strings, so the stack unwinder in BPF will have to be adapted. Without these changes, the stacks will be dropped if we can detect that they don't look right, or will be garbled otherwise.

@javierhonduco
Copy link
Owner

@manuelfelipe took the liberty to create a PR with your changes + what was needed to read the correct string (#64). The tests pass, but please do let me know if things aren't working properly and I'll be happy to take a look

@manuelfelipe
Copy link
Contributor

Thanks @javierhonduco, just gave it a try in a random worker around and now failing with:

RUST_LOG=debug RUST_BACKTRACE=1 rbperf record -d 10 --pid 99908 syscall exit_getpid
[2023-02-22T13:50:20Z DEBUG rbperf::rbperf] verbose_bpf_logging set to false
[2023-02-22T13:50:20Z DEBUG rbperf::rbperf] use_ringbuf set to false
[2023-02-22T13:50:20Z DEBUG rbperf::rbperf] open prog: walk_ruby_stack has 10541 intructions
[2023-02-22T13:50:20Z DEBUG rbperf::rbperf] open prog: on_event has 217 intructions
[2023-02-22T13:50:20Z DEBUG rbperf::binary] Checking symbol ruby_version in binary /proc/99908/exe
[2023-02-22T13:50:20Z DEBUG rbperf::process] Binary "/proc/99908/exe"
[2023-02-22T13:50:20Z DEBUG rbperf::binary] Checking symbol ruby_current_vm in binary /proc/99908/exe
pid: 99908
statically linked
ruby main thread address: 0x562f94cccbd0
process base address: 0x562f9469c000
ruby version: "3.2.0"

[2023-02-22T13:50:20Z INFO  rbperf::rbperf] Adding config for version starting with RubyVersion { major_version: 3, minor_version: 2, patch_version: 0 } at index 9
[2023-02-22T13:50:20Z DEBUG rbperf::rbperf] profiling started
[2023-02-22T13:50:20Z DEBUG rbperf::events] syscall with id 207 found in /sys/kernel/debug/tracing/events/syscalls/sys_exit_getpid/id
[2023-02-22T13:50:20Z DEBUG rbperf::rbperf] program type Tracepoint
[2023-02-22T13:50:20Z DEBUG rbperf::rbperf] program type Tracepoint
[2023-02-22T13:50:30Z DEBUG rbperf::rbperf] incomplete stack
... # snip snip snip
2023-02-22T13:50:30Z DEBUG rbperf::rbperf] incomplete stack
[2023-02-22T13:50:30Z DEBUG rbperf::rbperf] incomplete stack
thread 'main' panicked at 'frame_idx should exist', src/rbperf.rs:430:34
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_display
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:135:5
   3: core::panicking::panic_str
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:119:5
   4: core::option::expect_failed
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/option.rs:1879:5
   5: rbperf::rbperf::Rbperf::start
   6: rbperf::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

tho on random occasions, it does not panic and I'm able to get some output from it. eg:

RUST_LOG=debug RUST_BACKTRACE=1 rbperf record -d 10 --pid 99908 syscall exit_getpid
[2023-02-22T13:48:48Z DEBUG rbperf::rbperf] verbose_bpf_logging set to false
[2023-02-22T13:48:48Z DEBUG rbperf::rbperf] use_ringbuf set to false
[2023-02-22T13:48:49Z DEBUG rbperf::rbperf] open prog: walk_ruby_stack has 10541 intructions
[2023-02-22T13:48:49Z DEBUG rbperf::rbperf] open prog: on_event has 217 intructions
[2023-02-22T13:48:49Z DEBUG rbperf::binary] Checking symbol ruby_version in binary /proc/99908/exe
[2023-02-22T13:48:49Z DEBUG rbperf::process] Binary "/proc/99908/exe"
[2023-02-22T13:48:49Z DEBUG rbperf::binary] Checking symbol ruby_current_vm in binary /proc/99908/exe
pid: 99908
statically linked
ruby main thread address: 0x562f94cccbd0
process base address: 0x562f9469c000
ruby version: "3.2.0"

[2023-02-22T13:48:49Z INFO  rbperf::rbperf] Adding config for version starting with RubyVersion { major_version: 3, minor_version: 2, patch_version: 0 } at index 9
[2023-02-22T13:48:49Z DEBUG rbperf::rbperf] profiling started
[2023-02-22T13:48:49Z DEBUG rbperf::events] syscall with id 207 found in /sys/kernel/debug/tracing/events/syscalls/sys_exit_getpid/id
[2023-02-22T13:48:49Z DEBUG rbperf::rbperf] program type Tracepoint
[2023-02-22T13:48:49Z DEBUG rbperf::rbperf] program type Tracepoint
[2023-02-22T13:48:59Z DEBUG rbperf::rbperf] incomplete stack
... # snip snip snip
[2023-02-22T13:48:59Z DEBUG rbperf::rbperf] incomplete stack
Got 3074 samples and 1209 errors
Flamegraph written to: rbperf_flame_02222023_13h48m59s.svg

@javierhonduco
Copy link
Owner

javierhonduco commented Feb 22, 2023

Thanks for the logs, this is super valuable to understand what's going on! Will look into the panic.

About the incomplete stacks, I'm assuming you are tracing a large Rails application, right? They tend to have incredibly deep stacks, and the current value is rather low. For some reason, even though I've done testing in mid to large Rails applications I never increased the total number of frames that can be captured.

Right now, up to 150 frames can be unwound. Just opened #65 to increase this to 750 frames. Let me know if this doesn't fix the truncated (incomplete) stacks you are seeing.

@manuelfelipe
Copy link
Contributor

Nice, thanks for looking into the incomplete stacks. Yeah, I was trying this out in worker from a really large Rails app 🙈. After #65 I no longer see the incomplete stack msgs showing up.

 RUST_LOG=debug RUST_BACKTRACE=1 rbperf record -d 10 --pid 261778 syscall exit_getpid
[2023-02-22T14:35:33Z DEBUG rbperf::rbperf] verbose_bpf_logging set to false
[2023-02-22T14:35:33Z DEBUG rbperf::rbperf] use_ringbuf set to false
[2023-02-22T14:35:33Z DEBUG rbperf::rbperf] open prog: on_event has 217 intructions
[2023-02-22T14:35:33Z DEBUG rbperf::rbperf] open prog: walk_ruby_stack has 10541 intructions
[2023-02-22T14:35:33Z DEBUG rbperf::binary] Checking symbol ruby_version in binary /proc/261778/exe
[2023-02-22T14:35:33Z DEBUG rbperf::process] Binary "/proc/261778/exe"
[2023-02-22T14:35:33Z DEBUG rbperf::binary] Checking symbol ruby_current_vm in binary /proc/261778/exe
pid: 261778
statically linked
ruby main thread address: 0x55e8bfea1bd0
process base address: 0x55e8bf871000
ruby version: "3.2.0"

[2023-02-22T14:35:34Z INFO  rbperf::rbperf] Adding config for version starting with RubyVersion { major_version: 3, minor_version: 2, patch_version: 0 } at index 9
[2023-02-22T14:35:34Z DEBUG rbperf::rbperf] profiling started
[2023-02-22T14:35:34Z DEBUG rbperf::events] syscall with id 207 found in /sys/kernel/debug/tracing/events/syscalls/sys_exit_getpid/id
[2023-02-22T14:35:34Z DEBUG rbperf::rbperf] program type Tracepoint
[2023-02-22T14:35:34Z DEBUG rbperf::rbperf] program type Tracepoint
thread 'main' panicked at 'frame_idx should exist', src/rbperf.rs:430:34
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_display
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:135:5
   3: core::panicking::panic_str
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:119:5
   4: core::option::expect_failed
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/option.rs:1879:5
   5: rbperf::rbperf::Rbperf::start
   6: rbperf::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@javierhonduco
Copy link
Owner

Awesome! Thanks for giving it another try. I've just cut another PR to fix the frame id 0 which mattered in the previous design but it's now broken.

Glad you are mentioning this last error. I've just cut #67 because when I was going through the code I realised that the error handling can definitely be improved. These samples should be ignored, this is not a fatal error by any means

@javierhonduco
Copy link
Owner

javierhonduco commented Feb 22, 2023

(this might take me a couple of hours/days until I can get to it, though)

edit: #68 should remediate this problem

@manuelfelipe
Copy link
Contributor

nice. Thanks again @javierhonduco! definitely much better now. I'm no longer seeing the panics and getting more consistent results ❤️

the one thing I notice, is that n ocations, the output is full of erorr like:

root@gke-core-exp-dpv2-us-ea1c-vi3-core-2-bf3aa8f4-h5w9:/usr/bin# RUST_LOG=debug RUST_BACKTRACE=1 rbperf record -d 10 --pid 2832793 syscall enter_getpid
[2023-02-22T16:54:41Z DEBUG rbperf::rbperf] verbose_bpf_logging set to false
[2023-02-22T16:54:41Z DEBUG rbperf::rbperf] use_ringbuf set to false
[2023-02-22T16:54:41Z DEBUG rbperf::rbperf] open prog: walk_ruby_stack has 10541 intructions
[2023-02-22T16:54:41Z DEBUG rbperf::rbperf] open prog: on_event has 217 intructions
[2023-02-22T16:54:41Z DEBUG rbperf::binary] Checking symbol ruby_version in binary /proc/2832793/exe
[2023-02-22T16:54:41Z DEBUG rbperf::process] Binary "/proc/2832793/exe"
[2023-02-22T16:54:41Z DEBUG rbperf::binary] Checking symbol ruby_current_vm in binary /proc/2832793/exe
pid: 2832793
statically linked
ruby main thread address: 0x562fa1a88b90
process base address: 0x562fa1458000
ruby version: "3.2.1"

[2023-02-22T16:54:41Z INFO  rbperf::rbperf] Adding config for version starting with RubyVersion { major_version: 3, minor_version: 2, patch_version: 1 } at index 10
[2023-02-22T16:54:41Z DEBUG rbperf::rbperf] profiling started
[2023-02-22T16:54:41Z DEBUG rbperf::events] syscall with id 206 found in /sys/kernel/debug/tracing/events/syscalls/sys_enter_getpid/id
[2023-02-22T16:54:41Z DEBUG rbperf::rbperf] program type Tracepoint
[2023-02-22T16:54:41Z DEBUG rbperf::rbperf] program type Tracepoint
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
# ... snip snip 
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:51Z ERROR rbperf::rbperf] mismatched expected=289 and received=264 frame count
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:51Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
# ... snip snip
[2023-02-22T16:54:54Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:54Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:54Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:54Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:54Z ERROR rbperf::rbperf] mismatched expected=359 and received=300 frame count
[2023-02-22T16:54:54Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry
[2023-02-22T16:54:54Z DEBUG rbperf::rbperf] Reading from id_to_stack did not found an entry



Got 4726 samples and 162587 errors
Flamegraph written to: rbperf_flame_02222023_16h54m55s.svg
root@gke-core-exp-dpv2-us-ea1c-vi3-core-2-bf3aa8f4-h5w9:/usr/bin#
root@gke-core-exp-dpv2-us-ea1c-vi3-core-2-bf3aa8f4-h5w9:/usr/bin#
root@gke-core-exp-dpv2-us-ea1c-vi3-core-2-bf3aa8f4-h5w9:/usr/bin#
root@gke-core-exp-dpv2-us-ea1c-vi3-core-2-bf3aa8f4-h5w9:/usr/bin#

@javierhonduco
Copy link
Owner

Happy to help! Fixing this problem properly will take a little while, but there's a stop-gap you can perhaps try meanwhile.

The underlying issue is that some of the stack frames in the id_to_stack map are resulting in hash collisions. When they happen, the newest stack is overwriting the older one with the same hash:

bpf_map_update_elem(&stack_to_id, frame, &random, BPF_ANY);
bpf_map_update_elem(&id_to_stack, &random, frame, BPF_ANY);

(note the BPF_ANY).

Then we try to access the overwritten stack that doesn't exist anymore. This is the error we see in userspace.

There are a couple of things that have to be done to fix this problem properly. The WIP I had some months back that I didn't publish because I didn't make much progress was to have a worker thread that will poll from the events, read the Ruby stack off the BPF maps and produce the Ruby stack we want. By doing this concurrently with the BPF program producing samples, it should heavily reduce the chance of collisions.

To unblock you, a possibility will be to bump these two maps where we store the stacks (from 10240 to some bigger value):

https://github.com/javierhonduco/rbperf/blob/main/src/bpf/rbperf.bpf.c#L36-L48

The other issue that the code has (will put a fix some time this week) is to stop relying on the random number generator. Not only this burns more cycles, but it can also produce the same values that we already have in the maps, making this problem worse.

Hope this makes sense

@manuelfelipe
Copy link
Contributor

thanks for the clarifications. Making more sense now. For now, will give it a try as you are suggesting with a different value for the ruby frame map.

@javierhonduco
Copy link
Owner

Closing this issue as Ruby 3.2.x support has already landed in #64. Let's keep track of the other bugs in their own issues. Thanks for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants