Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Add more CPU profiler metrics #2041

Merged
merged 6 commits into from
Sep 21, 2023
Merged

*: Add more CPU profiler metrics #2041

merged 6 commits into from
Sep 21, 2023

Conversation

kakkoyun
Copy link
Member

@kakkoyun kakkoyun commented Sep 21, 2023

Why?

We want to track more metrics for the failure modes of the BPF code.

This PR also includes some code re-organization to separate components semantically.

What?

Refactor the CPU profiler and the BPF-related code to use new packages and improve logging, error handling, and metrics.

Test Plan

  • make test/profiler
  • CI
Relevant metrics


# HELP parca_agent_native_unwinder_error_total There was an error while unwinding the stack.
# TYPE parca_agent_native_unwinder_error_total counter
parca_agent_native_unwinder_error_total{reason="catchall"} 0
parca_agent_native_unwinder_error_total{reason="frame_pointer_action"} 1
parca_agent_native_unwinder_error_total{reason="jit_mixed_mode_disabled"} 0
parca_agent_native_unwinder_error_total{reason="jit_unupdated_mapping"} 0
parca_agent_native_unwinder_error_total{reason="pc_not_covered"} 17
parca_agent_native_unwinder_error_total{reason="pc_not_covered_jit"} 0
parca_agent_native_unwinder_error_total{reason="should_never_happen"} 0
parca_agent_native_unwinder_error_total{reason="truncated"} 0
parca_agent_native_unwinder_error_total{reason="unsupported_cfa_register"} 0
parca_agent_native_unwinder_error_total{reason="unsupported_expression"} 0
# HELP parca_agent_native_unwinder_samples_total Total samples.
# TYPE parca_agent_native_unwinder_samples_total counter
parca_agent_native_unwinder_samples_total{unwinder="dwarf"} 42
# HELP parca_agent_native_unwinder_success_total Samples that unwound successfully reaching the bottom frame.
# TYPE parca_agent_native_unwinder_success_total counter
parca_agent_native_unwinder_success_total{unwinder="dwarf"} 23
parca_agent_native_unwinder_success_total{unwinder="dwarf_reach_bottom"} 40
parca_agent_native_unwinder_success_total{unwinder="dwarf_to_jit"} 0
parca_agent_native_unwinder_success_total{unwinder="jit_frame"} 0
parca_agent_native_unwinder_success_total{unwinder="jit_reach_bottom"} 0
parca_agent_native_unwinder_success_total{unwinder="jit_to_dwarf"} 0


# HELP parca_agent_profiler_bpf_maps_clean_errors_total Number of errors cleaning BPF maps
# TYPE parca_agent_profiler_bpf_maps_clean_errors_total counter
parca_agent_profiler_bpf_maps_clean_errors_total{map="dwarf_stack_traces",type="cpu"} 0
parca_agent_profiler_bpf_maps_clean_errors_total{map="process_info",type="cpu"} 0
parca_agent_profiler_bpf_maps_clean_errors_total{map="stack_counts",type="cpu"} 0
parca_agent_profiler_bpf_maps_clean_errors_total{map="stack_traces",type="cpu"} 0
parca_agent_profiler_bpf_maps_clean_errors_total{map="unwind_info_chunks",type="cpu"} 0

# HELP parca_agent_profiler_bpf_maps_refresh_proc_info_errors_total Number of errors refreshing process info
# TYPE parca_agent_profiler_bpf_maps_refresh_proc_info_errors_total counter
parca_agent_profiler_bpf_maps_refresh_proc_info_errors_total{error="hash",type="cpu"} 0
parca_agent_profiler_bpf_maps_refresh_proc_info_errors_total{error="unwind_table_add",type="cpu"} 0

# HELP parca_agent_profiler_events_lost_total Total number of profile events lost.
# TYPE parca_agent_profiler_events_lost_total counter
parca_agent_profiler_events_lost_total{type="cpu"} 0

# HELP parca_agent_profiler_events_received_total Total number of profile events received.
# TYPE parca_agent_profiler_events_received_total counter
parca_agent_profiler_events_received_total{event="empty",type="cpu"} 0
parca_agent_profiler_events_received_total{event="process_mappings",type="cpu"} 2887
parca_agent_profiler_events_received_total{event="refresh_proc_info",type="cpu"} 16
parca_agent_profiler_events_received_total{event="unwind_info",type="cpu"} 829

# HELP parca_agent_profiler_frame_drop_total Number of addresses dropped from the profile.
# TYPE parca_agent_profiler_frame_drop_total counter
parca_agent_profiler_frame_drop_total{reason="mapping_nil",type="cpu"} 28

# HELP parca_agent_profiler_map_read_attempts_total Number of attempts to read from the BPF maps.
# TYPE parca_agent_profiler_map_read_attempts_total counter
parca_agent_profiler_map_read_attempts_total{action="dwarf_unwind",stack="user",status="error",type="cpu"} 0
parca_agent_profiler_map_read_attempts_total{action="dwarf_unwind",stack="user",status="failed",type="cpu"} 0
parca_agent_profiler_map_read_attempts_total{action="dwarf_unwind",stack="user",status="missing",type="cpu"} 0
parca_agent_profiler_map_read_attempts_total{action="dwarf_unwind",stack="user",status="success",type="cpu"} 23
parca_agent_profiler_map_read_attempts_total{action="kernel_unwind",stack="kernel",status="error",type="cpu"} 0
parca_agent_profiler_map_read_attempts_total{action="kernel_unwind",stack="kernel",status="failed",type="cpu"} 0
parca_agent_profiler_map_read_attempts_total{action="kernel_unwind",stack="kernel",status="missing",type="cpu"} 2580
parca_agent_profiler_map_read_attempts_total{action="kernel_unwind",stack="kernel",status="success",type="cpu"} 290
parca_agent_profiler_map_read_attempts_total{action="kernel_unwind",stack="user",status="error",type="cpu"} 0
parca_agent_profiler_map_read_attempts_total{action="kernel_unwind",stack="user",status="failed",type="cpu"} 0
parca_agent_profiler_map_read_attempts_total{action="kernel_unwind",stack="user",status="missing",type="cpu"} 0
parca_agent_profiler_map_read_attempts_total{action="kernel_unwind",stack="user",status="success",type="cpu"} 2847

# HELP parca_agent_profiler_profiles_drop_total Number of profiles dropped from the profile (one profile represents 1 process in a profiling duration).
# TYPE parca_agent_profiler_profiles_drop_total counter
parca_agent_profiler_profiles_drop_total{reason="process_info",type="cpu"} 24

# HELP parca_agent_profiler_stack_drop_total Total number of stacks dropped from the profile.
# TYPE parca_agent_profiler_stack_drop_total counter
parca_agent_profiler_stack_drop_total{reason="iterator",type="cpu"} 0
parca_agent_profiler_stack_drop_total{reason="read_kernel_stack",type="cpu"} 2580
parca_agent_profiler_stack_drop_total{reason="read_stack_count",type="cpu"} 0
parca_agent_profiler_stack_drop_total{reason="read_stack_count_zero",type="cpu"} 0
parca_agent_profiler_stack_drop_total{reason="read_stack_key",type="cpu"} 0
parca_agent_profiler_stack_drop_total{reason="read_user_stack_with_dwarf",type="cpu"} 0
parca_agent_profiler_stack_drop_total{reason="read_user_stack_with_frame_pointer",type="cpu"} 0

# HELP parca_agent_profiler_unwind_table_add_errors_total Total number of errors adding entries to the unwind table.
# TYPE parca_agent_profiler_unwind_table_add_errors_total counter
parca_agent_profiler_unwind_table_add_errors_total{error="need_more_rounds",type="cpu"} 0
parca_agent_profiler_unwind_table_add_errors_total{error="other",type="cpu"} 0
parca_agent_profiler_unwind_table_add_errors_total{error="procfs_race",type="cpu"} 446
parca_agent_profiler_unwind_table_add_errors_total{error="too_many_mappings",type="cpu"} 0

# HELP parca_agent_profiler_unwind_table_persist_errors_total Total number of errors persisting the unwind table.
# TYPE parca_agent_profiler_unwind_table_persist_errors_total counter
parca_agent_profiler_unwind_table_persist_errors_total{error="need_more_rounds",type="cpu"} 0
parca_agent_profiler_unwind_table_persist_errors_total{error="other",type="cpu"} 0

@kakkoyun kakkoyun requested a review from a team as a code owner September 21, 2023 15:12
Copy link
Contributor

@javierhonduco javierhonduco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Could you provide the metrics in the test plan? Something like curling the metrics endpoint and showing that the counters we care about are there

@kakkoyun
Copy link
Member Author

Looks good! Could you provide the metrics in the test plan? Something like curling the metrics endpoint and showing that the counters we care about are there

Added them to description

Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
@kakkoyun kakkoyun merged commit e575e45 into main Sep 21, 2023
22 checks passed
@kakkoyun kakkoyun deleted the add_more_cpu_prof_metrics branch September 21, 2023 19:35
@kakkoyun kakkoyun mentioned this pull request Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants