Skip to content

Conversation

@gma1k
Copy link
Owner

@gma1k gma1k commented Dec 2, 2025

Add Advanced Observability Features

This PR introduces comprehensive new observability capabilities to podtrace, significantly enhancing its diagnostic and troubleshooting capabilities.

New Features

Stack Trace Capture for Slow Operations

  • Captures user-space stack traces for operations exceeding performance thresholds
  • Automatically captures stacks for slow I/O, DNS, network, lock contention, and database operations
  • Resolves stack frames to symbols using addr2line for human-readable diagnostics
  • Helps pinpoint exact code paths causing performance bottlenecks

Lock Contention & Synchronization Tracking

  • Tracks futex-based synchronization via do_futex kprobe
  • Monitors pthread mutex operations via uprobes
  • Measures lock acquisition time and identifies contention hotspots
  • Enables detection of deadlocks and thread synchronization bottlenecks

TCP Retransmission & Network Error Tracking

  • Monitors TCP retransmissions via tcp_retransmit_skb tracepoint
  • Tracks network device errors and packet drops via net_dev_xmit tracepoint
  • Provides network quality diagnostics and congestion detection
  • Helps identify packet loss and network reliability issues

Extended Syscall Tracing

  • Process Execution: Tracks execve via do_execveat_common kprobe
  • Process Creation: Monitors fork/clone via sched_process_fork tracepoint
  • File Operations: Tracks open/openat via do_sys_openat2 kprobe
  • File Descriptor Management: Monitors close operations (when available)
  • FD Leak Detection: Identifies potential file descriptor leaks by comparing opens vs closes

Database Query Tracing

  • Tracks PostgreSQL queries via PQexec uprobe (libpq)
  • Tracks MySQL queries via mysql_real_query uprobe (libmysqlclient)
  • Measures query execution time and extracts query patterns
  • Sanitizes queries to capture only patterns (first token) for security
  • Supports multiple database drivers with automatic library detection

Technical Improvements

BPF Stack Overflow Fix

  • Replaced stack-allocated struct event with per-CPU array map (event_buf)
  • Replaced stack-allocated stack traces with per-CPU array map (stack_buf)
  • Reduced MAX_STACK_DEPTH from 64 to 32 to optimize memory usage
  • All BPF programs now comply with 512-byte stack limit

New BPF Maps

  • stack_traces: Stores completed stack traces
  • event_buf: Per-CPU temporary event storage
  • stack_buf: Per-CPU temporary stack trace storage
  • lock_targets: Stores lock identifiers for contention tracking
  • db_queries: Stores sanitized database query patterns
  • syscall_paths: Stores file paths for syscall operations

New Event Types

  • EVENT_LOCK_CONTENTION: Lock contention events
  • EVENT_TCP_RETRANS: TCP retransmission events
  • EVENT_NET_DEV_ERROR: Network device error events
  • EVENT_DB_QUERY: Database query events
  • EVENT_EXEC: Process execution events
  • EVENT_FORK: Process/thread creation events
  • EVENT_OPEN: File open events
  • EVENT_CLOSE: File close events

Diagnostic Enhancements

New Report Sections

  • Process and Syscall Activity: Exec/fork/open/close statistics, FD leak detection, top opened files
  • Stack Traces for Slow Operations: Grouped stack traces with symbol resolution
  • Lock Contention Analysis: Lock wait times and hotspot identification
  • Network Reliability: TCP retransmission and network error statistics
  • Database Query Performance: Query pattern analysis and execution latency

Enhanced Filtering

  • Added proc filter option for process lifecycle events
  • Updated --filter flag to support: dns, net, fs, cpu, proc
  • Filter combinations supported (e.g., --filter net,proc)

Breaking Changes

None - all changes are backward compatible.

…yscall tracing, network reliability, and database query monitoring
@gma1k gma1k merged commit ad1a612 into main Dec 2, 2025
1 check passed
@gma1k gma1k deleted the fe-feat branch December 2, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants