GitHub - gma1k/podtrace: eBPF-driven diagnostic tool for Kubernetes applications 🐝

A lightweight yet powerful eBPF-driven diagnostic tool for Kubernetes applications. Podtrace delivers full-stack observability from kernel events to application-layer behavior, all activated on demand, with no prior configuration or instrumentation. With a single command, it uncovers insights across the entire lifecycle of a pod, including network flows, TCP/UDP performance, file system activity, memory behavior, latency patterns, system calls, and high-level application events such as HTTP, DNS, and database queries.

Overview

Podtrace attaches eBPF programs directly to the container, allowing it to observe real behavior as it happens at runtime. It automatically correlates low-level kernel activity with high-level application operations, surfacing clear, human-readable diagnostic events that reveal what the pod is experiencing internally.

Instead of assembling data from multiple systems or modifying application code, Podtrace provides deep operational visibility in one place, enabling you to understand:

Why a service is slow
Where latency originates
How network and I/O resources are being used
Which operations block or fail
How requests flow through the application
What happens inside the pod during incidents

By combining system-level details, application-layer insights, and real-time event correlation, Podtrace acts as a single on-demand observability lens. This makes it uniquely effective for debugging, performance analysis, and production incident response in Kubernetes environments, especially when time, context, or access is limited.

Documentation

Podtrace documentation is available in the docs/ directory.

Three usage patterns

Podtrace ships one binary that runs in three modes. Pick the one that fits your workflow:

1. Standalone CLI binary

Best for ad-hoc, interactive debugging from a workstation or a privileged debug pod. Install via signed tarball:

curl -fsSL https://github.com/gma1k/podtrace/releases/latest/download/podtrace_linux_amd64.tar.gz \
  | sudo tar xz -C /usr/local/bin podtrace

(/usr/local/bin needs sudo. To install without sudo, extract to a user-writable directory on $PATH instead, e.g. mkdir -p ~/.local/bin && tar xz -C ~/.local/bin podtrace.)

Or via krew:

kubectl krew install podtrace
kubectl podtrace -n production my-pod

Then:

# Realtime trace
podtrace -n production my-pod

# Bounded diagnose with a JSON report
podtrace -n production my-pod --diagnose 30s --export json > report.json

By default the CLI spawns a privileged pod on the target pod's node and runs eBPF there — works on Talos, EKS, GKE, AKS, OpenShift, and any cluster where your workstation is not the kubelet host. On kind / minikube / docker-desktop the workstation is the kubelet host, so add --local to skip the spawn and load eBPF on the workstation directly (faster, no privileged pod needed):

podtrace --local -n production my-pod

For other platforms (linux/arm64, darwin/amd64, darwin/arm64) and cosign-verifiable installs, see docs/installation.md#install-the-cli. The CLI architecture (when to use --local, RBAC needed for the spawn path, etc.) is documented in docs/cli-architecture.md. Full CLI reference: docs/usage.md.

2. Continuous tracing via the `PodTrace` CR

Best for long-running observability: have the operator watch a selector cluster-wide and stream events through an agent DaemonSet.

apiVersion: podtrace.io/v1alpha1
kind: ExporterConfig
metadata: { name: prod-otlp, namespace: my-app }
spec:
  type: otlp
  otlp: { endpoint: otel-collector.observability:4318, protocol: http, insecure: true }
---
apiVersion: podtrace.io/v1alpha1
kind: PodTrace
metadata: { name: watch-api, namespace: my-app }
spec:
  selector: { matchLabels: { app: api } }
  filters: [dns, net]
  exporterRef: { name: prod-otlp }

kubectl apply -f trace.yaml
kubectl get podtraces.podtrace.io watch-api -n my-app -o yaml

Or skip the YAML and let the CLI author the PodTrace for you — target a whole application by label, across every namespace, and it keeps tracing through pod restarts and rollouts until you delete it:

# Trace an application everywhere, continuously:
podtrace watch --app api --all-namespaces --exporter prod-otlp

# Or target with any label selector (--label), scoped to one namespace:
podtrace watch --label app=api,tier=web -n my-app --name api-web --exporter prod-otlp

# Render the manifest instead of applying it:
podtrace watch --app api --all-namespaces --print-only

The same --app/--label/--all-namespaces targeting also works on the plain podtrace command for ephemeral tracing (stream to your terminal, no CR):

podtrace --app api -n my-app --diagnose 30s --filter dns,net

Rule of thumb: podtrace <targeting> = look now (terminal); podtrace watch <targeting> = record continuously (exporter).

Full reference: docs/crd-podtrace.md.

3. Bounded diagnose via the `PodTraceSession` CR

Best for repeatable, GitOps-driven diagnose runs that produce a shareable report artifact. Equivalent to the CLI's --diagnose mode but operator-managed and multi-tenant.

apiVersion: podtrace.io/v1alpha1
kind: PodTraceSession
metadata: { name: diag-api, namespace: my-app }
spec:
  selector: { matchLabels: { app: api } }
  duration: 30s
  filters: [dns, net]
  exporterRef: { name: prod-otlp }
  reportRef:
    configMap: { name: api-diag-report }

kubectl apply -f session.yaml
kubectl get podtracesession diag-api -n my-app -w   # wait for Completed
kubectl get cm api-diag-report -n my-app -o jsonpath='{.data.report\.txt}' | less

This path runs the same eBPF stack the CLI uses, but as a per-node privileged Job. Results land in three parallel channels:

status.summary — aggregated event counts
status.jobs[].eventCount — per-node breakdown
reportRef.configMap (or .secret) — full human-readable report

Full reference: docs/crd-podtracesession.md.

4. Recurring diagnose via the `PodTraceSchedule` CR

Best for nightly diagnose sweeps and on-call probes where you want the last N runs ready to inspect on demand. The schedule fires a fresh PodTraceSession on every cron tick, owns each child via owner references, and prunes history per the configured limits.

apiVersion: podtrace.io/v1alpha1
kind: PodTraceSchedule
metadata: { name: nightly-diagnose, namespace: my-app }
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulSessionsHistoryLimit: 7
  sessionTemplate:
    spec:
      selector: { matchLabels: { app: api } }
      duration: 5m
      filters: [dns, net]
      exporterRef: { name: prod-otlp }

Manual one-off trigger from the CLI:

kubectl podtrace schedule trigger nightly-diagnose -n my-app

Full reference: docs/crd-podtraceschedule.md.

Install the operator

The fastest path is the one-shot quickstart manifest — operator + CRDs + a sample nginx workload + PodTraceSession that reaches state: Completed and writes a report to a ConfigMap. Single kubectl apply, no Helm, no clone, no build toolchain:

kubectl apply -f https://github.com/gma1k/podtrace/releases/latest/download/quickstart.yaml

# Watch the demo session reach Completed (~45-60s end-to-end)
kubectl get podtracesession demo-trace -n podtrace-demo -w

# Read the report
kubectl get cm nginx-trace-report -n podtrace-demo \
  -o jsonpath='{.data.report\.txt}'

# Tear down the demo (operator + sample workload + CRDs)
kubectl delete ns podtrace-system podtrace-demo
kubectl delete crd -l app.kubernetes.io/name=podtrace

On OpenShift or any OLM-managed cluster, podtrace is also available via the OperatorHub.io community catalog:

# OpenShift Console: Operators → OperatorHub → search "podtrace" → Install
# Or apply the Subscription manifest directly:
kubectl apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata: { name: podtrace, namespace: operators }
spec:
  channel: stable
  name: podtrace
  source: operatorhubio-catalog
  sourceNamespace: olm
EOF

For production Helm-managed deployments — custom values, validating webhook, multi-tenant agent config — install via the published OCI chart in GHCR:

helm install podtrace oci://ghcr.io/gma1k/charts/podtrace \
  --namespace podtrace-system --create-namespace

Verify the image was built by this repository:

cosign verify ghcr.io/gma1k/podtrace:latest \
  --certificate-identity-regexp 'https://github.com/gma1k/podtrace/.+' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com

For prerequisites, supported kernels, and per-distro notes see docs/installation.md and docs/compatibility.md. For chart values and operator architecture see docs/operator.md.

Building from source (for contributors or air-gapped clusters) is covered below under Building.

Coming from the CLI

If you're already a CLI user and want a translation table from the CLI to CRs, see docs/migration.md.

Features

Network Tracing

TCP Connection Monitoring: Tracks TCP IPv4/IPv6 connection latency and errors
TCP RTT Analysis: Detects RTT spikes and retry patterns
TCP State Tracking: Monitors TCP connection state transitions (SYN, ESTABLISHED, FIN, etc.)
TCP Retransmission Tracking: Detects TCP retransmissions for network quality diagnostics
Network Device Errors: Monitors network interface errors and packet drops
UDP Network Tracing: Tracks UDP send/receive operations with latency and bandwidth metrics
I/O Bandwidth Tracking: Monitors bytes transferred for TCP/UDP send/receive operations

File System Monitoring

File Operations: Tracks read, write, and fsync operations with latency analysis
File Path Tracking: Captures full file paths
I/O Bandwidth: Monitors bytes transferred for file read/write operations
Throughput Analysis: Calculates average throughput and peak transfer rates

Memory & System Events

Page Fault Tracking: Monitors page faults with error code analysis
OOM Kill Detection: Tracks out-of-memory kills with memory usage details

Application Layer

HTTP Tracing: HTTP request/response tracking via uprobes
DNS Tracking: Monitors DNS lookups with latency and error tracking
Database Query Tracing: Tracks PostgreSQL and MySQL query execution with pattern extraction and latency analysis
TLS/SSL Handshake Tracking: Track TLS handshake latency, errors and failures
Connection Pool Monitoring: Tracks connection pool usage, monitors pool exhaustion, and tracks connection reuse patterns
Redis Tracing: Captures hiredis redisCommand / redisCommandArgv calls with command name and latency (no application changes required)
Memcached Tracing: Captures libmemcached get, set, and delete operations with key and value size
FastCGI / PHP-FPM Tracing: Tracks FastCGI request URI, method, and end-to-end latency via unix-socket kprobes (BTF-only)
gRPC Method Tracing: Extracts gRPC method paths from HTTP/2 HEADERS frames via a second kprobe on tcp_sendmsg (BTF-only)
Kafka Tracing: Tracks librdkafka rd_kafka_produce and rd_kafka_consumer_poll with topic name, payload size, and latency
USDT Auto-Detection: Scans ELF binaries for .note.stapsdt sections and reports available userspace tracepoints
Critical Path Reconstruction: Automatically correlates per-request latency segments by PID and emits a breakdown on HTTP/FastCGI/gRPC response boundaries
PII Redaction: Applies configurable regex rules to scrub passwords, Bearer tokens, email addresses, and credit card numbers from event fields before dispatch

System Monitoring

CPU/Scheduling Tracking: Monitors thread blocking and CPU scheduling events
CPU Usage per Process: Shows CPU consumption by process
Process Activity Analysis: Shows which processes are generating events
Stack Traces for Slow Operations: Captures user-space stack traces for slow I/O, DNS, CPU blocks, memory faults, and other operations exceeding thresholds
Lock Contention Tracking: Monitors futex and pthread mutex waits with timing and hot lock identification
Syscall Tracing: Tracks process lifecycle via execve, fork/clone, open/openat, and close syscalls with file descriptor leak detection
Network Reliability: Monitors TCP retransmissions and network device errors for network quality diagnostics
Database Query Tracing: Tracks PostgreSQL and MySQL query execution patterns and latency
Resource Limit Monitoring: Monitor resource usage vs limits
Error Correlation with Root Cause Analysis: Correlates errors with operations and Kubernetes context

Multi-Pod Tracing

Dynamic Multi-Pod Targeting: Trace multiple pods in one run using explicit pod lists, label selectors, or namespace-wide selection
Cross-Namespace Support: Trace pods across namespaces with --namespaces and selector-based targeting
Live Target Updates: Automatically refresh target pod/cgroup filters when pods are added, updated, or deleted

Distributed Tracing

Trace Context Extraction: Automatically extracts trace context from HTTP/HTTP2 headers and gRPC metadata
Event Correlation: Groups events by trace ID to build complete request flows across services
Request Flow Graphs: Builds directed graphs showing service interactions with latency and error metrics
Multiple Exporters: Supports OpenTelemetry (OTLP), Jaeger, Splunk HEC, Datadog, and Zipkin
Sampling Support: Configurable sampling rates to control export volume

Performance Profiling

pprof & perf Integration: Discovers and fetches heap, goroutine, and CPU profiles from target pod pprof HTTP endpoints
On-demand Profiling Triggers: Activate profiling via the /profile/start management endpoint or automatically when slow events exceed configurable thresholds
CPU/Memory Profiling Correlation: Ties BPF SchedSwitch stack traces to slow events, surfacing the exact goroutine and CPU stacks active during high-latency periods
BPF ktime ↔ Wall-clock Alignment: Derives a monotonic offset so kernel timestamps map accurately to wall time for precise correlation
Profiling Section in Reports: Correlation results are appended to both diagnose-mode and normal-mode reports

Diagnostics

Diagnose Mode: Collects events for a specified duration and generates a comprehensive summary report

Alerting

Real-time Alerts: Sends immediate notifications when fatal, critical, or warning-level issues are detected
Multiple Channels: Supports webhooks, Slack, and Splunk HEC for alert delivery
Smart Deduplication: Prevents alert storms with configurable deduplication windows
Rate Limiting: Configurable rate limits to prevent overwhelming notification systems

Prerequisites

Linux kernel 5.8+ with BTF support
Go 1.26+ (or any earlier 1.x with GOTOOLCHAIN=auto)
Kubernetes cluster access

Building

# Install dependencies
make deps

# Build eBPF program and Go binary
make build

# Build and set capabilities
make build-setup

Usage

Basic Usage

# Trace a pod in real-time
./bin/podtrace -n production my-pod

# Run in diagnostic mode
./bin/podtrace -n production my-pod --diagnose 20s

Diagnose Report

The diagnose mode generates a comprehensive report including:

Summary Statistics: Total events, events per second, collection period
DNS Statistics: DNS lookup latency, errors, top targets
TCP Statistics: RTT analysis, spikes detection, send/receive operations, bandwidth metrics (total bytes, average bytes, peak bytes, throughput)
UDP Statistics: Send/receive operations, latency analysis, bandwidth metrics, error tracking
Connection Statistics: IPv4/IPv6 connection latency, failures, error breakdown, top targets
TCP Connection State Tracking: State transition analysis, state distribution, connection lifecycle monitoring
File System Statistics: Read, write, and fsync operation latency, slow operations, bandwidth metrics (total bytes, average bytes, throughput)
HTTP Statistics: Request/response counts, latency analysis, bandwidth metrics, top requested URLs
Memory Statistics: Page fault counts and error codes, OOM kill tracking with memory usage details
CPU Statistics: Thread blocking times and scheduling events
CPU Usage by Process: CPU percentage per process
Process Activity: Top active processes by event count
Activity Timeline: Event distribution over time
Activity Bursts: Detection of burst periods
Connection Patterns: Analysis of connection behavior
Network I/O Patterns: Send/receive ratios and throughput analysis
Process and Syscall Activity: Process execution, fork/clone, file operations, and file descriptor leak detection
Stack Traces for Slow Operations: User-space stack traces for operations exceeding thresholds with symbol resolution
Lock Contention Analysis: Futex and pthread mutex wait times and hot lock identification
Network Reliability: TCP retransmission tracking and network device error monitoring
Database Query Performance: Query pattern analysis and execution latency (PostgreSQL, MySQL)
Connection Pool Statistics: Connection pool usage, acquire/release rates, reuse patterns, and exhaustion events
Potential Issues: Automatic detection of high error rates and performance problems
Resource Limit Monitoring: Monitor resource usage vs limits
Error Correlation with Root Cause Analysis: Correlates errors with operations and Kubernetes context

Running without sudo

After building, set capabilities to run without sudo:

sudo ./scripts/setup-capabilities.sh

License

Podtrace is dual-licensed:

Go code is licensed under the Apache License 2.0.
eBPF programs under bpf/ are licensed under GPL-2.0 (declared via SPDX-License-Identifier: GPL-2.0 headers). The GPL declaration is required for BPF programs to access kernel helpers via the BPF verifier.

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.github		.github
api/v1alpha1		api/v1alpha1
assets		assets
bpf		bpf
cmd/podtrace		cmd/podtrace
deploy		deploy
docs		docs
examples		examples
hack		hack
internal		internal
pkg		pkg
scripts		scripts
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
.krew.yaml		.krew.yaml
.release-please-manifest.json		.release-please-manifest.json
AUTHORS		AUTHORS
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
STABILITY.md		STABILITY.md
SUPPORT.md		SUPPORT.md
go.mod		go.mod
go.sum		go.sum
release-please-config.json		release-please-config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Documentation

Three usage patterns

1. Standalone CLI binary

2. Continuous tracing via the `PodTrace` CR

3. Bounded diagnose via the `PodTraceSession` CR

4. Recurring diagnose via the `PodTraceSchedule` CR

Install the operator

Coming from the CLI

Features

Network Tracing

File System Monitoring

Memory & System Events

Application Layer

System Monitoring

Multi-Pod Tracing

Distributed Tracing

Performance Profiling

Diagnostics

Alerting

Prerequisites

Building

Usage

Basic Usage

Diagnose Report

Running without sudo

License

About

Uh oh!

Releases 33

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Documentation

Three usage patterns

1. Standalone CLI binary

2. Continuous tracing via the PodTrace CR

3. Bounded diagnose via the PodTraceSession CR

4. Recurring diagnose via the PodTraceSchedule CR

Install the operator

Coming from the CLI

Features

Network Tracing

File System Monitoring

Memory & System Events

Application Layer

System Monitoring

Multi-Pod Tracing

Distributed Tracing

Performance Profiling

Diagnostics

Alerting

Prerequisites

Building

Usage

Basic Usage

Diagnose Report

Running without sudo

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 33

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

2. Continuous tracing via the `PodTrace` CR

3. Bounded diagnose via the `PodTraceSession` CR

4. Recurring diagnose via the `PodTraceSchedule` CR

Packages