diff --git a/.gitignore b/.gitignore index 131bbe1..6761c7c 100644 --- a/.gitignore +++ b/.gitignore @@ -77,3 +77,4 @@ tmp/ temp/ cache/ .tmp/ +docs/plans/ diff --git a/CLAUDE.md b/CLAUDE.md index 1e18049..7fc5a60 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -47,3 +47,10 @@ Use the research note for implementation context: - Read `README.md` first for the project summary. - Read `docs/brainstorms/java-profiler-requirements.md` before making product decisions. - Read `docs/research/coroot-node-agent-java-agent.md` when reasoning about Coroot or async-profiler behavior. + +## Design System + +Always read `DESIGN.md` before making any visual or UI decisions. +All font choices, colors, spacing, layout density, and aesthetic direction are defined there. +Do not deviate without explicit user approval. +In QA or review mode, flag UI code that does not match `DESIGN.md`. diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..2e14252 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,188 @@ +# Design System - Java Profiler + +## Product Context + +- **What this is:** A Java-focused production profiling workbench for Kubernetes services. It helps engineers inspect real async-profiler evidence from a selected Java process during production performance incidents. +- **Who it's for:** Senior Java developers, SREs, platform engineers, and incident responders handling P0/P1 CPU, allocation, lock, and JVM performance problems. +- **Space/industry:** Kubernetes Java performance profiling and incident diagnostics. +- **Project type:** Data-dense operational web app, not a marketing site or general observability dashboard. + +## Aesthetic Direction + +- **Direction:** Industrial / Utilitarian / Forensic. +- **Decoration level:** Minimal. +- **Mood:** Serious, precise, and evidence-first. The UI should feel like a production incident workbench, not a dashboard built for screenshots. +- **Reference products:** Datadog Continuous Profiler, Dynatrace CPU profiling, Grafana Pyroscope. These are references for profiler interaction patterns only; this project must keep its Java/Kubernetes scope and avoid becoming a general observability suite. + +## Core Design Principles + +1. **Evidence before decoration.** Freshness, drop rate, sampling frequency, Pod/JVM scope, and CPU quota baseline must be visible near the data they qualify. +2. **Java semantics stay visible.** Use JVM, HotSpot, JIT, allocation, lock, stack frame, and method-signature language directly. Do not flatten the product into a generic multi-language profiler. +3. **MVP stays narrow.** The first UI should optimize one expert path: choose a Java Pod, inspect CPU profile, find top methods, select a frame, copy/share the evidence. +4. **Numbers must carry units.** Avoid raw sample counts in primary UI. Convert samples into time, cores, percentages with explicit baseline, or rates. +5. **Noise is optional.** Native/system frames must be hideable. Expert users should be able to return visual focus to Java application frames quickly. +6. **Collaboration is part of incident response.** Share, copy stack, and permalink actions are core workbench actions, not polish. + +## MVP Screen Scope + +The initial UI should be a **single Java Pod CPU profile view**. + +### Include in MVP + +- Top context bar with Namespace, Service, Pod, time window, and evidence health. +- Explicit evidence health: freshness lag, drop rate, sampling frequency, and collection status. +- CPU profile type as the primary view. +- Flame Graph and Top Methods views, with a combined "Both" mode. +- Self CPU and Total CPU with time/cores conversion. +- Selected Frame detail drawer with FQCN, method signature, line number when available, Self/Total, baseline, JIT status when known, and stack path. +- Hide Native/System Frames toggle. +- Search by class, method, and line number. +- Copy Stack, Share/Permalink, Focus, Back, and Reset actions. +- Light and dark mode compatibility. + +### Defer from MVP + +- A/B Comparison as a primary panel. +- Wall Clock, GC, and I/O detail views. +- JVM event timeline correlation. +- Multi-Pod service rollup. +- Release/version comparison. +- AI-generated interpretation blocks. +- Code viewer integration beyond copy/permalink actions. + +Deferred features may appear as disabled navigation items or roadmap notes only when doing so does not distract from the core CPU profile workflow. + +## Future Evidence Views + +These features are part of the product direction, but should not expand the first UI implementation unless the active plan explicitly includes them. + +- **A/B Comparison:** Compare equivalent evidence across two contexts, such as normal Pod versus anomalous Pod, baseline time window versus incident window, or release A versus release B. The first comparison mode should work without release metadata by comparing two time windows. +- **Wall Clock:** Analyze runnable plus blocked time for Java services where request latency is not explained by CPU. +- **GC:** Show GC pause, allocation pressure, and JVM event evidence correlated to the selected profile window. +- **I/O:** Show network and disk blocking time when supported by the collected evidence. +- **Service Rollup:** Show Pod variance and outlier detection, then let users drill down to a single Java Pod. + +## Typography + +- **UI / Body:** IBM Plex Sans. It is readable, neutral, technical, and works well in dense operational interfaces. +- **Data / Tables:** IBM Plex Mono with tabular numbers. Use for CPU cores, percentages, timestamps, Pod names, sample rates, and profile IDs. +- **Code / Stack Frames:** JetBrains Mono. Use for method names, stack paths, FQCNs, and snippets. +- **Loading strategy:** Prefer self-hosted fonts in production. Google Fonts or Bunny Fonts are acceptable for prototypes only. + +## Type Scale + +- **Page title:** 20px / 28px, 600. +- **Panel title:** 14px / 20px, 600. +- **Body:** 13px / 20px, 400. +- **Table body:** 12px / 18px, 400. +- **Labels / headers:** 11px / 16px, 700, uppercase only for compact labels. +- **Stack frames:** 11px / 16px, JetBrains Mono. +- **Metric values:** 18-20px / 24px, IBM Plex Mono, 600. + +Do not scale font size with viewport width. Keep letter spacing at `0` except compact uppercase labels, where `0.04em` is acceptable. + +## Color + +- **Approach:** Restrained semantic palette. Color exists to communicate evidence type, state, and severity. +- **Background:** `#F7F8FA` +- **Surface:** `#FFFFFF` +- **Surface muted:** `#EEF1F4` +- **Border:** `#D5DAE1` +- **Strong border:** `#B6BEC8` +- **Text:** `#172026` +- **Muted text:** `#5D6975` +- **Primary:** `#0F766E` +- **CPU:** `#C2410C` +- **Wall Clock:** `#0F766E` +- **Allocation:** `#2563EB` +- **GC:** `#7C2D12` +- **I/O:** `#6D5D00` +- **Lock:** `#854D0E` +- **Success:** `#15803D` +- **Warning:** `#B45309` +- **Error:** `#B42318` +- **Info:** `#2563EB` + +## Color Usage Rules + +- Use strong semantic colors for labels, icons, active borders, legends, and selected states. +- Use low-saturation or translucent variants for large filled areas such as flame graph frames. +- Flame graph fill examples: + - CPU: `rgba(194, 65, 12, 0.12-0.28)` + - Wall Clock: `rgba(15, 118, 110, 0.12-0.24)` + - Allocation: `rgba(37, 99, 235, 0.10-0.22)` + - GC: `rgba(124, 45, 18, 0.10-0.22)` + - I/O: `rgba(109, 93, 0, 0.10-0.22)` +- Do not use gradients, decorative color blobs, or large saturated background bands. +- Validate contrast for every foreground/background pair used in table rows, flame frames, tooltips, badges, and controls. + +## Dark Mode + +Dark mode should be redesigned, not merely inverted. + +- **Background:** `#101214` +- **Surface:** `#171B1F` +- **Surface muted:** `#20262B` +- **Border:** `#2B3238` +- **Strong border:** `#3B454D` +- **Text:** `#E7EAEE` +- **Muted text:** `#A4ACB5` + +Reduce large-area saturation in dark mode. Keep selected states legible without glowing effects. + +## Spacing + +- **Base unit:** 4px. +- **Density:** Compact. +- **Scale:** 2px, 4px, 8px, 12px, 16px, 24px, 32px, 48px, 64px. +- **Table rows:** 34px default. +- **Toolbar controls:** 32px height. +- **Tabs:** 26-28px height. +- **Panel padding:** 10-14px. +- **Page padding:** 14-16px. + +Dense does not mean cramped. Preserve enough row height for scanning long method names and numeric columns without vertical jitter. + +## Layout + +- **Approach:** Grid-disciplined workbench. +- **Primary shell:** top context bar, left evidence/scope navigation, central profile workspace, right selected-frame detail drawer. +- **MVP layout:** three columns on desktop: left navigation around 180-200px, central workspace fluid, detail drawer around 300-340px. +- **Responsive behavior:** collapse the detail drawer below the main workspace on medium screens, and stack all regions on narrow screens. +- **Max content width:** None for the workbench. Use full available width because flame graphs and tables need horizontal space. +- **Border radius:** 4px for small controls, 6px for buttons/chips, 8px maximum for panels and cards, `9999px` only for compact badges. + +Avoid nested cards and decorative card mosaics. Use panels only for actual work surfaces: flame graph, table, selected frame, event detail, or controls. + +## Interaction + +- **Search:** `Cmd/Ctrl + K` opens class/method search. +- **Flame graph traversal:** Arrow keys move between frames, `Enter` focuses, `Esc` backs out or resets. +- **Table sorting:** Column headers must sort, especially Self CPU and Total CPU. +- **Tooltips:** Hovering a method/frame should show rich detail: FQCN, method signature, line number, Self/Total, converted time/cores, baseline, JIT/inlining state when known, and copy/focus actions. +- **Share:** Permalinks should preserve namespace, service, Pod, time range, profile type, search term, hide-native state, and focused frame when possible. +- **Copy Stack:** Must produce a complete stack path suitable for incident notes, Jira, Slack, or ticket systems. + +## Motion + +- **Approach:** Minimal-functional. +- **Durations:** 80-100ms for micro feedback, 150-200ms for drawer/tab transitions, 240ms maximum for larger layout changes. +- **Easing:** ease-out for entering, ease-in for exiting, ease-in-out for movement. +- **Avoid:** Decorative motion, looping effects, animated backgrounds, or refresh animations that make data appear unstable. + +## Accessibility + +- Maintain keyboard parity for search, focus, back, reset, and table sorting. +- Keep focus rings visible and consistent. +- Do not encode state by color alone. Pair color with text, icon, or badge labels. +- Use zebra striping or subtle row separators for dense tables. +- Keep method names accessible via tooltip or expandable detail when truncated. + +## Decisions Log + +| Date | Decision | Rationale | +|------|----------|-----------| +| 2026-05-17 | Adopt Industrial / Utilitarian / Forensic direction | Production profiling is an incident-response workflow; precision and trust matter more than decorative appeal. | +| 2026-05-17 | Scope MVP to single Java Pod CPU profile | This preserves the core expert workflow while avoiding premature complexity from A/B diff, event correlation, and multi-Pod aggregation. | +| 2026-05-17 | Use IBM Plex Sans, IBM Plex Mono, and JetBrains Mono | The product is data-heavy and Java-code-heavy; these fonts support dense scanning and developer trust. | +| 2026-05-17 | Use low-saturation flame graph fills | Large saturated profiling blocks reduce readability and can break contrast; strong colors are reserved for labels, legends, and selected states. | diff --git a/README.md b/README.md index fa729cc..1c445e9 100644 --- a/README.md +++ b/README.md @@ -1,146 +1,86 @@ # java-profiler -Java Profiler is a focused performance profiling system for Java services running on Kubernetes. It helps service owners and platform engineers answer one question with real data: where is this Java service spending CPU time, allocating memory, or waiting on locks? +Java performance profiling for Kubernetes services. Find where a HotSpot JVM is spending CPU, allocating memory, waiting on locks, pausing for GC, or blocking on Java I/O, using real async-profiler/JFR-derived data and a service-focused UI. -The project is intentionally narrower than a full observability platform. It provides node-local profiling, bounded profile storage, and a service-centric UI for profile analysis, target status, and ingestion health. +[Docs](https://koolay.github.io/java-profiler/) · [中文文档](https://koolay.github.io/java-profiler/zh/) · [Quickstart](https://koolay.github.io/java-profiler/getting-started/quickstart) · [Analyze a service](https://koolay.github.io/java-profiler/operations/performance-analysis-user-manual) · [Contributing](https://koolay.github.io/java-profiler/contributing/development) -## What It Analyzes +## Why java-profiler -- CPU hotspots: high-cost Java methods, self time, total time, and full sampled stack context. -- Allocation hotspots: methods and call paths that create allocation pressure. -- Lock delay: synchronized or monitor-related paths that block under contention. -- Thread evidence: thread snapshots that help cross-check CPU, lock, sleep, and blocked states. -- Deadlock evidence: deadlock events when the target JVM reports them. -- Profiling health: whether a JVM was accepted, disabled, unsupported, failed attach, conflicted with another profiler, or produced rejected/dropped ingestion data. +Most observability stacks tell you that a Java service is slow. `java-profiler` is for the next question: which Java stack is responsible? -## How It Works +- **Kubernetes-native opt-in**: enable profiling with annotations or labels. No application code changes. +- **Real JVM profile data**: CPU, Wall Clock, allocation, lock-delay, Java I/O wait, and GC evidence come from async-profiler/JFR-derived collection. +- **Expert Java workbench**: Top Table, Flame Graph, selected-frame details, native-frame filtering, target status, deadlocks, and ingestion health in one workflow. +- **Ownable storage**: profile data lands in ClickHouse with retention bounded to 7 days or less. +- **Focused scope**: no required Pyroscope, Parca, or Grafana backend. +- **Built for proof**: real acceptance requires non-empty CPU, Wall Clock, Java I/O wait, GC, allocation, lock, ClickHouse, ingestion, and browser UI evidence. -The first version assumes: +## Quickstart -- Java workloads run in Kubernetes. -- Profiling is opt-in through Kubernetes annotations or labels. -- A DaemonSet collector runs node-local and discovers Java processes through node/pod metadata. -- HotSpot-compatible JVMs are supported first. -- async-profiler collects CPU, allocation, and lock profiles. -- The backend stores profile data in ClickHouse with retention bounded to 7 days or less. -- The Web UI provides a compact service-diagnosis workflow for status, CPU, memory allocation, locks, deadlocks, and ingestion. +Enable temporary profiling on a workload pod template: -## Current State - -The repository has moved beyond documentation-only planning. The implementation is split across: - -```text -cmd/ - backend/ - collector/ -backend/ - internal/ -collector/ - internal/ -contracts/ - profiling/ -java-helper/ - thread-diagnostics/ -examples/ - jdk17-http-demo/ -web/ - src/ -deploy/ - helm/ -docs/ - architecture/ - brainstorms/ - operations/ - research/ -``` - -Release delivery is automated from `vX.Y.Z` tag pushes. The workflow publishes backend, collector, and web images to GHCR, emits SBOM/provenance attestations, packages the Helm chart, and creates the matching GitHub Release with image digests. - -## Quick Verification - -Run local checks before changing profiling, ingestion, backend APIs, or UI behavior: - -```bash -go test ./... -javac --release 11 java-helper/thread-diagnostics/src/main/java/com/ebpfjava/threads/*.java -cd examples/jdk17-http-demo && mvn test -cd web && npm ci && npm test && npm run build -``` - -Optional local ClickHouse-compatible smoke check using chDB: - -```bash -scripts/verify-chdb-local.sh +```yaml +metadata: + annotations: + java-profiler.io/profile-mode: temporary + java-profiler.io/profile-duration: 15m ``` -The script skips cleanly when `libchdb` is not installed. Use `CHDB_REQUIRED=1` to make missing chDB fail automation. - -## Real Kubernetes Acceptance +Open the Web UI, select the namespace, service, and time range, then start with: -Real acceptance is required for changes that affect collector profiling, ingestion, ClickHouse storage, backend query APIs, Kubernetes deployment, the JDK17 demo service, or the profile UI. +- `status` to confirm the JVM was accepted. +- `cpu` to find expensive Java methods. +- `wall` when latency is not explained by CPU alone. +- `io` to isolate Java-owned socket or file blocking paths. +- `gc` to correlate JVM pause evidence with allocation pressure. +- `memory` to inspect allocation pressure. +- `locks` and `deadlocks` to investigate contention. +- `ingestion` to confirm profile batches were accepted. -Use a real cluster: +See the [Quickstart](docs/getting-started/quickstart.md) and [Performance Analysis Manual](docs/operations/performance-analysis-user-manual.md). -```bash -export KUBECONFIG=$HOME/backup/localk8s.yaml -``` - -Build backend, collector, and web images from the current workspace: +## What it analyzes -```bash -export BACKEND_IMAGE=java-profiler-backend:qa-$(date +%Y%m%d%H%M%S) -export COLLECTOR_IMAGE=java-profiler-collector:qa-$(date +%Y%m%d%H%M%S) -export WEB_IMAGE=java-profiler-web:qa-$(date +%Y%m%d%H%M%S) - -bash scripts/build-real-acceptance-images.sh -``` +- CPU hotspots: high-cost Java methods, self time, total time, and sampled stack context. +- Wall Clock latency: Java stack time spent runnable, blocked, waiting, sleeping, or doing I/O. +- Java I/O wait: socket or file blocking paths when JVM/JFR evidence preserves Java ownership. +- GC pauses: JVM GC event evidence correlated with allocation profiles and the incident window. +- Allocation hotspots: methods and call paths creating allocation pressure. +- Lock delay: synchronized or monitor paths that block under contention. +- Thread evidence: snapshots for CPU, lock, sleep, blocked, and waiting states. +- Deadlock evidence: deadlock cycles reported by the target JVM. +- Profiling health: accepted, disabled, unsupported, attach failure, profiler conflict, rejected upload, or dropped ingestion data. -Run strict real profiling acceptance against the JDK17 demo target: +## How it works -```bash -scripts/real-acceptance.sh \ - --service jdk17-http-demo \ - --configure-profiler \ - --require-full-profiling \ - --high-volume \ - --artifact-dir /tmp/java-profiler-real-acceptance-$(date +%Y%m%d%H%M%S) +```text +Kubernetes metadata + | + v +Node-local collector DaemonSet + | + v +async-profiler/JFR + thread diagnostics + | + v +Backend API -> ClickHouse + | + v +Service diagnosis UI ``` -Passing real acceptance means proving all of the following from the current Kubernetes run window: - -- target status has an accepted Java target -- CPU profile is non-empty -- allocation profile is non-empty -- lock-delay profile is non-empty -- ClickHouse has profile sample and stack rows -- ingestion has accepted profile batches without unexplained rejected/dropped/truncated data -- profile TTL remains bounded to 7 days or less -- Browser UI acceptance passes against real backend data -- target workload restart count does not increase - -See `docs/operations/real-profiling-acceptance-standard.md` for the full standard. - -## Documentation - -- Online documentation: https://koolay.github.io/java-profiler/ -- `docs/brainstorms/java-profiler-requirements.md`: product requirements, actors, flows, acceptance examples, and scope boundaries. -- `docs/architecture/java-profiler-architecture.md`: collector, backend, ClickHouse, query, and UI architecture. -- `docs/architecture/performance-ingestion-architecture-review.md`: ingestion hardening, batch limits, OOM risk, and ClickHouse query pressure. -- `docs/research/coroot-node-agent-java-agent.md`: research notes on Coroot's Java agent and async-profiler behavior. -- `docs/operations/performance-analysis-user-manual.md`: service owner workflow for CPU, allocation, lock, deadlock, status, and ingestion analysis. -- `docs/operations/java-profiling-runbook.md`: operator workflow for enabling profiling, reading statuses, retention, and troubleshooting. -- `docs/operations/deployment-operations-admin-manual.md`: deployment, operations, security, storage, upgrade, and platform troubleshooting. -- `docs/operations/e2e-automation-test-guide.md`: real E2E and browser automation guide. -- `docs/operations/real-profiling-acceptance-standard.md`: mandatory real Kubernetes acceptance standard. -- `docs/index.md`: documentation site homepage. +The first version targets Java services running on Kubernetes, HotSpot-compatible JVMs first. Profiling is controlled through Kubernetes metadata, collected node-locally, stored in ClickHouse, and exposed through a compact UI for service owners and platform engineers. -## Real UI Evidence +## Screenshots -These screenshots are generated from a real Kubernetes acceptance environment, not from mocked UI state. They show the expected diagnosis workflow and give maintainers a visual regression anchor for the core surfaces. +These screenshots come from a real Kubernetes acceptance environment, not mocked UI state. ![CPU profile analysis showing real DemoHttpService hotspots](docs/assets/screenshots/real-cpu-analysis.png) - [Target status evidence](docs/assets/screenshots/real-target-status.png) +- [Wall Clock latency evidence](docs/assets/screenshots/real-wall-clock.png) +- [Java I/O wait evidence](docs/assets/screenshots/real-io-wait.png) +- [GC pause and allocation correlation](docs/assets/screenshots/real-gc-pauses.png) - [Deadlock diagnosis surface](docs/assets/screenshots/real-deadlocks.png) - [Ingestion health evidence](docs/assets/screenshots/real-ingestion-health.png) @@ -153,20 +93,41 @@ export REAL_ACCEPTANCE_SERVICE=jdk17-http-demo node scripts/capture-doc-screenshots.mjs ``` -## Scope Boundaries +## Develop -The first version does not include: +Run local checks before changing profiling, ingestion, backend APIs, or UI behavior: -- Pyroscope, Parca, Grafana, or another required profile backend. -- Non-Java profiling. -- OpenJ9 support. -- Distributed ClickHouse. -- Heap dump analysis or retained-heap dominator analysis. -- General-purpose tracing, log analysis, service maps, dashboarding, or alerting. -- Prometheus metrics storage or dashboard replacement. +```bash +go test ./... +javac --release 11 java-helper/thread-diagnostics/src/main/java/com/ebpfjava/threads/*.java +cd examples/jdk17-http-demo && mvn test +cd ../../web && npm ci && npm test && npm run build +``` -Metrics may be exposed by collector/backend exporters, but Prometheus-series systems own metric storage, dashboards, alerting, and retention. +Build the docs site: + +```bash +cd docs +npm install +npm run docs:build +``` -## Working Rule +For changes touching collector profiling, ingestion, ClickHouse storage, backend query APIs, deployment, the demo service, or profile UI, run real Kubernetes acceptance. See [Contributing](docs/contributing/development.md) and the [Real Profiling Acceptance Standard](docs/operations/real-profiling-acceptance-standard.md). -Keep implementation and documentation aligned with `docs/brainstorms/java-profiler-requirements.md`. If a change affects scope, retention, collection behavior, storage, or user workflow, update the relevant operation manual and acceptance standard in the same change. +## Documentation + +- [Online docs](https://koolay.github.io/java-profiler/) +- [中文文档](https://koolay.github.io/java-profiler/zh/) +- [Quickstart](docs/getting-started/quickstart.md) +- [Analyze a Java service](docs/operations/performance-analysis-user-manual.md) +- [Enable profiling](docs/operations/java-profiling-runbook.md) +- [Deploy and operate the platform](docs/operations/deployment-operations-admin-manual.md) +- [Development setup](docs/contributing/development.md) +- [Localization policy](docs/contributing/localization.md) +- [Architecture](docs/architecture/java-profiler-architecture.md) + +## Scope + +The first version does not include non-Java profiling, OpenJ9 support, heap dump analysis, distributed ClickHouse, tracing, log analysis, service maps, dashboarding, alerting, or Prometheus metric storage. + +Metrics may be exposed by collector/backend exporters, but Prometheus-series systems own metric storage, dashboards, alerting, and retention. diff --git a/backend/internal/app/ingest_jvm_event_batch.go b/backend/internal/app/ingest_jvm_event_batch.go new file mode 100644 index 0000000..5b64c44 --- /dev/null +++ b/backend/internal/app/ingest_jvm_event_batch.go @@ -0,0 +1,84 @@ +package app + +import ( + "context" + "crypto/sha256" + "encoding/hex" + "encoding/json" + "time" + + "github.com/koolay/java-profiler/backend/internal/clickhouse" + "github.com/koolay/java-profiler/domain" +) + +type JVMEventBatchRequest struct { + BatchID string `json:"batch_id"` + CollectorID string `json:"collector_id"` + ReceivedAt time.Time `json:"received_at"` + Events []clickhouse.JVMEvent `json:"events"` +} + +type JVMEventIngestor struct { + Events JVMEventStore + Ingestion IngestionStore +} + +type JVMEventStore interface { + InsertJVMEvents(context.Context, []clickhouse.JVMEvent) error + QueryJVMEvents(context.Context, clickhouse.JVMEventQuery) ([]clickhouse.JVMEvent, error) +} + +func (i JVMEventIngestor) Ingest(ctx context.Context, req JVMEventBatchRequest) (IngestResult, error) { + if req.BatchID == "" || req.CollectorID == "" { + return IngestResult{Status: clickhouse.IngestionRejected, Message: "batch_id and collector_id are required"}, nil + } + events := append([]clickhouse.JVMEvent(nil), req.Events...) + for index, event := range events { + if event.BatchID != "" && event.BatchID != req.BatchID { + return IngestResult{Status: clickhouse.IngestionRejected, Message: "JVM event batch_id conflicts with envelope batch_id"}, nil + } + if event.EventID == "" || event.EventType == "" || event.Target.Key() == "" || event.EventAt.IsZero() { + return IngestResult{Status: clickhouse.IngestionRejected, Message: "invalid JVM event"}, nil + } + events[index].BatchID = req.BatchID + } + batch := clickhouse.IngestionBatch{ + BatchID: req.BatchID, + CollectorID: req.CollectorID, + BatchType: domain.BatchTypeJVMEvent, + ReceivedAt: firstNonZero(req.ReceivedAt, time.Now().UTC()), + Status: clickhouse.IngestionClaimed, + PayloadHash: jvmEventHash(events), + } + status, err := i.Ingestion.Record(ctx, batch) + if err != nil { + return IngestResult{}, err + } + if status == clickhouse.IngestionRejected { + batch.Status = clickhouse.IngestionRejected + batch.Message = "batch id reused with different payload" + _, _ = i.Ingestion.Record(ctx, batch) + return IngestResult{Status: clickhouse.IngestionRejected, Message: batch.Message}, nil + } + if status == clickhouse.IngestionDuplicate { + return IngestResult{Status: clickhouse.IngestionDuplicate, Message: "duplicate batch ignored"}, nil + } + if err := i.Events.InsertJVMEvents(ctx, events); err != nil { + return IngestResult{Status: clickhouse.IngestionRetryable, Retryable: true, Message: err.Error()}, nil + } + batch.Status = clickhouse.IngestionAccepted + status, err = i.Ingestion.Record(ctx, batch) + if err != nil { + return IngestResult{}, err + } + if status == clickhouse.IngestionRejected { + return IngestResult{Status: clickhouse.IngestionRejected, Message: "batch id reused with different payload"}, nil + } + return IngestResult{Status: clickhouse.IngestionAccepted, Message: "accepted"}, nil +} + +func jvmEventHash(events []clickhouse.JVMEvent) string { + data, _ := json.Marshal(events) + sum := sha256.Sum256(data) + return hex.EncodeToString(sum[:]) +} diff --git a/backend/internal/app/ingest_profile_batch.go b/backend/internal/app/ingest_profile_batch.go index 382bca8..fa8f78b 100644 --- a/backend/internal/app/ingest_profile_batch.go +++ b/backend/internal/app/ingest_profile_batch.go @@ -41,6 +41,7 @@ type ProfileQueryStore interface { QuerySamples(context.Context, clickhouse.ProfileQuery) ([]clickhouse.ProfileSample, error) QueryFlamegraphSamples(context.Context, clickhouse.ProfileQuery) ([]clickhouse.FlamegraphSample, error) QueryTopStackSamples(context.Context, clickhouse.ProfileQuery) ([]clickhouse.TopStackSample, error) + QueryProfileTargetSummary(context.Context, clickhouse.ProfileQuery) ([]clickhouse.ProfileTargetSummary, error) } type IngestionStore interface { diff --git a/backend/internal/app/ingest_profile_batch_test.go b/backend/internal/app/ingest_profile_batch_test.go index 8585a6f..18816bc 100644 --- a/backend/internal/app/ingest_profile_batch_test.go +++ b/backend/internal/app/ingest_profile_batch_test.go @@ -102,6 +102,7 @@ type appProfileQueryStore interface { QuerySamples(context.Context, clickhouse.ProfileQuery) ([]clickhouse.ProfileSample, error) QueryFlamegraphSamples(context.Context, clickhouse.ProfileQuery) ([]clickhouse.FlamegraphSample, error) QueryTopStackSamples(context.Context, clickhouse.ProfileQuery) ([]clickhouse.TopStackSample, error) + QueryProfileTargetSummary(context.Context, clickhouse.ProfileQuery) ([]clickhouse.ProfileTargetSummary, error) } func TestProfileBatchIngestorRejectsSameBatchDifferentPayload(t *testing.T) { diff --git a/backend/internal/app/query_flamegraph.go b/backend/internal/app/query_flamegraph.go index c7936b8..21d573b 100644 --- a/backend/internal/app/query_flamegraph.go +++ b/backend/internal/app/query_flamegraph.go @@ -50,6 +50,7 @@ func (q FlamegraphQuerier) Query(ctx context.Context, query FlamegraphQuery) (ba } buildStarted := time.Now() result := backenddomain.BuildFlamegraph(flamegraphSamples, query.NodeLimit) + result = backenddomain.ApplyProfileSemantics(result, query.ProfileType, domain.TimeWindow{StartedAt: query.Start, EndsAt: query.End}) recordMetric(q.Metrics, "java_profiler_query_flamegraph_build_seconds_total", time.Since(buildStarted).Seconds()) recordMetric(q.Metrics, "java_profiler_query_flamegraph_scanned_samples_total", float64(result.Metadata.ScannedSamples)) recordMetric(q.Metrics, "java_profiler_query_flamegraph_omitted_nodes_total", float64(result.Metadata.OmittedNodes)) diff --git a/backend/internal/app/query_jvm_events.go b/backend/internal/app/query_jvm_events.go new file mode 100644 index 0000000..dea8cae --- /dev/null +++ b/backend/internal/app/query_jvm_events.go @@ -0,0 +1,36 @@ +package app + +import ( + "context" + + "github.com/koolay/java-profiler/backend/internal/clickhouse" + "github.com/koolay/java-profiler/domain" +) + +type JVMEventQuery struct { + Namespace string + Service string + Pod string + EventType string + Start domain.TimeWindow + Limit int +} + +type JVMEventEvidence struct { + Events []clickhouse.JVMEvent `json:"events"` + Partial bool `json:"partial"` +} + +func QueryJVMEvents(ctx context.Context, repo JVMEventStore, q clickhouse.JVMEventQuery) (JVMEventEvidence, error) { + limit := boundedQueryLimit(q.Limit, 500, 5000) + q.Limit = limit + 1 + events, err := repo.QueryJVMEvents(ctx, q) + if err != nil { + return JVMEventEvidence{}, err + } + partial := len(events) > limit + if partial { + events = events[:limit] + } + return JVMEventEvidence{Events: events, Partial: partial}, nil +} diff --git a/backend/internal/app/query_service_summary.go b/backend/internal/app/query_service_summary.go index 465e50f..c8703ba 100644 --- a/backend/internal/app/query_service_summary.go +++ b/backend/internal/app/query_service_summary.go @@ -1,10 +1,43 @@ package app -import "github.com/koolay/java-profiler/domain" +import ( + "context" + "sort" + "time" -type ServiceSummary struct { - Namespace string `json:"namespace"` - Service string `json:"service"` - AvailableProfiles []domain.ProfileType `json:"available_profiles"` - StatusCounts map[string]int `json:"status_counts"` + "github.com/koolay/java-profiler/backend/internal/clickhouse" + "github.com/koolay/java-profiler/backend/internal/metrics" +) + +type ServiceProfileSummary struct { + Targets []clickhouse.ProfileTargetSummary `json:"targets"` + Partial bool `json:"partial"` +} + +func QueryServiceProfileSummary(ctx context.Context, repo ProfileQueryStore, q clickhouse.ProfileQuery, exporter *metrics.Exporter) (ServiceProfileSummary, error) { + started := time.Now() + targets, err := repo.QueryProfileTargetSummary(ctx, q) + if err != nil { + return ServiceProfileSummary{}, err + } + recordMetric(exporter, "java_profiler_query_service_summary_fetch_seconds_total", time.Since(started).Seconds()) + sort.Slice(targets, func(i, j int) bool { + if targets[i].TotalValue != targets[j].TotalValue { + return targets[i].TotalValue > targets[j].TotalValue + } + if targets[i].Pod != targets[j].Pod { + return targets[i].Pod < targets[j].Pod + } + return targets[i].ProcessID < targets[j].ProcessID + }) + limit := boundedQueryLimit(q.Limit, 100, 500) + partial := len(targets) > limit + if partial { + targets = targets[:limit] + } + recordMetric(exporter, "java_profiler_query_service_summary_targets_total", float64(len(targets))) + if partial { + recordMetric(exporter, "java_profiler_query_service_summary_partial_total", 1) + } + return ServiceProfileSummary{Targets: targets, Partial: partial}, nil } diff --git a/backend/internal/app/query_service_summary_test.go b/backend/internal/app/query_service_summary_test.go new file mode 100644 index 0000000..c21edd8 --- /dev/null +++ b/backend/internal/app/query_service_summary_test.go @@ -0,0 +1,38 @@ +package app + +import ( + "context" + "testing" + "time" + + "github.com/koolay/java-profiler/backend/internal/clickhouse" + "github.com/koolay/java-profiler/domain" +) + +func TestQueryServiceProfileSummaryRanksPodJVMTargets(t *testing.T) { + repo := clickhouse.NewProfileRepository() + now := time.Unix(100, 0).UTC() + if err := repo.InsertProfileBatch(context.Background(), "batch", []clickhouse.ProfileSample{ + {Target: domain.TargetIdentity{Namespace: "prod", Service: "checkout", Pod: "pod-a", Container: "app", ProcessID: 11, JVMStartTime: now}, ProfileType: domain.ProfileTypeCPU, StartedAt: now, EndedAt: now.Add(time.Second), Value: uint64(2 * time.Second)}, + {Target: domain.TargetIdentity{Namespace: "prod", Service: "checkout", Pod: "pod-b", Container: "app", ProcessID: 12, JVMStartTime: now}, ProfileType: domain.ProfileTypeCPU, StartedAt: now, EndedAt: now.Add(time.Second), Value: uint64(8 * time.Second)}, + }); err != nil { + t.Fatal(err) + } + + got, err := QueryServiceProfileSummary(context.Background(), repo, clickhouse.ProfileQuery{ + Namespace: "prod", + Service: "checkout", + ProfileType: domain.ProfileTypeCPU, + Start: now, + End: now.Add(10 * time.Second), + }, nil) + if err != nil { + t.Fatal(err) + } + if len(got.Targets) != 2 || got.Targets[0].Pod != "pod-b" { + t.Fatalf("expected pod-b to rank first, got %+v", got.Targets) + } + if got.Targets[0].DisplayValue != "8.00 s · 0.80 cores" || got.Targets[0].PercentOfTotal != "80.0%" { + t.Fatalf("unexpected target semantics: %+v", got.Targets[0]) + } +} diff --git a/backend/internal/app/query_top_stacks.go b/backend/internal/app/query_top_stacks.go index 640302c..9280642 100644 --- a/backend/internal/app/query_top_stacks.go +++ b/backend/internal/app/query_top_stacks.go @@ -13,13 +13,16 @@ import ( ) type TopStackRow struct { - Symbol string `json:"symbol"` - Location string `json:"location"` - ProfileType string `json:"profile_type"` - Self uint64 `json:"self"` - Total uint64 `json:"total"` - SelfPercent string `json:"self_percent"` - TotalPercent string `json:"total_percent"` + Symbol string `json:"symbol"` + Location string `json:"location"` + ProfileType string `json:"profile_type"` + Self uint64 `json:"self"` + Total uint64 `json:"total"` + SelfDisplay string `json:"self_display"` + TotalDisplay string `json:"total_display"` + SelfPercent string `json:"self_percent"` + TotalPercent string `json:"total_percent"` + Semantics domain.ProfileValueSemantics `json:"semantics"` } var topStackExcludedMethods = map[string]struct{}{ @@ -38,7 +41,7 @@ func QueryTopStacks(ctx context.Context, repo ProfileQueryStore, q clickhouse.Pr } recordMetric(exporter, "java_profiler_query_top_stacks_fetch_seconds_total", time.Since(fetchStarted).Seconds()) rankStarted := time.Now() - rows, stats := buildTopStacks(samples) + rows, stats := buildTopStacks(samples, domain.TimeWindow{StartedAt: q.Start, EndsAt: q.End}) recordMetric(exporter, "java_profiler_query_top_stacks_rank_seconds_total", time.Since(rankStarted).Seconds()) recordMetric(exporter, "java_profiler_query_top_stacks_samples_total", float64(stats.samples)) recordMetric(exporter, "java_profiler_query_top_stacks_frames_total", float64(stats.frames)) @@ -47,7 +50,7 @@ func QueryTopStacks(ctx context.Context, repo ProfileQueryStore, q clickhouse.Pr } func rankTopStacks(samples []clickhouse.TopStackSample) []TopStackRow { - rows, _ := buildTopStacks(samples) + rows, _ := buildTopStacks(samples, domain.TimeWindow{}) return rows } @@ -56,7 +59,7 @@ type topStackStats struct { frames int } -func buildTopStacks(samples []clickhouse.TopStackSample) ([]TopStackRow, topStackStats) { +func buildTopStacks(samples []clickhouse.TopStackSample, window domain.TimeWindow) ([]TopStackRow, topStackStats) { type contribution struct { location string profileType domain.ProfileType @@ -96,14 +99,18 @@ func buildTopStacks(samples []clickhouse.TopStackSample) ([]TopStackRow, topStac rows := make([]TopStackRow, 0, len(byLocation)) for location, contribution := range byLocation { + semantics := contribution.profileType.Semantics(window) rows = append(rows, TopStackRow{ Symbol: classifier.symbol(location), Location: contribution.location, ProfileType: string(contribution.profileType), Self: contribution.self, Total: contribution.total, + SelfDisplay: domain.FormatProfileValue(contribution.profileType, contribution.self, window), + TotalDisplay: domain.FormatProfileValue(contribution.profileType, contribution.total, window), SelfPercent: percent(contribution.self, totalSamples), TotalPercent: percent(contribution.total, totalSamples), + Semantics: semantics, }) } sort.Slice(rows, func(i, j int) bool { diff --git a/backend/internal/app/query_top_stacks_test.go b/backend/internal/app/query_top_stacks_test.go index 1ef12ba..ab07cc9 100644 --- a/backend/internal/app/query_top_stacks_test.go +++ b/backend/internal/app/query_top_stacks_test.go @@ -26,6 +26,9 @@ func TestTopStacksSeparatesSelfAndTotal(t *testing.T) { if handle.TotalPercent != "100.0%" || burn.SelfPercent != "80.0%" { t.Fatalf("unexpected percents: handle total=%s burn self=%s", handle.TotalPercent, burn.SelfPercent) } + if handle.Semantics.ValueUnit != "nanoseconds" || handle.SelfDisplay != "0 ns" || burn.SelfDisplay != "8 ns" { + t.Fatalf("unexpected semantics/display: handle=%+v burn=%+v", handle, burn) + } } func TestTopStacksKeepsJavaRowsWhenRuntimeFramesExist(t *testing.T) { diff --git a/backend/internal/clickhouse/001_initial_profile_schema.sql b/backend/internal/clickhouse/001_initial_profile_schema.sql index 0dab16f..c052c40 100644 --- a/backend/internal/clickhouse/001_initial_profile_schema.sql +++ b/backend/internal/clickhouse/001_initial_profile_schema.sql @@ -96,6 +96,33 @@ PARTITION BY toDate(event_at) ORDER BY (cluster, namespace, service, pod, process_id, event_at, cycle_id) TTL expires_at DELETE; +CREATE TABLE IF NOT EXISTS java_profiler_jvm_events +( + event_id String, + batch_id String, + cluster LowCardinality(String), + namespace LowCardinality(String), + service LowCardinality(String), + pod String, + container String, + process_id UInt32, + jvm_start_time DateTime64(9, 'UTC'), + event_type LowCardinality(String), + event_at DateTime64(9, 'UTC'), + duration_ns UInt64, + collector String, + action String, + cause String, + message String, + stack_frames Array(String), + created_at DateTime64(9, 'UTC') DEFAULT now64(9), + expires_at DateTime DEFAULT toDateTime(created_at) + INTERVAL 7 DAY +) +ENGINE = MergeTree +PARTITION BY toDate(event_at) +ORDER BY (cluster, namespace, service, pod, process_id, event_type, event_at, event_id) +TTL expires_at DELETE; + CREATE TABLE IF NOT EXISTS java_profiler_target_status ( batch_id String, diff --git a/backend/internal/clickhouse/profile_repository.go b/backend/internal/clickhouse/profile_repository.go index f06afa7..04d6bcf 100644 --- a/backend/internal/clickhouse/profile_repository.go +++ b/backend/internal/clickhouse/profile_repository.go @@ -3,6 +3,7 @@ package clickhouse import ( "context" "errors" + "fmt" "sync" "time" @@ -36,6 +37,31 @@ type TopStackSample struct { Value uint64 } +type ProfileTargetSummary struct { + Namespace string `json:"namespace"` + Service string `json:"service"` + Pod string `json:"pod"` + Container string `json:"container"` + ProcessID int `json:"process_id"` + JVMStartTime time.Time `json:"jvm_start_time"` + ProfileType domain.ProfileType `json:"profile_type"` + TotalValue uint64 `json:"total_value"` + DisplayValue string `json:"display_value"` + SampleCount int `json:"sample_count"` + PercentOfTotal string `json:"percent_of_total"` + WindowSemantics domain.ProfileValueSemantics `json:"semantics"` +} + +type JVMEventQuery struct { + Namespace string + Service string + Pod string + EventType string + Start time.Time + End time.Time + Limit int +} + type ProfileRepository struct { mu sync.RWMutex batches map[string]struct{} @@ -161,3 +187,97 @@ func (r *ProfileRepository) QueryTopStackSamples(_ context.Context, q ProfileQue } return out, nil } + +func (r *ProfileRepository) QueryProfileTargetSummary(_ context.Context, q ProfileQuery) ([]ProfileTargetSummary, error) { + r.mu.RLock() + defer r.mu.RUnlock() + type aggregate struct { + sample ProfileSample + total uint64 + count int + } + byTarget := map[string]aggregate{} + var grandTotal uint64 + for _, sample := range r.samples { + if !profileSampleMatches(sample, q) { + continue + } + key := sample.Target.Key() + "|" + sample.ProfileType.String() + current := byTarget[key] + current.sample = sample + current.total += sample.Value + current.count++ + byTarget[key] = current + grandTotal += sample.Value + } + out := make([]ProfileTargetSummary, 0, len(byTarget)) + window := domain.TimeWindow{StartedAt: q.Start, EndsAt: q.End} + for _, item := range byTarget { + out = append(out, ProfileTargetSummary{ + Namespace: item.sample.Target.Namespace, + Service: item.sample.Target.Service, + Pod: item.sample.Target.Pod, + Container: item.sample.Target.Container, + ProcessID: item.sample.Target.ProcessID, + JVMStartTime: item.sample.Target.JVMStartTime, + ProfileType: item.sample.ProfileType, + TotalValue: item.total, + DisplayValue: domain.FormatProfileValue(item.sample.ProfileType, item.total, window), + SampleCount: item.count, + PercentOfTotal: percentOfTotal(item.total, grandTotal), + WindowSemantics: item.sample.ProfileType.Semantics(window), + }) + } + return out, nil +} + +func profileSampleMatches(sample ProfileSample, q ProfileQuery) bool { + if q.Namespace != "" && sample.Target.Namespace != q.Namespace { + return false + } + if q.Service != "" && sample.Target.Service != q.Service { + return false + } + if q.Pod != "" && sample.Target.Pod != q.Pod { + return false + } + if q.ProfileType != "" && sample.ProfileType != q.ProfileType { + return false + } + if !q.Start.IsZero() && sample.EndedAt.Before(q.Start) { + return false + } + if !q.End.IsZero() && sample.StartedAt.After(q.End) { + return false + } + return true +} + +func percentOfTotal(value, total uint64) string { + if total == 0 { + return "0.0%" + } + return fmt.Sprintf("%.1f%%", float64(value)/float64(total)*100) +} + +func jvmEventMatches(event JVMEvent, q JVMEventQuery) bool { + if q.Namespace != "" && event.Target.Namespace != q.Namespace { + return false + } + if q.Service != "" && event.Target.Service != q.Service { + return false + } + if q.Pod != "" && event.Target.Pod != q.Pod { + return false + } + if q.EventType != "" && event.EventType != q.EventType { + return false + } + if !q.Start.IsZero() && event.EventAt.Before(q.Start) { + return false + } + if !q.End.IsZero() && event.EventAt.After(q.End) { + return false + } + return true +} diff --git a/backend/internal/clickhouse/sql_repository.go b/backend/internal/clickhouse/sql_repository.go index 565a38b..d6ca25d 100644 --- a/backend/internal/clickhouse/sql_repository.go +++ b/backend/internal/clickhouse/sql_repository.go @@ -331,6 +331,56 @@ func (r *SQLRepository) QueryTopStackSamples(ctx context.Context, q ProfileQuery return out, rows.Err() } +func (r *SQLRepository) QueryProfileTargetSummary(ctx context.Context, q ProfileQuery) ([]ProfileTargetSummary, error) { + limit := q.Limit + if limit <= 0 { + limit = 5000 + } + query := ` + SELECT namespace, service, pod, container, process_id, jvm_start_time, profile_type, sum(sample_value) AS total_value, count() AS sample_count + FROM java_profiler_profile_samples + PREWHERE (? = '' OR namespace = ?) AND (? = '' OR service = ?) AND (? = '' OR pod = ?) AND (? = '' OR profile_type = ?) + AND (? = 1 OR ended_at >= ?) AND (? = 1 OR started_at <= ?) + GROUP BY namespace, service, pod, container, process_id, jvm_start_time, profile_type + ORDER BY total_value DESC + LIMIT ?` + rows, err := r.db.QueryContext(ctx, query, + q.Namespace, q.Namespace, + q.Service, q.Service, + q.Pod, q.Pod, + q.ProfileType.String(), q.ProfileType.String(), + zeroTimeFlag(q.Start), q.Start, + zeroTimeFlag(q.End), q.End, + limit, + ) + if err != nil { + return nil, err + } + defer rows.Close() + var out []ProfileTargetSummary + var grandTotal uint64 + window := domain.TimeWindow{StartedAt: q.Start, EndsAt: q.End} + for rows.Next() { + var summary ProfileTargetSummary + var profileType string + if err := rows.Scan(&summary.Namespace, &summary.Service, &summary.Pod, &summary.Container, &summary.ProcessID, &summary.JVMStartTime, &profileType, &summary.TotalValue, &summary.SampleCount); err != nil { + return nil, err + } + summary.ProfileType = profileTypeFromString(profileType) + summary.DisplayValue = domain.FormatProfileValue(summary.ProfileType, summary.TotalValue, window) + summary.WindowSemantics = summary.ProfileType.Semantics(window) + grandTotal += summary.TotalValue + out = append(out, summary) + } + if err := rows.Err(); err != nil { + return nil, err + } + for i := range out { + out[i].PercentOfTotal = percentOfTotal(out[i].TotalValue, grandTotal) + } + return out, nil +} + func (r *SQLRepository) InsertSnapshots(ctx context.Context, snapshots []ThreadSnapshot, deadlocks []DeadlockEvent) error { for _, snapshot := range snapshots { daemon := uint8(0) @@ -390,6 +440,90 @@ func (r *SQLRepository) InsertSnapshots(ctx context.Context, snapshots []ThreadS return nil } +func (r *SQLRepository) InsertJVMEvents(ctx context.Context, events []JVMEvent) error { + if len(events) == 0 { + return nil + } + rows := make([][]any, 0, len(events)) + for _, event := range events { + rows = append(rows, []any{ + event.EventID, + event.BatchID, + event.Target.Cluster, + event.Target.Namespace, + event.Target.Service, + event.Target.Pod, + event.Target.Container, + event.Target.ProcessID, + event.Target.JVMStartTime, + event.EventType, + event.EventAt, + event.DurationNS, + event.Collector, + event.Action, + event.Cause, + event.Message, + event.StackFrames, + }) + } + return execMultiRowInsert(ctx, r.db, ` + INSERT INTO java_profiler_jvm_events + (event_id, batch_id, cluster, namespace, service, pod, container, process_id, jvm_start_time, event_type, event_at, duration_ns, collector, action, cause, message, stack_frames)`, rows) +} + +func (r *SQLRepository) QueryJVMEvents(ctx context.Context, q JVMEventQuery) ([]JVMEvent, error) { + limit := q.Limit + if limit <= 0 { + limit = 1000 + } + rows, err := r.db.QueryContext(ctx, ` + SELECT event_id, batch_id, cluster, namespace, service, pod, container, process_id, jvm_start_time, event_type, event_at, duration_ns, collector, action, cause, message, stack_frames + FROM java_profiler_jvm_events + PREWHERE (? = '' OR namespace = ?) AND (? = '' OR service = ?) AND (? = '' OR pod = ?) AND (? = '' OR event_type = ?) + AND (? = 1 OR event_at >= ?) AND (? = 1 OR event_at <= ?) + ORDER BY event_at DESC + LIMIT ?`, + q.Namespace, q.Namespace, + q.Service, q.Service, + q.Pod, q.Pod, + q.EventType, q.EventType, + zeroTimeFlag(q.Start), q.Start, + zeroTimeFlag(q.End), q.End, + limit, + ) + if err != nil { + return nil, err + } + defer rows.Close() + var out []JVMEvent + for rows.Next() { + var event JVMEvent + if err := rows.Scan( + &event.EventID, + &event.BatchID, + &event.Target.Cluster, + &event.Target.Namespace, + &event.Target.Service, + &event.Target.Pod, + &event.Target.Container, + &event.Target.ProcessID, + &event.Target.JVMStartTime, + &event.EventType, + &event.EventAt, + &event.DurationNS, + &event.Collector, + &event.Action, + &event.Cause, + &event.Message, + &event.StackFrames, + ); err != nil { + return nil, err + } + out = append(out, event) + } + return out, rows.Err() +} + const ( defaultThreadSnapshotQueryLimit = 1000 defaultDeadlockQueryLimit = 500 diff --git a/backend/internal/clickhouse/thread_repository.go b/backend/internal/clickhouse/thread_repository.go index ac4ef9c..b4d8d44 100644 --- a/backend/internal/clickhouse/thread_repository.go +++ b/backend/internal/clickhouse/thread_repository.go @@ -9,11 +9,40 @@ import ( type ThreadSnapshot = profiling.ThreadSnapshot type DeadlockEvent = profiling.DeadlockEvent +type JVMEvent = profiling.JVMEvent type ThreadRepository struct { mu sync.RWMutex snapshots []ThreadSnapshot deadlocks []DeadlockEvent + jvmEvents []JVMEvent +} + +func (r *ThreadRepository) InsertJVMEvents(_ context.Context, events []JVMEvent) error { + r.mu.Lock() + defer r.mu.Unlock() + r.jvmEvents = append(r.jvmEvents, events...) + return nil +} + +func (r *ThreadRepository) QueryJVMEvents(_ context.Context, q JVMEventQuery) ([]JVMEvent, error) { + r.mu.RLock() + defer r.mu.RUnlock() + limit := q.Limit + if limit <= 0 { + limit = 1000 + } + out := make([]JVMEvent, 0) + for _, event := range r.jvmEvents { + if !jvmEventMatches(event, q) { + continue + } + out = append(out, event) + if len(out) >= limit { + break + } + } + return out, nil } func NewThreadRepository() *ThreadRepository { return &ThreadRepository{} } diff --git a/backend/internal/domain/alias.go b/backend/internal/domain/alias.go index adb9f5d..48ec4b7 100644 --- a/backend/internal/domain/alias.go +++ b/backend/internal/domain/alias.go @@ -3,15 +3,16 @@ package domain import root "github.com/koolay/java-profiler/domain" type ( - ProfileType = root.ProfileType - EnablementMode = root.EnablementMode - TargetDesiredState = root.TargetDesiredState - StatusReason = root.StatusReason - BatchType = root.BatchType - TargetIdentity = root.TargetIdentity - TimeWindow = root.TimeWindow - RetentionPolicy = root.RetentionPolicy - Confidence = root.Confidence + ProfileType = root.ProfileType + EnablementMode = root.EnablementMode + TargetDesiredState = root.TargetDesiredState + StatusReason = root.StatusReason + BatchType = root.BatchType + TargetIdentity = root.TargetIdentity + TimeWindow = root.TimeWindow + RetentionPolicy = root.RetentionPolicy + Confidence = root.Confidence + ProfileValueSemantics = root.ProfileValueSemantics ) const ( @@ -20,6 +21,8 @@ const ( ProfileTypeAllocObjects = root.ProfileTypeAllocObjects ProfileTypeLockContention = root.ProfileTypeLockContention ProfileTypeLockDelay = root.ProfileTypeLockDelay + ProfileTypeWallClock = root.ProfileTypeWallClock + ProfileTypeIOWait = root.ProfileTypeIOWait EnablementDisabled = root.EnablementDisabled EnablementContinuous = root.EnablementContinuous diff --git a/backend/internal/domain/flamegraph_builder.go b/backend/internal/domain/flamegraph_builder.go index 6afb51a..e370a0a 100644 --- a/backend/internal/domain/flamegraph_builder.go +++ b/backend/internal/domain/flamegraph_builder.go @@ -2,13 +2,16 @@ package domain import ( "sort" + + rootdomain "github.com/koolay/java-profiler/domain" ) type FlamegraphNode struct { - Name string `json:"name"` - Value uint64 `json:"value"` - Children []FlamegraphNode `json:"children,omitempty"` - childIndex map[string]int `json:"-"` + Name string `json:"name"` + Value uint64 `json:"value"` + DisplayValue string `json:"display_value,omitempty"` + Children []FlamegraphNode `json:"children,omitempty"` + childIndex map[string]int `json:"-"` } type FlamegraphMetadata struct { @@ -19,8 +22,9 @@ type FlamegraphMetadata struct { } type FlamegraphResult struct { - Root FlamegraphNode `json:"root"` - Metadata FlamegraphMetadata `json:"metadata"` + Root FlamegraphNode `json:"root"` + Metadata FlamegraphMetadata `json:"metadata"` + Semantics rootdomain.ProfileValueSemantics `json:"semantics"` } type FlamegraphSample struct { @@ -66,6 +70,19 @@ func BuildFlamegraph(samples []FlamegraphSample, nodeLimit int) FlamegraphResult return FlamegraphResult{Root: root, Metadata: metadata} } +func ApplyProfileSemantics(result FlamegraphResult, profileType rootdomain.ProfileType, window rootdomain.TimeWindow) FlamegraphResult { + result.Semantics = profileType.Semantics(window) + applyDisplayValues(&result.Root, profileType, window) + return result +} + +func applyDisplayValues(node *FlamegraphNode, profileType rootdomain.ProfileType, window rootdomain.TimeWindow) { + node.DisplayValue = rootdomain.FormatProfileValue(profileType, node.Value, window) + for i := range node.Children { + applyDisplayValues(&node.Children[i], profileType, window) + } +} + func sortNode(children []FlamegraphNode) { sort.Slice(children, func(i, j int) bool { return children[i].Value > children[j].Value }) for i := range children { diff --git a/backend/internal/domain/flamegraph_builder_test.go b/backend/internal/domain/flamegraph_builder_test.go index c53506b..bae29de 100644 --- a/backend/internal/domain/flamegraph_builder_test.go +++ b/backend/internal/domain/flamegraph_builder_test.go @@ -2,6 +2,9 @@ package domain import ( "testing" + "time" + + rootdomain "github.com/koolay/java-profiler/domain" ) func TestBuildFlamegraphAggregatesStacks(t *testing.T) { @@ -14,6 +17,18 @@ func TestBuildFlamegraphAggregatesStacks(t *testing.T) { } } +func TestApplyProfileSemanticsAddsDisplayValues(t *testing.T) { + got := BuildFlamegraph([]FlamegraphSample{{Frames: []string{"A"}, Value: uint64(2 * time.Second)}}, 10) + got = ApplyProfileSemantics(got, rootdomain.ProfileTypeCPU, rootdomain.TimeWindow{StartedAt: time.Unix(0, 0), EndsAt: time.Unix(10, 0)}) + + if got.Semantics.ValueUnit != "nanoseconds" || got.Root.DisplayValue != "2.00 s · 0.20 cores" { + t.Fatalf("unexpected semantic flamegraph: %+v", got) + } + if got.Root.Children[0].DisplayValue != "2.00 s · 0.20 cores" { + t.Fatalf("child display value = %q", got.Root.Children[0].DisplayValue) + } +} + func TestBuildFlamegraphMarksPartial(t *testing.T) { got := BuildFlamegraph([]FlamegraphSample{{Frames: []string{"A", "B", "C"}, Value: 1}}, 2) if !got.Metadata.Partial || got.Metadata.OmittedNodes == 0 { diff --git a/backend/internal/domain/types_test.go b/backend/internal/domain/types_test.go index 2147192..0b13503 100644 --- a/backend/internal/domain/types_test.go +++ b/backend/internal/domain/types_test.go @@ -6,8 +6,8 @@ import ( ) func TestAllProfileTypesAreStableAndValid(t *testing.T) { - if len(AllProfileTypes) != 5 { - t.Fatalf("expected 5 profile types, got %d", len(AllProfileTypes)) + if len(AllProfileTypes) != 7 { + t.Fatalf("expected 7 profile types, got %d", len(AllProfileTypes)) } for _, pt := range AllProfileTypes { if !pt.IsValid() { @@ -18,8 +18,10 @@ func TestAllProfileTypesAreStableAndValid(t *testing.T) { "java_allocation_bytes", "java_allocation_objects", "java_cpu_nanoseconds", + "java_io_wait_nanoseconds", "java_lock_contention_count", "java_lock_delay_nanoseconds", + "java_wall_clock_nanoseconds", } got := StableProfileTypeNames() for i := range want { diff --git a/backend/internal/httpapi/ingest_handlers.go b/backend/internal/httpapi/ingest_handlers.go index 62a8fc1..b19cf21 100644 --- a/backend/internal/httpapi/ingest_handlers.go +++ b/backend/internal/httpapi/ingest_handlers.go @@ -14,10 +14,56 @@ import ( type IngestHandlers struct { Profiles app.ProfileBatchIngestor ThreadSnapshots app.ThreadSnapshotIngestor + JVMEvents app.JVMEventIngestor TargetStatuses app.TargetStatusIngestor Metrics *metrics.Exporter } +func (h IngestHandlers) JVMEventBatch(w http.ResponseWriter, r *http.Request) { + started := time.Now() + if r.Method != http.MethodPost { + http.Error(w, "method not allowed", http.StatusMethodNotAllowed) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_requests_total", 1) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_errors_total", 1) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_duration_seconds_total", time.Since(started).Seconds()) + return + } + defer r.Body.Close() + var req app.JVMEventBatchRequest + if err := json.NewDecoder(http.MaxBytesReader(w, r.Body, 8<<20)).Decode(&req); err != nil { + http.Error(w, "invalid json", http.StatusBadRequest) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_requests_total", 1) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_errors_total", 1) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_duration_seconds_total", time.Since(started).Seconds()) + return + } + result, err := h.JVMEvents.Ingest(r.Context(), req) + if err != nil { + http.Error(w, err.Error(), http.StatusServiceUnavailable) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_requests_total", 1) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_errors_total", 1) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_duration_seconds_total", time.Since(started).Seconds()) + return + } + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_requests_total", 1) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_duration_seconds_total", time.Since(started).Seconds()) + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_rows_total", float64(len(req.Events))) + status := http.StatusAccepted + if result.Status == clickhouse.IngestionRejected { + status = http.StatusBadRequest + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_rejected_total", 1) + } + if result.Status == clickhouse.IngestionRetryable { + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_retryable_total", 1) + } + if result.Status == clickhouse.IngestionAccepted { + recordMetric(h.Metrics, "java_profiler_http_ingest_jvm_event_accepted_total", 1) + } + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(status) + _ = json.NewEncoder(w).Encode(result) +} + func (h IngestHandlers) ProfileBatch(w http.ResponseWriter, r *http.Request) { started := time.Now() if r.Method != http.MethodPost { diff --git a/backend/internal/httpapi/query_handlers.go b/backend/internal/httpapi/query_handlers.go index 51d8445..07771ac 100644 --- a/backend/internal/httpapi/query_handlers.go +++ b/backend/internal/httpapi/query_handlers.go @@ -15,11 +15,31 @@ import ( type QueryHandlers struct { Profiles app.ProfileQueryStore Threads app.ThreadStore + JVMEvents app.JVMEventStore Statuses app.TargetStatusQueryStore IngestionStore app.IngestionQueryStore Metrics *metrics.Exporter } +func (h QueryHandlers) JVMEventsEvidence(w http.ResponseWriter, r *http.Request) { + result, err := h.observe("java_profiler_http_query_jvm_events", func() (any, error) { + return app.QueryJVMEvents(r.Context(), h.JVMEvents, clickhouse.JVMEventQuery{ + Namespace: r.URL.Query().Get("namespace"), + Service: r.URL.Query().Get("service"), + Pod: r.URL.Query().Get("pod"), + EventType: r.URL.Query().Get("event_type"), + Start: parseQueryTime(r.URL.Query().Get("start")), + End: parseQueryTime(r.URL.Query().Get("end")), + Limit: parseQueryLimit(r, 500, 5000), + }) + }) + if err != nil { + http.Error(w, err.Error(), http.StatusServiceUnavailable) + return + } + writeJSON(w, result) +} + func (h QueryHandlers) Flamegraph(w http.ResponseWriter, r *http.Request) { result, err := h.observe("java_profiler_http_query_flamegraph", func() (any, error) { return (app.FlamegraphQuerier{Profiles: h.Profiles, Metrics: h.Metrics}).Query(r.Context(), app.FlamegraphQuery{ @@ -51,6 +71,17 @@ func (h QueryHandlers) TopStacks(w http.ResponseWriter, r *http.Request) { writeJSON(w, result) } +func (h QueryHandlers) ServiceSummary(w http.ResponseWriter, r *http.Request) { + result, err := h.observe("java_profiler_http_query_service_summary", func() (any, error) { + return app.QueryServiceProfileSummary(r.Context(), h.Profiles, profileQueryFromRequest(r, 5000), h.Metrics) + }) + if err != nil { + http.Error(w, err.Error(), http.StatusServiceUnavailable) + return + } + writeJSON(w, result) +} + func (h QueryHandlers) ThreadDiagnosis(w http.ResponseWriter, r *http.Request) { result, err := h.observe("java_profiler_http_query_thread_diagnosis", func() (any, error) { return app.QueryThreadDiagnosis( @@ -151,6 +182,11 @@ func (h QueryHandlers) observe(metricPrefix string, fn func() (any, error)) (any recordMetric(h.Metrics, metricPrefix+"_rows_total", float64(len(v.Batches))) case []app.TopStackRow: recordMetric(h.Metrics, metricPrefix+"_rows_total", float64(len(v))) + case app.ServiceProfileSummary: + recordMetric(h.Metrics, metricPrefix+"_rows_total", float64(len(v.Targets))) + if v.Partial { + recordMetric(h.Metrics, metricPrefix+"_partial_total", 1) + } } return result, err } diff --git a/backend/internal/httpapi/query_handlers_test.go b/backend/internal/httpapi/query_handlers_test.go index e7a87ea..1bfa87b 100644 --- a/backend/internal/httpapi/query_handlers_test.go +++ b/backend/internal/httpapi/query_handlers_test.go @@ -105,8 +105,15 @@ func TestTopStacksRouteReturnsSelfAndTotalRows(t *testing.T) { Symbol string `json:"symbol"` Self uint64 `json:"self"` Total uint64 `json:"total"` + SelfDisplay string `json:"self_display"` + TotalDisplay string `json:"total_display"` SelfPercent string `json:"self_percent"` TotalPercent string `json:"total_percent"` + Semantics struct { + ValueUnit string `json:"value_unit"` + DisplayUnit string `json:"display_unit"` + PercentBasis string `json:"percent_basis"` + } `json:"semantics"` } if err := json.Unmarshal(rec.Body.Bytes(), &rows); err != nil { t.Fatal(err) @@ -117,6 +124,9 @@ func TestTopStacksRouteReturnsSelfAndTotalRows(t *testing.T) { if rows[0].Self != 0 || rows[0].Total != 10 || rows[0].TotalPercent != "100.0%" { t.Fatalf("unexpected handleWork row: %#v", rows[0]) } + if rows[0].SelfDisplay == "" || rows[0].TotalDisplay == "" || rows[0].Semantics.ValueUnit != "nanoseconds" { + t.Fatalf("missing semantic display contract: %#v", rows[0]) + } snapshot := exporter.Snapshot() if !strings.Contains(snapshot, "java_profiler_http_query_top_stacks_requests_total 1") { t.Fatalf("missing top stacks request metric: %s", snapshot) @@ -125,3 +135,111 @@ func TestTopStacksRouteReturnsSelfAndTotalRows(t *testing.T) { t.Fatalf("missing top stacks row metric: %s", snapshot) } } + +func TestServiceSummaryRouteReturnsPodJVMContributions(t *testing.T) { + server, err := NewServer(ServerConfig{AllowInMemory: true, Auth: AuthConfig{CollectorToken: "collector", UIToken: "ui"}}, metrics.NewExporter()) + if err != nil { + t.Fatal(err) + } + now := time.Unix(200, 0).UTC() + payload := app.ProfileBatchRequest{ + BatchID: "batch-summary", + CollectorID: "collector-a", + ReceivedAt: now, + Samples: []profiling.ProfileSample{ + {Target: domain.TargetIdentity{Namespace: "prod", Service: "checkout", Pod: "checkout-1", ProcessID: 1, JVMStartTime: now}, ProfileType: domain.ProfileTypeCPU, StartedAt: now, EndedAt: now.Add(time.Second), StackID: "a", Frames: []string{"Demo.hot"}, Value: uint64(3 * time.Second)}, + {Target: domain.TargetIdentity{Namespace: "prod", Service: "checkout", Pod: "checkout-2", ProcessID: 2, JVMStartTime: now}, ProfileType: domain.ProfileTypeCPU, StartedAt: now, EndedAt: now.Add(time.Second), StackID: "b", Frames: []string{"Demo.hot"}, Value: uint64(7 * time.Second)}, + }, + } + body, err := json.Marshal(payload) + if err != nil { + t.Fatal(err) + } + ingestReq := httptest.NewRequest(http.MethodPost, "/api/collector/v1/profile-batches", bytes.NewReader(body)) + ingestReq.Header.Set("Content-Type", "application/json") + ingestReq.Header.Set("Authorization", "Bearer collector") + ingestRec := httptest.NewRecorder() + server.ServeHTTP(ingestRec, ingestReq) + if ingestRec.Code != http.StatusAccepted { + t.Fatalf("ingest status = %d body=%s", ingestRec.Code, ingestRec.Body.String()) + } + + req := httptest.NewRequest(http.MethodGet, "/api/ui/v1/service-summary?namespace=prod&service=checkout&profile_type=java_cpu_nanoseconds&start="+now.Format(time.RFC3339)+"&end="+now.Add(10*time.Second).Format(time.RFC3339), nil) + req.Header.Set("Authorization", "Bearer ui") + rec := httptest.NewRecorder() + server.ServeHTTP(rec, req) + if rec.Code != http.StatusOK { + t.Fatalf("summary status = %d body=%s", rec.Code, rec.Body.String()) + } + var response struct { + Targets []struct { + Pod string `json:"pod"` + DisplayValue string `json:"display_value"` + PercentOfTotal string `json:"percent_of_total"` + } `json:"targets"` + } + if err := json.Unmarshal(rec.Body.Bytes(), &response); err != nil { + t.Fatal(err) + } + if len(response.Targets) != 2 || response.Targets[0].Pod != "checkout-2" || response.Targets[0].PercentOfTotal != "70.0%" { + t.Fatalf("unexpected summary response: %+v", response) + } +} + +func TestJVMEventsRouteReturnsGCPauseEvidence(t *testing.T) { + server, err := NewServer(ServerConfig{AllowInMemory: true, Auth: AuthConfig{CollectorToken: "collector", UIToken: "ui"}}, metrics.NewExporter()) + if err != nil { + t.Fatal(err) + } + now := time.Unix(300, 0).UTC() + payload := app.JVMEventBatchRequest{ + BatchID: "batch-gc", + CollectorID: "collector-a", + ReceivedAt: now, + Events: []profiling.JVMEvent{{ + EventID: "gc-1", + Target: domain.TargetIdentity{Namespace: "prod", Service: "checkout", Pod: "checkout-1", ProcessID: 1, JVMStartTime: now}, + EventType: "gc_pause", + EventAt: now.Add(time.Second), + DurationNS: uint64(42 * time.Millisecond), + Collector: "G1", + Action: "end of minor GC", + Cause: "Allocation Failure", + }}, + } + body, err := json.Marshal(payload) + if err != nil { + t.Fatal(err) + } + ingestReq := httptest.NewRequest(http.MethodPost, "/api/collector/v1/jvm-event-batches", bytes.NewReader(body)) + ingestReq.Header.Set("Content-Type", "application/json") + ingestReq.Header.Set("Authorization", "Bearer collector") + ingestRec := httptest.NewRecorder() + server.ServeHTTP(ingestRec, ingestReq) + if ingestRec.Code != http.StatusAccepted { + t.Fatalf("JVM event ingest status = %d body=%s", ingestRec.Code, ingestRec.Body.String()) + } + + req := httptest.NewRequest(http.MethodGet, "/api/ui/v1/jvm-events?namespace=prod&service=checkout&pod=checkout-1&event_type=gc_pause", nil) + req.Header.Set("Authorization", "Bearer ui") + rec := httptest.NewRecorder() + server.ServeHTTP(rec, req) + if rec.Code != http.StatusOK { + t.Fatalf("JVM events status = %d body=%s", rec.Code, rec.Body.String()) + } + var response struct { + Events []struct { + EventID string `json:"event_id"` + EventType string `json:"event_type"` + DurationNS uint64 `json:"duration_ns"` + Collector string `json:"collector"` + } `json:"events"` + Partial bool `json:"partial"` + } + if err := json.Unmarshal(rec.Body.Bytes(), &response); err != nil { + t.Fatal(err) + } + if len(response.Events) != 1 || response.Events[0].EventID != "gc-1" || response.Events[0].DurationNS != uint64(42*time.Millisecond) || response.Events[0].Collector != "G1" { + t.Fatalf("unexpected JVM event response: %+v", response) + } +} diff --git a/backend/internal/httpapi/server.go b/backend/internal/httpapi/server.go index 74d33de..1c1f9bb 100644 --- a/backend/internal/httpapi/server.go +++ b/backend/internal/httpapi/server.go @@ -26,6 +26,7 @@ func NewServer(cfg ServerConfig, exporter *metrics.Exporter) (http.Handler, erro var statuses app.TargetStatusQueryStore var statusIngest app.TargetStatusStore var threads app.ThreadStore + var jvmEvents app.JVMEventStore if cfg.ClickHouseDSN != "" { sqlRepo, err := clickhouse.OpenSQLRepository(cfg.ClickHouseDSN) if err != nil { @@ -44,6 +45,7 @@ func NewServer(cfg ServerConfig, exporter *metrics.Exporter) (http.Handler, erro statuses = sqlRepo statusIngest = sqlRepo threads = sqlRepo + jvmEvents = sqlRepo } else if cfg.AllowInMemory { profiles = clickhouse.NewProfileRepository() ingestionRepo := clickhouse.NewIngestionRepository() @@ -52,25 +54,31 @@ func NewServer(cfg ServerConfig, exporter *metrics.Exporter) (http.Handler, erro statusRepo := clickhouse.NewStatusRepository() statuses = statusRepo statusIngest = statusRepo - threads = clickhouse.NewThreadRepository() + threadsRepo := clickhouse.NewThreadRepository() + threads = threadsRepo + jvmEvents = threadsRepo } else { return nil, fmt.Errorf("JAVA_PROFILER_CLICKHOUSE_DSN is required unless in-memory mode is explicitly enabled") } handlers := IngestHandlers{ Profiles: app.ProfileBatchIngestor{Profiles: profiles, Ingestion: ingestion}, ThreadSnapshots: app.ThreadSnapshotIngestor{Threads: threads, Ingestion: ingestion}, + JVMEvents: app.JVMEventIngestor{Events: jvmEvents, Ingestion: ingestion}, TargetStatuses: app.TargetStatusIngestor{Statuses: statusIngest, Ingestion: ingestion}, Metrics: exporter, } - queryHandlers := QueryHandlers{Profiles: profiles, Threads: threads, Statuses: statuses, IngestionStore: ingestionQuery, Metrics: exporter} + queryHandlers := QueryHandlers{Profiles: profiles, Threads: threads, JVMEvents: jvmEvents, Statuses: statuses, IngestionStore: ingestionQuery, Metrics: exporter} mux := http.NewServeMux() mux.Handle("/api/collector/v1/profile-batches", RequireCollectorAuth(cfg.Auth, http.HandlerFunc(handlers.ProfileBatch))) mux.Handle("/api/collector/v1/thread-snapshot-batches", RequireCollectorAuth(cfg.Auth, http.HandlerFunc(handlers.ThreadSnapshotBatch))) + mux.Handle("/api/collector/v1/jvm-event-batches", RequireCollectorAuth(cfg.Auth, http.HandlerFunc(handlers.JVMEventBatch))) mux.Handle("/api/collector/v1/target-status-batches", RequireCollectorAuth(cfg.Auth, http.HandlerFunc(handlers.TargetStatusBatch))) mux.Handle("/api/ui/v1/flamegraph", RequireUIAuth(cfg.Auth, http.HandlerFunc(queryHandlers.Flamegraph))) mux.Handle("/api/ui/v1/top-stacks", RequireUIAuth(cfg.Auth, http.HandlerFunc(queryHandlers.TopStacks))) + mux.Handle("/api/ui/v1/service-summary", RequireUIAuth(cfg.Auth, http.HandlerFunc(queryHandlers.ServiceSummary))) mux.Handle("/api/ui/v1/thread-diagnosis", RequireUIAuth(cfg.Auth, http.HandlerFunc(queryHandlers.ThreadDiagnosis))) mux.Handle("/api/ui/v1/deadlocks", RequireUIAuth(cfg.Auth, http.HandlerFunc(queryHandlers.Deadlocks))) + mux.Handle("/api/ui/v1/jvm-events", RequireUIAuth(cfg.Auth, http.HandlerFunc(queryHandlers.JVMEventsEvidence))) mux.Handle("/api/ui/v1/target-status", RequireUIAuth(cfg.Auth, http.HandlerFunc(queryHandlers.TargetStatus))) mux.Handle("/api/ui/v1/ingestion", RequireUIAuth(cfg.Auth, http.HandlerFunc(queryHandlers.Ingestion))) mux.HandleFunc("/metrics", func(w http.ResponseWriter, _ *http.Request) { diff --git a/collector/internal/jfr/normalizer.go b/collector/internal/jfr/normalizer.go index f92e460..94e5975 100644 --- a/collector/internal/jfr/normalizer.go +++ b/collector/internal/jfr/normalizer.go @@ -3,6 +3,7 @@ package jfr import ( "crypto/sha1" "encoding/hex" + "fmt" "strings" "time" @@ -10,10 +11,15 @@ import ( "github.com/koolay/java-profiler/domain" ) -const DefaultCPUExecutionSampleValueNS uint64 = 10_000_000 +const ( + DefaultCPUExecutionSampleValueNS uint64 = 10_000_000 + DefaultWallClockSampleValueNS uint64 = 10_000_000 + DefaultIOWaitWallSampleValueNS uint64 = 10_000_000 +) type NormalizedWindow struct { Samples []profiling.ProfileSample + JVMEvents []profiling.JVMEvent RawSampleCount int } @@ -27,16 +33,26 @@ func NormalizeWindow(batchID string, target domain.TargetIdentity, events []Even func NormalizeWindowWithStats(batchID string, target domain.TargetIdentity, events []Event, startedAt, endedAt time.Time) NormalizedWindow { var samples []profiling.ProfileSample + var jvmEvents []profiling.JVMEvent rawSampleCount := 0 for _, event := range events { + if jvmEvent, ok := normalizeJVMEvent(batchID, target, event, startedAt, endedAt); ok { + jvmEvents = append(jvmEvents, jvmEvent) + continue + } profileType, ok := profileTypeForEvent(event.Type) if !ok { continue } rawSampleCount++ value := event.Value - if profileType == domain.ProfileTypeCPU { + switch profileType { + case domain.ProfileTypeCPU: value = event.Value * DefaultCPUExecutionSampleValueNS + case domain.ProfileTypeWallClock: + value = event.Value * DefaultWallClockSampleValueNS + case domain.ProfileTypeIOWait: + value = event.Value * DefaultIOWaitWallSampleValueNS } frames := boundedFrames(event.Frames, 256) samples = append(samples, profiling.ProfileSample{ @@ -51,7 +67,39 @@ func NormalizeWindowWithStats(batchID string, target domain.TargetIdentity, even Truncated: len(event.Frames) > len(frames), }) } - return NormalizedWindow{Samples: AggregateSamples(samples), RawSampleCount: rawSampleCount} + return NormalizedWindow{Samples: AggregateSamples(samples), JVMEvents: jvmEvents, RawSampleCount: rawSampleCount} +} + +func normalizeJVMEvent(batchID string, target domain.TargetIdentity, event Event, startedAt, endedAt time.Time) (profiling.JVMEvent, bool) { + switch event.Type { + case "gc_pause": + default: + return profiling.JVMEvent{}, false + } + eventAt := endedAt + if eventAt.IsZero() { + eventAt = startedAt + } + if eventAt.IsZero() { + eventAt = time.Now().UTC() + } + eventID := event.Labels["event_id"] + if eventID == "" { + eventID = stackID(append(event.Frames, fmt.Sprintf("%s:%d:%s", event.Type, event.Value, eventAt.UTC().Format(time.RFC3339Nano)))) + } + return profiling.JVMEvent{ + EventID: eventID, + BatchID: batchID, + Target: target, + EventType: event.Type, + EventAt: eventAt, + DurationNS: event.Value, + Collector: event.Labels["collector"], + Action: event.Labels["action"], + Cause: event.Labels["cause"], + Message: event.Labels["message"], + StackFrames: boundedFrames(event.Frames, 256), + }, true } func profileTypeForEvent(event string) (domain.ProfileType, bool) { @@ -66,6 +114,10 @@ func profileTypeForEvent(event string) (domain.ProfileType, bool) { return domain.ProfileTypeLockContention, true case "lock_delay": return domain.ProfileTypeLockDelay, true + case "wall_clock": + return domain.ProfileTypeWallClock, true + case "io_wait": + return domain.ProfileTypeIOWait, true default: return "", false } diff --git a/collector/internal/jfr/normalizer_test.go b/collector/internal/jfr/normalizer_test.go index 1fc1a2a..776df2b 100644 --- a/collector/internal/jfr/normalizer_test.go +++ b/collector/internal/jfr/normalizer_test.go @@ -17,9 +17,11 @@ func TestNormalizeMapsRequiredProfileTypes(t *testing.T) { {Type: "alloc_objects", Value: 3, Frames: []string{"C"}}, {Type: "monitor_enter", Value: 4, Frames: []string{"D"}}, {Type: "lock_delay", Value: 5, Frames: []string{"E"}}, + {Type: "wall_clock", Value: 6, Frames: []string{"F"}}, + {Type: "io_wait", Value: 7, Frames: []string{"G"}}, }, startedAt, endedAt) - if len(samples) != 5 { - t.Fatalf("expected five samples, got %+v", samples) + if len(samples) != 7 { + t.Fatalf("expected seven samples, got %+v", samples) } for _, sample := range samples { if !sample.ProfileType.IsValid() || sample.StackID == "" || !sample.StartedAt.Equal(startedAt) || !sample.EndedAt.Equal(endedAt) { @@ -36,8 +38,10 @@ func TestNormalizeMapsRequiredProfileTypes(t *testing.T) { if byType[domain.ProfileTypeAllocBytes] != 2 || byType[domain.ProfileTypeAllocObjects] != 3 || byType[domain.ProfileTypeLockContention] != 4 || - byType[domain.ProfileTypeLockDelay] != 5 { - t.Fatalf("non-CPU sample values should remain unchanged: %+v", samples) + byType[domain.ProfileTypeLockDelay] != 5 || + byType[domain.ProfileTypeWallClock] != 6*DefaultWallClockSampleValueNS || + byType[domain.ProfileTypeIOWait] != 7*DefaultIOWaitWallSampleValueNS { + t.Fatalf("sample values should be normalized by profile semantics: %+v", samples) } } @@ -77,3 +81,27 @@ func TestNormalizeWindowWithStatsPreservesRawCountBeforeAggregation(t *testing.T t.Fatalf("expected one aggregated sample, got %+v", result.Samples) } } + +func TestNormalizeWindowExtractsGCJVMEvents(t *testing.T) { + target := domain.TargetIdentity{Namespace: "prod", Service: "checkout", Pod: "checkout-1", ProcessID: 1, JVMStartTime: time.Unix(1, 0)} + startedAt := time.Unix(100, 0) + endedAt := time.Unix(160, 0) + + result := NormalizeWindowWithStats("batch-1", target, []Event{{ + Type: "gc_pause", + Value: 42_000_000, + Frames: []string{"jdk.G1.collect"}, + Labels: map[string]string{"collector": "G1", "action": "end of minor GC", "cause": "Allocation Failure"}, + }}, startedAt, endedAt) + + if len(result.Samples) != 0 { + t.Fatalf("GC events must not be profile samples: %+v", result.Samples) + } + if result.RawSampleCount != 0 || len(result.JVMEvents) != 1 { + t.Fatalf("expected one raw JVM event, got raw=%d events=%+v", result.RawSampleCount, result.JVMEvents) + } + event := result.JVMEvents[0] + if event.EventType != "gc_pause" || event.DurationNS != 42_000_000 || event.Collector != "G1" || event.Target.Pod != "checkout-1" { + t.Fatalf("unexpected GC event: %+v", event) + } +} diff --git a/collector/internal/jfr/parser.go b/collector/internal/jfr/parser.go index 96932f7..e1829c1 100644 --- a/collector/internal/jfr/parser.go +++ b/collector/internal/jfr/parser.go @@ -18,6 +18,7 @@ type Event struct { Type string Value uint64 Frames []string + Labels map[string]string } type Parser struct{} @@ -57,11 +58,27 @@ func parseFixtureEvents(r io.Reader) ([]Event, error) { if err != nil { return nil, err } - events = append(events, Event{Type: parts[0], Value: value, Frames: strings.Split(parts[2], ";")}) + event := Event{Type: parts[0], Value: value, Frames: strings.Split(parts[2], ";")} + if len(parts) > 3 { + event.Labels = parseFixtureLabels(parts[3]) + } + events = append(events, event) } return events, scanner.Err() } +func parseFixtureLabels(raw string) map[string]string { + labels := map[string]string{} + for _, pair := range strings.Split(raw, ",") { + key, value, ok := strings.Cut(pair, "=") + if !ok { + continue + } + labels[strings.TrimSpace(key)] = strings.TrimSpace(value) + } + return labels +} + func parseAsyncProfilerJFR(data []byte) ([]Event, error) { p := grafanaparser.NewParser(data, grafanaparser.Options{SymbolProcessor: grafanaparser.ProcessSymbols}) var events []Event @@ -76,6 +93,19 @@ func parseAsyncProfilerJFR(data []byte) ([]Event, error) { switch typ { case p.TypeMap.T_EXECUTION_SAMPLE: events = append(events, Event{Type: "execution_sample", Value: 1, Frames: stackFrames(p, p.ExecutionSample.StackTrace)}) + case p.TypeMap.T_WALL_CLOCK_SAMPLE: + frames := stackFrames(p, p.WallClockSample.StackTrace) + value := uint64(p.WallClockSample.Samples) + if value == 0 { + value = 1 + } + events = append(events, Event{Type: "wall_clock", Value: value, Frames: frames}) + if isJavaIOStack(frames) { + events = append(events, Event{Type: "io_wait", Value: value, Frames: frames}) + } + if isJVMGCStack(frames) { + events = append(events, Event{Type: "gc_pause", Value: value * DefaultWallClockSampleValueNS, Frames: frames, Labels: map[string]string{"collector": "JVM", "action": "wall-clock GC activity", "cause": "GC thread sample"}}) + } case p.TypeMap.T_ALLOC_IN_NEW_TLAB: frames := stackFrames(p, p.ObjectAllocationInNewTLAB.StackTrace) events = append(events, Event{Type: "alloc_objects", Value: 1, Frames: frames}, Event{Type: "alloc_bytes", Value: p.ObjectAllocationInNewTLAB.TlabSize, Frames: frames}) @@ -95,6 +125,40 @@ func parseAsyncProfilerJFR(data []byte) ([]Event, error) { return events, nil } +func isJavaIOStack(frames []string) bool { + for _, frame := range frames { + normalized := strings.ReplaceAll(frame, "/", ".") + if strings.Contains(normalized, "java.io.") || + strings.Contains(normalized, "java.net.") || + strings.Contains(normalized, "java.nio.") || + strings.Contains(normalized, "sun.nio.ch.") || + strings.Contains(normalized, "jdk.internal.net.http.") || + strings.Contains(normalized, "io.netty.channel.") || + strings.Contains(normalized, "okhttp3.") || + strings.Contains(normalized, "org.apache.http.") { + return true + } + } + return false +} + +func isJVMGCStack(frames []string) bool { + for _, frame := range frames { + normalized := strings.ToLower(strings.ReplaceAll(frame, "/", ".")) + if strings.Contains(normalized, ".gc.") || + strings.Contains(normalized, "java.lang.system.gc") || + strings.Contains(normalized, "java.lang.runtime.gc") || + strings.Contains(normalized, "garbagecollect") || + strings.Contains(normalized, "g1") || + strings.Contains(normalized, "shenandoah") || + strings.Contains(normalized, "zgc") || + strings.Contains(normalized, "vm_gc") { + return true + } + } + return false +} + func stackFrames(p *grafanaparser.Parser, ref types.StackTraceRef) []string { stack := p.GetStacktrace(ref) if stack == nil { diff --git a/collector/internal/jfr/parser_test.go b/collector/internal/jfr/parser_test.go index d02752f..ca813e1 100644 --- a/collector/internal/jfr/parser_test.go +++ b/collector/internal/jfr/parser_test.go @@ -15,6 +15,27 @@ func TestParserParsesFixtureEvents(t *testing.T) { } } +func TestJavaIOStackClassifierUsesJavaOwnershipFrames(t *testing.T) { + if !isJavaIOStack([]string{"com.example.Client.call", "sun/nio/ch/SocketDispatcher.read0"}) { + t.Fatalf("expected Java socket stack to be classified as I/O wait") + } + if isJavaIOStack([]string{"com.example.CpuBurn.loop", "java.lang.Thread.run"}) { + t.Fatalf("CPU-only stack should not be classified as I/O wait") + } +} + +func TestJVMGCStackClassifierUsesJVMOwnershipFrames(t *testing.T) { + if !isJVMGCStack([]string{"jdk/internal/vm/G1CollectedHeap.doCollection"}) { + t.Fatalf("expected JVM GC stack to be classified as GC evidence") + } + if !isJVMGCStack([]string{"java/lang/System.gc", "com/example/DemoHttpService.createGcPressure"}) { + t.Fatalf("expected explicit System.gc stack to be classified as GC evidence") + } + if isJVMGCStack([]string{"com.example.Checkout.handle"}) { + t.Fatalf("application stack should not be classified as GC evidence") + } +} + func TestParserRejectsInvalidJFRMagic(t *testing.T) { _, err := (Parser{}).Parse(strings.NewReader("FLR\x00bad")) if err == nil { diff --git a/collector/internal/pipeline/profile_batcher.go b/collector/internal/pipeline/profile_batcher.go index c3df5ed..7ae5499 100644 --- a/collector/internal/pipeline/profile_batcher.go +++ b/collector/internal/pipeline/profile_batcher.go @@ -35,6 +35,22 @@ type ThreadSnapshotBatchPayload struct { Deadlocks []profiling.DeadlockEvent `json:"Deadlocks"` } +type JVMEventBatchPayload struct { + BatchID string `json:"batch_id"` + CollectorID string `json:"collector_id"` + ReceivedAt time.Time `json:"received_at"` + Events []profiling.JVMEvent `json:"events"` +} + +func BuildJVMEventBatch(batchID, collectorID string, events []profiling.JVMEvent) (Batch, error) { + payload := JVMEventBatchPayload{BatchID: batchID, CollectorID: collectorID, ReceivedAt: time.Now().UTC(), Events: events} + data, err := json.Marshal(payload) + if err != nil { + return Batch{}, err + } + return Batch{ID: batchID, Type: "jvm_event", Bytes: len(data), CreatedAt: payload.ReceivedAt, Payload: data}, nil +} + func BuildThreadSnapshotBatch(batchID, collectorID string, snapshots []profiling.ThreadSnapshot, deadlocks []profiling.DeadlockEvent) (Batch, error) { payload := ThreadSnapshotBatchPayload{BatchID: batchID, CollectorID: collectorID, ReceivedAt: time.Now().UTC(), Snapshots: snapshots, Deadlocks: deadlocks} data, err := json.Marshal(payload) @@ -93,3 +109,10 @@ func ThreadSnapshotURL(profileURL string) string { } return strings.TrimRight(profileURL, "/") + "/thread-snapshot-batches" } + +func JVMEventURL(profileURL string) string { + if strings.Contains(profileURL, "/profile-batches") { + return strings.Replace(profileURL, "/profile-batches", "/jvm-event-batches", 1) + } + return strings.TrimRight(profileURL, "/") + "/jvm-event-batches" +} diff --git a/collector/internal/pipeline/profile_batcher_test.go b/collector/internal/pipeline/profile_batcher_test.go index a050405..f37159e 100644 --- a/collector/internal/pipeline/profile_batcher_test.go +++ b/collector/internal/pipeline/profile_batcher_test.go @@ -52,3 +52,22 @@ func TestBuildProfileBatchIncludesMetadata(t *testing.T) { t.Fatalf("metadata should mark the batch as truncated") } } + +func TestBuildJVMEventBatchUsesStableJSONKeys(t *testing.T) { + batch, err := BuildJVMEventBatch("batch-gc", "collector-a", []profiling.JVMEvent{{EventID: "gc-1", EventType: "gc_pause"}}) + if err != nil { + t.Fatal(err) + } + if batch.Type != "jvm_event" { + t.Fatalf("batch type = %q", batch.Type) + } + var wire map[string]json.RawMessage + if err := json.Unmarshal(batch.Payload, &wire); err != nil { + t.Fatal(err) + } + for _, key := range []string{"batch_id", "collector_id", "received_at", "events"} { + if _, ok := wire[key]; !ok { + t.Fatalf("JVM event payload missing wire key %q: %s", key, string(batch.Payload)) + } + } +} diff --git a/collector/internal/profiler/async_profiler.go b/collector/internal/profiler/async_profiler.go index 2b66ca9..a454c47 100644 --- a/collector/internal/profiler/async_profiler.go +++ b/collector/internal/profiler/async_profiler.go @@ -33,6 +33,7 @@ type Config struct { LibraryPath string TargetTmpDir string AllocationAndLockJFR bool + DisableWallClockJFR bool Now func() time.Time } @@ -46,6 +47,7 @@ type Runner struct { type CollectionResult struct { Samples []profiling.ProfileSample + JVMEvents []profiling.JVMEvent RawSampleCount int } @@ -70,6 +72,7 @@ type cachedLibrary struct { const ( allocationSampleInterval = "8m" lockSampleThreshold = "10us" + wallSampleInterval = "10ms" ) func NewRunner(cfg Config, attach AttachController) *Runner { @@ -131,6 +134,7 @@ func (r *Runner) Collect(ctx context.Context, batchID string, target domain.Targ normalized := jfr.NormalizeWindowWithStats(batchID, target, events, prior.startedAt, now) result := CollectionResult{ Samples: normalized.Samples, + JVMEvents: normalized.JVMEvents, RawSampleCount: normalized.RawSampleCount, } if err := r.start(ctx, key, target.ProcessID, nsPID, now); err != nil { @@ -239,6 +243,9 @@ func (r *Runner) start(ctx context.Context, key string, pid int, nsPID int, star if r.cfg.AllocationAndLockJFR { args = append(args[:5], append([]string{"--alloc", allocationSampleInterval, "--lock", lockSampleThreshold}, args[5:]...)...) } + if !r.cfg.DisableWallClockJFR { + args = append(args[:5], append([]string{"--wall", wallSampleInterval}, args[5:]...)...) + } args = append(args, strconv.Itoa(pid)) if _, err := r.exec.Run(ctx, r.cfg.AsprofPath, args...); err != nil { _ = RemoveSessionMarker(r.cfg.ProcRoot, pid) @@ -251,6 +258,9 @@ func (r *Runner) start(ctx context.Context, key string, pid int, nsPID int, star if r.cfg.AllocationAndLockJFR { args += ",alloc=" + allocationSampleInterval + ",lock=" + lockSampleThreshold } + if !r.cfg.DisableWallClockJFR { + args += ",wall=" + wallSampleInterval + } if err := r.attach.LoadNativeAgent(ctx, pid, r.targetLibraryPath(), args); err != nil { _ = RemoveSessionMarker(r.cfg.ProcRoot, pid) return err diff --git a/collector/internal/profiler/async_profiler_test.go b/collector/internal/profiler/async_profiler_test.go index 138182d..3ae7a57 100644 --- a/collector/internal/profiler/async_profiler_test.go +++ b/collector/internal/profiler/async_profiler_test.go @@ -75,7 +75,7 @@ func TestRunnerUsesOfficialAsprofCLIWhenConfigured(t *testing.T) { t.Fatalf("expected initial start asprof command, got %+v", exec.commands) } startArgs := strings.Join(exec.commands[0].args, " ") - for _, want := range []string{"start -e itimer", "-f /tmp/java-profiler/ap_7.jfr", "--libpath /tmp/java-profiler/libasyncProfiler.so"} { + for _, want := range []string{"start -e itimer", "--wall 10ms", "-f /tmp/java-profiler/ap_7.jfr", "--libpath /tmp/java-profiler/libasyncProfiler.so"} { if !strings.Contains(startArgs, want) { t.Fatalf("expected start command to contain %q, got %s", want, startArgs) } @@ -126,6 +126,43 @@ func TestRunnerCanOptIntoAllocationAndLockProfiling(t *testing.T) { } } +func TestRunnerCanDisableWallClockProfiling(t *testing.T) { + root := t.TempDir() + procRoot := filepath.Join(root, "proc") + targetRoot := filepath.Join(procRoot, "42", "root") + if err := os.MkdirAll(filepath.Join(targetRoot, "tmp"), 0o755); err != nil { + t.Fatal(err) + } + if err := os.MkdirAll(filepath.Join(procRoot, "42"), 0o755); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(filepath.Join(procRoot, "42", "status"), []byte("Name:\tjava\nNSpid:\t4242\t7\n"), 0o644); err != nil { + t.Fatal(err) + } + libPath := filepath.Join(root, "libasyncProfiler.so") + if err := os.WriteFile(libPath, []byte("native-profiler"), 0o644); err != nil { + t.Fatal(err) + } + runner := NewRunner(Config{ + ProcRoot: procRoot, + AsprofPath: "/assets/asprof", + LibraryPath: libPath, + TargetTmpDir: "/tmp/java-profiler", + DisableWallClockJFR: true, + }, &recordingAttachController{}) + exec := &recordingExecutor{} + runner.exec = exec + target := domain.TargetIdentity{Cluster: "c", Namespace: "prod", Service: "jdk17-http-demo", Pod: "jdk17-http-demo-1", ProcessID: 42, JVMStartTime: time.Unix(1, 0)} + + if _, err := runner.Collect(context.Background(), "batch-1", target); err != nil { + t.Fatal(err) + } + startArgs := strings.Join(exec.commands[0].args, " ") + if strings.Contains(startArgs, "--wall") { + t.Fatalf("expected wall clock to be disabled, got %s", startArgs) + } +} + func TestRunnerUsesHotSpotAttachStopsParsesAndRestartsAsyncProfilerJFR(t *testing.T) { root := t.TempDir() procRoot := filepath.Join(root, "proc") @@ -163,7 +200,7 @@ func TestRunnerUsesHotSpotAttachStopsParsesAndRestartsAsyncProfilerJFR(t *testin t.Fatalf("expected initial start attach command, got %+v", attach.commands) } start := attach.commands[0] - if start.pid != 42 || start.agentPath != "/tmp/java-profiler/libasyncProfiler.so" || !strings.HasPrefix(start.args, "start,file=/tmp/java-profiler/ap_7.jfr,jfr,event=itimer,interval=10ms") || strings.Contains(start.args, "alloc=") || strings.Contains(start.args, "lock=") { + if start.pid != 42 || start.agentPath != "/tmp/java-profiler/libasyncProfiler.so" || !strings.HasPrefix(start.args, "start,file=/tmp/java-profiler/ap_7.jfr,jfr,event=itimer,interval=10ms") || !strings.Contains(start.args, "wall=10ms") || strings.Contains(start.args, "alloc=") || strings.Contains(start.args, "lock=") { t.Fatalf("expected Coroot-style native agent start command, got %+v", start) } if _, err := os.Stat(filepath.Join(targetRoot, "tmp", "java-profiler", "libasyncProfiler.so")); err != nil { diff --git a/collector/runtime/runtime.go b/collector/runtime/runtime.go index c4dead2..3002fe8 100644 --- a/collector/runtime/runtime.go +++ b/collector/runtime/runtime.go @@ -262,7 +262,7 @@ func (r *Runtime) ScanOnce(ctx context.Context) error { return err } batchID := fmt.Sprintf("%s-profile-%d", r.collectorID, started.UnixNano()) - samples, rawSampleCount, profileErr := r.collectProfiles(ctx, batchID, acceptedTargets) + samples, jvmEvents, rawSampleCount, profileErr := r.collectProfiles(ctx, batchID, acceptedTargets) if profileErr != nil { r.exporter.Inc("java_profiler_collector_profiler_failures") } @@ -272,11 +272,33 @@ func (r *Runtime) ScanOnce(ctx context.Context) error { log.Printf("profile batch upload failed: batch=%s samples=%d: %v", batchID, len(samples), err) return err } + if len(jvmEvents) > 0 { + eventBatchID := fmt.Sprintf("%s-jvm-events-%d", r.collectorID, started.UnixNano()) + if err := r.uploadJVMEvents(ctx, eventBatchID, jvmEvents); err != nil { + r.exporter.Inc("java_profiler_collector_upload_failures") + r.exporter.Inc("java_profiler_collector_upload_retryable") + log.Printf("JVM event batch upload failed: batch=%s events=%d: %v", eventBatchID, len(jvmEvents), err) + return err + } + } r.exporter.Inc("java_profiler_collector_upload_success") } return nil } +func (r *Runtime) uploadJVMEvents(ctx context.Context, batchID string, events []profiling.JVMEvent) error { + for index := range events { + events[index].BatchID = batchID + } + payload, err := pipeline.BuildJVMEventBatch(batchID, r.collectorID, events) + if err != nil { + return err + } + client := r.backend + client.URL = pipeline.JVMEventURL(r.backend.URL) + return client.Upload(ctx, payload) +} + func (r *Runtime) uploadProfileSamples(ctx context.Context, batchID string, samples []profiling.ProfileSample, rawSampleCount int) error { boundedSamples, metadata := pipeline.BoundProfileSamples(samples, rawSampleCount, r.profileLimits) maxPerBatch := r.profileLimits.MaxSamplesPerBatch @@ -335,9 +357,9 @@ func chunkProfileSamples(samples []profiling.ProfileSample, maxPerBatch int) [][ return chunks } -func (r *Runtime) collectProfiles(ctx context.Context, batchID string, targets []domain.TargetIdentity) ([]profiling.ProfileSample, int, error) { +func (r *Runtime) collectProfiles(ctx context.Context, batchID string, targets []domain.TargetIdentity) ([]profiling.ProfileSample, []profiling.JVMEvent, int, error) { if len(targets) == 0 { - return nil, 0, nil + return nil, nil, 0, nil } limit := maxConcurrentProfiles if len(targets) < limit { @@ -345,6 +367,7 @@ func (r *Runtime) collectProfiles(ctx context.Context, batchID string, targets [ } type profileResult struct { samples []profiling.ProfileSample + events []profiling.JVMEvent raw int err error } @@ -363,11 +386,12 @@ func (r *Runtime) collectProfiles(ctx context.Context, batchID string, targets [ results[i].err = err return } - results[i] = profileResult{samples: result.Samples, raw: result.RawSampleCount} + results[i] = profileResult{samples: result.Samples, events: result.JVMEvents, raw: result.RawSampleCount} }(index, target) } wg.Wait() out := make([]profiling.ProfileSample, 0) + events := make([]profiling.JVMEvent, 0) rawSampleCount := 0 var firstErr error for _, result := range results { @@ -378,9 +402,10 @@ func (r *Runtime) collectProfiles(ctx context.Context, batchID string, targets [ continue } out = append(out, result.samples...) + events = append(events, result.events...) rawSampleCount += result.raw } - return out, rawSampleCount, firstErr + return out, events, rawSampleCount, firstErr } func (r *Runtime) targetAllowed(target domain.TargetIdentity) bool { diff --git a/collector/runtime/runtime_test.go b/collector/runtime/runtime_test.go index 2d3bf75..656505b 100644 --- a/collector/runtime/runtime_test.go +++ b/collector/runtime/runtime_test.go @@ -314,10 +314,11 @@ func TestCollectProfilesUsesLimitedConcurrency(t *testing.T) { done := make(chan struct{}) var samples []profiling.ProfileSample + var events []profiling.JVMEvent var rawCount int var err error go func() { - samples, rawCount, err = rt.collectProfiles(context.Background(), "batch-1", targets) + samples, events, rawCount, err = rt.collectProfiles(context.Background(), "batch-1", targets) close(done) }() @@ -348,6 +349,9 @@ func TestCollectProfilesUsesLimitedConcurrency(t *testing.T) { if len(samples) != len(targets) { t.Fatalf("samples = %d, want %d", len(samples), len(targets)) } + if len(events) != 0 { + t.Fatalf("events = %d, want 0", len(events)) + } if got := rt.profiler.(*blockingProfileCollector).maxActive; got != maxConcurrentProfiles { t.Fatalf("max active = %d, want %d", got, maxConcurrentProfiles) } diff --git a/contracts/profiling/payloads.md b/contracts/profiling/payloads.md index 44e9332..168465f 100644 --- a/contracts/profiling/payloads.md +++ b/contracts/profiling/payloads.md @@ -42,6 +42,8 @@ Stable v1 profile types: - `java_allocation_objects` - `java_lock_contention_count` - `java_lock_delay_nanoseconds` +- `java_wall_clock_nanoseconds` +- `java_io_wait_nanoseconds` ## Batch types @@ -49,6 +51,7 @@ Stable batch types: - `profile` - `thread_snapshot` +- `jvm_event` - `target_status` - `collector_heartbeat` - `ingestion` @@ -64,6 +67,13 @@ Profile batches are sent to `/api/collector/v1/profile-batches`. - `ReceivedAt`: collector-side batch creation time - `Samples`: profile samples +JVM event batches are sent to `/api/collector/v1/jvm-event-batches`. + +- `batch_id`: unique collector-generated batch id +- `collector_id`: stable collector instance id +- `received_at`: collector-side batch creation time +- `events`: JVM-scoped events such as `gc_pause` with `event_id`, `target`, `event_type`, `event_at`, `duration_ns`, `collector`, `action`, `cause`, `message`, and optional `stack_frames` + Target status batches are sent to `/api/collector/v1/target-status-batches`. - `BatchID`: unique collector-generated batch id diff --git a/contracts/profiling/types.go b/contracts/profiling/types.go index e36aa74..9f887c2 100644 --- a/contracts/profiling/types.go +++ b/contracts/profiling/types.go @@ -56,3 +56,17 @@ type DeadlockEvent struct { Locks []string `json:"locks"` BlockingFrames []string `json:"blocking_frames"` } + +type JVMEvent struct { + EventID string `json:"event_id"` + BatchID string `json:"batch_id"` + Target domain.TargetIdentity `json:"target"` + EventType string `json:"event_type"` + EventAt time.Time `json:"event_at"` + DurationNS uint64 `json:"duration_ns"` + Collector string `json:"collector,omitempty"` + Action string `json:"action,omitempty"` + Cause string `json:"cause,omitempty"` + Message string `json:"message,omitempty"` + StackFrames []string `json:"stack_frames,omitempty"` +} diff --git a/docs/.vitepress/config.ts b/docs/.vitepress/config.ts index 7a4b5c5..ee26b05 100644 --- a/docs/.vitepress/config.ts +++ b/docs/.vitepress/config.ts @@ -1,5 +1,103 @@ import { defineConfig } from 'vitepress' +const enNav = [ + { text: 'Get Started', link: '/getting-started/quickstart' }, + { text: 'Users', link: '/operations/performance-analysis-user-manual' }, + { text: 'Operators', link: '/operations/deployment-operations-admin-manual' }, + { text: 'Contributors', link: '/contributing/development' }, + { text: 'Architecture', link: '/architecture/java-profiler-architecture' }, + { text: 'Reference', link: '/reference/profiling-contracts' } +] + +const zhNav = [ + { text: '快速开始', link: '/zh/getting-started/quickstart' }, + { text: '用户', link: '/zh/operations/performance-analysis-user-manual' }, + { text: '运维', link: '/operations/deployment-operations-admin-manual' }, + { text: '贡献者', link: '/zh/contributing/development' }, + { text: '架构', link: '/architecture/java-profiler-architecture' }, + { text: '参考', link: '/zh/reference/profiling-contracts' } +] + +const enSidebar = [ + { + text: 'Start Here', + items: [ + { text: 'Overview', link: '/' }, + { text: 'Quickstart', link: '/getting-started/quickstart' } + ] + }, + { + text: 'Users', + items: [ + { text: 'Analyze Performance', link: '/operations/performance-analysis-user-manual' }, + { text: 'Enable Profiling', link: '/operations/java-profiling-runbook' } + ] + }, + { + text: 'Operators', + items: [ + { text: 'Deployment Manual', link: '/operations/deployment-operations-admin-manual' }, + { text: 'Real Profiling Acceptance', link: '/operations/real-profiling-acceptance-standard' }, + { text: 'E2E Automation Guide', link: '/operations/e2e-automation-test-guide' } + ] + }, + { + text: 'Contributors', + items: [ + { text: 'Development Setup', link: '/contributing/development' }, + { text: 'Localization', link: '/contributing/localization' }, + { text: 'System Architecture', link: '/architecture/java-profiler-architecture' }, + { text: 'Ingestion Architecture', link: '/architecture/performance-ingestion-architecture-review' } + ] + }, + { + text: 'Reference', + items: [ + { text: 'Profiling Contracts', link: '/reference/profiling-contracts' } + ] + } +] + +const zhSidebar = [ + { + text: '开始', + items: [ + { text: '概览', link: '/zh/' }, + { text: '快速开始', link: '/zh/getting-started/quickstart' } + ] + }, + { + text: '用户', + items: [ + { text: '性能分析', link: '/zh/operations/performance-analysis-user-manual' }, + { text: '启用 Profiling', link: '/operations/java-profiling-runbook' } + ] + }, + { + text: '运维', + items: [ + { text: '部署运维手册', link: '/operations/deployment-operations-admin-manual' }, + { text: '真实 Profiling 验收', link: '/operations/real-profiling-acceptance-standard' }, + { text: 'E2E 自动化指南', link: '/operations/e2e-automation-test-guide' } + ] + }, + { + text: '贡献者', + items: [ + { text: '开发设置', link: '/zh/contributing/development' }, + { text: '本地化策略', link: '/zh/contributing/localization' }, + { text: '系统架构', link: '/architecture/java-profiler-architecture' }, + { text: 'Ingestion 架构', link: '/architecture/performance-ingestion-architecture-review' } + ] + }, + { + text: '参考', + items: [ + { text: 'Profiling 合同', link: '/zh/reference/profiling-contracts' } + ] + } +] + export default defineConfig({ title: 'Java Profiler', description: 'Java performance profiling for Kubernetes with async-profiler and ClickHouse', @@ -14,66 +112,51 @@ export default defineConfig({ dark: 'github-dark' } }, + locales: { + root: { + label: 'English', + lang: 'en-US', + title: 'Java Profiler', + description: 'Java performance profiling for Kubernetes with async-profiler and ClickHouse', + themeConfig: { + nav: enNav, + sidebar: enSidebar, + editLink: { + pattern: 'https://github.com/koolay/java-profiler/edit/main/docs/:path', + text: 'Edit this page on GitHub' + }, + footer: { + message: 'Java services on Kubernetes. HotSpot first. async-profiler first.', + copyright: 'Released from the java-profiler repository.' + } + } + }, + zh: { + label: '简体中文', + lang: 'zh-CN', + title: 'Java Profiler', + description: '面向 Kubernetes Java 服务的真实性能 Profiling 文档', + link: '/zh/', + themeConfig: { + nav: zhNav, + sidebar: zhSidebar, + editLink: { + pattern: 'https://github.com/koolay/java-profiler/edit/main/docs/:path', + text: '在 GitHub 上编辑本页' + }, + footer: { + message: '面向 Kubernetes Java 服务。HotSpot 优先。async-profiler 优先。', + copyright: '来自 java-profiler 仓库。' + } + } + } + }, themeConfig: { search: { provider: 'local' }, - nav: [ - { text: 'Guide', link: '/' }, - { text: 'Operations', link: '/operations/java-profiling-runbook' }, - { text: 'Architecture', link: '/architecture/java-profiler-architecture' }, - { text: 'Research', link: '/research/coroot-node-agent-java-agent' } - ], - sidebar: [ - { - text: 'Start Here', - items: [ - { text: 'Overview', link: '/' }, - { text: 'Requirements', link: '/brainstorms/java-profiler-requirements' } - ] - }, - { - text: 'Architecture', - items: [ - { text: 'System Architecture', link: '/architecture/java-profiler-architecture' }, - { text: 'Ingestion Review', link: '/architecture/performance-ingestion-architecture-review' } - ] - }, - { - text: 'Operations', - items: [ - { text: 'Deployment Manual', link: '/operations/deployment-operations-admin-manual' }, - { text: 'E2E Automation Guide', link: '/operations/e2e-automation-test-guide' }, - { text: 'Profiling Runbook', link: '/operations/java-profiling-runbook' }, - { text: 'Performance Analysis Manual', link: '/operations/performance-analysis-user-manual' }, - { text: 'Real Profiling Acceptance', link: '/operations/real-profiling-acceptance-standard' } - ] - }, - { - text: 'Research', - items: [ - { text: 'Coroot Node Agent Java Agent', link: '/research/coroot-node-agent-java-agent' }, - { text: 'chDB Go', link: '/research/chdb-go' }, - { text: 'Pyroscope UI Study', link: '/research/pyroscope-profile-ui-study' } - ] - }, - { - text: 'Reference', - items: [ - { text: 'Profiling Contracts', link: '/reference/profiling-contracts' } - ] - } - ], socialLinks: [ { icon: 'github', link: 'https://github.com/koolay/java-profiler' } - ], - editLink: { - pattern: 'https://github.com/koolay/java-profiler/edit/main/docs/:path', - text: 'Edit this page on GitHub' - }, - footer: { - message: 'Java services on Kubernetes. HotSpot first. async-profiler first.', - copyright: 'Released from the java-profiler repository.' - } + ] } }) diff --git a/docs/assets/screenshots/real-cpu-analysis.png b/docs/assets/screenshots/real-cpu-analysis.png index be68333..e1a184a 100644 Binary files a/docs/assets/screenshots/real-cpu-analysis.png and b/docs/assets/screenshots/real-cpu-analysis.png differ diff --git a/docs/assets/screenshots/real-deadlocks.png b/docs/assets/screenshots/real-deadlocks.png index 3507251..42e9a6e 100644 Binary files a/docs/assets/screenshots/real-deadlocks.png and b/docs/assets/screenshots/real-deadlocks.png differ diff --git a/docs/assets/screenshots/real-gc-pauses.png b/docs/assets/screenshots/real-gc-pauses.png new file mode 100644 index 0000000..d012685 Binary files /dev/null and b/docs/assets/screenshots/real-gc-pauses.png differ diff --git a/docs/assets/screenshots/real-ingestion-health.png b/docs/assets/screenshots/real-ingestion-health.png index 839c223..90d2829 100644 Binary files a/docs/assets/screenshots/real-ingestion-health.png and b/docs/assets/screenshots/real-ingestion-health.png differ diff --git a/docs/assets/screenshots/real-io-wait.png b/docs/assets/screenshots/real-io-wait.png new file mode 100644 index 0000000..eb5b80e Binary files /dev/null and b/docs/assets/screenshots/real-io-wait.png differ diff --git a/docs/assets/screenshots/real-target-status.png b/docs/assets/screenshots/real-target-status.png index c063903..7c4e556 100644 Binary files a/docs/assets/screenshots/real-target-status.png and b/docs/assets/screenshots/real-target-status.png differ diff --git a/docs/assets/screenshots/real-wall-clock.png b/docs/assets/screenshots/real-wall-clock.png new file mode 100644 index 0000000..0feb9dc Binary files /dev/null and b/docs/assets/screenshots/real-wall-clock.png differ diff --git a/docs/contributing/development.md b/docs/contributing/development.md new file mode 100644 index 0000000..b3fde4f --- /dev/null +++ b/docs/contributing/development.md @@ -0,0 +1,59 @@ +# Contributing + +This project needs contributors to preserve one rule: profiling changes must prove real profile data, not just a healthy UI. + +## Development setup + +Run commands from the repository root. + +```bash +go test ./... +javac --release 11 java-helper/thread-diagnostics/src/main/java/com/ebpfjava/threads/*.java +cd examples/jdk17-http-demo && mvn test +cd ../../web && npm ci && npm test && npm run build +``` + +## Docs site + +```bash +cd docs +npm install +npm run docs:dev +``` + +Build before publishing docs changes: + +```bash +cd docs +npm run docs:build +``` + +The docs site is bilingual. English is the source language, and Chinese covers the core user and contributor paths. See [Localization](./localization.md) before adding or moving public docs pages. + +## Real acceptance + +Use real Kubernetes acceptance for changes touching collector profiling, ingestion, ClickHouse storage, backend query APIs, deployment, the demo service, or the profile UI. + +```bash +export KUBECONFIG=$HOME/backup/localk8s.yaml + +scripts/real-acceptance.sh \ + --service jdk17-http-demo \ + --configure-profiler \ + --require-full-profiling \ + --high-volume \ + --artifact-dir /tmp/java-profiler-real-acceptance-$(date +%Y%m%d%H%M%S) +``` + +Passing means the run produced accepted target status, non-empty CPU/allocation/lock profiles, ClickHouse rows, ingestion evidence, bounded retention, browser UI evidence, and no target workload restart increase. + +## Screenshot evidence + +Docs screenshots should come from a real UI connected to a real backend. + +```bash +export REAL_ACCEPTANCE_BASE_URL=http://127.0.0.1:18081 +export REAL_ACCEPTANCE_NAMESPACE=java-profiler-qa +export REAL_ACCEPTANCE_SERVICE=jdk17-http-demo +node scripts/capture-doc-screenshots.mjs +``` diff --git a/docs/contributing/localization.md b/docs/contributing/localization.md new file mode 100644 index 0000000..764aaa2 --- /dev/null +++ b/docs/contributing/localization.md @@ -0,0 +1,50 @@ +# Localization + +English is the source language for `java-profiler` documentation. Chinese pages are localized copies for the paths most users and contributors need first. + +## Required bilingual pages + +Keep these pages available in both English and Chinese: + +| English | Chinese | +| --- | --- | +| `/` | `/zh/` | +| `/getting-started/quickstart` | `/zh/getting-started/quickstart` | +| `/operations/performance-analysis-user-manual` | `/zh/operations/performance-analysis-user-manual` | +| `/contributing/development` | `/zh/contributing/development` | +| `/reference/profiling-contracts` | `/zh/reference/profiling-contracts` | + +## English-only pages + +Keep implementation-heavy or low-traffic material English-only unless there is a clear user need: + +- Architecture details. +- Ingestion architecture review. +- E2E automation details. +- Real profiling acceptance standard. +- Research notes. +- Brainstorms and project-history material. + +Do not put research, brainstorms, or plans in the public navigation. + +## Translation workflow + +When changing a required bilingual page: + +1. Update the English page first. +2. Update the matching Chinese page in the same change. +3. Keep the same route shape under `/zh/`. +4. Reuse screenshots unless the image itself contains language-specific UI text that makes the page confusing. +5. Run the docs build before publishing. + +```bash +cd docs +npm run docs:build +``` + +## Style + +- Prefer clear technical Chinese over literal translation. +- Keep product names, API keys, annotation names, profile types, and file paths unchanged. +- Use English terms when they are the terms users see in the UI, for example `Top Table`, `Flame Graph`, `Self CPU`, and `Total CPU`. +- Avoid adding new product claims only to the Chinese page. If the claim matters, add it to English first. diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md new file mode 100644 index 0000000..299656d --- /dev/null +++ b/docs/getting-started/quickstart.md @@ -0,0 +1,47 @@ +# Quickstart + +Use this path when you already have `java-profiler` deployed and want to profile one Java service. + +## 1. Enable profiling + +Add profiling metadata to the target workload pod template: + +```yaml +metadata: + annotations: + java-profiler.io/profile-mode: temporary + java-profiler.io/profile-duration: 15m +``` + +Use `temporary` for incident work. Use `continuous` only for services approved for ongoing collection. + +## 2. Open the service + +In the Web UI, set: + +- `Namespace`: the Kubernetes namespace. +- `Service`: the service or workload name. +- `Range`: the time window that includes the profiling run. + +Start with [Target status](../operations/java-profiling-runbook.md#validate-an-existing-workload) if the UI looks empty. It explains whether the JVM was accepted, disabled, unsupported, expired, or failed attach. + +## 3. Analyze the profile + +Open [CPU profile analysis](../operations/performance-analysis-user-manual.md) first when investigating high CPU. + +Use: + +- Top Table to find the most expensive Java methods. +- Flame Graph to see full sampled stack context. +- Selected frame details to compare Self CPU and Total CPU. +- Search and Focus to isolate the stack path that matters. + +For latency that is not explained by CPU, switch to Wall Clock. For socket or file blocking, switch to I/O wait. For pause-time or allocation-pressure incidents, switch to GC pauses and allocation correlation. For contention, switch to lock diagnosis. + +## 4. Check ingestion health + +Before trusting a missing profile, check [Ingestion health](../operations/performance-analysis-user-manual.md#ingestion-health). A useful diagnosis needs accepted profile batches for the selected service and time range. + +## 5. Turn profiling off + +Temporary profiling expires automatically. For continuous profiling, remove or disable the metadata when the service no longer needs ongoing collection. diff --git a/docs/index.md b/docs/index.md index 60ec721..2990b84 100644 --- a/docs/index.md +++ b/docs/index.md @@ -3,36 +3,42 @@ layout: home hero: name: Java Profiler - text: Kubernetes-native Java profiling documentation - tagline: Operate, validate, and evolve a HotSpot-first profiler built around async-profiler and ClickHouse. + text: Find the Java stack behind Kubernetes performance problems + tagline: "A focused profiler for HotSpot services on Kubernetes: opt-in collection, async-profiler/JFR-derived evidence, ClickHouse storage, and a UI built for Java incident diagnosis." actions: - theme: brand - text: Start with Requirements - link: /brainstorms/java-profiler-requirements + text: Quickstart + link: /getting-started/quickstart - theme: alt - text: Run the Profiler - link: /operations/java-profiling-runbook + text: Analyze a Service + link: /operations/performance-analysis-user-manual features: - - title: Product Boundary - details: Java services on Kubernetes, opt-in through annotations or labels, node-local collection, and bounded profile retention. - - title: Real Acceptance - details: Acceptance requires non-empty CPU, allocation, and lock profile data from a real Kubernetes workload. - - title: Operations First - details: Deployment, profiling, performance analysis, and E2E automation guides are grouped for operators and implementers. + - title: Production-safe by default + details: Profiling is opt-in through Kubernetes metadata, collected node-locally, and retained for 7 days or less. + - title: Real Java evidence + details: CPU, Wall Clock, Java I/O wait, GC, allocation, lock delay, thread, deadlock, status, and ingestion evidence stay tied to one service and time range. + - title: Own the profiling stack + details: No required Pyroscope, Parca, or Grafana backend. async-profiler data lands in ClickHouse and a self-owned UI. --- -## Start here +## For service owners -- [Requirements](./brainstorms/java-profiler-requirements.md) defines the product scope, actors, retention policy, and success criteria. -- [Architecture](./architecture/java-profiler-architecture.md) explains the collector, backend, ClickHouse store, and web UI. -- [Java Profiling Runbook](./operations/java-profiling-runbook.md) shows how to enable, disable, and validate profiling. -- [Real Profiling Acceptance Standard](./operations/real-profiling-acceptance-standard.md) defines the evidence required before profiling changes are complete. +- [Quickstart](./getting-started/quickstart.md): enable profiling and read your first service profile. +- [Performance Analysis Manual](./operations/performance-analysis-user-manual.md): read CPU, Wall Clock, Java I/O wait, GC, allocation, lock, deadlock, target status, and ingestion evidence. +- [Java Profiling Runbook](./operations/java-profiling-runbook.md): enable temporary or continuous profiling for a Kubernetes workload. -## Main sections +## For platform operators -- [Operations](./operations/java-profiling-runbook.md): deployment, profiling, performance analysis, E2E testing, and acceptance. -- [Research](./research/coroot-node-agent-java-agent.md): upstream references and technology studies. +- [Deployment and Operations](./operations/deployment-operations-admin-manual.md): install, secure, operate, upgrade, and troubleshoot the profiler. +- [Real Profiling Acceptance](./operations/real-profiling-acceptance-standard.md): prove CPU, Wall Clock, Java I/O wait, GC, allocation, lock, ClickHouse, UI, and ingestion behavior before shipping changes. + +## For contributors + +- [Contributing](./contributing/development.md): run local checks, build docs, and execute real acceptance. +- [Architecture](./architecture/java-profiler-architecture.md): understand the collector, backend, ClickHouse store, contracts, and web UI. +- [E2E Automation Guide](./operations/e2e-automation-test-guide.md): run browser and real Kubernetes acceptance flows. +- [Profiling Contracts](./reference/profiling-contracts.md): inspect stable payload and configuration contracts. ## Local preview diff --git a/docs/operations/performance-analysis-user-manual.md b/docs/operations/performance-analysis-user-manual.md index e4c012e..937e06b 100644 --- a/docs/operations/performance-analysis-user-manual.md +++ b/docs/operations/performance-analysis-user-manual.md @@ -1,802 +1,84 @@ -# Java 服务性能分析用户手册 +# Performance Analysis Manual -本文面向 Java 服务负责人、值班响应人员和应用开发人员,说明如何使用 `java-profiler` 分析 Kubernetes 中 Java 服务的 CPU、内存分配、锁等待、死锁和线程问题。部署、权限、升级、ClickHouse、Web 代理和平台故障处理见 [部署运维管理员手册](./deployment-operations-admin-manual.md)。如果问题涉及部署、权限、ClickHouse、Web 代理、token 或 collector DaemonSet,请转交平台管理员并引用管理员手册。 +Use this workflow when a Java service is already being profiled and you need to find the stack responsible for CPU, Wall Clock latency, Java I/O wait, GC pauses, allocation, lock, deadlock, or ingestion problems. -## 真实工作流截图 +The screenshots below come from a real Kubernetes acceptance environment, not mocked UI state. -这些截图来自真实 Kubernetes acceptance 环境,不是 mock UI 状态。保留它们是为了让读者快速理解核心诊断路径,也让维护者有一组可回归对比的 UI 证据。 +## Start with target status -重新生成截图时,先把真实 Web UI port-forward 到本机,然后从仓库根目录运行: +Target status tells you whether the selected JVM was accepted, rejected, disabled, unsupported, expired, or failed attach. Check this before assuming a service has no performance data. -```bash -export REAL_ACCEPTANCE_BASE_URL=http://127.0.0.1:18081 -export REAL_ACCEPTANCE_NAMESPACE=java-profiler-qa -export REAL_ACCEPTANCE_SERVICE=jdk17-http-demo -node scripts/capture-doc-screenshots.mjs -``` +![Real target status evidence](../assets/screenshots/real-target-status.png) -### Target status +## Analyze CPU profiles -先确认目标 JVM 是否被接受、拒绝或因为 metadata/JVM/runtime 条件而没有数据。 +Use the CPU view when a service has high CPU or latency that may be caused by expensive Java code. -![真实 target status 证据](../assets/screenshots/real-target-status.png) +- Top Table shows the most expensive Java symbols. +- Self CPU is time spent directly in that frame. +- Total CPU includes callees under that frame. +- Flame Graph shows sampled stack context. +- Search highlights matching frames without hiding the rest of the stack. +- Focus narrows the graph to the selected stack path. -### CPU profile analysis +![Real CPU profile analysis](../assets/screenshots/real-cpu-analysis.png) -CPU 视图把 Top Table、Flame Graph、选中 frame 详情、Self/Total CPU 语义和 Java frame 分类放在同一个真实服务诊断流里。 +## Analyze Wall Clock latency -![真实 CPU profile analysis](../assets/screenshots/real-cpu-analysis.png) +Use Wall Clock when latency is high but CPU does not explain the full incident. It shows Java stack time that may be runnable, blocked, waiting, sleeping, or doing I/O. -### Deadlock diagnosis +- Compare Wall Clock with CPU before declaring a method CPU-bound. +- Use the same Top Table and Flame Graph controls as CPU. +- Treat Wall Clock as latency evidence, not as CPU utilization. -死锁视图用于确认选定服务和时间范围内是否有 cycle 证据;真实运行也可能呈现经过验证的空状态。 +![Real Wall Clock latency evidence](../assets/screenshots/real-wall-clock.png) -![真实 deadlock diagnosis surface](../assets/screenshots/real-deadlocks.png) +## Analyze Java I/O wait -### Ingestion health +Use I/O wait when a service is blocked on sockets, files, or Java-owned network clients. The view only claims Java ownership when the collected stack preserves JVM/JFR evidence for the blocking path. -Ingestion health 用 accepted/rejected/dropped payload 证据闭环 collector 到 backend 的上传和入库路径。 +- Start with top Java I/O symbols. +- Drill into Flame Graph context to separate business code from runtime or native frames. +- Cross-check remote dependency latency outside this profiler when the stack points to RPC or storage clients. -![真实 ingestion health 证据](../assets/screenshots/real-ingestion-health.png) +![Real Java I/O wait evidence](../assets/screenshots/real-io-wait.png) -## 你能用它回答什么 +## Correlate GC pauses and allocation pressure -`java-profiler` 第一版本聚焦 HotSpot 兼容 Java 服务: +Use GC pauses when request latency or throughput changes line up with JVM pause activity. The GC view shows JVM GC event evidence beside allocation profile context for the same service, Pod, and time range. -- 哪些 Java 调用栈消耗 CPU。 -- 哪些调用栈分配对象或分配字节最多。 -- 哪些锁路径产生等待或竞争。 -- 哪些线程处于死锁、阻塞、等待或忙碌状态。 -- 目标为什么没有数据:未启用、不兼容、冲突、attach 失败、上传失败或存储失败。 +- Confirm that GC events exist for the selected time window. +- Use allocation correlation to find Java paths creating pressure. +- Do not treat allocation profile size as retained heap ownership. -它不是通用观测平台。日志、分布式追踪、Prometheus 指标趋势、告警和服务拓扑仍然从现有观测系统进入。`java-profiler` 提供的是同一服务、同一时间范围内的 Java 栈证据。 +![Real GC pause and allocation correlation](../assets/screenshots/real-gc-pauses.png) -## 使用前确认 +## Analyze allocation pressure -使用前确认: +Use allocation profiles when heap usage, allocation rate, or GC pressure rises. Start with the largest allocation symbols, then inspect the stack paths that lead to those allocations. -- 平台管理员已经部署 collector、backend、Web UI 和 ClickHouse。 -- 你有权限访问目标 namespace 或 service 的 profiling 数据。 -- 目标服务运行在 Kubernetes 中。 -- 目标 JVM 是 HotSpot 兼容实现。 -- 目标服务已通过 `java-profiler.io/*` metadata 显式启用 profiling,或你有权限临时启用。 +## Analyze lock contention -如果 UI 没有数据,先看 `status`,不要直接判断“没有性能问题”。 +Use lock diagnosis when request latency or thread state suggests blocking. Lock-delay profiles point to synchronized or monitor paths where threads spend time waiting. -## 启用 profiling +## Check deadlocks -Profiling 默认关闭。你需要在目标 Pod 或 workload Pod template 上添加 annotation 或 label。annotation 与 label 使用相同键名;如果两者都存在,以 annotation 为准。 +Deadlock diagnosis shows cycles reported by the target JVM. A real run may show cycle evidence or a verified empty state for the selected time range. -只有在你拥有该 workload 的变更权限且已获得生产采集批准时,才应直接修改 profiling metadata。否则请通过平台管理员或既有 GitOps/变更流程申请临时 profiling;申请信息见本手册“无法自行启用 profiling 时”。 +![Real deadlock diagnosis surface](../assets/screenshots/real-deadlocks.png) -### 选择模式 +## Check ingestion health -| 模式 | 适用情境 | 建议 | -| --- | --- | --- | -| `temporary` | 线上事件、临时复现、单次排查 | 默认优先选择,必须设置 `profile-duration`。 | -| `continuous` | 核心服务、长期高流量服务、需要 7 天内随时回溯 | 需要服务 owner 同意长期采集和访问控制。 | -| `profile-disabled` / `disabled` | 事件结束、止血、明确不采集 | 优先使用 `profile-disabled: "true"` 止血;合同也允许 `profile-mode: disabled`。 | +Ingestion health closes the loop. It shows whether profile batches were accepted, rejected, dropped, or truncated by the backend ingestion path. -### 模式选择决策表 +![Real ingestion health evidence](../assets/screenshots/real-ingestion-health.png) -| 情况 | 推荐模式 | 原因 | -| --- | --- | --- | -| 正在处理线上事件 | `temporary` 10 到 15 分钟 | 控制开销,同时捕获现场。 | -| 核心服务长期高流量 | `continuous` | 需要 7 天内可回溯栈证据。 | -| 只有单个 Pod 异常 | 对该 Pod 使用 `temporary` | 避免混入其他副本。 | -| 刚上线新版本,需要观察风险 | `temporary` 30 分钟,或经批准使用 `continuous` | 取决于风险窗口和服务重要性。 | -| 服务包含敏感业务路径 | 优先 `temporary`,共享证据时脱敏 | 降低长期暴露面。 | -| 不确定是否可采 | `temporary` 5 分钟 smoke test | 先验证 status 和 ingestion。 | +## When the UI is empty -### 临时 profiling +Check these in order: -适合排查一次事件。临时模式到期后自动停止。 - -```yaml -metadata: - annotations: - java-profiler.io/profile-mode: temporary - java-profiler.io/profile-disabled: "false" - java-profiler.io/profile-duration: 10m - java-profiler.io/startup-delay: 0s - java-profiler.io/snapshot-interval: 10s -``` - -当前临时窗口按目标 Pod/JVM 的生命周期判断。对已经运行很久的 Pod 直接添加 `10m` temporary metadata,可能立即显示 `temporary_expired`;更稳妥的做法是在 workload Pod template 上添加 metadata 后滚动重启,或按管理员确认的方式重开目标 Pod。重复验收时可以添加一次性 annotation,例如 `java-profiler.io/acceptance-run: "20260516225619"`,强制 Deployment 滚动出新的窗口。 - -临时模式可以短时间提高线程快照频率,但不要长期运行高频快照。 - -### 真实 workload smoke test - -对线上或准线上 workload 第一次启用真实 async-profiler attach 时,先做短窗口 smoke test。平台管理员应把 collector/backend/UI 指向单个 namespace/service,保存 UI 截图、Playwright 视频、ClickHouse 计数、collector/backend 日志,以及目标 Pod 的前后 restart count。若目标服务在窗口内出现新重启,应先停止 profiling 并排查 attach 安全性,不要继续扩大采集范围。 - -### 持续 profiling - -适合核心服务保留最近 7 天内的栈证据。 - -```yaml -metadata: - annotations: - java-profiler.io/profile-mode: continuous - java-profiler.io/profile-disabled: "false" - java-profiler.io/startup-delay: 30s - java-profiler.io/snapshot-interval: 5m -``` - -持续 profiling 不是无限保存。所有采集数据仍然不超过 7 天 retention。 - -### 停止 profiling - -```yaml -metadata: - annotations: - java-profiler.io/profile-disabled: "true" -``` - -事件结束后应移除临时 annotation,或保留显式禁用作为止血控制。后续重新启用时,必须删除旧的 `profile-disabled: "true"`,或在 Pod template 上显式设置 `profile-disabled: "false"`;只改 `profile-mode` 不会覆盖禁用标记。 - -## 控制字段 - -| 字段 | 示例 | 说明 | -| --- | --- | --- | -| `java-profiler.io/profile-mode` | `temporary`, `continuous`, `disabled` | 开启临时、持续或禁用 profiling。 | -| `java-profiler.io/profile-disabled` | `"true"` | 强制禁用,优先级高于 `profile-mode`。truthy 值包括 `1`, `true`, `yes`, `enabled`, `on`。 | -| `java-profiler.io/profile-duration` | `10m`, `1h` | 临时 profiling 持续时间。临时模式必填。 | -| `java-profiler.io/startup-delay` | `0s`, `30s` | 新发现 JVM 启动后等待多久再开始 profiling。 | -| `java-profiler.io/snapshot-interval` | `10s`, `5m` | 线程快照间隔。临时排障可短时间缩短。 | - -时间字段使用 Go duration 格式,例如 `30s`、`10m`、`1h`。 - -控制字段和 status reason 的稳定合同由平台维护,见 [profiling contracts](../reference/profiling-contracts.md);本手册只解释服务 owner 如何使用这些字段。 - -## 无法自行启用 profiling 时 - -如果你没有权限修改 workload metadata,或目标服务属于敏感生产范围,请向平台管理员提交一次性申请: - -```text -Namespace: -Service: -Pod / workload: -Incident window: -Requested mode and duration: -Reason: -Urgency: -Owner / approver: -``` - -管理员应通过受控变更路径添加 `java-profiler.io/*` metadata。不要通过临时手工 patch 绕过团队的生产变更规则。 - -## 按症状选择入口 - -紧急排查时先按症状进入对应视图: - -| 症状 | 先看 | 再看 | -| --- | --- | --- | -| CPU 高 | `status` | `cpu` | -| GC 压力高或 allocation rate 高 | `status` | `memory` | -| 请求卡住或线程池耗尽 | `status` | `locks`、`threads` | -| 疑似死锁 | `status` | `deadlocks` | -| UI 没数据 | `status` | `ingestion` | -| 只有某个 Pod 异常 | `status` | 缩小到 Pod、container 或 JVM | -| rollout 后行为变化 | `status` | 按 Pod 和 JVM start time 分开查询 | - -## 常见误判速查 - -| 不要这样做 | 正确做法 | -| --- | --- | -| 空 flamegraph = 没有热点 | 先确认 status、ingestion 和时间范围。 | -| accepted = 已经有 profile 数据 | accepted 只说明目标可采,非空 profile 才说明数据链路可用。 | -| allocation 高 = retained heap 高 | allocation 只说明分配来源。 | -| RUNNABLE = 正在消耗 CPU | RUNNABLE 是线程状态,需要结合 CPU profile。 | -| 选中 Top Table 行 = 直接过滤成单行图 | 默认不是。选中应高亮匹配帧并显示详情;搜索是单独动作。 | -| 多副本 service 直接混看 | 先确认是否只有某个 Pod 异常。 | -| 发布窗口混合新旧 Pod | 用 Pod、PID、JVM start time 分开看。 | - -## 5 分钟应急流程 - -1. 记录 namespace、service、Pod 和问题时间范围。 -2. 打开 UI,选择相同 namespace、service 和时间范围。 -3. 先打开 `status`。 -4. 如果没有权限启用 profiling,把 namespace、service、Pod 和时间范围发给服务 owner 或平台管理员。 -5. 如果 UI 无权访问目标 namespace,不要申请全集群权限;申请对应 namespace 或 service 的最小访问权限。 -6. 如果目标不是 accepted,按 reason 处理;需要权限、attach 或平台问题时联系管理员。 -7. 如果目标 accepted,按症状进入 `cpu`、`memory`、`locks`、`deadlocks` 或线程证据;accepted 只说明控制面允许采集,不等于已经有 profile 数据。 -8. 如果视图为空,打开 `ingestion` 判断上传、存储或 retention 问题,并确认对应时间窗口有 accepted profile batch。 -9. 记录 top stack、thread evidence、target status 和 ingestion health。 -10. 事件结束后停止 temporary profiling 或确认 continuous 是否仍需要保留。 - -## UI 使用顺序 - -进入 Java Profiling UI 后: - -1. 选择 namespace。 -2. 选择 service。 -3. 选择时间范围。 -4. 先打开 `status`。 -5. 如果同一服务下有多个 Pod 或 JVM,先用 `status` 表中的 Pod、PID、Seen 和 Reason 缩小到具体目标。 -6. 状态正常后,再看 `cpu`、`memory`、`locks`、`deadlocks` 或线程证据。 -7. 如果数据为空,打开 `ingestion` 判断上传和存储是否正常。 - -所有诊断视图共享同一组选择器。发生 rollout、重启或扩缩容时,要核对 Pod、PID 和 JVM start time,避免把旧实例、新实例或无关副本混在一起解释。 - -## 页面区域解读 - -`Service diagnosis` 是服务级诊断页面。它把同一个 namespace、service、Pod 和时间范围下的证据放在一起,便于从“目标是否可采”切换到 CPU、内存分配、锁、死锁和 ingestion 链路。 - -页面左侧竖栏提供主视图快捷入口,分别进入 `status`、`cpu`、`memory`、`locks`、`deadlocks` 和 `ingestion`。它不是搜索框,也不是筛选器;真正的查询条件在页面顶部的上下文条里。 - -顶部上下文条显示当前查询条件: - -- `Namespace`:当前命名空间。截图示例是 `java-profiler-qa`。 -- `Service`:当前服务。截图示例是 `jdk17-http-demo`。 -- `Range`:当前分析时间范围。截图示例是 `Last 1h`。 -- `UTC`:当前时间显示时区。跨地区协作时,记录事件时间时要同时写明时区;本界面固定以 UTC 呈现时间戳。 - -上下文条下方的说明强调:Prometheus 仍然负责指标趋势图;本 UI 只展示 profiles、线程证据、目标状态和 ingestion health。也就是说,先用现有监控确认“CPU 高、GC 高、延迟高”这类症状,再回到本页面找 Java 栈证据。 - -顶部标签页含义: - -| 标签 | 用途 | 什么时候打开 | -| --- | --- | --- | -| `Memory` | 查看 allocation bytes、allocation objects 和 top allocating stacks。 | GC 压力、allocation rate 或对象创建异常。 | -| `Cpu` | 查看 CPU profile flamegraph。 | CPU 高、线程忙、业务路径耗时异常。 | -| `Locks` | 查看 lock wait 或 contention profile。 | 请求卡住、线程 BLOCKED、锁竞争。 | -| `Deadlocks` | 查看 JVM 结构化死锁事件。 | 疑似死锁或线程永久互等。 | -| `Status` | 查看每个目标 JVM 是否可采、为什么不可采、用户下一步动作。 | 所有排查都先打开。 | -| `Ingestion` | 查看 collector 上传、backend 接受、ClickHouse 写入和丢弃/拒绝状态。 | profile 或线程证据为空、数据不完整或怀疑链路问题。 | - -`Status` 标签页中的 `Target status` 表格逐行表示一个目标 JVM 或一次目标状态记录。截图中的关键列按如下方式阅读: - -| 列 | 含义 | 解读方式 | -| --- | --- | --- | -| `Pod` | 目标 Pod 名。长名称会截断,排查时应结合完整 Pod 名或 hover/title 信息记录。 | -| `PID` | 目标 JVM 进程 ID。PID 可能复用,发布或重启窗口内不要只靠 PID 判断身份。 | -| `Seen` | backend 看到该状态距当前查询时间的时间差。越新越能代表当前状态。 | -| `State` | collector 对目标的采集状态,例如 `temporary`、`disabled`、`unsupported`。 | -| `Reason` | 状态原因代码,例如 `accepted`、`disabled_by_metadata`、`unsupported_jvm`。它决定下一步动作。 | -| `Message` | 面向用户的简短状态说明。它帮助确认 reason 是否符合预期。 | -| `User action` | 推荐处理动作。优先按这一列处理,再进入 profile 视图。 | - -对截图中的三类状态,可以这样解读: - -- `State=temporary` 且 `Reason=accepted`:该 HotSpot 兼容 JVM 的临时 profiling 已生效,可以进入 `Cpu`、`Memory`、`Locks` 或线程证据查看同一目标和时间范围内的数据。 -- `State=disabled` 且 `Reason=disabled_by_metadata`:profiling 被 metadata 禁用,或没有启用所需 metadata。需要添加 profiling metadata,或确认显式禁用是预期行为。 -- `State=unsupported` 且 `Reason=unsupported_jvm`:collector 判断该 JVM 不适合当前 v1 HotSpot profiling。先确认它是否为业务 JVM、是否 HotSpot 兼容、是否选错 container 或 PID。 - -同一服务同时出现 `temporary`、`disabled` 和 `unsupported` 并不矛盾。多副本、重启、sidecar、多 Java 进程或不同 Pod metadata 都可能让同一 service 下有多种状态。排查时先缩小到异常 Pod 和 JVM,再解释 profile 数据。 - -## 理解目标状态 - -`status` 是第一入口。常见 reason: - -| reason | 含义 | 你的动作 | -| --- | --- | --- | -| `accepted` | 目标可被采集。 | 进入 CPU、memory、locks 或线程证据。 | -| `disabled_by_metadata` | 没有启用 metadata,或被显式禁用。 | 添加 profiling metadata,或确认禁用符合预期。 | -| `temporary_expired` | 临时窗口已过期。 | 如事件仍在发生,重新开启临时 profiling。 | -| `invalid_duration` | duration 配置错误。 | 修正为 `10m`、`1h`、`30s` 这类格式。 | -| `unsupported_jvm` | JVM 不兼容 HotSpot,或不是目标业务 JVM。 | 确认 JVM 类型、Pod 内容器和 PID。 | -| `profiler_conflict` | 已有其他 profiler 占用目标 JVM,或前一次运行留下 async-profiler 状态。 | 停止冲突工具;如是验收残留,在管理员确认后滚动目标 Pod 再重试。 | -| `attach_failed` | collector 无法 attach 到 JVM。 | 联系平台管理员检查权限、容器安全策略或 JVM 参数。 | -| `upload_retryable` | 上传暂时失败,可恢复。 | 查看 `ingestion`,必要时联系管理员。 | -| `upload_dropped` | collector 已丢弃部分批次。 | 该时间窗口证据不完整,联系管理员。 | -| `storage_rejected` | backend 拒绝数据。 | 查看 `ingestion`,联系管理员排查合同或存储问题。 | - -空 flamegraph 不等于没有热点。只有在 target status、ingestion 和时间范围都正确时,空结果才可解释为该窗口没有匹配样本。 - -## 分析视图 - -不同 profile 视图是否有非空数据,取决于当前平台版本、目标 JVM 支持情况、采集窗口和 ingestion 状态;先用 `status` 和 `ingestion` 确认可用性。 - -### CPU - -`cpu` flamegraph 展示时间窗口内被采样到的 Java 调用栈。 - -如果页面同时显示 Top Table 和 flame graph,Top Table 里的 `Symbol`、`Self`、`Total` 应该一起读: - -- `Self` 和 `Total` 都要可见,并且都可以排序。 -- `Self` 说明这个函数自己消耗了多少 CPU。 -- `Total` 说明这个函数加上它的子调用一共消耗了多少 CPU。 -- 排查瓶颈时,通常先看 `Total` 找到最重的业务入口,再结合 `Self` 判断是不是函数本身在烧 CPU。 - -交互方式要保留完整 stack context: - -- 选中表格行时,应高亮完整 flame graph 中的匹配帧,并显示选中帧详情。 -- 选中行为不应把主图替换成单行过滤结果。 -- 搜索是显式动作;只有用户主动搜索时,才对非匹配帧做高亮或淡化。 -- 点击火焰图块进入 `focus` 后,当前块会变成新的根,子树按新的根重新缩放。 -- `Back` 用于回到上一级 focus,`Reset` 用于回到完整上下文。 -- 选中帧详情应包含符号、样本、类别,以及 `Self` / `Total` 的解读。 - -阅读方式: - -- 宽度越大,代表累计 CPU 样本越多。 -- 优先看业务方法、序列化、正则、JSON、加密、压缩、数据库客户端、缓存路径。 -- CPU profile 是采样证据,不会还原每一次方法调用。 - -### Memory - -`memory` 视图用于 allocation 分析: - -- allocation bytes:哪些路径分配字节最多。 -- allocation objects:哪些路径创建对象数量最多。 - -不要把 allocation profile 解读成 retained heap。它不能回答“谁持有对象”“引用根是什么”“哪个对象泄漏”。这类问题需要 heap dump 或 retained-heap 分析。 - -### Locks - -`locks` 视图用于看 lock wait time 或 lock contention count。结合线程快照一起看: - -- lock flamegraph 解释时间范围内的锁成本。 -- 线程快照解释采样时刻的线程状态。 -- 两者指向同一路径时,结论更强。 - -### Deadlocks - -`deadlocks` 展示 JVM 结构化线程数据派生出的死锁事件: - -- 涉及线程。 -- 等待的锁。 -- 锁 owner。 -- 阻塞栈帧。 - -没有 deadlock event 不代表没有慢请求,可能是普通锁等待、IO 等待或线程池耗尽。 - -### Threads - -忙线程和慢线程证据可能来自: - -- JVM per-thread CPU time。 -- RUNNABLE 快照。 -- CPU profile 热栈关联。 - -RUNNABLE 是线程状态,不是 CPU 百分比。UI 若标记为 sampled 或 profile-only,应按采样证据解释。 - -### Ingestion - -当 profile 或线程证据为空时,看 `ingestion` 只为回答三个问题: - -- collector 有没有上传数据。 -- backend 有没有接受数据。 -- 是否出现 retryable、dropped 或 rejected,需要联系管理员。 - -如果出现 dropped,该时间窗口证据不完整。如果出现 rejected,不要反复重试或继续解释业务问题,先联系管理员排查上传合同或存储问题。 - -## 使用案例 - -### 案例 1:首次接入一个 Java 服务 - -适用情境:服务负责人希望让服务具备性能排查能力。 - -步骤: - -1. 确认服务是 Kubernetes 中的 HotSpot 兼容 Java 服务。 -2. 先用 temporary 模式接入 15 分钟。 -3. 在 UI 选择 namespace、service 和最近 15 到 30 分钟。 -4. 打开 `status`,确认目标是 `accepted` 或 `temporary`。 -5. 查看 `cpu`、`memory`、`locks` 是否有数据。 -6. 事件结束后移除临时 annotation 或显式禁用。 - -预期证据: - -- `status` 有 Pod、PID、Seen、State、Reason。 -- `ingestion` 能看到 target status 或 profile batch 被接受。 - -容易误判: - -- 只修改 Deployment metadata,没有修改 Pod template。 -- 服务无负载时 profile 可能很少,但 status 应该能解释采集状态。 - -### 案例 2:持续 profiling 用于核心服务 - -适用情境:核心高流量服务需要保留最近 7 天内栈证据。 - -步骤: - -1. 与团队确认长期采集和访问控制。 -2. 在 Pod template 上启用 `continuous`。 -3. rollout 后打开 `status`,确认所有 Pod 状态。 -4. 定期查看 `ingestion`,确认没有持续 retry 或 dropped batch。 -5. 遇到事件时直接选择事件时间范围分析。 - -预期证据: - -- 多副本服务的多个 Pod 分别显示状态。 -- 最近 7 天内可查询 profile。 - -容易误判: - -- continuous 不代表无限保存。 -- profile 是采样证据,不是完整调用日志。 - -### 案例 3:CPU 升高 - -适用情境:监控显示某服务 CPU 峰值。 - -步骤: - -1. 从监控记录 namespace、service、Pod 和峰值时间。 -2. 如未启用 profiling,开启 temporary 10 到 15 分钟。 -3. 打开 `status`,确认目标 accepted。 -4. 打开 `cpu`,选择峰值时间窗口。 -5. 查看最宽业务栈和框架栈。 -6. 记录 top stack、Pod/JVM、时间范围和结论。 - -预期证据: - -- CPU flamegraph 有非 root 栈。 -- top stack 能解释 CPU 样本来源。 - -容易误判: - -- 事件已过去才启用 profiling,无法还原过去现场。 -- 选错时间范围会看到正常负载。 - -### 案例 4:GC 压力或 allocation rate 升高 - -适用情境:GC 次数、GC 时间或 allocation rate 异常。 - -步骤: - -1. 用监控确定时间范围。 -2. 打开 `memory`。 -3. 先看 allocation bytes。 -4. 如果怀疑小对象风暴,再看 allocation objects。 -5. 定位集合复制、字符串拼接、JSON 序列化、正则、缓存反序列化等路径。 - -预期证据: - -- allocation flamegraph 能说明分配来源。 -- bytes 和 objects 可能指向不同热点。 - -容易误判: - -- allocation 高不等于 retained heap 高。 -- 本系统不能直接定位引用根。 - -### 案例 5:锁竞争导致请求慢 - -适用情境:请求延迟升高,线程出现 BLOCKED、WAITING 或 TIMED_WAITING。 - -步骤: - -1. temporary 开启 profiling。 -2. 将 `snapshot-interval` 临时设为 `10s`。 -3. 打开 `locks` 看 lock delay 或 contention。 -4. 查看 slow-thread 证据。 -5. 关注 monitor、park、连接池、缓存锁、日志锁、类加载或单例初始化路径。 - -预期证据: - -- lock flamegraph 集中在少数路径。 -- slow-thread 栈与 lock 热点一致。 - -容易误判: - -- 单次线程快照只能说明采样时刻状态。 -- RUNNABLE 不一定代表正在消耗 CPU。 - -### 案例 6:疑似死锁 - -适用情境:请求永久卡住、线程池不释放或 JVM 工具提示 deadlock。 - -步骤: - -1. 打开 `deadlocks`。 -2. 选择问题发生时间范围。 -3. 查看 deadlock cycle、线程、锁 owner 和 blocking frame。 -4. 如果没有事件,检查 `status` 和线程快照是否采集成功。 -5. 如问题仍在发生,提高短时间线程快照频率后再观察。 - -预期证据: - -- deadlock cycle 清楚显示互相等待的线程和锁。 - -容易误判: - -- 没有 deadlock 不代表没有阻塞。 -- 死锁超过 retention 后不会保留。 - -### 案例 7:线程池耗尽或忙线程 - -适用情境:队列堆积、请求超时,但 CPU 不一定高。 - -步骤: - -1. 查看 busy threads 和 slow threads。 -2. 有 per-thread CPU time 时优先看精确线程 CPU 证据。 -3. 没有 per-thread CPU time 时,把 RUNNABLE 快照和 CPU flamegraph 关联。 -4. 结合线程名、线程池配置、队列长度和下游依赖指标判断。 - -预期证据: - -- busy threads 指向高 CPU 或频繁 RUNNABLE 的线程。 -- slow threads 指向阻塞、等待或锁竞争路径。 - -容易误判: - -- 线程池耗尽可能由下游 IO 慢引起,profile 只能提供 Java 栈证据。 - -### 案例 8:UI 没有数据 - -适用情境:CPU、memory、locks 或 deadlocks 页面为空。 - -步骤: - -1. 检查 namespace、service、Pod、JVM 和时间范围。 -2. 打开 `status`。 -3. 如果是 `disabled_by_metadata`,在有权限且获批时通过受控变更路径添加 profiling metadata;否则按“无法自行启用 profiling 时”申请。 -4. 如果是 `temporary_expired`,重新开启临时窗口。 -5. 如果是 `unsupported_jvm`、`profiler_conflict` 或 `attach_failed`,按 reason 处理。 -6. 打开 `ingestion`,检查 retry、dropped、rejected。 -7. 确认数据是否超过 7 天 retention。 - -预期证据: - -- 空状态能被 status 或 ingestion 解释。 - -容易误判: - -- 空页面不是诊断结论。 - -### 案例 9:临时窗口过期 - -适用情境:`status` 显示 `temporary_expired`。 - -步骤: - -1. 查看 Seen 时间和 Pod。 -2. 确认 `profile-duration` 是否已经过期。 -3. 如果事件仍在发生,重新设置新的 temporary 窗口。 -4. 如果事件已结束,移除 annotation。 - -预期证据: - -- 新窗口内目标显示为 `temporary` 或 `accepted`。 - -容易误判: - -- 同一 service 可能同时显示旧 Pod expired 和新 Pod accepted。 - -### 案例 10:多副本服务排查 - -适用情境:Deployment 或 StatefulSet 有多个 Pod。 - -步骤: - -1. 先按 service 查询 `status`。 -2. 如果监控指向某个 Pod,缩小到该 Pod。 -3. 如果所有 Pod 异常,按 service 级 profile 查共同热点。 -4. 如果只有单 Pod 异常,对比该 Pod 与其他 Pod 的栈。 -5. rollout 期间分开看旧 Pod 和新 Pod。 - -预期证据: - -- 每个 Pod 有独立 target status。 -- Seen、PID 和 JVM start time 帮助区分实例。 - -容易误判: - -- 多副本混看可能掩盖单实例异常。 - -### 案例 11:多容器 Pod 或 sidecar - -适用情境:Pod 中有业务容器、sidecar、helper 或多个 Java 进程。 - -步骤: - -1. 在 `status` 中查看 container、PID 和 JVM start time。 -2. 确认目标 Java 进程属于业务容器。 -3. 如果支持过滤,缩小到目标 container 或 JVM。 -4. 不要把 sidecar profile 解释为业务服务热点。 - -预期证据: - -- target identity 能区分 Pod、container、PID 和 JVM start time。 - -容易误判: - -- Pod 相同不代表 JVM 相同。 -- PID 可复用,必须结合 JVM start time。 - -### 案例 12:Pod 重启或发布窗口 - -适用情境:问题发生时服务正在重启、扩缩容或发布。 - -步骤: - -1. 查看 Pod、PID、JVM start time 和 Seen 时间。 -2. 将 profile 时间范围对齐到具体 Pod 生命周期。 -3. 发布前后分别查询。 -4. 对同一 PID 的不同 JVM start time 分开解释。 - -预期证据: - -- 新旧 JVM 具有不同 JVM start time。 - -容易误判: - -- 发布窗口内 profile 可能混合新旧代码路径。 - -### 案例 13:unsupported JVM - -适用情境:目标显示 `unsupported_jvm`。 - -步骤: - -1. 确认 JVM 分发版和版本。 -2. 确认是否为 HotSpot 兼容。 -3. 检查是否为 wrapper、helper 或非业务 Java 进程。 -4. 如果同一 Pod 有多个 Java 进程,按 PID 区分。 - -预期证据: - -- 业务 JVM 若 HotSpot 兼容,应有自己的状态。 - -容易误判: - -- 一个 Java 进程 unsupported 不代表整个 Pod 都不能分析。 - -### 案例 14:profiler 冲突 - -适用情境:目标显示 `profiler_conflict`。 - -步骤: - -1. 确认是否有人手动运行 async-profiler。 -2. 检查是否存在其他 profiler 或诊断 agent。 -3. 停止冲突工具,或决定跳过该 JVM。 -4. 等待 collector 下一次扫描。 - -预期证据: - -- 冲突解除后状态变为 accepted 或对应启用状态。 - -容易误判: - -- 不要同时对同一 JVM 运行多个 profiler。 - -### 案例 15:attach 失败 - -适用情境:目标显示 `attach_failed`。 - -步骤: - -1. 记录 Pod、container、PID、JVM start time 和 message。 -2. 联系平台管理员检查 collector 权限、目标容器安全策略和 JVM attach 参数。 -3. 修复后重新触发扫描或等待下一次扫描。 - -预期证据: - -- 权限或 JVM 参数修复后状态变为 accepted 或其他明确状态。 - -容易误判: - -- attach 失败不是“没有性能问题”,而是采集失败。 - -### 案例 16:真实 profile 数据链路确认 - -适用情境:你需要确认系统不仅能发现目标,还能完成性能分析。 - -步骤: - -1. 对 JDK17 demo 或有稳定 CPU、allocation、lock contention 负载的 HotSpot Java 服务启用 profiling。 -2. 确认 `status` 在当前运行窗口内为 accepted。 -3. 等待至少一个 profile batch 被 backend accepted。 -4. 打开 `cpu`,确认 Top Table 和 flamegraph 都有非 root 栈。 -5. 打开 `memory` 和 `locks`,确认 allocation 与 lock-delay 都有非空栈。 -6. 在 `cpu` 里执行搜索、选中 frame、查看详情、Focus、Back、Reset。 -7. 打开 `ingestion`,确认 profile batch accepted,且没有 unexplained rejected/dropped/truncated 证据。 -8. 改变时间范围,确认查询结果随时间窗口变化。 -9. 如果这是验收而非日常排障,还要保存 ClickHouse sample/stack 行数、TTL、浏览器截图和目标 Pod restart count 前后对比。 - -预期证据: - -- UI 中出现真实方法栈和样本值。 -- `ingestion` 显示对应 profile batch 被接受。 -- CPU、allocation、lock-delay 都非空,才能证明核心 profile 链路完整。 - -容易误判: - -- 只有 target status accepted 只能证明控制面可用,不能证明 profile 数据链路完整可用。 -- thread snapshot 和 deadlock event 是补充证据;如果本次不是排查线程或死锁,它们为空应记录为 gap,而不是否定 CPU/memory/lock profile 验收。 - -## 安全和开销建议 - -- 默认只对明确需要的 Java 服务启用。 -- 事件排查优先使用 temporary,并设置明确 duration。 -- 高频线程快照只用于短时间窗口。 -- 栈数据可能包含类名、方法名、包名和业务路径,应按生产敏感数据处理。 -- 不要把完整 flamegraph、thread dump、stack payload、profile payload、thread snapshot 或 token 粘贴到公开 issue、聊天群或截图中。 -- 对外共享时,只保留必要方法名;按组织安全策略脱敏包名、服务名、namespace、Pod 名和业务路径。 -- profile 数据不是日志,但仍可能暴露业务流程、内部类名、接口路径或客户相关处理逻辑。 -- token、cookie、内部域名、内部 backend/storage 凭据或连接信息不得出现在截图、PR、issue 或聊天记录中。 -- 发现 profiler 冲突时,不要同时运行多个 async-profiler 工具采同一个 JVM。 -- 采集失败不应被忽略;先读 `status` 和 `ingestion`。 - -## 数据保留 - -所有采集数据保留不超过 7 天: - -- profile samples 和 stacks:7 天。 -- thread snapshots 和 thread stacks:7 天。 -- deadlock events:7 天。 -- target status 和 ingestion health:7 天。 -- 可选 raw artifact:默认关闭;如启用,最长 24 小时。 - -超过 retention 后,只能重新采集或使用其他历史证据。Prometheus 指标趋势可能仍存在,但 Java 栈证据不会由本系统长期保存。 - -### 历史事件超过 retention - -如果事件已经超过 7 天: - -1. 不要把空结果解释为没有问题。 -2. 使用 Prometheus、日志、变更记录和业务事件记录回溯时间线。 -3. 如果问题仍可复现,重新启用 temporary profiling。 -4. 如果问题不可复现,把缺失 profile 证据记录到复盘结论中。 - -## 结果解读边界 - -| 问题 | 本系统能否回答 | 说明 | -| --- | --- | --- | -| 哪段 Java 代码消耗 CPU | 能 | 使用 CPU flamegraph。 | -| 哪段 Java 代码分配最多 | 能 | 使用 allocation bytes 或 allocation objects。 | -| 哪些锁路径产生等待 | 能 | 使用 lock profile 与 slow-thread 证据。 | -| 是否存在 Java 死锁 | 能 | 使用 deadlock event。 | -| 哪个对象保留了堆内存 | 不能 | 需要 heap dump 或 retained-heap 分析。 | -| 完整还原历史线程执行过程 | 不能 | 线程快照是采样时刻证据。 | -| JVM 指标趋势和告警 | 不在本系统内 | 使用现有 Prometheus 面板。 | -| 非 Java 服务 profiling | 第一版本不支持 | 仅 HotSpot 兼容 Java 服务。 | - -## 术语表 - -| 术语 | 含义 | -| --- | --- | -| target | 一个可被发现和采集的 JVM 进程。 | -| target identity | 用于区分目标的身份字段,包括 cluster、namespace、service、pod、container、PID 和 JVM start time。 | -| status | collector 对目标当前采集状态的判断,例如 accepted、disabled、unsupported、attach_failed。 | -| reason | status 的原因,决定用户下一步动作。 | -| accepted | collector 认为目标可采集;它只证明控制面可用,不保证已经有非空 profile。 | -| non-empty profile | 已经有 CPU、allocation 或 lock 样本写入并可查询;这是性能分析数据链路可用的证据。 | -| ingestion | collector 上传数据到 backend,并由 backend 写入 ClickHouse 的过程。 | -| profile batch | 一次 CPU、allocation 或 lock profile 上传批次。 | -| target status batch | collector 上传的一批目标状态。 | -| flamegraph | 用宽度表示样本累计值的调用栈图。 | -| JVM start time | JVM 启动时间,用来区分 PID 复用后的不同 JVM。 | -| retention | 数据最大保留时间。本系统 profile、thread、deadlock、status 和 ingestion 数据不超过 7 天。 | - -## 向平台管理员求助时提供什么 - -如果需要平台管理员协助,请一次性提供这些信息,避免来回确认: - -```text -Namespace: -Service: -Pod: -Container / JVM if known: -Time range: -UI view: status / cpu / memory / locks / deadlocks / ingestion -Status reason: -Ingestion state: -What you expected: -Screenshot or sanitized evidence: redact tokens/cookies/DSN/internal domains/customer identifiers; prefer stack summaries over raw payloads. -``` - -## 事件记录模板 - -```text -Service: -Namespace: -Pod/JVM: -Incident window: -Profiling mode: temporary / continuous -Profile views used: cpu / memory / locks / deadlocks / status / ingestion -Top stack: -Thread evidence: -Target status: -Ingestion health: -Conclusion: -Follow-up: -Profiling stopped at: -Redaction check: no tokens/cookies/DSN/internal domains/customer identifiers/raw payloads. -``` +1. Target status: is the JVM accepted? +2. Service and namespace: do they match the workload metadata? +3. Time range: does it include the profiling window? +4. Ingestion health: were profile batches accepted? +5. Runbook: follow [Enable Profiling](./java-profiling-runbook.md) to verify metadata and collector behavior. diff --git a/docs/operations/real-profiling-acceptance-standard.md b/docs/operations/real-profiling-acceptance-standard.md index 3afdf15..11fdc8a 100644 --- a/docs/operations/real-profiling-acceptance-standard.md +++ b/docs/operations/real-profiling-acceptance-standard.md @@ -42,9 +42,13 @@ Every full acceptance run must collect and verify all of the following from the - target status contains at least one `accepted` row for the Java target - CPU flamegraph has a non-zero root value +- Wall Clock flamegraph has a non-zero root value for the Java target +- I/O wait evidence has a non-zero root value when the demo I/O workload is enabled - allocation flamegraph has a non-zero root value - lock-delay flamegraph has a non-zero root value +- GC pause evidence returns at least one JVM `gc_pause` event when the demo GC workload is enabled - ClickHouse contains profile samples and profile stacks for the target +- ClickHouse contains JVM event rows for GC evidence when GC acceptance is in scope - backend ingestion UI API returns successfully - profile sample TTL remains bounded to 7 days - target workload restart count does not increase during acceptance @@ -56,7 +60,10 @@ Thread snapshots and deadlock events are useful evidence, but they are optional The demo workload must be driven during the async-profiler window, not before or after it. - CPU load must execute while profiling is active. +- Wall Clock load must execute while profiling is active and must include blocked or waiting Java threads. +- I/O load must execute while profiling is active through Java socket/file/nio paths; node-level non-Java I/O is not valid evidence. - Allocation load must execute while profiling is active. +- GC load must create allocation pressure sufficient to produce at least one JVM GC pause event in the selected window. - Lock profiling must create real contention with concurrent lock requests. A single `synchronized` request is not enough because it can complete without blocking another Java thread. - If a previous run loaded `libasyncProfiler.so` into the target JVM and the collector was restarted, restart the demo pod before a fresh strict run to avoid stale profiler-conflict state. @@ -67,6 +74,9 @@ The profile UI must be validated with real backend data, not mocked data, and mu - service/namespace defaults and filters select the Java demo target - Status view shows target evidence - CPU view supports Top Table, Flame Graph, and Both modes +- Wall Clock view returns profile evidence without replacing CPU evidence +- I/O view returns Java-owned blocking evidence or a clear no-evidence state +- GC view returns JVM event evidence and allocation correlation for the same target/time filters - Top Table ranks application Java symbols with Self and Total CPU semantics - Flame Graph shows full sampled stack context, not Java source call order - Search changes flamegraph highlighting/dimming @@ -84,9 +94,11 @@ Use `scripts/real-acceptance.sh --require-full-profiling` for strict acceptance. - wait for target status rows instead of assuming they exist immediately after rollout or table truncation - explicitly clear stale disable metadata with `java-profiler.io/profile-disabled: "false"` when configuring a target - force a fresh workload rollout for acceptance by adding run-specific metadata such as `java-profiler.io/acceptance-run` -- drive CPU, allocation, and concurrent lock load for the full profiling wait window +- drive CPU, Wall Clock, I/O, allocation, GC, and concurrent lock load for the full profiling wait window - keep no-Service synthetic workload runs alive until the full profiling wait window has elapsed -- fail when CPU, allocation, or lock-delay profile data is empty +- fail when CPU, Wall Clock, allocation, or lock-delay profile data is empty +- fail when required GC event evidence is empty after the demo GC workload ran +- fail when required Java I/O wait evidence is empty after the demo I/O workload ran - support `--high-volume` for ingestion hardening changes; this mode must extend the profiling window, increase CPU/allocation/lock load parallelism, and verify bounded profile batch metadata - fail high-volume acceptance when profile batches are rejected, when a profile batch exceeds the collector batch limit, or when ClickHouse restarts/OOMKills during the run - verify collector/backend profile payload compatibility in tests; profile batch JSON must match the backend contract @@ -108,7 +120,9 @@ git diff --check Treat these as acceptance blockers: - no accepted target status for the current run window -- CPU, allocation, or lock-delay flamegraph root value is zero +- CPU, Wall Clock, allocation, or lock-delay flamegraph root value is zero +- GC pause event evidence is empty when GC acceptance is in scope +- Java I/O evidence is empty when I/O acceptance is in scope - backend rejects profile payloads because batch size is too large - backend rejects profile payloads because required fields such as `batch_id` or `collector_id` are missing; this means collector/backend payload versions or JSON tags do not match - target status is `disabled_by_metadata` because a stale truthy `profile-disabled` annotation was left on the Pod template diff --git a/docs/plans/2026-05-17-001-feat-expert-java-profiling-plan.md b/docs/plans/2026-05-17-001-feat-expert-java-profiling-plan.md new file mode 100644 index 0000000..6fdeb58 --- /dev/null +++ b/docs/plans/2026-05-17-001-feat-expert-java-profiling-plan.md @@ -0,0 +1,814 @@ +--- +title: "feat: Expert Java Profiling Incident Workflow" +type: feat +status: completed +date: 2026-05-17 +origin: docs/brainstorms/java-profiler-requirements.md +--- + +# feat: Expert Java Profiling Incident Workflow + +## Summary + +This plan upgrades the Java/Kubernetes profiler from a basic profile viewer into an expert incident-analysis workflow for Java services. It keeps the product Java-only and HotSpot-first while adding explicit profile semantics, Pod-level variance/drill-down, UI noise controls, source-copy ergonomics, and new Java Wall Clock, GC, and I/O evidence derived from JVM/async-profiler/JFR-capable data rather than general multi-language system profiling. + +--- + +## Problem Frame + +During P0/P1 Java production incidents, responders need profile data that is physically interpretable, instance-specific, and low-noise. The current UI can show CPU, allocation, lock, status, and ingestion evidence, but it still exposes large raw values and ambiguous percentages, can dilute a single bad Pod inside service-level aggregation, and does not yet provide Java Wall Clock, GC, or I/O blocking views that experts expect when CPU is not the bottleneck. + +--- + +## Requirements + +- R1. Preserve the product boundary: this is a Java program profiler for Kubernetes workloads, not a general multi-language eBPF observability platform. +- R2. Replace raw profile values in the UI with profile-type-aware human units and explicit percentage baselines. +- R3. Show enough baseline context for CPU, Wall Clock, allocation, lock, GC, and I/O values that a responder can decide whether a finding is material. +- R4. Support Pod/JVM-level drill-down and service-level variance so single-instance skew is not hidden by aggregation. +- R5. Add Java Wall Clock profiling for blocked, sleeping, waiting, or I/O-heavy Java execution paths. +- R6. Add Java GC evidence that can be correlated with allocation profiles and incident time ranges. +- R7. Add Java I/O evidence for socket/file blocking paths where JVM/JFR-capable events can identify Java ownership. +- R8. Provide UI controls to hide native/system/runtime frames while preserving the ability to inspect them when needed. +- R9. Improve Top Table expert workflows: sortable Self/Total columns, full symbol/method visibility, and copyable stack or frame context. +- R10. Clearly disclose current support limits for async context stitching, JIT/inlining recovery, virtual threads, and unsupported JVM/runtime cases. +- R11. Keep all collected data subject to the existing bounded retention rule: no profile, JVM event, thread, status, ingestion, or raw artifact data retained for more than 7 days. +- R12. Extend real Kubernetes acceptance so completion requires non-empty Java Wall Clock, GC, and I/O evidence in addition to existing CPU, allocation, lock, ClickHouse, UI, and no-restart checks. +- R13. Keep the first expert UI release focused on a single selected Java Pod CPU profile workflow; A/B comparison, service rollup, Wall Clock, GC, and I/O detail panes may land behind later evidence-view phases rather than in the first MVP screen. + +**Origin actors:** Java Service Owner, Incident Responder, Platform Operator. +**Origin flows:** opt-in profiling, incident temporary profiling, service and Pod analysis, retention cleanup, thread/profile diagnosis. +**Origin acceptance examples:** service profile rendering, Pod drill-down for allocation spikes, retention cleanup, memory/GC correlation, slow-thread and busy-thread analysis. + +--- + +## Scope Boundaries + +- The implementation must remain Java/HotSpot-compatible JVM profiling only. It must not introduce generic process profiling for Go, Node.js, Python, C/C++, or arbitrary container processes. +- The UI may expose links or time-range context for Prometheus dashboards, but it must not duplicate Prometheus as a metrics store, dashboard builder, alerting surface, tracing UI, topology UI, or log viewer. +- Pyroscope, Parca, Grafana, or other profile backends must not become required runtime dependencies. +- Async context stitching and JIT/inlining recovery are included as explicit capability disclosure and design groundwork; production-grade cross-thread request reconstruction is not required before the Java Wall Clock, GC, and I/O evidence lands. +- Retained heap ownership remains out of scope unless a heap dump or future retained-heap feature is separately designed. Allocation profiles must not be presented as retained heap. + +### Deferred to Follow-Up Work + +- Full async context stitching across WebFlux, Netty, `CompletableFuture`, and virtual-thread handoff: separate design and implementation after the current JVM event model can expose honest capability limits. +- Source repository integration or IDE deep links: defer beyond copyable stack/frame context unless a concrete source-indexing design is approved. +- Non-HotSpot JVM support: remains a future compatibility effort; unsupported JVMs must continue to be skipped with explicit status. + +--- + +## Context & Research + +### Relevant Code and Patterns + +- `collector/internal/jfr/normalizer.go` already converts CPU execution samples to nanoseconds with `DefaultCPUExecutionSampleValueNS`, while allocation and lock values use profile-specific units. +- `collector/internal/jfr/parser.go`, `collector/internal/jfr/aggregate.go`, and `collector/internal/pipeline/profile_batcher.go` are the collector-side extension points for new Java/JFR-derived event types. +- `backend/internal/app/query_flamegraph.go` and `backend/internal/app/query_top_stacks.go` query and rank profile samples but currently return raw `uint64` values and string percentages. +- `backend/internal/clickhouse/profile_repository.go` already filters by namespace, service, Pod, profile type, and time range; this pattern should be extended rather than replaced. +- `backend/internal/domain/flamegraph_builder.go` is the right layer for tree metadata such as unit/baseline information and native-frame filtering flags. +- `web/src/api/types.ts`, `web/src/features/cpu/cpu-view.tsx`, `web/src/features/memory/memory-view.tsx`, `web/src/features/locks/locks-view.tsx`, and `web/src/visualization/flamegraph.tsx` are the UI/API surfaces for profile semantics, top tables, frame classification, focus/search/reset, and profile-type-specific views. +- `docs/operations/real-profiling-acceptance-standard.md` is mandatory for changes touching collector profiling, ingestion, ClickHouse storage, backend query APIs, Kubernetes deployment, demo workload, or profile UI. + +### Institutional Learnings + +- Keep collection node-local, service-centric, bounded, and aligned with the documented architecture. +- Preserve the boundary between profiling evidence and metric dashboards: metrics stay in Prometheus-series systems; this product shows Java profiles, thread evidence, target status, and ingestion health. +- Real acceptance must prove non-empty profile data and real UI workflows, not only healthy pods. + +### External References + +- async-profiler officially supports CPU, allocation, lock, Wall Clock, multiple-event JFR output, native/JVM frames, and HotSpot-specific stack walking: https://github.com/async-profiler/async-profiler +- async-profiler profiling modes document Wall Clock sampling, Java lock profiling, and multi-event JFR output: https://github.com/async-profiler/async-profiler/blob/master/docs/ProfilingModes.md +- async-profiler options document intervals, wall options, stack depth, and stack walking features such as `comptask`, `vtable`, and `pcaddr`: https://github.com/async-profiler/async-profiler/blob/master/docs/ProfilerOptions.md +- Oracle JDK 21 `jdk.jfr` documentation describes JFR events, timestamps, durations, settings, and typed metadata such as timespan, data amount, frequency, and percentage: https://docs.oracle.com/en/java/javase/21/docs/api/jdk.jfr/jdk/jfr/package-summary.html + +--- + +## Key Technical Decisions + +- Keep new evidence Java-scoped: Wall Clock, GC, and I/O must be represented as Java/JVM target evidence tied to namespace/service/Pod/container/JVM identity, not as node-wide eBPF or OS-wide telemetry. +- Add semantics at the API boundary, not only in React formatting: backend responses should carry value unit, display unit, baseline description, and percentage basis so CLI/UI/docs all explain the same numbers. +- Treat CPU sample values as nanoseconds first: the collector already normalizes CPU events to time, so the UI should present CPU time and average cores over the selected window rather than raw integer values. +- Keep native/runtime frames available but filterable: experts need a default low-noise Java ownership view while still being able to inspect JVM/native evidence when the issue is JNI, runtime, or kernel-adjacent. +- Use a two-track Java evidence model: profile samples for stack-bearing flamegraph/top-table evidence, JVM events for timestamped metadata-heavy evidence such as GC pauses or I/O events without usable stacks. Both tracks share target identity, time range, retention, ingestion health, and capability metadata. +- Make capability gaps visible in-product: async context stitching, JIT/inlining recovery, virtual-thread behavior, and missing event support should appear as capability notes or target status details instead of hidden limitations. + +--- + +## Open Questions + +### Resolved During Planning + +- Should Wall Clock, GC, and I/O be active current scope or future roadmap? Resolved: active current scope, limited to Java/JVM evidence. +- Should the plan target all languages because eBPF is involved? Resolved: no; the plan explicitly targets Java programs only. + +### Deferred to Implementation + +- Exact JFR event names and parser mappings for GC and I/O: verify against the parser library and generated async-profiler/JFR payloads during implementation. +- Exact Wall Clock interval and overhead defaults: choose conservative production defaults after validating data volume and acceptance workload behavior. +- Exact CPU quota/baseline source: prefer Kubernetes resource limits and cgroup data where available, with a clear fallback to selected-window total profile time when quota is unavailable. +- Whether JIT/inlining metadata can be extracted from current JFR parser output: disclose limitations if the current parser cannot preserve the required metadata without a dependency upgrade. + +--- + +## High-Level Technical Design + +> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.* + +### Service Diagnosis Workspace Hierarchy + +The UI is an app workspace for incident response, not a dashboard card grid. `DESIGN.md` is now the source of truth for visual direction. The first MVP screen should answer, in order: what single Java Pod/JVM target is selected, whether CPU profile evidence is trustworthy, which Java methods dominate, and how to copy/share the selected frame context. + +The broader product still needs service skew, Wall Clock, GC, I/O, and A/B comparison, but these should be added as follow-on evidence views after the single-Pod CPU workflow is stable. The initial implementation must not force all evidence dimensions into the first screen. + +```text +MVP JAVA POD CPU WORKSPACE + +1. Context Bar + namespace / service / selected Pod-JVM / time range / evidence freshness + +2. Evidence Health Strip + collection status | freshness lag | drop rate | sample frequency | CPU quota baseline + +3. Evidence Navigation + MVP: CPU + Later profiles: Wall Clock | Allocation | I/O | Locks + Later JVM events: GC | Deadlocks + +4. Primary Evidence Pane + CPU Top Table + Flamegraph + Both mode + +5. Detail Drawer + selected Java frame | FQCN/method signature | units/baseline | stack path | copy/share/focus actions +``` + +The evidence health strip earns its pixels by preventing false confidence. It should show compact status and collection semantics, not long tutorial copy. Service skew, A/B comparison, event correlation, and AI interpretation blocks are intentionally out of the first MVP screen. + +### App UI Rules + +- Do not turn evidence health into a dashboard-card mosaic. Use one compact strip with collection status, freshness, drop rate, sample frequency, and CPU quota baseline in the MVP. +- Cards are allowed for repeated rows/items, modals, and detail drawer sections; they should not be used as decorative page-section wrappers. +- The primary evidence pane must get the most vertical space. Flamegraphs and top tables outrank explanatory copy in the MVP; event timelines belong to later JVM event phases. +- Use compact icon buttons with accessible labels/tooltips for secondary actions such as copy, focus, reset, and help. +- Tutorial or interpretation guidance belongs behind help affordances or in the detail drawer, not as persistent text above the evidence. +- Avoid profile-type rainbow UI. Use restrained colors with one clear accent for selection/warning and semantic status colors only where they earn their pixels. +- Truncated text must expose the full value on hover and keyboard focus. Long class names, Pod names, JVM IDs, and frame paths are normal data, not edge cases. + +### Lightweight UI Vocabulary + +Project-wide `DESIGN.md` now exists and must be read before implementing UI. This plan keeps a small implementation vocabulary aligned with that design source of truth. + +| UI element | Purpose | Rules | +|------------|---------|-------| +| View shell | Shared Java profile frame | One context bar, one evidence health strip, one primary evidence pane, one selected-frame drawer | +| Evidence health strip | Trust and collection orientation | Compact inline status groups; no decorative cards; show freshness, drop rate, sample frequency, and CPU quota baseline | +| Evidence navigation | Switch between evidence types | MVP exposes CPU first; later entries may be disabled or hidden until their backend evidence is real | +| Evidence pane | Main work surface | MVP profile evidence uses Top Table / Flamegraph / Both; event timelines wait for JVM event phases | +| Detail drawer | Selected frame context | Preserve target/time context; show full identifiers; include copy/share/focus actions and capability notes | +| Status badge | State at a glance | Short text plus semantic color; must have accessible label and not rely on color alone | +| Warning line | Partial, stale, unsupported, or truncated evidence | One concise line near affected evidence with link/action to Status or Ingestion when relevant | +| Icon button | Secondary actions | Use lucide icons where available; 44px touch target; tooltip and accessible label required | +| Tooltip / popover | Full identifiers and short explanations | Trigger on hover and keyboard focus; never be the only place critical error state appears | + +### Responsive and Accessibility Requirements + +The profiler workspace is desktop-first because flamegraphs and top tables need room, but it must degrade intentionally on narrower screens and remain keyboard/screen-reader usable. + +**Responsive layout** +- Desktop, `>= 1200px`: context bar and triage strip stay at the top; evidence navigation sits below; profile views can use split Top Table / Flamegraph layout; detail drawer opens on the right. +- Tablet, `768px-1199px`: evidence pane may stack Top Table above Flamegraph; detail drawer collapses below the selected evidence; triage strip wraps into two compact rows without becoming cards. +- Mobile, `< 768px`: context remains visible but compact; evidence navigation becomes horizontally scrollable grouped tabs; Top Table or event summary appears before deep flamegraph/timeline detail; flamegraph uses fit-to-width or horizontal scroll with readable row height. + +**Accessibility** +- Evidence tabs, mode switches, flamegraph frames, table rows, focus/reset/back controls, and copy/help buttons must be keyboard reachable. +- Flamegraph frame buttons need accessible names that include frame name, self value, total value, percent basis, and depth. +- Tooltips/popovers must open on keyboard focus as well as hover. +- Status badges and warnings must not rely on color alone; include text or accessible labels. +- Copy success/failure must be announced through an `aria-live` region and provide selectable fallback text on clipboard failure. +- Interactive controls need at least 44px touch targets or equivalent hit area. +- Text truncation must preserve full values for assistive technology and keyboard users. + +### UI Interaction State Matrix + +Each empty or partial state must explain what the user can conclude and what to do next. Avoid generic "No data" text because it hides the difference between unsupported capability, empty time range, ingestion failure, filtered-out Pod, and healthy-but-quiet evidence. + +| Feature | Loading | Empty | Error | Success | Partial | +|---------|---------|-------|-------|---------|---------| +| Context / triage strip | Stable skeleton with current filters retained | "No Java target selected or discovered" with Status as primary next action | Query failure with retry and Ingestion link | Freshness, target status, ingestion status, skew, dominant evidence type | Stale or mixed evidence window labeled with affected evidence types | +| Pod/JVM drill-down | Row skeletons sized like final rows | Single target state explains no variance to compare | Status/summary unavailable with retry | Per-Pod/JVM contribution and skew cue | Some Pods/JVMs missing status or profile evidence, shown as incomplete not zero | +| Profile evidence pane | Bounded graph/table skeleton | No samples for selected target/time/profile type, with time-range and ingestion hints | Query failed with retry and copied diagnostic context | Top Table plus Flamegraph/Both with semantic units | Truncated samples/nodes shown in metadata and warning line | +| JVM event pane | Timeline/table skeleton | No events in range, distinct from unsupported capability | Query failed with retry and Ingestion link | Summary stats plus event timeline/table and correlation panel | Event cap reached or metadata incomplete, with visible omitted count | +| Capability notes | Compact "checking support" state | Not applicable for this evidence type | Unknown support state with status reason | Supported or supported-with-limits explanation | Partial support lists missing fields or separate evidence windows | +| Copy actions | Busy feedback on the icon/button | Disabled with reason when no frame/event selected | Clipboard failure shows fallback selectable text | Copied confirmation includes what was copied | Copy includes partial/truncated warning when evidence is incomplete | + +### Incident Responder Journey + +The UI should be designed around symptom-to-evidence flow. A responder often arrives from an alert or Prometheus chart and needs a fast path to trustworthy Java ownership evidence. + +| Step | User does | User feels | UI must do | +|------|-----------|------------|------------| +| 1 | Opens service diagnosis during an incident | Time-boxed and suspicious of stale data | Confirm selected Java target, time range, freshness, target status, and ingestion state without requiring a tab switch | +| 2 | Checks whether the issue is service-wide or one instance | Looking for skew before detail | Show service-level Pod/JVM variance before the flamegraph consumes attention | +| 3 | Chooses evidence by symptom | Overloaded by possible causes | Triage strip points toward CPU, Wall Clock, GC, I/O, allocation, or lock evidence using compact status and dominance cues | +| 4 | Drills into a hot frame or event | Focused, needs context preserved | Keep namespace/service/Pod/JVM/time context visible while the detail drawer shows frame/event metadata | +| 5 | Copies evidence to incident chat, Jira, or IDE | Under pressure, needs a complete packet | Copy one complete payload with target identity, time range, evidence type, units, percent basis, frame/event details, and partial warnings | +| 6 | Encounters unsupported or partial evidence | Suspicious of the tool | Explain the limitation and the next best action, such as changing time range, checking ingestion, or using CPU/allocation/lock evidence | + +```text +DELIVERY DEPENDENCY SHAPE + +MVP Java Pod CPU workbench + U1 Profile semantics + U6 Frame filtering/capability notes for CPU + U7 Top table/copy workflows for CPU + | + v +Foundation for broader evidence + U8 Evidence storage/query/retention foundation + U2 Pod/JVM drill-down and service variance + | + v +First new evidence type + U3 Java Wall Clock + | + v +Additional Java JVM evidence + U4 GC events + U5 I/O events + | + v +Expanded expert incident UI + Wall/GC/I/O panes + service rollup + time-window A/B comparison + | + v +Acceptance and docs + U9 Real Java acceptance + docs +``` + +```mermaid +flowchart LR + java[Java Pod / HotSpot JVM] + collector[Node Collector DaemonSet] + jfr[JFR / async-profiler output] + normalize[Java event normalizers] + backend[Backend API] + ch[(ClickHouse)] + ui[Java Profile UI] + prom[Existing Prometheus dashboards] + + collector -->|attach only eligible Java JVMs| java + java --> jfr + jfr -->|CPU, allocation, lock, Wall Clock, I/O| normalize + jfr -->|GC events / pauses| normalize + normalize -->|typed Java evidence with units and target identity| backend + backend --> ch + ui -->|profile, JVM event, status, ingestion queries| backend + ui -. time-range links only .-> prom +``` + +The core rule is that every value crossing the backend/UI boundary must carry its physical meaning. A frame value is not just a number; it is CPU time, Wall Clock time, allocation bytes, object count, lock delay, lock count, GC pause duration, I/O duration, or I/O bytes/count, with a baseline that explains what the percentage means. + +### Java Evidence Data Model + +```text +JavaEvidence + shared identity: namespace, service, pod, container, process/JVM, node, time range + shared controls: retention <= 7 days, ingestion health, query limits, capability status + + ├── ProfileSample track + │ shape: stack frames + numeric value + profile type + │ examples: CPU, allocation, lock delay/count, Wall Clock, stack-bearing I/O + │ query: flamegraph, top table, frame inspector + │ + └── JVMEvent track + shape: timestamp/duration + event metadata + optional stack + examples: GC pause, GC cause/action, socket/file I/O without usable stack + query: event timeline, summary stats, correlation panels +``` + +Do not force GC pauses into `ProfileSample` just to reuse flamegraph code. Do not force Wall Clock out of `ProfileSample` if it has stack-bearing samples. The shared abstraction is identity, semantics, retention, health, and capability reporting, not one universal row shape. + +### Java Evidence Collection Mode Matrix + +| Evidence | Preferred collection mode | Must not break | Capability behavior | +|----------|---------------------------|----------------|---------------------| +| CPU | Existing async-profiler primary session using CPU sampling | Existing CPU flamegraph/top table and acceptance data | Required for strict profiling on supported HotSpot JVMs | +| Allocation bytes/objects | Existing async-profiler JFR allocation options in the CPU session when supported | CPU session continuity and upload cadence | Required for strict profiling when allocation support is enabled | +| Lock count/delay | Existing async-profiler JFR lock options in the CPU session when supported | CPU/allocation data and lock-delay semantics | Required for strict profiling when lock support is enabled | +| Wall Clock | Prefer an explicit Java evidence mode that does not replace CPU sampling; use a separate bounded phase/session if async-profiler cannot emit CPU and Wall Clock in one usable JFR stream | CPU/allocation/lock evidence in the same incident window | Capability-detected; unsupported/partial state is visible per target | +| GC | JVM/JFR event evidence keyed by the same target/time identity, not a stack-only profile pretending to be allocation ownership | Allocation semantics and retained-heap boundary | Capability-detected; absence of GC events is distinguishable from parser failure | +| I/O | JVM/JFR-visible Java socket/file I/O evidence; stack profile when frames are available, event table when only event metadata is available | Java-only scope and CPU/Wall Clock profile correctness | Capability-detected; no node-wide non-Java I/O fallback | + +Collector implementation should treat this matrix as the contract. Adding Wall Clock must not simply replace `itimer` CPU collection. If the underlying profiler cannot emit the desired evidence together, the collector should use explicit bounded phases and annotate the resulting evidence window so the UI does not imply perfect simultaneity. + +### Production Budget Gates + +These budgets are planning targets and acceptance gates. Implementation may choose stricter defaults, but it should not silently exceed these without updating the requirements and acceptance standard. + +| Budget | Gate | +|--------|------| +| Target safety | Strict real acceptance must show no target workload restart increase and no new sustained error state after profiling. | +| Collection window | Any extra Wall Clock phase must be bounded by the temporary profiling window and reported as a distinct evidence window when it is not simultaneous with CPU. | +| Collector output | Profile/event batching must enforce per-target sample/event caps, record dropped/truncated counts, and keep local buffering bounded under high-volume load. | +| Backend ingestion | New evidence types must expose accepted/rejected/dropped/truncated counters and batch size metadata through ingestion health and exporter metrics. | +| ClickHouse query | Flamegraph, top table, JVM event timeline, and service variance queries must have explicit row/node limits, partial metadata, and query-duration metrics. | +| UI rendering | Large result sets must render from bounded/partial backend responses; the browser must not need to render unbounded frame/event lists to stay responsive. | + +Real acceptance should fail strict mode when budget evidence is missing, not only when pods crash. A passing run needs positive proof that the profiler produced useful Java evidence while staying inside bounded collection, ingestion, query, and UI behavior. + +--- + +## Implementation Units + +### U1. Profile Semantics Contract and Unit Formatting + +**Goal:** Make every profile response explain value units, display units, and percentage basis so responders no longer see ambiguous raw numbers. + +**Requirements:** R2, R3, R9, R11 + +**Dependencies:** None + +**Files:** +- Modify: `backend/internal/domain/types.go` +- Modify: `backend/internal/app/query_flamegraph.go` +- Modify: `backend/internal/app/query_top_stacks.go` +- Modify: `backend/internal/httpapi/query_handlers.go` +- Modify: `web/src/api/types.ts` +- Modify: `web/src/visualization/flamegraph.tsx` +- Modify: `web/src/features/cpu/cpu-view.tsx` +- Modify: `web/src/features/memory/memory-view.tsx` +- Modify: `web/src/features/locks/locks-view.tsx` +- Test: `backend/internal/app/query_top_stacks_test.go` +- Test: `backend/internal/domain/flamegraph_builder_test.go` +- Test: `backend/internal/httpapi/query_handlers_test.go` +- Test: `web/src/visualization/flamegraph.test.tsx` +- Test: `web/src/features/cpu/cpu-view.test.tsx` + +**Approach:** +- Add response metadata describing `value_unit`, `display_unit`, `sample_basis`, `percentage_basis`, selected time range, and baseline availability. +- Convert CPU and Wall Clock nanoseconds into readable durations and average cores over the selected window when the baseline window is known. +- Preserve raw values for precise sorting and machine use, but make human display use typed formatting. +- Replace ambiguous labels such as `Total CPU` on generic flamegraph frames with profile-type-aware labels. + +**Execution note:** Add backend contract tests before changing UI rendering so semantic regressions fail outside the browser. + +**Patterns to follow:** +- Existing `PartialMetadata` response pattern in `web/src/api/types.ts`. +- Existing `TopStackRow` and `FlamegraphResult` builders in `backend/internal/app` and `backend/internal/domain`. + +**Test scenarios:** +- Happy path: CPU top-stack rows with nanosecond values return raw values plus formatted duration, average cores, and percent basis for the selected window. +- Happy path: allocation bytes, allocation objects, lock contention count, and lock delay each render their correct unit labels and do not reuse CPU wording. +- Edge case: missing selected-window baseline still returns raw values and a visible “percentage of returned profile samples” basis rather than implying Pod CPU quota. +- Edge case: zero-value roots and partial results keep existing empty/partial behavior while still returning semantic metadata. +- Integration: `/query/top-stacks` and `/query/flamegraph` responses remain backward-compatible enough for current UI fields while adding semantic metadata. + +**Verification:** +- Backend tests prove profile-type-specific units and percentage basis are returned. +- Web tests prove the UI no longer displays large raw CPU integers as the primary label. + +### U2. Pod/JVM Drill-Down and Service Variance + +**Goal:** Prevent service-level aggregation from hiding one bad Pod or JVM by exposing Pod/JVM filters, variance summaries, and skew cues. + +**Requirements:** R4, R11, R12 + +**Dependencies:** U1 + +**Files:** +- Modify: `backend/internal/app/query_service_summary.go` +- Modify: `backend/internal/app/query_status.go` +- Modify: `backend/internal/clickhouse/profile_repository.go` +- Modify: `backend/internal/httpapi/query_handlers.go` +- Modify: `web/src/routes/service-overview.tsx` +- Modify: `web/src/features/status/target-status-view.tsx` +- Modify: `web/src/features/cpu/cpu-view.tsx` +- Modify: `web/src/features/memory/memory-view.tsx` +- Modify: `web/src/features/locks/locks-view.tsx` +- Test: `backend/internal/app/query_status_test.go` +- Test: `backend/internal/app/query_top_stacks_test.go` +- Test: `backend/internal/httpapi/query_handlers_test.go` +- Test: `web/src/features/status/target-status-view.test.tsx` +- Test: `web/src/app.test.tsx` + +**Approach:** +- Extend service summary responses with per-Pod and per-JVM contribution totals for selected profile types and time ranges. +- Add UI drill-down controls for namespace, service, Pod, container, and JVM/process identity without removing the service-centric default. +- Show a compact variance/skew summary for service views, highlighting when one Pod dominates a resource dimension. +- Preserve release-window safety by displaying Pod and JVM identity context alongside PID, since PID alone can be reused. + +**Patterns to follow:** +- Existing target status table already surfaces Pod and process identity. +- Existing repository query filters already include namespace, service, Pod, profile type, and time range. + +**Test scenarios:** +- Happy path: a service with three Pods shows each Pod contribution and flags the Pod that dominates CPU or Wall Clock time. +- Happy path: selecting one Pod updates CPU, memory, lock, Wall Clock, GC, I/O, status, and ingestion queries to the same target context. +- Edge case: service-level view with one Pod behaves the same as today but still shows the selected identity. +- Edge case: two JVMs in one Pod are distinguishable by container/process/JVM metadata when status data is available. +- Integration: backend query handlers pass Pod filters through to ClickHouse-backed repositories and in-memory repositories consistently. + +**Verification:** +- A responder can identify whether a service-level hotspot is broad or isolated to one Pod before opening a flamegraph. + +### U3. Java Wall Clock Profiling + +**Goal:** Add Java Wall Clock evidence for blocked, sleeping, waiting, or remote-call-heavy execution where CPU profiles are insufficient. + +**Requirements:** R1, R3, R5, R10, R11, R12 + +**Dependencies:** U1, U2, U8 + +**Files:** +- Modify: `domain/profile_type.go` +- Modify: `collector/internal/profiler/async_profiler.go` +- Modify: `collector/internal/jfr/parser.go` +- Modify: `collector/internal/jfr/normalizer.go` +- Modify: `collector/internal/jfr/aggregate.go` +- Modify: `collector/internal/pipeline/profile_batcher.go` +- Modify: `backend/internal/app/query_flamegraph.go` +- Modify: `backend/internal/app/query_top_stacks.go` +- Modify: `web/src/features/cpu/cpu-view.tsx` +- Create: `web/src/features/wall-clock/wall-clock-view.tsx` +- Create: `web/src/features/wall-clock/wall-clock-view.test.tsx` +- Test: `collector/internal/profiler/async_profiler_test.go` +- Test: `collector/internal/jfr/parser_test.go` +- Test: `collector/internal/jfr/normalizer_test.go` +- Test: `collector/internal/jfr/aggregate_test.go` +- Test: `backend/internal/clickhouse/profile_repository_test.go` +- Test: `backend/internal/app/query_top_stacks_test.go` + +**Approach:** +- Extend the contracts and storage/query shape established by U8 rather than redefining shared schema in this unit. +- Add a Java Wall Clock profile type using async-profiler/JFR-capable collection for HotSpot JVMs. +- Follow the collection mode matrix: Wall Clock must not replace existing CPU sampling; if it cannot share a usable JFR stream with CPU/allocation/lock, collect it in an explicit bounded phase and mark the evidence window accordingly. +- Keep Wall Clock scoped to Java targets selected through existing Kubernetes opt-in policy. +- Store Wall Clock stack samples with nanosecond duration semantics and query them through the existing flamegraph/top-stack path. +- Add a dedicated UI view or tab for Wall Clock so CPU and blocked-time evidence are not conflated. +- Include capability notes explaining that Wall Clock shows thread-state-inclusive sampled time and does not automatically stitch async request context. + +**Patterns to follow:** +- Existing CPU profile type normalization and upload path. +- Existing `cpu-view` Top Table / Flame Graph / Both mode interaction. + +**Test scenarios:** +- Happy path: parsed Wall Clock events normalize to the new profile type with nanosecond values and Java target identity. +- Happy path: enabling Wall Clock does not remove or degrade CPU/allocation/lock samples for the same Java target in strict profiling mode. +- Edge case: if Wall Clock must run in a separate bounded phase, backend metadata and UI copy describe the evidence window instead of implying exact simultaneity with CPU samples. +- Happy path: backend queries return a non-empty Wall Clock flamegraph and top table for a selected Pod/time range. +- Edge case: if Wall Clock collection is unsupported by the runtime or profiler build, target status or profile metadata explains the unsupported capability without marking unrelated CPU/allocation/lock data failed. +- Error path: Wall Clock parser failures are reported through ingestion health with rejected/truncated context rather than silently dropping the whole batch. +- Integration: UI renders Wall Clock with duration/average-thread-time semantics and can search, select, focus, Back, and Reset like CPU. + +**Verification:** +- Real acceptance can prove non-empty Java Wall Clock data for the demo workload without increasing target workload restarts. + +### U4. Java GC Event Evidence and Allocation Correlation + +**Goal:** Add a GC view that correlates Java allocation pressure with GC pause/event evidence for the same service, Pod, JVM, and time range. + +**Requirements:** R1, R3, R6, R10, R11, R12 + +**Dependencies:** U1, U2, U8 + +**Files:** +- Modify: `collector/internal/jfr/parser.go` +- Modify: `collector/internal/jfr/normalizer.go` +- Modify: `collector/internal/pipeline/profile_batcher.go` +- Modify: `backend/internal/app/query_service_summary.go` +- Modify: `backend/internal/httpapi/query_handlers.go` +- Create: `backend/internal/app/query_gc_events.go` +- Create: `backend/internal/app/query_gc_events_test.go` +- Create: `web/src/features/gc/gc-view.tsx` +- Create: `web/src/features/gc/gc-view.test.tsx` +- Modify: `web/src/features/memory/memory-view.tsx` +- Test: `collector/internal/jfr/parser_test.go` +- Test: `collector/internal/jfr/normalizer_test.go` +- Test: `backend/internal/clickhouse/profile_repository_test.go` +- Test: `backend/internal/httpapi/query_handlers_test.go` + +**Approach:** +- Extend the JVM event track established by U8 rather than adding a separate GC-specific persistence path. +- Model GC evidence as Java/JVM events with timestamps, duration, collector/action/cause when available, and target identity. +- Store GC events under the same 7-day bounded retention policy as profile data. +- Correlate GC pauses with allocation bytes/object profiles in the UI using shared time range and target identity. +- Keep retained-heap claims out of scope; the UI should explain that GC/allocation correlation is not retained-object ownership. + +**Patterns to follow:** +- Existing thread diagnosis and deadlock event query patterns for non-flamegraph JVM evidence. +- Existing memory view wording that avoids presenting allocation as retained heap. + +**Test scenarios:** +- Happy path: collector parses GC pause events with duration and target identity and backend returns them for a selected time range. +- Happy path: UI shows GC pause count, total pause time, max pause, and nearest allocation hotspots for the same Pod/JVM context. +- Edge case: no GC events but non-empty allocation data shows a clear “no GC event evidence in this range” state, not an error. +- Error path: malformed or unsupported GC events are counted in ingestion health without corrupting other profile samples. +- Integration: selecting a Pod in the service context filters both allocation and GC evidence consistently. + +**Verification:** +- A responder can confirm whether allocation pressure and GC pauses coincide for the affected Java Pod and time window. + +### U5. Java I/O Blocking Evidence + +**Goal:** Add Java I/O evidence for socket/file blocking paths so responders can distinguish CPU bottlenecks from remote RPC, network, or disk waits. + +**Requirements:** R1, R3, R7, R10, R11, R12 + +**Dependencies:** U1, U2, U3, U8 + +**Files:** +- Modify: `domain/profile_type.go` +- Modify: `collector/internal/jfr/parser.go` +- Modify: `collector/internal/jfr/normalizer.go` +- Modify: `collector/internal/jfr/aggregate.go` +- Modify: `backend/internal/app/query_flamegraph.go` +- Modify: `backend/internal/app/query_top_stacks.go` +- Create: `web/src/features/io/io-view.tsx` +- Create: `web/src/features/io/io-view.test.tsx` +- Test: `collector/internal/jfr/parser_test.go` +- Test: `collector/internal/jfr/normalizer_test.go` +- Test: `backend/internal/app/query_top_stacks_test.go` +- Test: `web/src/visualization/flamegraph.test.tsx` + +**Approach:** +- Extend the U8 profile-sample or JVM-event track depending on whether the I/O evidence has usable Java stack frames. +- Prefer JVM/JFR-visible Java socket/file I/O evidence that preserves Java stack ownership and target identity. +- If exact I/O stack samples are represented differently from existing profile samples, keep the API explicit about event type, duration, byte count, and operation where available. +- Do not collect arbitrary node-level I/O for non-Java processes. +- In the UI, separate I/O wait evidence from CPU and Wall Clock while allowing cross-navigation between related frames/time ranges. + +**Patterns to follow:** +- Existing lock-delay profile semantics for duration-based non-CPU waits. +- Existing flamegraph and top-table views when stack samples are available. + +**Test scenarios:** +- Happy path: Java socket/file I/O events normalize with target identity, operation metadata where available, and duration or byte units. +- Happy path: backend returns I/O top stacks sorted by total wait duration for a selected Java Pod. +- Edge case: environments without I/O event support show a capability limitation state, not empty success. +- Error path: unsupported or partial I/O events are surfaced in ingestion health and query metadata. +- Integration: I/O view respects service, Pod, JVM, and time-range filters shared with Wall Clock and CPU. + +**Verification:** +- A responder can tell whether a slow Java service profile is dominated by CPU, Wall Clock waiting, or Java I/O blocking evidence. + +### U6. Frame Classification, Native/System Toggle, and JVM Capability Notes + +**Goal:** Reduce flamegraph noise while preserving expert access to runtime/native frames and support-limit explanations. + +**Requirements:** R8, R10 + +**Dependencies:** U1 + +**Files:** +- Modify: `backend/internal/domain/flamegraph_builder.go` +- Modify: `backend/internal/app/query_flamegraph.go` +- Modify: `backend/internal/httpapi/query_handlers.go` +- Modify: `web/src/visualization/flamegraph.tsx` +- Modify: `web/src/styles.css` +- Test: `backend/internal/domain/flamegraph_builder_test.go` +- Test: `backend/internal/httpapi/query_handlers_test.go` +- Test: `web/src/visualization/flamegraph.test.tsx` +- Test: `web/tests/real-acceptance.spec.ts` + +**Approach:** +- Keep classifying frames as application Java, JVM/runtime, or native/system, but add an explicit toggle to hide runtime/native frames. +- When hiding frames, preserve aggregate values by folding or summarizing hidden frames rather than making percentages lie. +- Add capability notes for async context stitching, JIT/inlining recovery, virtual threads, and native frame visibility based on available data. +- Move persistent instructional text into compact help affordances so the flamegraph gets more vertical space. + +**Patterns to follow:** +- Existing `classifyFrame` and legend behavior in `web/src/visualization/flamegraph.tsx`. +- Existing real acceptance expectations for search, focus, Back, Reset, selected-frame inspector, and frame categories. + +**Test scenarios:** +- Happy path: toggling hidden native/system frames removes noisy frames visually while totals and selected-frame details remain coherent. +- Happy path: native/system frames remain inspectable when the toggle is disabled. +- Edge case: a stack containing only runtime/native frames shows a clear folded/filtered state instead of an empty graph. +- Integration: search, focus, Back, Reset, and selected-frame inspector work in both full and filtered modes. + +**Verification:** +- Real UI acceptance captures both filtered and unfiltered flamegraph states from real backend data. + +### U7. Top Table Expert Ergonomics and Copy Workflows + +**Goal:** Make the Top Table usable under incident pressure by supporting sorting, full method visibility, and copyable context. + +**Requirements:** R2, R8, R9 + +**Dependencies:** U1, U6 + +**Files:** +- Modify: `web/src/features/cpu/cpu-view.tsx` +- Modify: `web/src/features/memory/memory-view.tsx` +- Modify: `web/src/features/locks/locks-view.tsx` +- Modify: `web/src/features/wall-clock/wall-clock-view.tsx` +- Modify: `web/src/features/io/io-view.tsx` +- Modify: `web/src/visualization/flamegraph.tsx` +- Modify: `web/src/styles.css` +- Test: `web/src/features/cpu/cpu-view.test.tsx` +- Test: `web/src/features/hot-code-view.test.tsx` +- Test: `web/src/visualization/flamegraph.test.tsx` +- Test: `web/tests/profiling-flow.spec.ts` + +**Approach:** +- Add sortable table headers for Self and Total values, preserving raw numeric sort even when display values are formatted. +- Add full symbol/method/location tooltip or expanded cell content so overloaded methods and long FQCNs are distinguishable. +- Add copy actions for selected frame, full stack path, and top-table row context using profile type, value semantics, target identity, and time range. +- Keep controls compact and icon-oriented where possible so they do not compete with the flamegraph. + +**Patterns to follow:** +- Existing CPU Top Table and flamegraph selected-frame inspector behavior. +- Existing `lucide-react` dependency for icon buttons. + +**Test scenarios:** +- Happy path: clicking Self sorts descending by raw self value, then toggles or preserves deterministic order as designed. +- Happy path: clicking Total sorts by raw total value and does not sort formatted strings lexicographically. +- Happy path: hovering or focusing a truncated symbol exposes the full class/method/location string. +- Happy path: copy selected stack includes namespace, service, Pod/JVM when selected, profile type, time range, frame path, self, total, units, and percentage basis. +- Edge case: clipboard failure falls back to selectable text or visible error state without breaking the table. +- Integration: selecting a Top Table row highlights matching flamegraph frames rather than destructively filtering the graph. + +**Verification:** +- A responder can sort to the highest self-time leaf, distinguish exact methods, and copy evidence for an incident ticket without leaving the UI. + +### U8. Backend Storage, Retention, and Query Health for New Java Evidence + +**Goal:** Establish the shared Java evidence storage, query, retention, partial-result, and ingestion-health foundation that Wall Clock, GC, and I/O implementations plug into. + +**Requirements:** R3, R5, R6, R7, R11, R12 + +**Dependencies:** U1, U2 + +**Files:** +- Modify: `contracts/profiling/types.go` +- Modify: `contracts/profiling/payloads.md` +- Modify: `domain/profile_type.go` +- Modify: `backend/internal/clickhouse/001_initial_profile_schema.sql` +- Modify: `backend/internal/clickhouse/schema.go` +- Modify: `backend/internal/clickhouse/profile_repository.go` +- Modify: `backend/internal/clickhouse/retention_repository.go` +- Modify: `backend/internal/app/query_ingestion_health.go` +- Modify: `backend/internal/metrics/exporter.go` +- Test: `backend/internal/clickhouse/profile_repository_test.go` +- Test: `backend/internal/clickhouse/retention_repository_test.go` +- Test: `backend/internal/app/query_ingestion_health_test.go` +- Test: `backend/internal/httpapi/query_handlers_test.go` +- Test: `backend/internal/metrics/exporter_test.go` +- Test: `tools/chdb-smoke` schema smoke coverage for profile-sample and JVM-event tables + +**Approach:** +- Extend ClickHouse schema or add bounded tables using the two-track evidence model: stack-bearing evidence remains profile samples; timestamped metadata-heavy evidence uses JVM event rows. +- Keep both tracks queryable through shared namespace/service/Pod/JVM/time/capability filters. +- Land storage/query/retention contracts before individual evidence types rely on them; U3/U4/U5 should add their concrete event mappings into this foundation rather than inventing separate persistence paths. +- Keep query limits and partial metadata for new evidence so high-cardinality stacks or event bursts do not overload ClickHouse. +- Add ingestion/retention/exporter metrics for new evidence types so operators can see accepted, rejected, dropped, truncated, and expired data. +- Ensure TTL remains 7 days or less for every new table/artifact. +- Enforce the production budget gates with per-target caps, partial metadata, query-duration metrics, and strict acceptance evidence. + +**Patterns to follow:** +- Existing profile schema, retention repository, ingestion health counters, and query limit tests. +- Existing real acceptance failure rules for dropped, truncated, split, or rejected batches. + +**Test scenarios:** +- Happy path: profile-sample evidence and JVM-event evidence are inserted, queried by namespace/service/Pod/JVM/evidence type/time range, and expired by retention policy. +- Happy path: stack-bearing Wall Clock or I/O evidence can use flamegraph/top-table queries, while GC events use event summary/timeline queries. +- Happy path: HTTP query handlers expose JVM-event query contracts with semantic metadata and target filters, not only profile-sample contracts. +- Integration: chdb smoke coverage creates the dual-track schema, verifies TTL <= 7 days, inserts representative profile-sample and JVM-event rows, and queries required target/time/unit fields. +- Edge case: high-volume Java evidence batches return partial metadata rather than unbounded result sets. +- Error path: rejected Java evidence payloads increment ingestion health and exporter metrics with meaningful reasons. +- Integration: retention tests prove no new collected data type can exceed 7 days. +- Integration: high-volume acceptance proves dropped/truncated counters, query limits, and partial metadata are present when budgets are hit. + +**Verification:** +- ClickHouse smoke tests and backend unit tests cover new schema/query/retention paths. + +### U9. Documentation and Real Java Acceptance + +**Goal:** Update docs and acceptance so the expanded Java-only profiler is validated with real data and clear user guidance. + +**Requirements:** R1, R10, R11, R12 + +**Dependencies:** U1, U2, U3, U4, U5, U6, U7, U8 + +**Files:** +- Modify: `docs/brainstorms/java-profiler-requirements.md` +- Modify: `docs/architecture/java-profiler-architecture.md` +- Modify: `docs/operations/performance-analysis-user-manual.md` +- Modify: `docs/operations/java-profiling-runbook.md` +- Modify: `docs/operations/real-profiling-acceptance-standard.md` +- Modify: `docs/operations/e2e-automation-test-guide.md` +- Modify: `README.md` +- Modify: `AGENTS.md` +- Modify: `CLAUDE.md` +- Modify: `scripts/real-acceptance.sh` +- Modify: `web/tests/real-acceptance.spec.ts` +- Test: `web/tests/real-acceptance.spec.ts` + +**Approach:** +- Update requirements and architecture at the same time because the active scope now includes Wall Clock, GC, and I/O. +- Make every doc say Java-only where scope could otherwise be mistaken for all-language profiling. +- Extend acceptance to require non-empty Wall Clock, GC, and I/O evidence from Java workloads when strict full profiling is requested. +- Add UI acceptance for unit/baseline explanations, Pod variance, native/system frame hiding, sortable top tables, copy workflows, and capability notes. + +**Patterns to follow:** +- Existing real profiling acceptance standard and real UI evidence sections. +- Existing user manual warning style for retained heap and Prometheus boundaries. + +**Test scenarios:** +- Happy path: strict real acceptance deploys current workspace images, enables Java profiling, and proves non-empty CPU, allocation, lock-delay, Wall Clock, GC, and I/O evidence. +- Happy path: browser acceptance validates Top Table, Flame Graph, Both, Self/Total semantics, search highlighting, selected-frame details, focus, Back, Reset, native-frame toggle, Pod drill-down, and ingestion evidence. +- Edge case: unsupported runtime capability is visible as a Java target capability limitation and does not falsely fail unrelated profile types. +- Error path: acceptance fails when new evidence is empty, stale, mocked, or only historical. +- Integration: no target workload restart increase is accepted for the expanded profiling run. + +**Verification:** +- The real acceptance artifact set includes profile rows, GC/I/O evidence, UI screenshots, ingestion evidence, retention proof, and workload restart comparison. + +--- + +## System-Wide Impact + +- **Interaction graph:** Collector profiling options, JFR parsing, batch upload, backend ingestion, ClickHouse schema/query paths, HTTP query handlers, React profile views, and real acceptance all change together. +- **Error propagation:** Unsupported Wall Clock/GC/I/O capability must be distinguishable from attach failure, parser failure, ingestion rejection, empty query result, and filtered-out UI state. +- **State lifecycle risks:** New event tables and profile types must obey the same 7-day-or-less retention policy and must not leave raw artifacts behind after temporary profiling windows expire. +- **API surface parity:** Top stacks, flamegraphs, service summary, target status, ingestion health, and UI routes must use the same target identity and value semantics. +- **Integration coverage:** Unit tests alone will not prove this feature; strict real Kubernetes acceptance is required because collector/backend/UI behavior depends on real HotSpot/JFR/async-profiler data. +- **Unchanged invariants:** Kubernetes opt-in control, DaemonSet node-local collection, ClickHouse as primary profile query store, no required Pyroscope/Parca/Grafana backend, and Java-only scope remain unchanged. + +--- + +## Risks & Dependencies + +| Risk | Mitigation | +|------|------------| +| Expanded Wall Clock/GC/I/O collection increases overhead or data volume. | Use conservative defaults, query limits, partial metadata, ingestion health counters, and strict real acceptance with no workload restart increase. | +| New evidence queries overload ClickHouse or the browser during incidents. | Enforce production budget gates for batch size, row/node limits, query-duration metrics, partial metadata, and bounded UI rendering. | +| GC and I/O event availability differs across JDK versions or parser capabilities. | Treat support as capability-detected Java evidence; show unsupported/partial states explicitly and keep CPU/allocation/lock unaffected. | +| Percentages remain misleading if CPU quota or selected-window baseline is unavailable. | Return explicit `percentage_basis` metadata and fallback wording instead of implying Pod quota. | +| Native/system frame hiding could falsify flamegraph totals. | Fold/summarize hidden frames while preserving aggregate values and keep full-frame mode available. | +| New UI views drift from existing search/focus/reset behavior. | Reuse the existing flamegraph component and extend real browser acceptance for every profile mode. | +| Async context stitching and JIT/inlining expectations exceed available data. | Make support limits visible in capability notes and defer full cross-thread/inlining reconstruction to follow-up work. | + +--- + +## Phased Delivery + +- Phase 1: MVP Java Pod CPU workbench. Land U1 plus the CPU portions of U6 and U7 so the existing CPU evidence becomes interpretable, low-noise, sortable, searchable, copyable, and shareable for one selected Java Pod/JVM. +- Phase 2: Shared evidence foundation. Land U8 and the target-identity portions of U2 so later evidence types have a consistent storage/query/retention contract without forcing service rollup into the first UI. +- Phase 3: Java Wall Clock. Land U3 on top of the shared foundation because it is closest to the current async-profiler profile pipeline. +- Phase 4: Java GC and I/O. Land U4 and U5 as concrete JVM/JFR evidence types that plug into the shared foundation instead of creating separate persistence paths. +- Phase 5: Service rollup and comparison. Complete the variance portions of U2 and add time-window A/B comparison only after single-target evidence and units are proven. +- Phase 6: Documentation and strict acceptance. Land U9 and rerun real Kubernetes acceptance against images built from the current workspace. + +--- + +## Documentation / Operational Notes + +- Documentation must consistently state that Wall Clock, GC, and I/O evidence is Java/JVM-scoped and tied to selected HotSpot-compatible JVM targets. +- Runbooks should explain when to use CPU, Wall Clock, allocation, GC, I/O, locks, deadlocks, status, and ingestion views during an incident. +- UI help should explain percent baselines, average cores, duration units, allocation units, lock wait units, GC pause semantics, and I/O wait/byte semantics. +- Operators need new ingestion and retention metrics for the added evidence types so failures are visible in the existing Prometheus stack without adding a new dashboard backend dependency. +- Runbooks must document the budget gates and how to interpret dropped, truncated, partial, and unsupported evidence during an incident. + +--- + +## Success Metrics + +- Expert users no longer need to infer whether large profile values are nanoseconds, sample counts, bytes, counts, or percentages. +- Service-level views make single-Pod skew visible before a responder opens a flamegraph. +- Java Wall Clock, GC, and I/O evidence can be queried and rendered for real HotSpot demo workloads in strict acceptance. +- Strict acceptance proves production budget gates: no target restart increase, bounded batches, visible drop/truncation counters, explicit query limits, and partial metadata when limits are reached. +- Flamegraph and Top Table interactions remain fast, searchable, sortable, copyable, and usable under both full-frame and hidden-native modes. +- The project still satisfies Java-only scope, ClickHouse retention, no required external profile backend, and real acceptance no-restart requirements. + +--- + +## Sources & References + +- **Origin document:** [docs/brainstorms/java-profiler-requirements.md](../brainstorms/java-profiler-requirements.md) +- **Architecture:** [docs/architecture/java-profiler-architecture.md](../architecture/java-profiler-architecture.md) +- **Real acceptance:** [docs/operations/real-profiling-acceptance-standard.md](../operations/real-profiling-acceptance-standard.md) +- **User manual:** [docs/operations/performance-analysis-user-manual.md](../operations/performance-analysis-user-manual.md) +- **Runbook:** [docs/operations/java-profiling-runbook.md](../operations/java-profiling-runbook.md) +- **Coroot / async-profiler research:** [docs/research/coroot-node-agent-java-agent.md](../research/coroot-node-agent-java-agent.md) +- **async-profiler README:** https://github.com/async-profiler/async-profiler +- **async-profiler profiling modes:** https://github.com/async-profiler/async-profiler/blob/master/docs/ProfilingModes.md +- **async-profiler options:** https://github.com/async-profiler/async-profiler/blob/master/docs/ProfilerOptions.md +- **Oracle JDK 21 JFR API:** https://docs.oracle.com/en/java/javase/21/docs/api/jdk.jfr/jdk/jfr/package-summary.html +--- + +## GSTACK REVIEW REPORT + +| Review | Trigger | Why | Runs | Status | Findings | +|--------|---------|-----|------|--------|----------| +| CEO Review | `/plan-ceo-review` | Scope & strategy | 0 | — | — | +| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — | +| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR | 6 issues found, 0 critical gaps; accepted fixes: dependency restructuring, collection mode matrix, dual-track evidence model, shared file ownership, U8 test tightening, production budget gates | +| Design Review | `/plan-design-review` | UI/UX gaps | 1 | CLEAR | score: 5/10 -> 9/10, 7 decisions added: workspace hierarchy, state matrix, responder journey, app UI rules, UI vocabulary, responsive/a11y, grouped evidence nav | +| DX Review | `/plan-devex-review` | Developer experience gaps | 0 | — | — | + +- **UNRESOLVED:** 0 +- **VERDICT:** ENG + DESIGN CLEARED — ready for implementation planning/execution. Run visual QA with `/design-review` after implementation because mockup generation was unavailable during plan review. diff --git a/docs/public/llms.txt b/docs/public/llms.txt new file mode 100644 index 0000000..c8202e0 --- /dev/null +++ b/docs/public/llms.txt @@ -0,0 +1,14 @@ +# Java Profiler Docs + +- [Overview](https://koolay.github.io/java-profiler/): Kubernetes-native Java profiling with opt-in collection, async-profiler/JFR-derived evidence, ClickHouse storage, and a Java incident diagnosis UI. +- [Quickstart](https://koolay.github.io/java-profiler/getting-started/quickstart): Enable temporary profiling, open a service, analyze CPU, Wall Clock, Java I/O wait, GC, allocation, lock evidence, and check ingestion health. +- [Performance Analysis Manual](https://koolay.github.io/java-profiler/operations/performance-analysis-user-manual): User workflow for CPU, Wall Clock, Java I/O wait, GC, allocation, lock, deadlock, target status, and ingestion analysis. +- [Java Profiling Runbook](https://koolay.github.io/java-profiler/operations/java-profiling-runbook): Operator workflow for enabling, disabling, validating, and troubleshooting profiling. +- [Deployment Manual](https://koolay.github.io/java-profiler/operations/deployment-operations-admin-manual): Deployment, security, storage, upgrade, and platform troubleshooting. +- [Real Profiling Acceptance](https://koolay.github.io/java-profiler/operations/real-profiling-acceptance-standard): Required evidence for profiling, ingestion, ClickHouse, backend API, deployment, demo service, and UI changes. +- [Contributing](https://koolay.github.io/java-profiler/contributing/development): Development setup, docs build, real acceptance, and screenshot evidence workflow. +- [Localization](https://koolay.github.io/java-profiler/contributing/localization): Bilingual documentation policy. English is the source language; Chinese covers core user and contributor paths. +- [中文首页](https://koolay.github.io/java-profiler/zh/): 中文文档入口。 +- [中文快速开始](https://koolay.github.io/java-profiler/zh/getting-started/quickstart): 中文 Quickstart。 +- [Architecture](https://koolay.github.io/java-profiler/architecture/java-profiler-architecture): Collector, backend, ClickHouse, contracts, and web UI architecture. +- [Profiling Contracts](https://koolay.github.io/java-profiler/reference/profiling-contracts): Stable payload and configuration contracts. diff --git a/docs/zh/contributing/development.md b/docs/zh/contributing/development.md new file mode 100644 index 0000000..f8186c5 --- /dev/null +++ b/docs/zh/contributing/development.md @@ -0,0 +1,59 @@ +# 贡献者指南 + +这个项目有一条底线:profiling 相关改动必须证明真实 profile 数据存在,不能只证明 UI 不报错。 + +## 开发环境检查 + +从仓库根目录运行: + +```bash +go test ./... +javac --release 11 java-helper/thread-diagnostics/src/main/java/com/ebpfjava/threads/*.java +cd examples/jdk17-http-demo && mvn test +cd ../../web && npm ci && npm test && npm run build +``` + +## 文档站 + +```bash +cd docs +npm install +npm run docs:dev +``` + +发布文档前先构建: + +```bash +cd docs +npm run docs:build +``` + +文档站支持中英文。英文是 source language,中文覆盖核心用户路径和贡献者路径。新增或移动公开文档前,先看 [本地化策略](./localization.md)。 + +## 真实验收 + +如果改动影响 collector profiling、ingestion、ClickHouse 存储、backend query API、部署、demo service 或 profile UI,需要跑真实 Kubernetes 验收。 + +```bash +export KUBECONFIG=$HOME/backup/localk8s.yaml + +scripts/real-acceptance.sh \ + --service jdk17-http-demo \ + --configure-profiler \ + --require-full-profiling \ + --high-volume \ + --artifact-dir /tmp/java-profiler-real-acceptance-$(date +%Y%m%d%H%M%S) +``` + +通过意味着当前运行窗口里有 accepted target status、非空 CPU/allocation/lock profiles、ClickHouse rows、ingestion evidence、受限 retention、浏览器 UI 证据,并且目标 workload restart count 没有增加。 + +## 截图证据 + +文档截图必须来自连接真实 backend 的真实 UI。 + +```bash +export REAL_ACCEPTANCE_BASE_URL=http://127.0.0.1:18081 +export REAL_ACCEPTANCE_NAMESPACE=java-profiler-qa +export REAL_ACCEPTANCE_SERVICE=jdk17-http-demo +node scripts/capture-doc-screenshots.mjs +``` diff --git a/docs/zh/contributing/localization.md b/docs/zh/contributing/localization.md new file mode 100644 index 0000000..4d9abcd --- /dev/null +++ b/docs/zh/contributing/localization.md @@ -0,0 +1,50 @@ +# 本地化策略 + +`java-profiler` 文档以英文为 source language。中文文档覆盖最重要的用户路径和贡献者路径,不追求把所有内部材料都翻译一遍。 + +## 必须双语的页面 + +这些页面需要同时提供英文和中文: + +| English | 中文 | +| --- | --- | +| `/` | `/zh/` | +| `/getting-started/quickstart` | `/zh/getting-started/quickstart` | +| `/operations/performance-analysis-user-manual` | `/zh/operations/performance-analysis-user-manual` | +| `/contributing/development` | `/zh/contributing/development` | +| `/reference/profiling-contracts` | `/zh/reference/profiling-contracts` | + +## 只保留英文的页面 + +实现细节重、访问频率低、或偏开发过程的材料默认只保留英文: + +- 架构细节。 +- Ingestion 架构 review。 +- E2E 自动化细节。 +- 真实 profiling 验收标准。 +- Research notes。 +- Brainstorms 和项目历史材料。 + +Research、brainstorms、plans 不进入公开导航。 + +## 翻译流程 + +修改必须双语的页面时: + +1. 先改英文页面。 +2. 同一个变更里同步中文页面。 +3. 中文路径保持 `/zh/` 下的同构 route。 +4. 截图默认复用,除非图片里的语言会让用户误解。 +5. 发布前构建文档站。 + +```bash +cd docs +npm run docs:build +``` + +## 风格 + +- 使用清楚的技术中文,不做逐句硬翻。 +- 产品名、API key、annotation、profile type、文件路径保持原样。 +- UI 里显示的术语保留英文,例如 `Top Table`、`Flame Graph`、`Self CPU`、`Total CPU`。 +- 不要只在中文页添加新的产品承诺。如果这个说法重要,先加到英文页。 diff --git a/docs/zh/getting-started/quickstart.md b/docs/zh/getting-started/quickstart.md new file mode 100644 index 0000000..243ad3d --- /dev/null +++ b/docs/zh/getting-started/quickstart.md @@ -0,0 +1,47 @@ +# 快速开始 + +当 `java-profiler` 已经部署好,并且你想分析一个 Java 服务时,走这条路径。 + +## 1. 启用 profiling + +在目标 workload 的 pod template 上添加 metadata: + +```yaml +metadata: + annotations: + java-profiler.io/profile-mode: temporary + java-profiler.io/profile-duration: 15m +``` + +线上事件优先用 `temporary`。只有经过批准需要长期采集的核心服务,才使用 `continuous`。 + +## 2. 打开服务 + +在 Web UI 中设置: + +- `Namespace`:Kubernetes namespace。 +- `Service`:服务或 workload 名称。 +- `Range`:包含 profiling 运行时间的窗口。 + +如果 UI 没有数据,先看 [Target status](../../operations/java-profiling-runbook.md#validate-an-existing-workload)。它会说明 JVM 是 accepted、disabled、unsupported、expired,还是 attach failed。 + +## 3. 分析 profile + +排查高 CPU 时,先打开 [性能分析用户手册](../operations/performance-analysis-user-manual.md) 对应的 CPU workflow。 + +使用: + +- Top Table 找最贵的 Java 方法。 +- Flame Graph 看完整 sampled stack context。 +- Selected frame details 对比 Self CPU 和 Total CPU。 +- Search 和 Focus 隔离真正重要的调用路径。 + +CPU 解释不了的延迟看 Wall Clock。Socket 或文件阻塞看 I/O wait。暂停时间或 allocation pressure 看 GC pauses 和 allocation correlation。锁竞争看 lock diagnosis。 + +## 4. 检查 ingestion health + +在相信“没有 profile”之前,先检查 [Ingestion health](../operations/performance-analysis-user-manual.md#ingestion-health)。有用的诊断需要看到当前服务和时间范围内的 accepted profile batches。 + +## 5. 关闭 profiling + +临时 profiling 会自动过期。持续 profiling 不再需要时,移除或禁用 metadata。 diff --git a/docs/zh/index.md b/docs/zh/index.md new file mode 100644 index 0000000..34a5208 --- /dev/null +++ b/docs/zh/index.md @@ -0,0 +1,40 @@ +--- +layout: home + +hero: + name: Java Profiler + text: 找到 Kubernetes Java 性能问题背后的调用栈 + tagline: "一个聚焦 HotSpot 服务的 profiler:Kubernetes metadata opt-in、async-profiler/JFR 真实证据、ClickHouse 存储,以及面向 Java 事故排障的 UI。" + actions: + - theme: brand + text: 快速开始 + link: /zh/getting-started/quickstart + - theme: alt + text: 分析服务 + link: /zh/operations/performance-analysis-user-manual + +features: + - title: 默认适合生产排查 + details: 通过 Kubernetes annotation 或 label 显式启用,节点本地采集,默认保留 7 天或更短。 + - title: 真实 Java 证据 + details: CPU、Wall Clock、Java I/O wait、GC、allocation、lock delay、线程、死锁、target status 和 ingestion evidence 都绑定到同一个服务和时间范围。 + - title: 自己掌控 Profiling 栈 + details: 不强制依赖 Pyroscope、Parca 或 Grafana 后端。async-profiler 数据写入 ClickHouse,并由自有 UI 查询。 +--- + +## 面向服务负责人 + +- [快速开始](./getting-started/quickstart.md):启用 profiling 并读懂第一个服务 profile。 +- [性能分析用户手册](./operations/performance-analysis-user-manual.md):分析 CPU、Wall Clock、Java I/O wait、GC、allocation、lock、deadlock、target status 和 ingestion evidence。 +- [Java Profiling Runbook](../operations/java-profiling-runbook.md):为 Kubernetes workload 启用临时或持续 profiling。 + +## 面向平台运维 + +- [部署运维手册](../operations/deployment-operations-admin-manual.md):安装、安全、存储、升级和故障处理。 +- [真实 Profiling 验收](../operations/real-profiling-acceptance-standard.md):在发布前证明 CPU、Wall Clock、Java I/O wait、GC、allocation、lock、ClickHouse、UI 和 ingestion 行为。 + +## 面向贡献者 + +- [开发设置](./contributing/development.md):运行本地检查、构建文档、执行真实验收。 +- [系统架构](../architecture/java-profiler-architecture.md):理解 collector、backend、ClickHouse、contracts 和 Web UI。 +- [Profiling 合同](./reference/profiling-contracts.md):查看稳定 payload 和配置合同。 diff --git a/docs/zh/operations/performance-analysis-user-manual.md b/docs/zh/operations/performance-analysis-user-manual.md new file mode 100644 index 0000000..a39802d --- /dev/null +++ b/docs/zh/operations/performance-analysis-user-manual.md @@ -0,0 +1,826 @@ +# Java 服务性能分析用户手册 + +本文面向 Java 服务负责人、值班响应人员和应用开发人员,说明如何使用 `java-profiler` 分析 Kubernetes 中 Java 服务的 CPU、Wall Clock 延迟、Java I/O wait、GC pauses、内存分配、锁等待、死锁和线程问题。部署、权限、升级、ClickHouse、Web 代理和平台故障处理见 [部署运维管理员手册](../../operations/deployment-operations-admin-manual.md)。如果问题涉及部署、权限、ClickHouse、Web 代理、token 或 collector DaemonSet,请转交平台管理员并引用管理员手册。 + +## 真实工作流截图 + +这些截图来自真实 Kubernetes acceptance 环境,不是 mock UI 状态。保留它们是为了让读者快速理解核心诊断路径,也让维护者有一组可回归对比的 UI 证据。 + +重新生成截图时,先把真实 Web UI port-forward 到本机,然后从仓库根目录运行: + +```bash +export REAL_ACCEPTANCE_BASE_URL=http://127.0.0.1:18081 +export REAL_ACCEPTANCE_NAMESPACE=java-profiler-qa +export REAL_ACCEPTANCE_SERVICE=jdk17-http-demo +node scripts/capture-doc-screenshots.mjs +``` + +### Target status + +先确认目标 JVM 是否被接受、拒绝或因为 metadata/JVM/runtime 条件而没有数据。 + +![真实 target status 证据](../../assets/screenshots/real-target-status.png) + +### CPU profile analysis + +CPU 视图把 Top Table、Flame Graph、选中 frame 详情、Self/Total CPU 语义和 Java frame 分类放在同一个真实服务诊断流里。 + +![真实 CPU profile analysis](../../assets/screenshots/real-cpu-analysis.png) + +### Wall Clock latency + +Wall Clock 视图用于分析 CPU 无法解释的延迟。它展示 Java 栈在 runnable、blocked、waiting、sleeping 或 I/O 路径上的墙上时间。 + +![真实 Wall Clock latency 证据](../../assets/screenshots/real-wall-clock.png) + +### Java I/O wait + +I/O wait 视图用于定位 socket、文件或 Java 客户端阻塞路径。只有采集到的 JVM/JFR 栈能保留 Java ownership 时,才把它呈现为 Java I/O wait evidence。 + +![真实 Java I/O wait 证据](../../assets/screenshots/real-io-wait.png) + +### GC pauses + +GC 视图把 JVM GC event evidence 与同一时间窗口内的 allocation profile 放在一起,帮助判断暂停是否与对象分配压力相关。 + +![真实 GC pause 和 allocation correlation](../../assets/screenshots/real-gc-pauses.png) + +### Deadlock diagnosis + +死锁视图用于确认选定服务和时间范围内是否有 cycle 证据;真实运行也可能呈现经过验证的空状态。 + +![真实 deadlock diagnosis surface](../../assets/screenshots/real-deadlocks.png) + +### Ingestion health + +Ingestion health 用 accepted/rejected/dropped payload 证据闭环 collector 到 backend 的上传和入库路径。 + +![真实 ingestion health 证据](../../assets/screenshots/real-ingestion-health.png) + +## 你能用它回答什么 + +`java-profiler` 第一版本聚焦 HotSpot 兼容 Java 服务: + +- 哪些 Java 调用栈消耗 CPU。 +- 哪些调用栈出现 Wall Clock 延迟。 +- 哪些 Java socket 或文件路径产生 I/O wait。 +- 哪些 GC pause 与 allocation pressure 相关。 +- 哪些调用栈分配对象或分配字节最多。 +- 哪些锁路径产生等待或竞争。 +- 哪些线程处于死锁、阻塞、等待或忙碌状态。 +- 目标为什么没有数据:未启用、不兼容、冲突、attach 失败、上传失败或存储失败。 + +它不是通用观测平台。日志、分布式追踪、Prometheus 指标趋势、告警和服务拓扑仍然从现有观测系统进入。`java-profiler` 提供的是同一服务、同一时间范围内的 Java 栈证据。 + +## 使用前确认 + +使用前确认: + +- 平台管理员已经部署 collector、backend、Web UI 和 ClickHouse。 +- 你有权限访问目标 namespace 或 service 的 profiling 数据。 +- 目标服务运行在 Kubernetes 中。 +- 目标 JVM 是 HotSpot 兼容实现。 +- 目标服务已通过 `java-profiler.io/*` metadata 显式启用 profiling,或你有权限临时启用。 + +如果 UI 没有数据,先看 `status`,不要直接判断“没有性能问题”。 + +## 启用 profiling + +Profiling 默认关闭。你需要在目标 Pod 或 workload Pod template 上添加 annotation 或 label。annotation 与 label 使用相同键名;如果两者都存在,以 annotation 为准。 + +只有在你拥有该 workload 的变更权限且已获得生产采集批准时,才应直接修改 profiling metadata。否则请通过平台管理员或既有 GitOps/变更流程申请临时 profiling;申请信息见本手册“无法自行启用 profiling 时”。 + +### 选择模式 + +| 模式 | 适用情境 | 建议 | +| --- | --- | --- | +| `temporary` | 线上事件、临时复现、单次排查 | 默认优先选择,必须设置 `profile-duration`。 | +| `continuous` | 核心服务、长期高流量服务、需要 7 天内随时回溯 | 需要服务 owner 同意长期采集和访问控制。 | +| `profile-disabled` / `disabled` | 事件结束、止血、明确不采集 | 优先使用 `profile-disabled: "true"` 止血;合同也允许 `profile-mode: disabled`。 | + +### 模式选择决策表 + +| 情况 | 推荐模式 | 原因 | +| --- | --- | --- | +| 正在处理线上事件 | `temporary` 10 到 15 分钟 | 控制开销,同时捕获现场。 | +| 核心服务长期高流量 | `continuous` | 需要 7 天内可回溯栈证据。 | +| 只有单个 Pod 异常 | 对该 Pod 使用 `temporary` | 避免混入其他副本。 | +| 刚上线新版本,需要观察风险 | `temporary` 30 分钟,或经批准使用 `continuous` | 取决于风险窗口和服务重要性。 | +| 服务包含敏感业务路径 | 优先 `temporary`,共享证据时脱敏 | 降低长期暴露面。 | +| 不确定是否可采 | `temporary` 5 分钟 smoke test | 先验证 status 和 ingestion。 | + +### 临时 profiling + +适合排查一次事件。临时模式到期后自动停止。 + +```yaml +metadata: + annotations: + java-profiler.io/profile-mode: temporary + java-profiler.io/profile-disabled: "false" + java-profiler.io/profile-duration: 10m + java-profiler.io/startup-delay: 0s + java-profiler.io/snapshot-interval: 10s +``` + +当前临时窗口按目标 Pod/JVM 的生命周期判断。对已经运行很久的 Pod 直接添加 `10m` temporary metadata,可能立即显示 `temporary_expired`;更稳妥的做法是在 workload Pod template 上添加 metadata 后滚动重启,或按管理员确认的方式重开目标 Pod。重复验收时可以添加一次性 annotation,例如 `java-profiler.io/acceptance-run: "20260516225619"`,强制 Deployment 滚动出新的窗口。 + +临时模式可以短时间提高线程快照频率,但不要长期运行高频快照。 + +### 真实 workload smoke test + +对线上或准线上 workload 第一次启用真实 async-profiler attach 时,先做短窗口 smoke test。平台管理员应把 collector/backend/UI 指向单个 namespace/service,保存 UI 截图、Playwright 视频、ClickHouse 计数、collector/backend 日志,以及目标 Pod 的前后 restart count。若目标服务在窗口内出现新重启,应先停止 profiling 并排查 attach 安全性,不要继续扩大采集范围。 + +### 持续 profiling + +适合核心服务保留最近 7 天内的栈证据。 + +```yaml +metadata: + annotations: + java-profiler.io/profile-mode: continuous + java-profiler.io/profile-disabled: "false" + java-profiler.io/startup-delay: 30s + java-profiler.io/snapshot-interval: 5m +``` + +持续 profiling 不是无限保存。所有采集数据仍然不超过 7 天 retention。 + +### 停止 profiling + +```yaml +metadata: + annotations: + java-profiler.io/profile-disabled: "true" +``` + +事件结束后应移除临时 annotation,或保留显式禁用作为止血控制。后续重新启用时,必须删除旧的 `profile-disabled: "true"`,或在 Pod template 上显式设置 `profile-disabled: "false"`;只改 `profile-mode` 不会覆盖禁用标记。 + +## 控制字段 + +| 字段 | 示例 | 说明 | +| --- | --- | --- | +| `java-profiler.io/profile-mode` | `temporary`, `continuous`, `disabled` | 开启临时、持续或禁用 profiling。 | +| `java-profiler.io/profile-disabled` | `"true"` | 强制禁用,优先级高于 `profile-mode`。truthy 值包括 `1`, `true`, `yes`, `enabled`, `on`。 | +| `java-profiler.io/profile-duration` | `10m`, `1h` | 临时 profiling 持续时间。临时模式必填。 | +| `java-profiler.io/startup-delay` | `0s`, `30s` | 新发现 JVM 启动后等待多久再开始 profiling。 | +| `java-profiler.io/snapshot-interval` | `10s`, `5m` | 线程快照间隔。临时排障可短时间缩短。 | + +时间字段使用 Go duration 格式,例如 `30s`、`10m`、`1h`。 + +控制字段和 status reason 的稳定合同由平台维护,见 [profiling contracts](../reference/profiling-contracts.md);本手册只解释服务 owner 如何使用这些字段。 + +## 无法自行启用 profiling 时 + +如果你没有权限修改 workload metadata,或目标服务属于敏感生产范围,请向平台管理员提交一次性申请: + +```text +Namespace: +Service: +Pod / workload: +Incident window: +Requested mode and duration: +Reason: +Urgency: +Owner / approver: +``` + +管理员应通过受控变更路径添加 `java-profiler.io/*` metadata。不要通过临时手工 patch 绕过团队的生产变更规则。 + +## 按症状选择入口 + +紧急排查时先按症状进入对应视图: + +| 症状 | 先看 | 再看 | +| --- | --- | --- | +| CPU 高 | `status` | `cpu` | +| GC 压力高或 allocation rate 高 | `status` | `memory` | +| 请求卡住或线程池耗尽 | `status` | `locks`、`threads` | +| 疑似死锁 | `status` | `deadlocks` | +| UI 没数据 | `status` | `ingestion` | +| 只有某个 Pod 异常 | `status` | 缩小到 Pod、container 或 JVM | +| rollout 后行为变化 | `status` | 按 Pod 和 JVM start time 分开查询 | + +## 常见误判速查 + +| 不要这样做 | 正确做法 | +| --- | --- | +| 空 flamegraph = 没有热点 | 先确认 status、ingestion 和时间范围。 | +| accepted = 已经有 profile 数据 | accepted 只说明目标可采,非空 profile 才说明数据链路可用。 | +| allocation 高 = retained heap 高 | allocation 只说明分配来源。 | +| RUNNABLE = 正在消耗 CPU | RUNNABLE 是线程状态,需要结合 CPU profile。 | +| 选中 Top Table 行 = 直接过滤成单行图 | 默认不是。选中应高亮匹配帧并显示详情;搜索是单独动作。 | +| 多副本 service 直接混看 | 先确认是否只有某个 Pod 异常。 | +| 发布窗口混合新旧 Pod | 用 Pod、PID、JVM start time 分开看。 | + +## 5 分钟应急流程 + +1. 记录 namespace、service、Pod 和问题时间范围。 +2. 打开 UI,选择相同 namespace、service 和时间范围。 +3. 先打开 `status`。 +4. 如果没有权限启用 profiling,把 namespace、service、Pod 和时间范围发给服务 owner 或平台管理员。 +5. 如果 UI 无权访问目标 namespace,不要申请全集群权限;申请对应 namespace 或 service 的最小访问权限。 +6. 如果目标不是 accepted,按 reason 处理;需要权限、attach 或平台问题时联系管理员。 +7. 如果目标 accepted,按症状进入 `cpu`、`memory`、`locks`、`deadlocks` 或线程证据;accepted 只说明控制面允许采集,不等于已经有 profile 数据。 +8. 如果视图为空,打开 `ingestion` 判断上传、存储或 retention 问题,并确认对应时间窗口有 accepted profile batch。 +9. 记录 top stack、thread evidence、target status 和 ingestion health。 +10. 事件结束后停止 temporary profiling 或确认 continuous 是否仍需要保留。 + +## UI 使用顺序 + +进入 Java Profiling UI 后: + +1. 选择 namespace。 +2. 选择 service。 +3. 选择时间范围。 +4. 先打开 `status`。 +5. 如果同一服务下有多个 Pod 或 JVM,先用 `status` 表中的 Pod、PID、Seen 和 Reason 缩小到具体目标。 +6. 状态正常后,再看 `cpu`、`memory`、`locks`、`deadlocks` 或线程证据。 +7. 如果数据为空,打开 `ingestion` 判断上传和存储是否正常。 + +所有诊断视图共享同一组选择器。发生 rollout、重启或扩缩容时,要核对 Pod、PID 和 JVM start time,避免把旧实例、新实例或无关副本混在一起解释。 + +## 页面区域解读 + +`Service diagnosis` 是服务级诊断页面。它把同一个 namespace、service、Pod 和时间范围下的证据放在一起,便于从“目标是否可采”切换到 CPU、内存分配、锁、死锁和 ingestion 链路。 + +页面左侧竖栏提供主视图快捷入口,分别进入 `status`、`cpu`、`wall`、`io`、`gc`、`memory`、`locks`、`deadlocks` 和 `ingestion`。它不是搜索框,也不是筛选器;真正的查询条件在页面顶部的上下文条里。 + +顶部上下文条显示当前查询条件: + +- `Namespace`:当前命名空间。截图示例是 `java-profiler-qa`。 +- `Service`:当前服务。截图示例是 `jdk17-http-demo`。 +- `Range`:当前分析时间范围。截图示例是 `Last 1h`。 +- `UTC`:当前时间显示时区。跨地区协作时,记录事件时间时要同时写明时区;本界面固定以 UTC 呈现时间戳。 + +上下文条下方的说明强调:Prometheus 仍然负责指标趋势图;本 UI 只展示 profiles、线程证据、目标状态和 ingestion health。也就是说,先用现有监控确认“CPU 高、GC 高、延迟高”这类症状,再回到本页面找 Java 栈证据。 + +顶部标签页含义: + +| 标签 | 用途 | 什么时候打开 | +| --- | --- | --- | +| `Cpu` | 查看 CPU profile flamegraph。 | CPU 高、线程忙、业务路径耗时异常。 | +| `Wall Clock` | 查看 Java stack wall time。 | 延迟高但 CPU 不高、线程 waiting/sleeping/blocking。 | +| `I/O` | 查看 Java socket/file blocking paths。 | RPC、文件、网络或存储调用阻塞。 | +| `GC` | 查看 GC pause event,并关联 allocation profile。 | GC 时间、暂停或 allocation pressure 异常。 | +| `Memory` | 查看 allocation bytes、allocation objects 和 top allocating stacks。 | GC 压力、allocation rate 或对象创建异常。 | +| `Locks` | 查看 lock wait 或 contention profile。 | 请求卡住、线程 BLOCKED、锁竞争。 | +| `Deadlocks` | 查看 JVM 结构化死锁事件。 | 疑似死锁或线程永久互等。 | +| `Status` | 查看每个目标 JVM 是否可采、为什么不可采、用户下一步动作。 | 所有排查都先打开。 | +| `Ingestion` | 查看 collector 上传、backend 接受、ClickHouse 写入和丢弃/拒绝状态。 | profile 或线程证据为空、数据不完整或怀疑链路问题。 | + +`Status` 标签页中的 `Target status` 表格逐行表示一个目标 JVM 或一次目标状态记录。截图中的关键列按如下方式阅读: + +| 列 | 含义 | 解读方式 | +| --- | --- | --- | +| `Pod` | 目标 Pod 名。长名称会截断,排查时应结合完整 Pod 名或 hover/title 信息记录。 | +| `PID` | 目标 JVM 进程 ID。PID 可能复用,发布或重启窗口内不要只靠 PID 判断身份。 | +| `Seen` | backend 看到该状态距当前查询时间的时间差。越新越能代表当前状态。 | +| `State` | collector 对目标的采集状态,例如 `temporary`、`disabled`、`unsupported`。 | +| `Reason` | 状态原因代码,例如 `accepted`、`disabled_by_metadata`、`unsupported_jvm`。它决定下一步动作。 | +| `Message` | 面向用户的简短状态说明。它帮助确认 reason 是否符合预期。 | +| `User action` | 推荐处理动作。优先按这一列处理,再进入 profile 视图。 | + +对截图中的三类状态,可以这样解读: + +- `State=temporary` 且 `Reason=accepted`:该 HotSpot 兼容 JVM 的临时 profiling 已生效,可以进入 `Cpu`、`Memory`、`Locks` 或线程证据查看同一目标和时间范围内的数据。 +- `State=disabled` 且 `Reason=disabled_by_metadata`:profiling 被 metadata 禁用,或没有启用所需 metadata。需要添加 profiling metadata,或确认显式禁用是预期行为。 +- `State=unsupported` 且 `Reason=unsupported_jvm`:collector 判断该 JVM 不适合当前 v1 HotSpot profiling。先确认它是否为业务 JVM、是否 HotSpot 兼容、是否选错 container 或 PID。 + +同一服务同时出现 `temporary`、`disabled` 和 `unsupported` 并不矛盾。多副本、重启、sidecar、多 Java 进程或不同 Pod metadata 都可能让同一 service 下有多种状态。排查时先缩小到异常 Pod 和 JVM,再解释 profile 数据。 + +## 理解目标状态 + +`status` 是第一入口。常见 reason: + +| reason | 含义 | 你的动作 | +| --- | --- | --- | +| `accepted` | 目标可被采集。 | 进入 CPU、memory、locks 或线程证据。 | +| `disabled_by_metadata` | 没有启用 metadata,或被显式禁用。 | 添加 profiling metadata,或确认禁用符合预期。 | +| `temporary_expired` | 临时窗口已过期。 | 如事件仍在发生,重新开启临时 profiling。 | +| `invalid_duration` | duration 配置错误。 | 修正为 `10m`、`1h`、`30s` 这类格式。 | +| `unsupported_jvm` | JVM 不兼容 HotSpot,或不是目标业务 JVM。 | 确认 JVM 类型、Pod 内容器和 PID。 | +| `profiler_conflict` | 已有其他 profiler 占用目标 JVM,或前一次运行留下 async-profiler 状态。 | 停止冲突工具;如是验收残留,在管理员确认后滚动目标 Pod 再重试。 | +| `attach_failed` | collector 无法 attach 到 JVM。 | 联系平台管理员检查权限、容器安全策略或 JVM 参数。 | +| `upload_retryable` | 上传暂时失败,可恢复。 | 查看 `ingestion`,必要时联系管理员。 | +| `upload_dropped` | collector 已丢弃部分批次。 | 该时间窗口证据不完整,联系管理员。 | +| `storage_rejected` | backend 拒绝数据。 | 查看 `ingestion`,联系管理员排查合同或存储问题。 | + +空 flamegraph 不等于没有热点。只有在 target status、ingestion 和时间范围都正确时,空结果才可解释为该窗口没有匹配样本。 + +## 分析视图 + +不同 profile 视图是否有非空数据,取决于当前平台版本、目标 JVM 支持情况、采集窗口和 ingestion 状态;先用 `status` 和 `ingestion` 确认可用性。 + +### CPU + +`cpu` flamegraph 展示时间窗口内被采样到的 Java 调用栈。 + +如果页面同时显示 Top Table 和 flame graph,Top Table 里的 `Symbol`、`Self`、`Total` 应该一起读: + +- `Self` 和 `Total` 都要可见,并且都可以排序。 +- `Self` 说明这个函数自己消耗了多少 CPU。 +- `Total` 说明这个函数加上它的子调用一共消耗了多少 CPU。 +- 排查瓶颈时,通常先看 `Total` 找到最重的业务入口,再结合 `Self` 判断是不是函数本身在烧 CPU。 + +交互方式要保留完整 stack context: + +- 选中表格行时,应高亮完整 flame graph 中的匹配帧,并显示选中帧详情。 +- 选中行为不应把主图替换成单行过滤结果。 +- 搜索是显式动作;只有用户主动搜索时,才对非匹配帧做高亮或淡化。 +- 点击火焰图块进入 `focus` 后,当前块会变成新的根,子树按新的根重新缩放。 +- `Back` 用于回到上一级 focus,`Reset` 用于回到完整上下文。 +- 选中帧详情应包含符号、样本、类别,以及 `Self` / `Total` 的解读。 + +阅读方式: + +- 宽度越大,代表累计 CPU 样本越多。 +- 优先看业务方法、序列化、正则、JSON、加密、压缩、数据库客户端、缓存路径。 +- CPU profile 是采样证据,不会还原每一次方法调用。 + +### Memory + +`memory` 视图用于 allocation 分析: + +- allocation bytes:哪些路径分配字节最多。 +- allocation objects:哪些路径创建对象数量最多。 + +不要把 allocation profile 解读成 retained heap。它不能回答“谁持有对象”“引用根是什么”“哪个对象泄漏”。这类问题需要 heap dump 或 retained-heap 分析。 + +### Locks + +`locks` 视图用于看 lock wait time 或 lock contention count。结合线程快照一起看: + +- lock flamegraph 解释时间范围内的锁成本。 +- 线程快照解释采样时刻的线程状态。 +- 两者指向同一路径时,结论更强。 + +### Deadlocks + +`deadlocks` 展示 JVM 结构化线程数据派生出的死锁事件: + +- 涉及线程。 +- 等待的锁。 +- 锁 owner。 +- 阻塞栈帧。 + +没有 deadlock event 不代表没有慢请求,可能是普通锁等待、IO 等待或线程池耗尽。 + +### Threads + +忙线程和慢线程证据可能来自: + +- JVM per-thread CPU time。 +- RUNNABLE 快照。 +- CPU profile 热栈关联。 + +RUNNABLE 是线程状态,不是 CPU 百分比。UI 若标记为 sampled 或 profile-only,应按采样证据解释。 + +### Ingestion + +当 profile 或线程证据为空时,看 `ingestion` 只为回答三个问题: + +- collector 有没有上传数据。 +- backend 有没有接受数据。 +- 是否出现 retryable、dropped 或 rejected,需要联系管理员。 + +如果出现 dropped,该时间窗口证据不完整。如果出现 rejected,不要反复重试或继续解释业务问题,先联系管理员排查上传合同或存储问题。 + +## 使用案例 + +### 案例 1:首次接入一个 Java 服务 + +适用情境:服务负责人希望让服务具备性能排查能力。 + +步骤: + +1. 确认服务是 Kubernetes 中的 HotSpot 兼容 Java 服务。 +2. 先用 temporary 模式接入 15 分钟。 +3. 在 UI 选择 namespace、service 和最近 15 到 30 分钟。 +4. 打开 `status`,确认目标是 `accepted` 或 `temporary`。 +5. 查看 `cpu`、`memory`、`locks` 是否有数据。 +6. 事件结束后移除临时 annotation 或显式禁用。 + +预期证据: + +- `status` 有 Pod、PID、Seen、State、Reason。 +- `ingestion` 能看到 target status 或 profile batch 被接受。 + +容易误判: + +- 只修改 Deployment metadata,没有修改 Pod template。 +- 服务无负载时 profile 可能很少,但 status 应该能解释采集状态。 + +### 案例 2:持续 profiling 用于核心服务 + +适用情境:核心高流量服务需要保留最近 7 天内栈证据。 + +步骤: + +1. 与团队确认长期采集和访问控制。 +2. 在 Pod template 上启用 `continuous`。 +3. rollout 后打开 `status`,确认所有 Pod 状态。 +4. 定期查看 `ingestion`,确认没有持续 retry 或 dropped batch。 +5. 遇到事件时直接选择事件时间范围分析。 + +预期证据: + +- 多副本服务的多个 Pod 分别显示状态。 +- 最近 7 天内可查询 profile。 + +容易误判: + +- continuous 不代表无限保存。 +- profile 是采样证据,不是完整调用日志。 + +### 案例 3:CPU 升高 + +适用情境:监控显示某服务 CPU 峰值。 + +步骤: + +1. 从监控记录 namespace、service、Pod 和峰值时间。 +2. 如未启用 profiling,开启 temporary 10 到 15 分钟。 +3. 打开 `status`,确认目标 accepted。 +4. 打开 `cpu`,选择峰值时间窗口。 +5. 查看最宽业务栈和框架栈。 +6. 记录 top stack、Pod/JVM、时间范围和结论。 + +预期证据: + +- CPU flamegraph 有非 root 栈。 +- top stack 能解释 CPU 样本来源。 + +容易误判: + +- 事件已过去才启用 profiling,无法还原过去现场。 +- 选错时间范围会看到正常负载。 + +### 案例 4:GC 压力或 allocation rate 升高 + +适用情境:GC 次数、GC 时间或 allocation rate 异常。 + +步骤: + +1. 用监控确定时间范围。 +2. 打开 `memory`。 +3. 先看 allocation bytes。 +4. 如果怀疑小对象风暴,再看 allocation objects。 +5. 定位集合复制、字符串拼接、JSON 序列化、正则、缓存反序列化等路径。 + +预期证据: + +- allocation flamegraph 能说明分配来源。 +- bytes 和 objects 可能指向不同热点。 + +容易误判: + +- allocation 高不等于 retained heap 高。 +- 本系统不能直接定位引用根。 + +### 案例 5:锁竞争导致请求慢 + +适用情境:请求延迟升高,线程出现 BLOCKED、WAITING 或 TIMED_WAITING。 + +步骤: + +1. temporary 开启 profiling。 +2. 将 `snapshot-interval` 临时设为 `10s`。 +3. 打开 `locks` 看 lock delay 或 contention。 +4. 查看 slow-thread 证据。 +5. 关注 monitor、park、连接池、缓存锁、日志锁、类加载或单例初始化路径。 + +预期证据: + +- lock flamegraph 集中在少数路径。 +- slow-thread 栈与 lock 热点一致。 + +容易误判: + +- 单次线程快照只能说明采样时刻状态。 +- RUNNABLE 不一定代表正在消耗 CPU。 + +### 案例 6:疑似死锁 + +适用情境:请求永久卡住、线程池不释放或 JVM 工具提示 deadlock。 + +步骤: + +1. 打开 `deadlocks`。 +2. 选择问题发生时间范围。 +3. 查看 deadlock cycle、线程、锁 owner 和 blocking frame。 +4. 如果没有事件,检查 `status` 和线程快照是否采集成功。 +5. 如问题仍在发生,提高短时间线程快照频率后再观察。 + +预期证据: + +- deadlock cycle 清楚显示互相等待的线程和锁。 + +容易误判: + +- 没有 deadlock 不代表没有阻塞。 +- 死锁超过 retention 后不会保留。 + +### 案例 7:线程池耗尽或忙线程 + +适用情境:队列堆积、请求超时,但 CPU 不一定高。 + +步骤: + +1. 查看 busy threads 和 slow threads。 +2. 有 per-thread CPU time 时优先看精确线程 CPU 证据。 +3. 没有 per-thread CPU time 时,把 RUNNABLE 快照和 CPU flamegraph 关联。 +4. 结合线程名、线程池配置、队列长度和下游依赖指标判断。 + +预期证据: + +- busy threads 指向高 CPU 或频繁 RUNNABLE 的线程。 +- slow threads 指向阻塞、等待或锁竞争路径。 + +容易误判: + +- 线程池耗尽可能由下游 IO 慢引起,profile 只能提供 Java 栈证据。 + +### 案例 8:UI 没有数据 + +适用情境:CPU、memory、locks 或 deadlocks 页面为空。 + +步骤: + +1. 检查 namespace、service、Pod、JVM 和时间范围。 +2. 打开 `status`。 +3. 如果是 `disabled_by_metadata`,在有权限且获批时通过受控变更路径添加 profiling metadata;否则按“无法自行启用 profiling 时”申请。 +4. 如果是 `temporary_expired`,重新开启临时窗口。 +5. 如果是 `unsupported_jvm`、`profiler_conflict` 或 `attach_failed`,按 reason 处理。 +6. 打开 `ingestion`,检查 retry、dropped、rejected。 +7. 确认数据是否超过 7 天 retention。 + +预期证据: + +- 空状态能被 status 或 ingestion 解释。 + +容易误判: + +- 空页面不是诊断结论。 + +### 案例 9:临时窗口过期 + +适用情境:`status` 显示 `temporary_expired`。 + +步骤: + +1. 查看 Seen 时间和 Pod。 +2. 确认 `profile-duration` 是否已经过期。 +3. 如果事件仍在发生,重新设置新的 temporary 窗口。 +4. 如果事件已结束,移除 annotation。 + +预期证据: + +- 新窗口内目标显示为 `temporary` 或 `accepted`。 + +容易误判: + +- 同一 service 可能同时显示旧 Pod expired 和新 Pod accepted。 + +### 案例 10:多副本服务排查 + +适用情境:Deployment 或 StatefulSet 有多个 Pod。 + +步骤: + +1. 先按 service 查询 `status`。 +2. 如果监控指向某个 Pod,缩小到该 Pod。 +3. 如果所有 Pod 异常,按 service 级 profile 查共同热点。 +4. 如果只有单 Pod 异常,对比该 Pod 与其他 Pod 的栈。 +5. rollout 期间分开看旧 Pod 和新 Pod。 + +预期证据: + +- 每个 Pod 有独立 target status。 +- Seen、PID 和 JVM start time 帮助区分实例。 + +容易误判: + +- 多副本混看可能掩盖单实例异常。 + +### 案例 11:多容器 Pod 或 sidecar + +适用情境:Pod 中有业务容器、sidecar、helper 或多个 Java 进程。 + +步骤: + +1. 在 `status` 中查看 container、PID 和 JVM start time。 +2. 确认目标 Java 进程属于业务容器。 +3. 如果支持过滤,缩小到目标 container 或 JVM。 +4. 不要把 sidecar profile 解释为业务服务热点。 + +预期证据: + +- target identity 能区分 Pod、container、PID 和 JVM start time。 + +容易误判: + +- Pod 相同不代表 JVM 相同。 +- PID 可复用,必须结合 JVM start time。 + +### 案例 12:Pod 重启或发布窗口 + +适用情境:问题发生时服务正在重启、扩缩容或发布。 + +步骤: + +1. 查看 Pod、PID、JVM start time 和 Seen 时间。 +2. 将 profile 时间范围对齐到具体 Pod 生命周期。 +3. 发布前后分别查询。 +4. 对同一 PID 的不同 JVM start time 分开解释。 + +预期证据: + +- 新旧 JVM 具有不同 JVM start time。 + +容易误判: + +- 发布窗口内 profile 可能混合新旧代码路径。 + +### 案例 13:unsupported JVM + +适用情境:目标显示 `unsupported_jvm`。 + +步骤: + +1. 确认 JVM 分发版和版本。 +2. 确认是否为 HotSpot 兼容。 +3. 检查是否为 wrapper、helper 或非业务 Java 进程。 +4. 如果同一 Pod 有多个 Java 进程,按 PID 区分。 + +预期证据: + +- 业务 JVM 若 HotSpot 兼容,应有自己的状态。 + +容易误判: + +- 一个 Java 进程 unsupported 不代表整个 Pod 都不能分析。 + +### 案例 14:profiler 冲突 + +适用情境:目标显示 `profiler_conflict`。 + +步骤: + +1. 确认是否有人手动运行 async-profiler。 +2. 检查是否存在其他 profiler 或诊断 agent。 +3. 停止冲突工具,或决定跳过该 JVM。 +4. 等待 collector 下一次扫描。 + +预期证据: + +- 冲突解除后状态变为 accepted 或对应启用状态。 + +容易误判: + +- 不要同时对同一 JVM 运行多个 profiler。 + +### 案例 15:attach 失败 + +适用情境:目标显示 `attach_failed`。 + +步骤: + +1. 记录 Pod、container、PID、JVM start time 和 message。 +2. 联系平台管理员检查 collector 权限、目标容器安全策略和 JVM attach 参数。 +3. 修复后重新触发扫描或等待下一次扫描。 + +预期证据: + +- 权限或 JVM 参数修复后状态变为 accepted 或其他明确状态。 + +容易误判: + +- attach 失败不是“没有性能问题”,而是采集失败。 + +### 案例 16:真实 profile 数据链路确认 + +适用情境:你需要确认系统不仅能发现目标,还能完成性能分析。 + +步骤: + +1. 对 JDK17 demo 或有稳定 CPU、allocation、lock contention 负载的 HotSpot Java 服务启用 profiling。 +2. 确认 `status` 在当前运行窗口内为 accepted。 +3. 等待至少一个 profile batch 被 backend accepted。 +4. 打开 `cpu`,确认 Top Table 和 flamegraph 都有非 root 栈。 +5. 打开 `memory` 和 `locks`,确认 allocation 与 lock-delay 都有非空栈。 +6. 在 `cpu` 里执行搜索、选中 frame、查看详情、Focus、Back、Reset。 +7. 打开 `ingestion`,确认 profile batch accepted,且没有 unexplained rejected/dropped/truncated 证据。 +8. 改变时间范围,确认查询结果随时间窗口变化。 +9. 如果这是验收而非日常排障,还要保存 ClickHouse sample/stack 行数、TTL、浏览器截图和目标 Pod restart count 前后对比。 + +预期证据: + +- UI 中出现真实方法栈和样本值。 +- `ingestion` 显示对应 profile batch 被接受。 +- CPU、allocation、lock-delay 都非空,才能证明核心 profile 链路完整。 + +容易误判: + +- 只有 target status accepted 只能证明控制面可用,不能证明 profile 数据链路完整可用。 +- thread snapshot 和 deadlock event 是补充证据;如果本次不是排查线程或死锁,它们为空应记录为 gap,而不是否定 CPU/memory/lock profile 验收。 + +## 安全和开销建议 + +- 默认只对明确需要的 Java 服务启用。 +- 事件排查优先使用 temporary,并设置明确 duration。 +- 高频线程快照只用于短时间窗口。 +- 栈数据可能包含类名、方法名、包名和业务路径,应按生产敏感数据处理。 +- 不要把完整 flamegraph、thread dump、stack payload、profile payload、thread snapshot 或 token 粘贴到公开 issue、聊天群或截图中。 +- 对外共享时,只保留必要方法名;按组织安全策略脱敏包名、服务名、namespace、Pod 名和业务路径。 +- profile 数据不是日志,但仍可能暴露业务流程、内部类名、接口路径或客户相关处理逻辑。 +- token、cookie、内部域名、内部 backend/storage 凭据或连接信息不得出现在截图、PR、issue 或聊天记录中。 +- 发现 profiler 冲突时,不要同时运行多个 async-profiler 工具采同一个 JVM。 +- 采集失败不应被忽略;先读 `status` 和 `ingestion`。 + +## 数据保留 + +所有采集数据保留不超过 7 天: + +- profile samples 和 stacks:7 天。 +- thread snapshots 和 thread stacks:7 天。 +- deadlock events:7 天。 +- target status 和 ingestion health:7 天。 +- 可选 raw artifact:默认关闭;如启用,最长 24 小时。 + +超过 retention 后,只能重新采集或使用其他历史证据。Prometheus 指标趋势可能仍存在,但 Java 栈证据不会由本系统长期保存。 + +### 历史事件超过 retention + +如果事件已经超过 7 天: + +1. 不要把空结果解释为没有问题。 +2. 使用 Prometheus、日志、变更记录和业务事件记录回溯时间线。 +3. 如果问题仍可复现,重新启用 temporary profiling。 +4. 如果问题不可复现,把缺失 profile 证据记录到复盘结论中。 + +## 结果解读边界 + +| 问题 | 本系统能否回答 | 说明 | +| --- | --- | --- | +| 哪段 Java 代码消耗 CPU | 能 | 使用 CPU flamegraph。 | +| 哪段 Java 代码分配最多 | 能 | 使用 allocation bytes 或 allocation objects。 | +| 哪些锁路径产生等待 | 能 | 使用 lock profile 与 slow-thread 证据。 | +| 是否存在 Java 死锁 | 能 | 使用 deadlock event。 | +| 哪个对象保留了堆内存 | 不能 | 需要 heap dump 或 retained-heap 分析。 | +| 完整还原历史线程执行过程 | 不能 | 线程快照是采样时刻证据。 | +| JVM 指标趋势和告警 | 不在本系统内 | 使用现有 Prometheus 面板。 | +| 非 Java 服务 profiling | 第一版本不支持 | 仅 HotSpot 兼容 Java 服务。 | + +## 术语表 + +| 术语 | 含义 | +| --- | --- | +| target | 一个可被发现和采集的 JVM 进程。 | +| target identity | 用于区分目标的身份字段,包括 cluster、namespace、service、pod、container、PID 和 JVM start time。 | +| status | collector 对目标当前采集状态的判断,例如 accepted、disabled、unsupported、attach_failed。 | +| reason | status 的原因,决定用户下一步动作。 | +| accepted | collector 认为目标可采集;它只证明控制面可用,不保证已经有非空 profile。 | +| non-empty profile | 已经有 CPU、allocation 或 lock 样本写入并可查询;这是性能分析数据链路可用的证据。 | +| ingestion | collector 上传数据到 backend,并由 backend 写入 ClickHouse 的过程。 | +| profile batch | 一次 CPU、allocation 或 lock profile 上传批次。 | +| target status batch | collector 上传的一批目标状态。 | +| flamegraph | 用宽度表示样本累计值的调用栈图。 | +| JVM start time | JVM 启动时间,用来区分 PID 复用后的不同 JVM。 | +| retention | 数据最大保留时间。本系统 profile、thread、deadlock、status 和 ingestion 数据不超过 7 天。 | + +## 向平台管理员求助时提供什么 + +如果需要平台管理员协助,请一次性提供这些信息,避免来回确认: + +```text +Namespace: +Service: +Pod: +Container / JVM if known: +Time range: +UI view: status / cpu / memory / locks / deadlocks / ingestion +Status reason: +Ingestion state: +What you expected: +Screenshot or sanitized evidence: redact tokens/cookies/DSN/internal domains/customer identifiers; prefer stack summaries over raw payloads. +``` + +## 事件记录模板 + +```text +Service: +Namespace: +Pod/JVM: +Incident window: +Profiling mode: temporary / continuous +Profile views used: cpu / memory / locks / deadlocks / status / ingestion +Top stack: +Thread evidence: +Target status: +Ingestion health: +Conclusion: +Follow-up: +Profiling stopped at: +Redaction check: no tokens/cookies/DSN/internal domains/customer identifiers/raw payloads. +``` diff --git a/docs/zh/reference/profiling-contracts.md b/docs/zh/reference/profiling-contracts.md new file mode 100644 index 0000000..d8963c9 --- /dev/null +++ b/docs/zh/reference/profiling-contracts.md @@ -0,0 +1,11 @@ +# Profiling 合同 + +稳定的 profiling payload 和配置合同在源码目录 `contracts/profiling` 下。 + +修改 collector payload、backend ingestion 或 UI 解释逻辑时,以这些文件为准: + +- [`configuration.md`](https://github.com/koolay/java-profiler/blob/main/contracts/profiling/configuration.md) +- [`payloads.md`](https://github.com/koolay/java-profiler/blob/main/contracts/profiling/payloads.md) +- [`types.go`](https://github.com/koolay/java-profiler/blob/main/contracts/profiling/types.go) + +如果合同改动影响 scope、retention、collection、storage 或用户可见行为,需要同步更新 requirements、operations guides 和 real profiling acceptance standard。 diff --git a/domain/types.go b/domain/types.go index 250b9cd..f1b8e98 100644 --- a/domain/types.go +++ b/domain/types.go @@ -15,27 +15,101 @@ const ( ProfileTypeAllocObjects ProfileType = "java_allocation_objects" ProfileTypeLockContention ProfileType = "java_lock_contention_count" ProfileTypeLockDelay ProfileType = "java_lock_delay_nanoseconds" + ProfileTypeWallClock ProfileType = "java_wall_clock_nanoseconds" + ProfileTypeIOWait ProfileType = "java_io_wait_nanoseconds" ) +type ProfileValueSemantics struct { + ValueUnit string `json:"value_unit"` + DisplayUnit string `json:"display_unit"` + PercentBasis string `json:"percent_basis"` + BaselineDescription string `json:"baseline_description"` + WindowSeconds float64 `json:"window_seconds,omitempty"` +} + var AllProfileTypes = []ProfileType{ ProfileTypeCPU, ProfileTypeAllocBytes, ProfileTypeAllocObjects, ProfileTypeLockContention, ProfileTypeLockDelay, + ProfileTypeWallClock, + ProfileTypeIOWait, } func (p ProfileType) String() string { return string(p) } func (p ProfileType) IsValid() bool { switch p { - case ProfileTypeCPU, ProfileTypeAllocBytes, ProfileTypeAllocObjects, ProfileTypeLockContention, ProfileTypeLockDelay: + case ProfileTypeCPU, ProfileTypeAllocBytes, ProfileTypeAllocObjects, ProfileTypeLockContention, ProfileTypeLockDelay, ProfileTypeWallClock, ProfileTypeIOWait: return true default: return false } } +func (p ProfileType) Semantics(window TimeWindow) ProfileValueSemantics { + semantics := ProfileValueSemantics{ + PercentBasis: "returned_profile_value", + BaselineDescription: "Percentage is relative to the returned profile samples for the selected filters.", + } + if duration := window.Duration(); duration > 0 { + semantics.WindowSeconds = duration.Seconds() + semantics.BaselineDescription = "Percentage is relative to returned profile samples; average cores use the selected time window." + } + switch p { + case ProfileTypeCPU: + semantics.ValueUnit = "nanoseconds" + semantics.DisplayUnit = "duration_and_average_cores" + case ProfileTypeWallClock: + semantics.ValueUnit = "nanoseconds" + semantics.DisplayUnit = "duration" + semantics.BaselineDescription = "Percentage is relative to returned Wall Clock samples for the selected Java target and time range." + case ProfileTypeIOWait: + semantics.ValueUnit = "nanoseconds" + semantics.DisplayUnit = "duration" + semantics.BaselineDescription = "Percentage is relative to returned Java I/O wait samples for the selected target and time range." + case ProfileTypeAllocBytes: + semantics.ValueUnit = "bytes" + semantics.DisplayUnit = "bytes" + case ProfileTypeAllocObjects: + semantics.ValueUnit = "objects" + semantics.DisplayUnit = "count" + case ProfileTypeLockContention: + semantics.ValueUnit = "events" + semantics.DisplayUnit = "count" + case ProfileTypeLockDelay: + semantics.ValueUnit = "nanoseconds" + semantics.DisplayUnit = "duration" + default: + semantics.ValueUnit = "raw" + semantics.DisplayUnit = "raw" + } + return semantics +} + +func FormatProfileValue(profileType ProfileType, value uint64, window TimeWindow) string { + switch profileType { + case ProfileTypeCPU: + duration := formatDurationNanos(value) + if windowDuration := window.Duration(); windowDuration > 0 && value > 0 { + cores := float64(value) / float64(windowDuration.Nanoseconds()) + if cores >= 0.01 { + return fmt.Sprintf("%s · %.2f cores", duration, cores) + } + } + return duration + case ProfileTypeLockDelay, ProfileTypeWallClock, ProfileTypeIOWait: + return formatDurationNanos(value) + case ProfileTypeAllocBytes: + return formatBytes(value) + case ProfileTypeAllocObjects, ProfileTypeLockContention: + return fmt.Sprintf("%d", value) + default: + return fmt.Sprintf("%d", value) + } +} + type EnablementMode string const ( @@ -95,6 +169,7 @@ type BatchType string const ( BatchTypeProfile BatchType = "profile" BatchTypeThreadSnapshot BatchType = "thread_snapshot" + BatchTypeJVMEvent BatchType = "jvm_event" BatchTypeTargetStatus BatchType = "target_status" BatchTypeCollectorBeat BatchType = "collector_heartbeat" BatchTypeIngestion BatchType = "ingestion" @@ -104,7 +179,7 @@ const ( func (b BatchType) IsValid() bool { switch b { - case BatchTypeProfile, BatchTypeThreadSnapshot, BatchTypeTargetStatus, BatchTypeCollectorBeat, BatchTypeIngestion, BatchTypeRetention, BatchTypeArtifactIndex: + case BatchTypeProfile, BatchTypeThreadSnapshot, BatchTypeJVMEvent, BatchTypeTargetStatus, BatchTypeCollectorBeat, BatchTypeIngestion, BatchTypeRetention, BatchTypeArtifactIndex: return true default: return false @@ -222,3 +297,32 @@ func StableProfileTypeNames() []string { sort.Strings(out) return out } + +func formatDurationNanos(value uint64) string { + switch { + case value >= uint64(time.Minute): + return fmt.Sprintf("%.1f min", float64(value)/float64(time.Minute)) + case value >= uint64(time.Second): + return fmt.Sprintf("%.2f s", float64(value)/float64(time.Second)) + case value >= uint64(time.Millisecond): + return fmt.Sprintf("%.1f ms", float64(value)/float64(time.Millisecond)) + case value >= uint64(time.Microsecond): + return fmt.Sprintf("%.1f us", float64(value)/float64(time.Microsecond)) + default: + return fmt.Sprintf("%d ns", value) + } +} + +func formatBytes(value uint64) string { + const unit = 1024 + switch { + case value >= unit*unit*unit: + return fmt.Sprintf("%.1f GiB", float64(value)/(unit*unit*unit)) + case value >= unit*unit: + return fmt.Sprintf("%.1f MiB", float64(value)/(unit*unit)) + case value >= unit: + return fmt.Sprintf("%.1f KiB", float64(value)/unit) + default: + return fmt.Sprintf("%d B", value) + } +} diff --git a/domain/types_test.go b/domain/types_test.go new file mode 100644 index 0000000..280f6ed --- /dev/null +++ b/domain/types_test.go @@ -0,0 +1,33 @@ +package domain + +import ( + "testing" + "time" +) + +func TestProfileValueSemanticsAndFormatting(t *testing.T) { + window := TimeWindow{StartedAt: time.Unix(0, 0), EndsAt: time.Unix(10, 0)} + + semantics := ProfileTypeCPU.Semantics(window) + if semantics.ValueUnit != "nanoseconds" || semantics.DisplayUnit != "duration_and_average_cores" { + t.Fatalf("unexpected CPU semantics: %+v", semantics) + } + if semantics.WindowSeconds != 10 { + t.Fatalf("window seconds = %v, want 10", semantics.WindowSeconds) + } + if got := FormatProfileValue(ProfileTypeCPU, uint64(2*time.Second), window); got != "2.00 s · 0.20 cores" { + t.Fatalf("CPU display = %q", got) + } + if got := FormatProfileValue(ProfileTypeAllocBytes, 2*1024*1024, window); got != "2.0 MiB" { + t.Fatalf("allocation display = %q", got) + } + if got := FormatProfileValue(ProfileTypeLockDelay, uint64(15*time.Millisecond), window); got != "15.0 ms" { + t.Fatalf("lock delay display = %q", got) + } + if got := FormatProfileValue(ProfileTypeWallClock, uint64(3*time.Second), window); got != "3.00 s" { + t.Fatalf("wall display = %q", got) + } + if got := FormatProfileValue(ProfileTypeIOWait, uint64(250*time.Millisecond), window); got != "250.0 ms" { + t.Fatalf("io wait display = %q", got) + } +} diff --git a/examples/jdk17-http-demo/Dockerfile b/examples/jdk17-http-demo/Dockerfile index f35ec8c..d77c1d7 100644 --- a/examples/jdk17-http-demo/Dockerfile +++ b/examples/jdk17-http-demo/Dockerfile @@ -1,5 +1,5 @@ -ARG BUILDER_IMAGE=ghcr.io/koolay/josepdcs/kubectl-prof:1.5.0-jvm-alpine -ARG RUNTIME_IMAGE=ghcr.io/koolay/josepdcs/kubectl-prof:1.5.0-jvm-alpine +ARG BUILDER_IMAGE=docker.m.daocloud.io/eclipse-temurin:21-jdk +ARG RUNTIME_IMAGE=docker.m.daocloud.io/eclipse-temurin:21-jdk FROM ${BUILDER_IMAGE} AS build WORKDIR /src diff --git a/examples/jdk17-http-demo/src/main/java/com/ebpfjava/examples/httpdemo/DemoHttpService.java b/examples/jdk17-http-demo/src/main/java/com/ebpfjava/examples/httpdemo/DemoHttpService.java index db536cb..56edf54 100644 --- a/examples/jdk17-http-demo/src/main/java/com/ebpfjava/examples/httpdemo/DemoHttpService.java +++ b/examples/jdk17-http-demo/src/main/java/com/ebpfjava/examples/httpdemo/DemoHttpService.java @@ -8,6 +8,8 @@ import java.net.HttpURLConnection; import java.net.InetSocketAddress; import java.net.URLDecoder; +import java.nio.file.Files; +import java.nio.file.Path; import java.nio.charset.StandardCharsets; import java.time.Duration; import java.time.Instant; @@ -19,6 +21,7 @@ import java.util.concurrent.Executors; import java.util.concurrent.ThreadFactory; import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.locks.LockSupport; public final class DemoHttpService { private static final Object LOCK_MONITOR = new Object(); @@ -92,12 +95,15 @@ private static void handleWork(HttpExchange exchange) throws IOException { switch (mode) { case "cpu" -> operations = burnCpu(durationMs); case "alloc" -> operations = allocateObjects(durationMs); + case "gc" -> operations = createGcPressure(durationMs); + case "io" -> operations = exerciseFileIo(durationMs); + case "wall" -> operations = waitWallClock(durationMs); case "lock" -> operations = contendLock(durationMs); default -> { writeJson( exchange, HttpURLConnection.HTTP_BAD_REQUEST, - "{\"error\":\"mode must be one of cpu, alloc, lock\"}"); + "{\"error\":\"mode must be one of cpu, alloc, gc, io, wall, lock\"}"); return; } } @@ -176,7 +182,7 @@ private static void handleIndex(HttpExchange exchange) throws IOException { HttpURLConnection.HTTP_OK, "{" + "\"service\":\"jdk17-http-demo\"," - + "\"endpoints\":[\"/health\",\"/work?mode=cpu|alloc|lock&durationMs=1000\"," + + "\"endpoints\":[\"/health\",\"/work?mode=cpu|alloc|gc|io|wall|lock&durationMs=1000\"," + "\"/threads?durationMs=5000\"]" + "}"); } @@ -205,6 +211,50 @@ private static long allocateObjects(int durationMs) { return operations; } + private static long createGcPressure(int durationMs) { + long deadline = System.nanoTime() + Duration.ofMillis(durationMs).toNanos(); + long operations = 0; + int slot = 0; + byte[][] retained = new byte[64][]; + while (System.nanoTime() < deadline) { + for (int i = 0; i < retained.length; i++) { + retained[slot] = new byte[256 * 1024]; + slot = (slot + 1) % retained.length; + operations++; + } + System.gc(); + LockSupport.parkNanos(Duration.ofMillis(10).toNanos()); + } + return operations; + } + + private static long exerciseFileIo(int durationMs) throws IOException { + long deadline = System.nanoTime() + Duration.ofMillis(durationMs).toNanos(); + long operations = 0; + Path file = Files.createTempFile("java-profiler-demo-", ".bin"); + try { + byte[] payload = new byte[64 * 1024]; + while (System.nanoTime() < deadline) { + Files.write(file, payload); + byte[] readBack = Files.readAllBytes(file); + operations += readBack.length; + } + } finally { + Files.deleteIfExists(file); + } + return operations; + } + + private static long waitWallClock(int durationMs) { + long deadline = System.nanoTime() + Duration.ofMillis(durationMs).toNanos(); + long operations = 0; + while (System.nanoTime() < deadline) { + sleepQuietly(25); + operations++; + } + return operations; + } + private static long contendLock(int durationMs) { long deadline = System.nanoTime() + Duration.ofMillis(durationMs).toNanos(); long operations = 0; diff --git a/examples/jdk17-http-demo/src/test/java/com/ebpfjava/examples/httpdemo/DemoHttpServiceTest.java b/examples/jdk17-http-demo/src/test/java/com/ebpfjava/examples/httpdemo/DemoHttpServiceTest.java index 4e6220e..ec89000 100644 --- a/examples/jdk17-http-demo/src/test/java/com/ebpfjava/examples/httpdemo/DemoHttpServiceTest.java +++ b/examples/jdk17-http-demo/src/test/java/com/ebpfjava/examples/httpdemo/DemoHttpServiceTest.java @@ -52,8 +52,8 @@ void healthEndpointReturnsServiceIdentity() throws Exception { } @Test - void workEndpointCanExerciseCpuAllocationAndLockPaths() throws Exception { - for (String mode : new String[] {"cpu", "alloc", "lock"}) { + void workEndpointCanExerciseProfilerEvidencePaths() throws Exception { + for (String mode : new String[] {"cpu", "alloc", "gc", "io", "wall", "lock"}) { HttpResponse response = get("/work?mode=" + mode + "&durationMs=20"); assertEquals(200, response.statusCode(), mode); diff --git a/scripts/deploy-jdk17-demo.sh b/scripts/deploy-jdk17-demo.sh index adf34d8..c0683bc 100755 --- a/scripts/deploy-jdk17-demo.sh +++ b/scripts/deploy-jdk17-demo.sh @@ -18,7 +18,7 @@ Options: --source-image IMAGE JDK image for --source-mode. Default: docker.m.daocloud.io/eclipse-temurin:21-jdk. --kind-load Load the image into a kind cluster after building. --kind-cluster NAME kind cluster name. Default: kind. - --run-load Port-forward the service and call CPU/alloc/lock/thread endpoints. + --run-load Port-forward the service and call CPU/alloc/gc/io/wall/lock/thread endpoints. --duration-ms N Load duration per endpoint. Default: 30000. --local-port PORT Local port for port-forward. Default: 18080. --artifact-dir DIR Write deployment evidence. Default: /tmp/java-profiler-jdk17-demo-. @@ -120,6 +120,7 @@ kubectl create namespace "$namespace" --dry-run=client -o yaml | kubectl apply - if [[ "$source_mode" == "true" ]]; then log "## Apply Source ConfigMap" + source_hash="$(shasum -a 256 examples/jdk17-http-demo/src/main/java/com/ebpfjava/examples/httpdemo/DemoHttpService.java | awk '{print $1}')" kubectl -n "$namespace" create configmap jdk17-http-demo-source \ --from-file=DemoHttpService.java=examples/jdk17-http-demo/src/main/java/com/ebpfjava/examples/httpdemo/DemoHttpService.java \ --dry-run=client -o yaml | kubectl apply -f - | tee "$artifact_dir/source-configmap.log" @@ -170,6 +171,7 @@ if [[ "$source_mode" == "true" ]]; then {"op":"add","path":"/spec/template/spec/containers/0/command","value":["/bin/sh","-lc"]}, {"op":"add","path":"/spec/template/spec/containers/0/args","value":["mkdir -p /tmp/classes && javac --release 17 -d /tmp/classes /src/DemoHttpService.java && exec java -XX:+UseContainerSupport -cp /tmp/classes com.ebpfjava.examples.httpdemo.DemoHttpService"]} ]' | tee "$artifact_dir/demo-source-patch.log" + kubectl -n "$namespace" patch deploy/jdk17-http-demo --type=merge -p "{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"java-profiler.io/source-hash\":\"${source_hash}\"}}}}}" | tee "$artifact_dir/demo-source-hash.log" else kubectl -n "$namespace" set image deploy/jdk17-http-demo "app=$image" | tee "$artifact_dir/demo-set-image.log" fi @@ -188,6 +190,9 @@ if [[ "$run_load" == "true" ]]; then curl -fsS "http://127.0.0.1:${local_port}/health" | tee "$artifact_dir/health.json" curl -fsS "http://127.0.0.1:${local_port}/work?mode=cpu&durationMs=${duration_ms}" | tee "$artifact_dir/work-cpu.json" curl -fsS "http://127.0.0.1:${local_port}/work?mode=alloc&durationMs=${duration_ms}" | tee "$artifact_dir/work-alloc.json" + curl -fsS "http://127.0.0.1:${local_port}/work?mode=gc&durationMs=${duration_ms}" | tee "$artifact_dir/work-gc.json" + curl -fsS "http://127.0.0.1:${local_port}/work?mode=io&durationMs=${duration_ms}" | tee "$artifact_dir/work-io.json" + curl -fsS "http://127.0.0.1:${local_port}/work?mode=wall&durationMs=${duration_ms}" | tee "$artifact_dir/work-wall.json" curl -fsS "http://127.0.0.1:${local_port}/work?mode=lock&durationMs=${duration_ms}" | tee "$artifact_dir/work-lock.json" curl -fsS "http://127.0.0.1:${local_port}/threads?durationMs=${duration_ms}" | tee "$artifact_dir/threads.json" diff --git a/scripts/real-acceptance.sh b/scripts/real-acceptance.sh index 35ba398..2dd1b9b 100755 --- a/scripts/real-acceptance.sh +++ b/scripts/real-acceptance.sh @@ -113,6 +113,10 @@ optional_gap() { log "GAP: $*" } +json_root_value() { + jq --stream -r 'select(.[0] == ["root", "value"]) | .[1]' "$1" | head -n 1 +} + require_cmd() { command -v "$1" >/dev/null 2>&1 || fail "missing required command: $1" } @@ -154,7 +158,7 @@ drive_workload_load() { local iteration=0 while [[ "$(date +%s)" -lt "$deadline" ]]; do iteration=$((iteration + 1)) - for mode in cpu alloc; do + for mode in cpu alloc gc io wall; do load_pids=() for _ in $(seq 1 "$cpu_alloc_parallel"); do curl -fsS "http://127.0.0.1:${local_port}/work?mode=${mode}&durationMs=3000" >>"$artifact_dir/workload-${mode}-load.log" 2>&1 & @@ -581,6 +585,19 @@ if [[ "$configure_profiler" == "true" && "$install" != "true" ]]; then --set "profiling.targetNamespace=${namespace}" \ --set "profiling.targetService=${service_name}" pass "profiler target filters configured for ${namespace}/${service_name}" + if kubectl -n "$namespace" get "deploy/$service_name" >/dev/null 2>&1; then + kubectl -n "$namespace" annotate "deploy/$service_name" \ + "java-profiler.io/acceptance-run=${acceptance_started_at}" \ + "java-profiler.io/profile-disabled=false" \ + "java-profiler.io/profile-mode=temporary" \ + "java-profiler.io/profile-duration=1h" \ + "java-profiler.io/startup-delay=0s" \ + "java-profiler.io/snapshot-interval=10s" \ + --overwrite >/dev/null + kubectl -n "$namespace" rollout restart "deploy/$service_name" >/dev/null + kubectl -n "$namespace" rollout status "deploy/$service_name" --timeout=180s + pass "target workload restarted after profiler configuration to avoid stale async-profiler conflict" + fi fi log "## Cluster State" @@ -610,9 +627,13 @@ pass "all required workloads are rolled out" # success before the backend/web processes have opened their ports. sleep "${JAVA_PROFILER_ACCEPTANCE_SETTLE_SECONDS:-10}" -kubectl -n "$profiler_namespace" port-forward --address 127.0.0.1 "svc/$release-web" "${web_port}:80" >"$artifact_dir/port-forward-web.log" 2>&1 & +web_pod="$(kubectl -n "$profiler_namespace" get pod -l app.kubernetes.io/name=java-profiler-web -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | awk '{print $1}')" +backend_pod="$(kubectl -n "$profiler_namespace" get pod -l app.kubernetes.io/name=java-profiler-backend -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | awk '{print $1}')" +[[ -n "$web_pod" ]] || fail "web pod not found for port-forward" +[[ -n "$backend_pod" ]] || fail "backend pod not found for port-forward" +kubectl -n "$profiler_namespace" port-forward --address 127.0.0.1 "pod/$web_pod" "${web_port}:80" >"$artifact_dir/port-forward-web.log" 2>&1 & cleanup_pids+=("$!") -kubectl -n "$profiler_namespace" port-forward --address 127.0.0.1 "svc/$release-backend" "${backend_port}:8080" >"$artifact_dir/port-forward-backend.log" 2>&1 & +kubectl -n "$profiler_namespace" port-forward --address 127.0.0.1 "pod/$backend_pod" "${backend_port}:8080" >"$artifact_dir/port-forward-backend.log" 2>&1 & cleanup_pids+=("$!") collector_pod="$(kubectl -n "$profiler_namespace" get pod -l app.kubernetes.io/name=java-profiler-collector -o jsonpath='{.items[0].metadata.name}')" kubectl -n "$profiler_namespace" port-forward --address 127.0.0.1 "pod/$collector_pod" "${collector_port}:9090" >"$artifact_dir/port-forward-collector.log" 2>&1 & @@ -625,8 +646,8 @@ pass "port-forwards are ready" web_status_len="0" for _ in $(seq 1 90); do - curl -sS "http://127.0.0.1:${web_port}/api/ui/v1/target-status" >"$artifact_dir/web-target-status.json" - web_status_len="$(jq 'length' "$artifact_dir/web-target-status.json")" + curl -sS "http://127.0.0.1:${web_port}/api/ui/v1/target-status" >"$artifact_dir/web-target-status.json" || true + web_status_len="$(jq 'length' "$artifact_dir/web-target-status.json" 2>/dev/null || echo 0)" if [[ "$web_status_len" -gt 0 ]]; then break fi @@ -678,8 +699,11 @@ else fi curl -sS -H "Authorization: Bearer ${ui_token}" "http://127.0.0.1:${backend_port}/api/ui/v1/flamegraph?${query}&profile_type=java_cpu_nanoseconds" >"$artifact_dir/backend-flamegraph-cpu.json" +curl -sS -H "Authorization: Bearer ${ui_token}" "http://127.0.0.1:${backend_port}/api/ui/v1/flamegraph?${query}&profile_type=java_wall_clock_nanoseconds" >"$artifact_dir/backend-flamegraph-wall.json" +curl -sS -H "Authorization: Bearer ${ui_token}" "http://127.0.0.1:${backend_port}/api/ui/v1/flamegraph?${query}&profile_type=java_io_wait_nanoseconds" >"$artifact_dir/backend-flamegraph-io-wait.json" curl -sS -H "Authorization: Bearer ${ui_token}" "http://127.0.0.1:${backend_port}/api/ui/v1/flamegraph?${query}&profile_type=java_allocation_bytes" >"$artifact_dir/backend-flamegraph-alloc-bytes.json" curl -sS -H "Authorization: Bearer ${ui_token}" "http://127.0.0.1:${backend_port}/api/ui/v1/flamegraph?${query}&profile_type=java_lock_delay_nanoseconds" >"$artifact_dir/backend-flamegraph-lock-delay.json" +curl -sS -H "Authorization: Bearer ${ui_token}" "http://127.0.0.1:${backend_port}/api/ui/v1/jvm-events?${query}&event_type=gc_pause" >"$artifact_dir/backend-jvm-events-gc.json" curl -sS -H "Authorization: Bearer ${ui_token}" "http://127.0.0.1:${backend_port}/api/ui/v1/thread-diagnosis?${query}" >"$artifact_dir/backend-thread-diagnosis.json" curl -sS -H "Authorization: Bearer ${ui_token}" "http://127.0.0.1:${backend_port}/api/ui/v1/deadlocks?${query}" >"$artifact_dir/backend-deadlocks.json" ingestion_code="$(curl -sS -o "$artifact_dir/backend-ingestion.txt" -w '%{http_code}' -H "Authorization: Bearer ${ui_token}" "http://127.0.0.1:${backend_port}/api/ui/v1/ingestion")" @@ -688,7 +712,7 @@ curl -sS "http://127.0.0.1:${collector_port}/metrics" >"$artifact_dir/collector- curl -sS "http://127.0.0.1:${backend_port}/metrics" >"$artifact_dir/backend-metrics.txt" kubectl -n "$profiler_namespace" exec deploy/clickhouse -- clickhouse-client --query \ - "WITH parseDateTime64BestEffort('${start}', 9, 'UTC') AS run_start SELECT 'target_status' t, count() FROM java_profiler.java_profiler_target_status WHERE namespace='${namespace}' AND service='${service_name}' AND status_at >= run_start UNION ALL SELECT 'ingestion_batches', count() FROM java_profiler.java_profiler_ingestion_batches WHERE received_at >= run_start UNION ALL SELECT 'profile_samples', count() FROM java_profiler.java_profiler_profile_samples WHERE namespace='${namespace}' AND service='${service_name}' AND created_at >= run_start UNION ALL SELECT 'profile_stacks', count() FROM java_profiler.java_profiler_profile_stacks WHERE namespace='${namespace}' AND service='${service_name}' AND created_at >= run_start UNION ALL SELECT 'thread_snapshots', count() FROM java_profiler.java_profiler_thread_snapshots WHERE namespace='${namespace}' AND service='${service_name}' AND created_at >= run_start UNION ALL SELECT 'deadlock_events', count() FROM java_profiler.java_profiler_deadlock_events WHERE namespace='${namespace}' AND service='${service_name}' AND created_at >= run_start UNION ALL SELECT 'artifact_index', count() FROM java_profiler.java_profiler_artifact_index WHERE created_at >= run_start" \ + "WITH parseDateTime64BestEffort('${start}', 9, 'UTC') AS run_start SELECT 'target_status' t, count() FROM java_profiler.java_profiler_target_status WHERE namespace='${namespace}' AND service='${service_name}' AND status_at >= run_start UNION ALL SELECT 'ingestion_batches', count() FROM java_profiler.java_profiler_ingestion_batches WHERE received_at >= run_start UNION ALL SELECT 'profile_samples', count() FROM java_profiler.java_profiler_profile_samples WHERE namespace='${namespace}' AND service='${service_name}' AND created_at >= run_start UNION ALL SELECT 'profile_stacks', count() FROM java_profiler.java_profiler_profile_stacks WHERE namespace='${namespace}' AND service='${service_name}' AND created_at >= run_start UNION ALL SELECT 'jvm_events', count() FROM java_profiler.java_profiler_jvm_events WHERE namespace='${namespace}' AND service='${service_name}' AND created_at >= run_start UNION ALL SELECT 'thread_snapshots', count() FROM java_profiler.java_profiler_thread_snapshots WHERE namespace='${namespace}' AND service='${service_name}' AND created_at >= run_start UNION ALL SELECT 'deadlock_events', count() FROM java_profiler.java_profiler_deadlock_events WHERE namespace='${namespace}' AND service='${service_name}' AND created_at >= run_start UNION ALL SELECT 'artifact_index', count() FROM java_profiler.java_profiler_artifact_index WHERE created_at >= run_start" \ >"$artifact_dir/clickhouse-counts.tsv" kubectl -n "$profiler_namespace" exec deploy/clickhouse -- clickhouse-client --query \ @@ -712,9 +736,13 @@ thread_snapshots="$(awk '$1=="thread_snapshots"{print $2}' "$artifact_dir/clickh deadlock_events="$(awk '$1=="deadlock_events"{print $2}' "$artifact_dir/clickhouse-counts.tsv")" target_status="$(awk '$1=="target_status"{print $2}' "$artifact_dir/clickhouse-counts.tsv")" ingestion_batches="$(awk '$1=="ingestion_batches"{print $2}' "$artifact_dir/clickhouse-counts.tsv")" -flamegraph_value="$(jq -r '.root.value // 0' "$artifact_dir/backend-flamegraph-cpu.json")" -alloc_flamegraph_value="$(jq -r '.root.value // 0' "$artifact_dir/backend-flamegraph-alloc-bytes.json")" -lock_flamegraph_value="$(jq -r '.root.value // 0' "$artifact_dir/backend-flamegraph-lock-delay.json")" +flamegraph_value="$(json_root_value "$artifact_dir/backend-flamegraph-cpu.json")" +alloc_flamegraph_value="$(json_root_value "$artifact_dir/backend-flamegraph-alloc-bytes.json")" +lock_flamegraph_value="$(json_root_value "$artifact_dir/backend-flamegraph-lock-delay.json")" +wall_flamegraph_value="$(json_root_value "$artifact_dir/backend-flamegraph-wall.json")" +io_flamegraph_value="$(json_root_value "$artifact_dir/backend-flamegraph-io-wait.json")" +gc_event_count="$(jq -r '.events | length' "$artifact_dir/backend-jvm-events-gc.json")" +jvm_events="$(awk '$1=="jvm_events"{print $2}' "$artifact_dir/clickhouse-counts.tsv")" read -r dropped_sample_count dropped_stack_count max_truncated max_batch_sample_count accepted_profile_batches rejected_profile_batches <"$artifact_dir/clickhouse-profile-ingestion-metadata.tsv" [[ "${target_status:-0}" -gt 0 ]] || fail "ClickHouse target_status is empty" @@ -727,6 +755,24 @@ else gap "non-empty CPU profile path is not proven: profile_samples=${profile_samples:-0}, profile_stacks=${profile_stacks:-0}, flamegraph.root.value=${flamegraph_value:-0}" fi +if [[ "${wall_flamegraph_value:-0}" -gt 0 ]]; then + pass "non-empty Wall Clock stack path is working" +else + gap "non-empty Wall Clock stack path is not proven: wall flamegraph.root.value=${wall_flamegraph_value:-0}" +fi + +if [[ "${io_flamegraph_value:-0}" -gt 0 ]]; then + pass "non-empty Java I/O wait stack path is working" +else + gap "non-empty Java I/O wait stack path is not proven: io flamegraph.root.value=${io_flamegraph_value:-0}" +fi + +if [[ "${gc_event_count:-0}" -gt 0 && "${jvm_events:-0}" -gt 0 ]]; then + pass "non-empty JVM GC event path is working" +else + gap "non-empty JVM GC event path is not proven: jvm_events=${jvm_events:-0}, gc_events=${gc_event_count:-0}" +fi + if [[ "${alloc_flamegraph_value:-0}" -gt 0 ]]; then pass "non-empty allocation stack path is working" else diff --git a/web/index.html b/web/index.html index 098861f..71dbfe6 100644 --- a/web/index.html +++ b/web/index.html @@ -4,6 +4,9 @@ Java Profiler + + +
diff --git a/web/src/api/client.ts b/web/src/api/client.ts index c0d858c..142a48f 100644 --- a/web/src/api/client.ts +++ b/web/src/api/client.ts @@ -1,4 +1,4 @@ -import type { DeadlockEvent, FlamegraphResponse, IngestionHealth, TargetStatus, ThreadDiagnosis, TopStackRow } from "./types"; +import type { DeadlockEvent, FlamegraphResponse, IngestionHealth, JVMEventEvidence, TargetStatus, ThreadDiagnosis, TopStackRow } from "./types"; const apiBase = import.meta.env.VITE_API_BASE ?? ""; @@ -33,3 +33,7 @@ export function getTargetStatus(params: URLSearchParams) { export function getIngestionHealth() { return getJSON("/api/ui/v1/ingestion"); } + +export function getJVMEvents(params: URLSearchParams) { + return getJSON(`/api/ui/v1/jvm-events?${params}`); +} diff --git a/web/src/api/types.ts b/web/src/api/types.ts index 8c90ce9..a8b2f7c 100644 --- a/web/src/api/types.ts +++ b/web/src/api/types.ts @@ -3,14 +3,25 @@ export type ProfileType = | "java_allocation_bytes" | "java_allocation_objects" | "java_lock_contention_count" - | "java_lock_delay_nanoseconds"; + | "java_lock_delay_nanoseconds" + | "java_wall_clock_nanoseconds" + | "java_io_wait_nanoseconds"; export type FlamegraphNode = { name: string; value: number; + display_value?: string; children?: FlamegraphNode[]; }; +export type ProfileValueSemantics = { + value_unit: string; + display_unit: string; + percent_basis: string; + baseline_description: string; + window_seconds?: number; +}; + export type PartialMetadata = { partial: boolean; reasons?: string[]; @@ -21,6 +32,7 @@ export type PartialMetadata = { export type FlamegraphResponse = { root: FlamegraphNode; metadata: PartialMetadata; + semantics?: ProfileValueSemantics; }; export type TopStackRow = { @@ -29,8 +41,11 @@ export type TopStackRow = { profile_type: ProfileType | string; self: number; total: number; + self_display?: string; + total_display?: string; self_percent: string; total_percent: string; + semantics?: ProfileValueSemantics; }; export type TargetStatusReason = @@ -88,3 +103,22 @@ export type IngestionHealth = { }>; partial: boolean; }; + +export type JVMEvent = { + event_id: string; + batch_id?: string; + target?: { namespace?: string; service?: string; pod?: string; container?: string; process_id?: number }; + event_type: string; + event_at: string; + duration_ns: number; + collector?: string; + action?: string; + cause?: string; + message?: string; + stack_frames?: string[]; +}; + +export type JVMEventEvidence = { + events: JVMEvent[]; + partial: boolean; +}; diff --git a/web/src/app.test.tsx b/web/src/app.test.tsx index 36f8858..52d6fd7 100644 --- a/web/src/app.test.tsx +++ b/web/src/app.test.tsx @@ -1,4 +1,4 @@ -import { fireEvent, render, screen } from "@testing-library/react"; +import { render, screen } from "@testing-library/react"; import { beforeEach, test, vi } from "vitest"; import { App } from "./app"; @@ -10,14 +10,8 @@ beforeEach(() => { vi.clearAllMocks(); }); -test("left navigation changes the active diagnosis view", () => { +test("renders the Java profiler workbench with CPU as the initial view", () => { render(); expect(screen.getByText("active-view:cpu")).toBeInTheDocument(); - - fireEvent.click(screen.getByLabelText("Service status")); - expect(screen.getByText("active-view:status")).toBeInTheDocument(); - - fireEvent.click(screen.getByLabelText("Allocation profiles")); - expect(screen.getByText("active-view:memory")).toBeInTheDocument(); }); diff --git a/web/src/app.tsx b/web/src/app.tsx index 674ac87..a1e3201 100644 --- a/web/src/app.tsx +++ b/web/src/app.tsx @@ -1,53 +1,14 @@ import { useState } from "react"; -import { Activity, AlertTriangle, Cpu, Database, Flame, LockKeyhole } from "lucide-react"; import { ServiceOverview } from "./routes/service-overview"; -import type { ReactNode } from "react"; -export type DiagnosisView = "memory" | "cpu" | "locks" | "deadlocks" | "status" | "ingestion"; - -const navigationItems: Array<{ - view: DiagnosisView; - label: string; - icon: ReactNode; -}> = [ - { view: "status", label: "Service status", icon: }, - { view: "cpu", label: "CPU profiles", icon: }, - { view: "memory", label: "Allocation profiles", icon: }, - { view: "locks", label: "Lock diagnosis", icon: }, - { view: "deadlocks", label: "Deadlock diagnosis", icon: }, - { view: "ingestion", label: "Ingestion health", icon: }, -]; +export type DiagnosisView = "memory" | "cpu" | "wall" | "io" | "gc" | "locks" | "deadlocks" | "status" | "ingestion"; export function App() { const [activeView, setActiveView] = useState("cpu"); return (
- -
-
-
-

Kubernetes Java profiling

-

Service diagnosis

-
-
- -
+
); } diff --git a/web/src/features/cpu/cpu-view.tsx b/web/src/features/cpu/cpu-view.tsx index c91931b..a6ccafc 100644 --- a/web/src/features/cpu/cpu-view.tsx +++ b/web/src/features/cpu/cpu-view.tsx @@ -9,11 +9,19 @@ export function CpuView({ params }: { params: URLSearchParams }) { const { data: topRows, error: topRowsError } = useAPI(() => getTopStacks(params), [params.toString()], []); const root = data?.root ?? fallback.root; const usableTopRows = !topRowsError && topRows && topRows.length > 0 ? topRows : undefined; + const profileWindow = getProfileWindow(params); return (
{error &&

Backend unavailable: {error}

} {topRowsError &&

Top table unavailable: {topRowsError}

} - +
); } + +function getProfileWindow(params: URLSearchParams) { + const start = Date.parse(params.get("start") ?? ""); + const end = Date.parse(params.get("end") ?? ""); + if (!Number.isFinite(start) || !Number.isFinite(end) || end <= start) return undefined; + return { start: new Date(start), end: new Date(end), durationMs: end - start }; +} diff --git a/web/src/features/cpu/hot-code-view.test.tsx b/web/src/features/cpu/hot-code-view.test.tsx index a4d26cb..6ca6505 100644 --- a/web/src/features/cpu/hot-code-view.test.tsx +++ b/web/src/features/cpu/hot-code-view.test.tsx @@ -74,7 +74,8 @@ test("renders top table and flame graph in both mode", () => { render(); const analysis = screen.getByLabelText("CPU profile analysis"); - expect(within(analysis).getByRole("heading", { name: "CPU profile" })).toBeInTheDocument(); + expect(within(analysis).getByRole("heading", { name: "Single Pod CPU profile" })).toBeInTheDocument(); + expect(screen.getByLabelText("CPU profile units")).toHaveTextContent("CPU time"); expect(screen.getByRole("region", { name: "Top table" })).toBeInTheDocument(); expect(screen.getByRole("region", { name: "Flamegraph" })).toBeInTheDocument(); expect(screen.getByRole("columnheader", { name: "Self CPU" })).toBeInTheDocument(); @@ -87,7 +88,7 @@ test("renders top table and flame graph in both mode", () => { expect(screen.getByPlaceholderText("Search frame")).toHaveValue(""); expect(screen.getByRole("button", { name: /DemoHttpService\.handleWork:93/ })).toHaveClass("flame-row-match"); expect(screen.getByRole("button", { name: /so\.6/ })).not.toHaveClass("flame-row-dimmed"); - expect(screen.getByText(/High total, low self: start from DemoHttpService\.handleWork/)).toBeInTheDocument(); + expect(screen.getByLabelText("Selected Java frame")).toHaveTextContent("DemoHttpService.handleWork"); }); test("renders backend top rows with self and total CPU values", () => { @@ -119,8 +120,8 @@ test("renders backend top rows with self and total CPU values", () => { ); const handleRow = screen.getByRole("row", { name: /DemoHttpService\.handleWork/ }); - expect(handleRow).toHaveTextContent("0 0.0%"); - expect(handleRow).toHaveTextContent("10 100.0%"); + expect(handleRow).toHaveTextContent("0 ns 0.0%"); + expect(handleRow).toHaveTextContent("10 ns 100.0%"); }); test("selecting a backend top row highlights matching flame graph frames", () => { @@ -213,14 +214,13 @@ test("reset clears search and selected top table row state", () => { const topTable = screen.getByRole("region", { name: "Top table" }); fireEvent.click(within(topTable).getByRole("button", { name: /DemoHttpService\.handleWork/ })); - expect(screen.getByText(/High total, low self: start from DemoHttpService\.handleWork/)).toBeInTheDocument(); + expect(screen.getByLabelText("Selected Java frame")).toHaveTextContent("DemoHttpService.handleWork"); expect(within(topTable).getByRole("row", { name: /DemoHttpService\.handleWork/ })).toHaveClass("active"); fireEvent.change(screen.getByLabelText("Search flamegraph frames"), { target: { value: "handlework" } }); fireEvent.click(screen.getByRole("button", { name: "Reset" })); expect(screen.getByLabelText("Search flamegraph frames")).toHaveValue(""); - expect(screen.queryByText(/High total, low self: start from DemoHttpService\.handleWork/)).not.toBeInTheDocument(); expect(within(topTable).getByRole("row", { name: /DemoHttpService\.handleWork/ })).not.toHaveClass("active"); }); diff --git a/web/src/features/cpu/hot-code-view.tsx b/web/src/features/cpu/hot-code-view.tsx index 7876076..6437e36 100644 --- a/web/src/features/cpu/hot-code-view.tsx +++ b/web/src/features/cpu/hot-code-view.tsx @@ -6,6 +6,13 @@ type Props = { root: FlamegraphNode; metadata?: PartialMetadata; topRows?: TopStackRow[]; + profileWindow?: { start: Date; end: Date; durationMs: number }; + profileType?: string; + title?: string; + description?: string; + valueLabel?: string; + selfColumnLabel?: string; + totalColumnLabel?: string; }; type HotFrame = { @@ -16,6 +23,8 @@ type HotFrame = { line?: number; self: number; total: number; + selfDisplay?: string; + totalDisplay?: string; selfPercent: number | string; totalPercent: number | string; }; @@ -23,19 +32,19 @@ type HotFrame = { type ViewMode = "top-table" | "flame-graph" | "both"; type SortKey = "total" | "self" | "symbol"; -export function HotCodeView({ root, metadata, topRows }: Props) { +export function HotCodeView({ root, metadata, topRows, profileWindow, profileType = "java_cpu_nanoseconds", title = "Single Pod CPU profile", description = "Top table ranks Java methods by CPU time. Values are rendered from nanoseconds into incident-readable time and average cores.", valueLabel = "CPU time", selfColumnLabel = "Self CPU", totalColumnLabel = "Total CPU" }: Props) { const fallbackFrames = useMemo(() => collectHotJavaFrames(root), [root]); const hotFrames = useMemo(() => (topRows && topRows.length > 0 ? topRows.map(topRowToHotFrame) : fallbackFrames), [fallbackFrames, topRows]); const [selectedName, setSelectedName] = useState(); const [searchQuery, setSearchQuery] = useState(""); const [viewMode, setViewMode] = useState("both"); - const [sortKey, setSortKey] = useState("total"); + const [sortKey, setSortKey] = useState("self"); const visibleFrames = useMemo(() => filterHotFrames(hotFrames, searchQuery), [hotFrames, searchQuery]); const sortedFrames = useMemo(() => sortHotFrames(visibleFrames, sortKey), [visibleFrames, sortKey]); const selected = selectedName ? hotFrames.find((frame) => frame.name === selectedName) : undefined; const fallbackSelected = sortedFrames[0]; const highlightQuery = selected ? selected.fullSymbol || selected.name || selected.symbol : ""; - const insight = selected ? describeHotFrame(selected) : undefined; + const valueFormatter = (value: number) => formatProfileValue(profileType, value, profileWindow); if (hotFrames.length === 0) { return ( @@ -53,8 +62,8 @@ export function HotCodeView({ root, metadata, topRows }: Props) {
-

CPU profile

-

Top table ranks Java symbols by self and total CPU samples. The flame graph shows sampled stack context, not source call order.

+

{title}

+

{description}

@@ -62,22 +71,38 @@ export function HotCodeView({ root, metadata, topRows }: Props) {
+
+
+ Value unit + {valueLabel} +
+
+ Percent basis + Returned CPU profile +
+
+ Window + {profileWindow ? formatWindow(profileWindow) : "selected range"} +
+
- {viewMode !== "flame-graph" && } + {viewMode !== "flame-graph" && } {viewMode !== "top-table" && (
setSelectedName(undefined)} + formatValue={valueFormatter} + valueLabel={valueLabel} />
)}
+
); } @@ -197,6 +222,9 @@ function TopTable({ sortKey, onSort, onSelect, + valueFormatter, + selfColumnLabel, + totalColumnLabel, }: { frames: HotFrame[]; selected?: HotFrame; @@ -204,6 +232,9 @@ function TopTable({ sortKey: SortKey; onSort: (sortKey: SortKey) => void; onSelect: (name: string) => void; + valueFormatter: (value: number) => string; + selfColumnLabel: string; + totalColumnLabel: string; }) { return (
@@ -211,8 +242,8 @@ function TopTable({ - - + + @@ -224,8 +255,8 @@ function TopTable({ {frame.line ? `${frame.className}:${frame.line}` : frame.fullSymbol} - {formatSamples(frame.self)} {formatPercent(frame.selfPercent)} - {formatSamples(frame.total)} {formatPercent(frame.totalPercent)} + {frame.selfDisplay ?? valueFormatter(frame.self)} {formatPercent(frame.selfPercent)} + {frame.totalDisplay ?? valueFormatter(frame.total)} {formatPercent(frame.totalPercent)} ))} @@ -234,10 +265,6 @@ function TopTable({ ); } -function formatSamples(value: number) { - return value.toLocaleString(); -} - function formatPercent(value: number | string) { return typeof value === "string" ? value : `${value.toFixed(1)}%`; } @@ -254,18 +281,80 @@ function topRowToHotFrame(row: TopStackRow): HotFrame { line: parsed?.line, self: row.self, total: row.total, + selfDisplay: row.self_display, + totalDisplay: row.total_display, selfPercent: row.self_percent, totalPercent: row.total_percent, }; } -function describeHotFrame(frame: HotFrame) { - if (frame.total <= 0) return undefined; - if (frame.self > 0 && frame.self / frame.total >= 0.5) { - return `High self CPU: inspect ${frame.symbol}'s own work first.`; - } - if (frame.self === 0 || frame.self / frame.total < 0.5) { - return `High total, low self: start from ${frame.symbol}, then inspect highlighted callees and runtime/native frames.`; +function SelectedFramePanel({ frame, valueFormatter, selfColumnLabel, totalColumnLabel }: { frame?: HotFrame; valueFormatter: (value: number) => string; selfColumnLabel: string; totalColumnLabel: string }) { + if (!frame) return null; + const copyText = `${frame.fullSymbol}${frame.line ? `:${frame.line}` : ""}`; + const copyFrame = () => { + if (navigator.clipboard) { + void navigator.clipboard.writeText(copyText); + } + }; + const copyPermalink = () => { + if (navigator.clipboard) { + const url = new URL(window.location.href); + url.searchParams.set("frame", copyText); + void navigator.clipboard.writeText(url.toString()); + } + }; + return ( + + ); +} + +function formatProfileValue(profileType: string, value: number, profileWindow?: { durationMs: number }) { + if (profileType !== "java_cpu_nanoseconds") { + return formatDuration(value); } - return `Mixed CPU cost: inspect both ${frame.symbol} and its callees.`; + if (value <= 0) return "0 ns"; + const duration = formatDuration(value); + if (!profileWindow || profileWindow.durationMs <= 0) return duration; + const cores = value / (profileWindow.durationMs * 1_000_000); + if (cores < 0.01) return duration; + return `${duration} · ${formatCores(cores)}`; +} + +function formatDuration(ns: number) { + if (ns >= 60_000_000_000) return `${(ns / 60_000_000_000).toFixed(1)} min`; + if (ns >= 1_000_000_000) return `${(ns / 1_000_000_000).toFixed(2)} s`; + if (ns >= 1_000_000) return `${(ns / 1_000_000).toFixed(1)} ms`; + if (ns >= 1_000) return `${(ns / 1_000).toFixed(1)} us`; + return `${ns.toLocaleString()} ns`; +} + +function formatCores(cores: number) { + return `${cores >= 1 ? cores.toFixed(2) : cores.toFixed(3)} cores`; +} + +function formatWindow(profileWindow: { start: Date; end: Date }) { + return `${profileWindow.start.toISOString().slice(11, 16)}-${profileWindow.end.toISOString().slice(11, 16)} UTC`; } diff --git a/web/src/features/gc/gc-view.tsx b/web/src/features/gc/gc-view.tsx new file mode 100644 index 0000000..b61f77f --- /dev/null +++ b/web/src/features/gc/gc-view.tsx @@ -0,0 +1,63 @@ +import { getFlamegraph, getJVMEvents } from "../../api/client"; +import { useAPI } from "../../api/use-api"; +import type { FlamegraphResponse, JVMEventEvidence } from "../../api/types"; +import { HotCodeView } from "../cpu/hot-code-view"; + +export function GCView({ params }: { params: URLSearchParams }) { + const gcParams = new URLSearchParams(params); + gcParams.set("event_type", "gc_pause"); + const allocationParams = new URLSearchParams(params); + allocationParams.set("profile_type", "java_allocation_bytes"); + const fallbackEvents: JVMEventEvidence = { events: [], partial: false }; + const fallbackProfile: FlamegraphResponse = { root: { name: params.get("service") ?? "service", value: 0, children: [] }, metadata: { partial: false } }; + const { data: events, error: eventsError } = useAPI(() => getJVMEvents(gcParams), [gcParams.toString()], fallbackEvents); + const { data: allocation, error: allocationError } = useAPI(() => getFlamegraph(allocationParams), [allocationParams.toString()], fallbackProfile); + + return ( +
+
+
+

GC pauses

+

GC evidence is JVM-scoped and filtered by the same namespace, service, Pod, and time range as allocation samples.

+
+
+ {eventsError &&

GC event evidence unavailable: {eventsError}

} + {allocationError &&

Allocation correlation unavailable: {allocationError}

} +
+ {(events?.events ?? []).length === 0 ? ( +

No GC pause event evidence in this range.

+ ) : ( + (events?.events ?? []).map((event) => ( +
+
+ {formatDuration(event.duration_ns)} + {event.collector || "JVM GC"} · {event.action || event.event_type} +
+
+ {new Date(event.event_at).toLocaleString()} + {event.cause || "cause unavailable"} +
+
+ )) + )} +
+ +
+ ); +} + +function formatDuration(ns: number) { + if (ns >= 1_000_000_000) return `${(ns / 1_000_000_000).toFixed(2)} s`; + if (ns >= 1_000_000) return `${(ns / 1_000_000).toFixed(1)} ms`; + if (ns >= 1_000) return `${(ns / 1_000).toFixed(1)} us`; + return `${ns} ns`; +} diff --git a/web/src/features/io/io-view.tsx b/web/src/features/io/io-view.tsx new file mode 100644 index 0000000..667c64a --- /dev/null +++ b/web/src/features/io/io-view.tsx @@ -0,0 +1,35 @@ +import { getFlamegraph, getTopStacks } from "../../api/client"; +import type { FlamegraphResponse } from "../../api/types"; +import { useAPI } from "../../api/use-api"; +import { HotCodeView } from "../cpu/hot-code-view"; + +export function IOView({ params }: { params: URLSearchParams }) { + const fallback: FlamegraphResponse = { root: { name: params.get("service") ?? "service", value: 0, children: [] }, metadata: { partial: false } }; + const { data, error } = useAPI(() => getFlamegraph(params), [params.toString()], fallback); + const { data: topRows, error: topRowsError } = useAPI(() => getTopStacks(params), [params.toString()], []); + return ( +
+ {error &&

Backend unavailable: {error}

} + {topRowsError &&

Top table unavailable: {topRowsError}

} + 0 ? topRows : undefined} + profileWindow={profileWindow(params)} + profileType="java_io_wait_nanoseconds" + title="Single Pod I/O wait profile" + description="I/O wait evidence highlights Java socket or file blocking paths when backend evidence can preserve stack ownership." + valueLabel="I/O wait" + selfColumnLabel="Self I/O" + totalColumnLabel="Total I/O" + /> +
+ ); +} + +function profileWindow(params: URLSearchParams) { + const start = Date.parse(params.get("start") ?? ""); + const end = Date.parse(params.get("end") ?? ""); + if (!Number.isFinite(start) || !Number.isFinite(end) || end <= start) return undefined; + return { start: new Date(start), end: new Date(end), durationMs: end - start }; +} diff --git a/web/src/features/wall-clock/wall-clock-view.tsx b/web/src/features/wall-clock/wall-clock-view.tsx new file mode 100644 index 0000000..caf91bb --- /dev/null +++ b/web/src/features/wall-clock/wall-clock-view.tsx @@ -0,0 +1,35 @@ +import { getFlamegraph, getTopStacks } from "../../api/client"; +import type { FlamegraphResponse } from "../../api/types"; +import { useAPI } from "../../api/use-api"; +import { HotCodeView } from "../cpu/hot-code-view"; + +export function WallClockView({ params }: { params: URLSearchParams }) { + const fallback: FlamegraphResponse = { root: { name: params.get("service") ?? "service", value: 0, children: [] }, metadata: { partial: false } }; + const { data, error } = useAPI(() => getFlamegraph(params), [params.toString()], fallback); + const { data: topRows, error: topRowsError } = useAPI(() => getTopStacks(params), [params.toString()], []); + return ( +
+ {error &&

Backend unavailable: {error}

} + {topRowsError &&

Top table unavailable: {topRowsError}

} + 0 ? topRows : undefined} + profileWindow={profileWindow(params)} + profileType="java_wall_clock_nanoseconds" + title="Single Pod Wall Clock profile" + description="Wall Clock shows Java stack time that may be runnable, blocked, waiting, or sleeping. It helps separate latency from pure CPU burn." + valueLabel="Wall time" + selfColumnLabel="Self Wall" + totalColumnLabel="Total Wall" + /> +
+ ); +} + +function profileWindow(params: URLSearchParams) { + const start = Date.parse(params.get("start") ?? ""); + const end = Date.parse(params.get("end") ?? ""); + if (!Number.isFinite(start) || !Number.isFinite(end) || end <= start) return undefined; + return { start: new Date(start), end: new Date(end), durationMs: end - start }; +} diff --git a/web/src/routes/service-overview.tsx b/web/src/routes/service-overview.tsx index 01e335e..a19caf4 100644 --- a/web/src/routes/service-overview.tsx +++ b/web/src/routes/service-overview.tsx @@ -1,6 +1,11 @@ import { useMemo, useState } from "react"; +import { Activity, AlertTriangle, Copy, Cpu, Database, Flame, LockKeyhole, Share2 } from "lucide-react"; +import type { ReactNode } from "react"; import type { ProfileType } from "../api/types"; import { CpuView } from "../features/cpu/cpu-view"; +import { WallClockView } from "../features/wall-clock/wall-clock-view"; +import { IOView } from "../features/io/io-view"; +import { GCView } from "../features/gc/gc-view"; import { MemoryView } from "../features/memory/memory-view"; import { LocksView } from "../features/locks/locks-view"; import { DeadlocksView } from "../features/deadlocks/deadlocks-view"; @@ -8,9 +13,34 @@ import { TargetStatusView } from "../features/status/target-status-view"; import { IngestionHealthView } from "../features/ingestion/ingestion-health-view"; import type { DiagnosisView } from "../app"; -const tabs = ["memory", "cpu", "locks", "deadlocks", "status", "ingestion"] as const; +const tabs = ["memory", "cpu", "wall", "io", "gc", "locks", "deadlocks", "status", "ingestion"] as const; type Tab = (typeof tabs)[number]; +const navigationGroups: Array<{ + label: string; + items: Array<{ view: DiagnosisView; label: string; shortLabel: string; detail: string; icon: ReactNode }>; +}> = [ + { + label: "Profiles", + items: [ + { view: "cpu", label: "CPU profiles", shortLabel: "CPU", detail: "MVP", icon: }, + { view: "wall", label: "Wall Clock profiles", shortLabel: "Wall Clock", detail: "phase", icon: }, + { view: "io", label: "I/O wait profiles", shortLabel: "I/O", detail: "phase", icon: }, + { view: "gc", label: "GC pauses", shortLabel: "GC", detail: "events", icon: }, + { view: "memory", label: "Allocation profiles", shortLabel: "Allocation", detail: "later", icon: }, + { view: "locks", label: "Lock diagnosis", shortLabel: "Locks", detail: "later", icon: }, + ], + }, + { + label: "Health", + items: [ + { view: "status", label: "Service status", shortLabel: "Status", detail: "targets", icon: }, + { view: "ingestion", label: "Ingestion health", shortLabel: "Ingestion", detail: "batches", icon: }, + { view: "deadlocks", label: "Deadlock diagnosis", shortLabel: "Deadlocks", detail: "events", icon: }, + ], + }, +]; + type ServiceOverviewProps = { activeView: DiagnosisView; onViewChange: (view: DiagnosisView) => void; @@ -19,7 +49,9 @@ type ServiceOverviewProps = { export function ServiceOverview({ activeView, onViewChange }: ServiceOverviewProps) { const [namespace, setNamespace] = useState("java-profiler-qa"); const [service, setService] = useState("jdk17-http-demo"); + const [pod, setPod] = useState(""); const [rangeMinutes, setRangeMinutes] = useState(60); + const [copyStatus, setCopyStatus] = useState(""); const params = useMemo(() => { const end = new Date(); const start = new Date(end.getTime() - rangeMinutes * 60_000); @@ -30,13 +62,42 @@ export function ServiceOverview({ activeView, onViewChange }: ServiceOverviewPro start: start.toISOString(), end: end.toISOString(), }); + if (pod.trim()) { + value.set("pod", pod.trim()); + } return value; - }, [namespace, service, activeView, rangeMinutes]); + }, [namespace, service, pod, activeView, rangeMinutes]); + const copyContext = async () => { + const context = [ + `view=${activeView}`, + `namespace=${namespace}`, + `service=${service}`, + `pod=${pod.trim() || ""}`, + `range=${rangeMinutes}m`, + `profile_type=${profileTypeFor(activeView)}`, + ].join("\n"); + await navigator.clipboard?.writeText(context); + setCopyStatus("Context copied"); + }; + const shareView = async () => { + const url = new URL(window.location.href); + url.searchParams.set("view", activeView); + for (const [key, value] of params.entries()) url.searchParams.set(key, value); + await navigator.clipboard?.writeText(url.toString()); + setCopyStatus("Permalink copied"); + }; return ( -
-
-
+
+
+
+
JVM
+
+ Java Profiler + MVP incident view +
+
+
+ -
- Timezone - UTC - All timestamps are rendered in UTC. -
-

Profiles, thread evidence, target state, and ingestion health stay here. Metric trend charts remain in Prometheus.

-
-
-
-
+ {copyStatus} + + + + +
+
+
+ Collection + CPU profile +
+
+ Target scope + {pod.trim() ? "Single Pod" : "Service query"} +
+
+ Sample rate + 99Hz target +
+
+ Baseline + Pod quota when available +
{activeView === "memory" && } {activeView === "cpu" && } + {activeView === "wall" && } + {activeView === "io" && } + {activeView === "gc" && } {activeView === "locks" && } {activeView === "deadlocks" && } {activeView === "status" && } @@ -87,6 +191,8 @@ export function ServiceOverview({ activeView, onViewChange }: ServiceOverviewPro function profileTypeFor(tab: Tab): ProfileType { if (tab === "cpu") return "java_cpu_nanoseconds"; + if (tab === "wall") return "java_wall_clock_nanoseconds"; + if (tab === "io") return "java_io_wait_nanoseconds"; if (tab === "locks") return "java_lock_delay_nanoseconds"; return "java_allocation_bytes"; } diff --git a/web/src/styles.css b/web/src/styles.css index 412c0bb..987a35f 100644 --- a/web/src/styles.css +++ b/web/src/styles.css @@ -1,28 +1,56 @@ :root { - color: #1b2525; - background: #e7ebe7; - font-family: "Aptos", "Segoe UI", sans-serif; - --surface: #f7f8f4; - --ink: #172120; - --muted: #63706d; - --line: #c5cec8; - --accent: #0c7d7d; - --risk: #b7352d; - --warn: #a86705; + color: #172026; + background: #f7f8fa; + font-family: "IBM Plex Sans", system-ui, sans-serif; + --surface: #ffffff; + --surface-muted: #eef1f4; + --ink: #172026; + --muted: #5d6975; + --line: #d5dae1; + --line-strong: #b6bec8; + --accent: #0f766e; + --cpu: #c2410c; + --cpu-soft: rgba(194,65,12,.14); + --risk: #b42318; + --warn: #b45309; + --success: #15803d; } * { box-sizing: border-box; } +.sr-only { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border: 0; } body { margin: 0; min-width: 320px; min-height: 100vh; } button, input, select { font: inherit; } button { cursor: pointer; } button:focus-visible, input:focus-visible { outline: 2px solid var(--accent); outline-offset: 2px; } -.app-shell { display: grid; grid-template-columns: 56px 1fr; min-height: 100vh; } -.rail { background: #162221; color: white; display: flex; flex-direction: column; align-items: center; gap: 12px; padding: 16px 8px; } -.brand { width: 34px; height: 34px; display: grid; place-items: center; background: #c8f3e5; color: #14211f; font-weight: 800; } +.app-shell { min-height: 100vh; } +.workbench-shell { min-height: 100vh; display: grid; grid-template-columns: 188px minmax(680px, 1fr); grid-template-rows: auto 1fr; } +.workbench-topbar { grid-column: 1 / -1; min-height: 56px; display: grid; grid-template-columns: 226px minmax(0, 1fr) auto; gap: 12px; align-items: center; padding: 10px 14px; border-bottom: 1px solid var(--line); background: var(--surface); position: sticky; top: 0; z-index: 5; } +.workbench-brand { display: flex; align-items: center; gap: 10px; min-width: 0; } +.brand-mark { width: 28px; height: 28px; display: grid; place-items: center; border: 1px solid var(--line-strong); border-radius: 6px; background: var(--surface-muted); color: var(--accent); font-family: "JetBrains Mono", monospace; font-size: 12px; font-weight: 700; } +.workbench-brand strong { display: block; font-size: 14px; line-height: 1.05; } +.workbench-brand span { display: block; margin-top: 3px; color: var(--muted); font-size: 11px; } +.workbench-actions { display: flex; align-items: center; gap: 8px; } +.workbench-actions button { height: 32px; display: inline-flex; align-items: center; justify-content: center; gap: 6px; border: 1px solid var(--line); border-radius: 6px; background: var(--surface); color: var(--ink); padding: 0 10px; font-size: 12px; white-space: nowrap; } +.workbench-actions .primary-action { border-color: var(--accent); background: var(--accent); color: #fff; } +.workbench-side { border-right: 1px solid var(--line); background: var(--surface); padding: 12px; overflow: auto; } +.nav-group { margin-bottom: 12px; } +.side-title { color: var(--muted); font-size: 11px; font-weight: 700; letter-spacing: .04em; text-transform: uppercase; margin: 6px 8px 8px; } +.workbench-nav-item { width: 100%; height: 36px; border: 1px solid transparent; border-radius: 6px; background: transparent; color: var(--ink); display: flex; align-items: center; justify-content: space-between; gap: 8px; padding: 0 9px; margin-bottom: 4px; font-size: 13px; } +.workbench-nav-item.active { border-color: color-mix(in srgb, var(--accent) 26%, var(--line)); background: rgba(15,118,110,.12); color: var(--accent); font-weight: 700; } +.nav-label { display: inline-flex; align-items: center; gap: 7px; min-width: 0; } +.nav-count { color: var(--muted); font-family: "IBM Plex Mono", ui-monospace, monospace; font-size: 11px; } +.scope-card { border: 1px solid var(--line); border-radius: 8px; background: var(--surface-muted); padding: 10px; margin-top: 12px; } +.scope-row { display: flex; justify-content: space-between; gap: 8px; padding: 5px 0; border-bottom: 1px solid var(--line); font-size: 12px; } +.scope-row span { color: var(--muted); } +.scope-row strong { font-size: 12px; } +.scope-card p { margin: 9px 0 0; color: var(--muted); font-size: 12px; line-height: 1.45; } +.evidence-main { min-width: 0; overflow: auto; padding: 14px; } +.rail { background: #101214; color: white; display: flex; flex-direction: column; align-items: center; gap: 12px; padding: 16px 8px; } +.brand { width: 34px; height: 34px; display: grid; place-items: center; background: var(--surface-muted); color: var(--accent); font-weight: 800; border: 1px solid var(--line-strong); border-radius: 6px; font-family: "JetBrains Mono", monospace; } .rail button { width: 36px; height: 36px; border: 0; background: transparent; color: #dbe6df; border-radius: 8px; display: grid; place-items: center; } -.rail button:hover { background: #243635; } -.rail button.active { background: #c8f3e5; color: #14211f; } +.rail button:hover { background: #20262b; } +.rail button.active { background: #e7eaee; color: #101214; } .workspace { padding: 14px; } .topbar { display: flex; justify-content: space-between; gap: 12px; align-items: center; margin-bottom: 12px; } .topbar-copy { display: grid; gap: 2px; } @@ -31,22 +59,29 @@ h1, h2, h3, p { margin-top: 0; } .topbar h1 { font-size: 26px; margin-bottom: 0; } .context-panel input, .context-panel select, .flamegraph-tools input { border: 0; background: transparent; color: var(--ink); width: 100%; } .service-layout { display: grid; gap: 12px; } -.context-strip, .diagnosis-panel { background: var(--surface); border: 1px solid var(--line); padding: 11px; } +.context-strip, .diagnosis-panel { background: var(--surface); border: 1px solid var(--line); padding: 11px; border-radius: 8px; } .context-strip { display: grid; gap: 8px; } -.context-fields { display: grid; grid-template-columns: minmax(0, 1.15fr) minmax(0, 1.15fr) minmax(220px, 0.95fr) minmax(190px, 0.95fr); gap: 10px; align-items: stretch; } +.context-fields { display: grid; grid-template-columns: minmax(0, 1.1fr) minmax(0, 1.1fr) minmax(220px, 1fr) minmax(160px, .72fr); gap: 10px; align-items: stretch; } +.context-fields-topbar { display: grid; grid-template-columns: minmax(170px, 1fr) minmax(170px, 1fr) minmax(210px, 1fr) minmax(140px, .65fr); gap: 8px; min-width: 0; overflow: auto; scrollbar-width: thin; } .context-field, .context-chip { display: grid; gap: 4px; } .context-fields span, .context-chip span { color: var(--muted); text-transform: uppercase; font-size: 11px; letter-spacing: .06em; } -.context-fields input, .context-fields select, .flamegraph-tools input { border: 1px solid var(--line); background: #fff; color: var(--ink); width: 100%; min-height: 44px; padding: 10px 11px; } +.context-fields input, .context-fields select, .flamegraph-tools input { border: 1px solid var(--line); background: #fff; color: var(--ink); width: 100%; min-height: 32px; padding: 7px 9px; border-radius: 6px; } .context-range select { appearance: none; padding-right: 38px; background-image: linear-gradient(45deg, transparent 50%, var(--muted) 50%), linear-gradient(135deg, var(--muted) 50%, transparent 50%); background-position: calc(100% - 18px) calc(50% + 1px), calc(100% - 12px) calc(50% + 1px); background-size: 6px 6px, 6px 6px; background-repeat: no-repeat; } .context-chip { background: linear-gradient(180deg, #eef2ec 0%, #e7ece7 100%); border: 1px solid var(--line); padding: 10px 11px; align-content: center; box-shadow: inset 0 1px 0 rgba(255,255,255,.65); } .context-timezone { grid-template-rows: auto auto auto; align-content: center; } .context-chip strong { font-size: 14px; line-height: 1.1; } .context-timezone small { color: var(--muted); font-size: 12px; line-height: 1.35; } .scope-note { color: var(--muted); line-height: 1.4; font-size: 13px; } +.evidence-health-strip, .cpu-unit-strip { display: grid; grid-template-columns: repeat(4, minmax(150px, 1fr)); gap: 8px; margin-bottom: 12px; } +.cpu-unit-strip { grid-template-columns: repeat(3, minmax(150px, 1fr)); margin: 0 0 8px; } +.health-chip, .cpu-unit-strip div { border: 1px solid var(--line); border-radius: 8px; background: var(--surface-muted); padding: 8px 10px; min-height: 54px; display: grid; gap: 4px; align-content: center; } +.health-chip span, .cpu-unit-strip span { color: var(--muted); text-transform: uppercase; font-size: 11px; letter-spacing: .04em; } +.health-chip strong, .cpu-unit-strip strong { font-size: 13px; line-height: 1.2; } +.health-chip-ok { border-color: color-mix(in srgb, var(--success) 34%, var(--line)); background: rgba(21,128,61,.12); color: var(--success); } .muted { color: var(--muted); } .tab-row { display: flex; justify-content: space-between; gap: 12px; align-items: center; margin-bottom: 8px; } .tabs { display: flex; gap: 6px; flex-wrap: wrap; } -.tabs button { border: 1px solid var(--line); background: transparent; color: var(--ink); padding: 6px 9px; text-transform: capitalize; } +.tabs button { border: 1px solid var(--line); background: transparent; color: var(--ink); padding: 6px 9px; text-transform: capitalize; border-radius: 6px; } .tabs button.active { background: var(--ink); color: white; border-color: var(--ink); } .diagnosis-content { min-width: 0; } .profile-analysis-wide { display: grid; gap: 8px; } @@ -54,11 +89,16 @@ h1, h2, h3, p { margin-top: 0; } .flamegraph { min-width: 0; } .flamegraph-tools { display: flex; gap: 6px; margin-bottom: 7px; align-items: center; } .flamegraph-tools input { flex: 1 1 auto; min-width: 160px; } -.flamegraph-tools button { border: 1px solid var(--line); background: #fff; padding: 6px 9px; } +.flamegraph-tools button { border: 1px solid var(--line); background: #fff; padding: 6px 9px; border-radius: 6px; } +.flamegraph-tools button.active { border-color: var(--accent); background: rgba(15,118,110,.12); color: var(--accent); font-weight: 700; } .flamegraph-tools button:disabled { color: var(--muted); background: #edf1ed; cursor: default; } .flamegraph-mode { color: var(--muted); font-size: 12px; line-height: 1.35; margin: 0 0 7px; } .warning { color: var(--warn); background: #fff4d7; padding: 6px 9px; border-left: 3px solid var(--warn); } -.profile-analysis { min-width: 0; } +.profile-analysis { min-width: 0; display: grid; grid-template-columns: minmax(0, 1fr) 320px; gap: 10px; align-items: start; } +.profile-analysis > .profile-toolbar, +.profile-analysis > .cpu-unit-strip, +.profile-analysis > .profile-grid, +.profile-analysis > .profile-stack { grid-column: 1; } .profile-toolbar { display: flex; justify-content: space-between; gap: 12px; align-items: start; border-bottom: 1px solid var(--line); padding-bottom: 7px; margin-bottom: 8px; } .profile-toolbar-compact { align-items: center; } .profile-toolbar-tight { border-bottom: 0; padding-bottom: 0; margin-bottom: 0; } @@ -66,32 +106,33 @@ h1, h2, h3, p { margin-top: 0; } .profile-toolbar h2 { font-size: 17px; margin-bottom: 0; } .profile-toolbar p { color: var(--muted); margin: 0; line-height: 1.35; font-size: 12px; max-width: 72ch; } .profile-insight { margin: 0 0 7px; border-left: 3px solid var(--accent); background: #eef5f1; color: #253532; padding: 6px 9px; line-height: 1.35; font-size: 12px; } -.profile-view-toggle { display: inline-flex; border: 1px solid var(--line); background: #fff; } +.profile-view-toggle { display: inline-flex; border: 1px solid var(--line); background: var(--surface-muted); border-radius: 7px; padding: 3px; } .profile-view-toggle button { border: 0; border-right: 1px solid var(--line); background: #fff; padding: 6px 9px; white-space: nowrap; } .profile-view-toggle button:last-child { border-right: 0; } -.profile-view-toggle button.active { background: var(--ink); color: #fff; } +.profile-view-toggle button.active { background: var(--accent); color: #fff; } .profile-grid { display: grid; grid-template-columns: minmax(220px, .46fr) minmax(0, 1.54fr); gap: 10px; align-items: start; } .profile-grid-wide { align-items: start; } .profile-stack { display: grid; gap: 12px; } .profile-flamegraph { min-width: 0; } -.top-table-wrap { border: 1px solid var(--line); background: #fff; overflow: auto; max-height: 520px; } +.top-table-wrap { border: 1px solid var(--line); background: #fff; overflow: auto; max-height: 520px; border-radius: 8px; } .top-table { min-width: 0; table-layout: fixed; } .top-table th:nth-child(1), .top-table td:nth-child(1) { width: 50%; } .top-table th:nth-child(2), .top-table td:nth-child(2), .top-table th:nth-child(3), .top-table td:nth-child(3) { width: 25%; text-align: right; font-variant-numeric: tabular-nums; } .top-table th button { border: 0; background: transparent; color: var(--muted); padding: 0; font: inherit; text-transform: uppercase; } .top-table th button.active { color: var(--ink); } +.top-table tbody tr:nth-child(even) td { background: color-mix(in srgb, var(--surface-muted) 54%, transparent); } .top-table tr.active { background: #eef5f1; box-shadow: inset 3px 0 0 var(--accent); } .top-table td { vertical-align: middle; } .top-table td button { width: 100%; border: 0; background: transparent; color: var(--ink); padding: 0; text-align: left; } -.top-table td button span { display: block; font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace; font-size: 13px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; } +.top-table td button span { display: block; font-family: "JetBrains Mono", ui-monospace, monospace; font-size: 12px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; } .top-table td button small, .top-table td small { display: block; color: var(--muted); margin-top: 3px; line-height: 1.25; } -.flamegraph-stack { position: relative; min-width: 0; max-height: 560px; overflow: auto; border: 1px solid var(--line); background: #edf2ed; } +.flamegraph-stack { position: relative; min-width: 0; max-height: 560px; overflow: auto; border: 1px solid var(--line); background: var(--surface-muted); border-radius: 8px; } .flamegraph-legend { display: flex; flex-wrap: wrap; gap: 10px; margin: 0 0 6px; color: var(--muted); font-size: 11px; } .flamegraph-legend span { display: inline-flex; align-items: center; gap: 5px; } .flamegraph-legend i { width: 12px; height: 8px; border: 1px solid rgba(8,42,40,.25); display: inline-block; } -.legend-application, .flame-row-application { background: #c9efe7; } -.legend-runtime, .flame-row-runtime { background: #e1eaff; } -.legend-native, .flame-row-native { background: #f4ead1; color: #44504c; } +.legend-application, .flame-row-application { background: var(--cpu-soft); } +.legend-runtime, .flame-row-runtime { background: rgba(37,99,235,.12); } +.legend-native, .flame-row-native { background: rgba(109,93,0,.14); color: #44504c; } .focus-breadcrumb { display: flex; align-items: center; flex-wrap: wrap; gap: 5px; margin: 0 0 6px; color: var(--muted); font-size: 11px; } .focus-breadcrumb span { text-transform: uppercase; font-weight: 700; } .focus-breadcrumb code { padding: 3px 5px; border: 1px solid var(--line); background: #fff; color: var(--ink); font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace; } @@ -116,16 +157,35 @@ h1, h2, h3, p { margin-top: 0; } .flame-row-selected { box-shadow: inset 0 0 0 2px #102220; z-index: 2; } .flame-row-tiny { padding-inline: 0; } .flame-row-tiny .flame-frame { opacity: 0; } -.flame-frame { min-width: 0; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; text-align: left; font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace; font-size: 12px; } +.flame-frame { min-width: 0; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; text-align: left; font-family: "JetBrains Mono", ui-monospace, monospace; font-size: 11px; } .flame-value { flex: 0 0 auto; font-size: 11px; font-variant-numeric: tabular-nums; color: #364744; } .flamegraph-detail { margin-top: 8px; border: 1px solid var(--line); background: #fff; padding: 8px; display: grid; grid-template-columns: minmax(0, 1fr); gap: 8px; align-items: start; } .flamegraph-detail span, .flamegraph-detail dt { display: block; color: var(--muted); font-size: 12px; text-transform: uppercase; } -.flamegraph-detail code { display: block; margin-top: 5px; color: #102220; font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace; font-size: 13px; line-height: 1.45; overflow-wrap: anywhere; } +.flamegraph-detail code { display: block; margin-top: 5px; color: #102220; font-family: "JetBrains Mono", ui-monospace, monospace; font-size: 12px; line-height: 1.45; overflow-wrap: anywhere; } +.selected-frame-panel { grid-column: 2; grid-row: 1 / span 3; position: sticky; top: 76px; border: 1px solid var(--line); border-radius: 8px; background: #fff; padding: 10px; display: grid; grid-template-columns: minmax(0, 1fr); gap: 10px; align-items: start; } +.selected-frame-panel > div:first-child { min-width: 0; } +.selected-frame-panel span, .selected-frame-panel dt { color: var(--muted); font-size: 11px; text-transform: uppercase; letter-spacing: .04em; } +.selected-frame-panel strong { display: block; margin-top: 4px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap; font-family: "JetBrains Mono", ui-monospace, monospace; font-size: 12px; } +.selected-frame-panel dl { display: flex; flex-wrap: wrap; gap: 8px; margin: 0; } +.selected-frame-panel dl div { min-width: 96px; background: var(--surface-muted); padding: 7px 8px; border-radius: 6px; } +.selected-frame-panel dd { margin: 2px 0 0; font-weight: 700; font-variant-numeric: tabular-nums; } +.selected-frame-panel dd span { color: var(--muted); margin-left: 3px; text-transform: none; letter-spacing: 0; } +.selected-frame-actions { display: flex; gap: 8px; flex-wrap: wrap; } +.selected-frame-panel button { border: 1px solid var(--accent); background: var(--accent); color: #fff; padding: 7px 10px; border-radius: 6px; } +.selected-frame-panel button + button { background: #fff; color: var(--accent); } .flamegraph-detail dl { display: flex; gap: 8px; flex-wrap: wrap; margin: 0; } .flamegraph-detail dl div { min-width: 80px; background: #eef3ef; padding: 7px 8px; } .flamegraph-detail dd { margin: 2px 0 0; font-weight: 800; font-variant-numeric: tabular-nums; } .flamegraph-empty { border: 1px dashed var(--line); background: #fff; color: var(--muted); padding: 16px; line-height: 1.45; } .event-card { border-left: 4px solid var(--risk); background: #fff; padding: 12px; margin-bottom: 10px; } +.gc-evidence { display: grid; gap: 12px; } +.gc-event-list { border: 1px solid var(--line); border-radius: 8px; background: #fff; overflow: hidden; } +.gc-event-row { display: grid; grid-template-columns: minmax(180px, .7fr) minmax(0, 1fr); gap: 12px; padding: 10px 12px; border-bottom: 1px solid var(--line); } +.gc-event-row:last-child { border-bottom: 0; } +.gc-event-row:nth-child(even) { background: var(--surface-muted); } +.gc-event-row div { min-width: 0; display: grid; gap: 4px; } +.gc-event-row strong { font-variant-numeric: tabular-nums; } +.gc-event-row span { color: var(--muted); overflow: hidden; text-overflow: ellipsis; white-space: nowrap; } pre { overflow: auto; background: #172120; color: #dff4ed; padding: 12px; } table { width: 100%; border-collapse: collapse; } th, td { text-align: left; border-bottom: 1px solid var(--line); padding: 10px 8px; } @@ -158,12 +218,23 @@ th { color: var(--muted); font-size: 12px; text-transform: uppercase; } @media (max-width: 820px) { .app-shell { grid-template-columns: 1fr; } + .workbench-shell { grid-template-columns: 1fr; } + .workbench-topbar { grid-template-columns: 1fr; position: static; } + .workbench-side { border-right: 0; border-bottom: 1px solid var(--line); } + .evidence-main { padding: 12px; } .rail { flex-direction: row; justify-content: center; } - .topbar, .service-layout, .context-fields, .profile-grid, .view-grid { grid-template-columns: 1fr; display: grid; } + .topbar, .service-layout, .context-fields, .context-fields-topbar, .profile-grid, .view-grid { grid-template-columns: 1fr; display: grid; } .workspace { padding: 16px; } - .profile-toolbar { grid-template-columns: 1fr; display: grid; } + .profile-analysis, .profile-toolbar { grid-template-columns: 1fr; display: grid; } + .profile-analysis > .profile-toolbar, + .profile-analysis > .cpu-unit-strip, + .profile-analysis > .profile-grid, + .profile-analysis > .profile-stack, + .selected-frame-panel { grid-column: 1; grid-row: auto; position: static; } .flamegraph-detail { grid-template-columns: 1fr; } .flamegraph-tooltip { grid-template-columns: 1fr; } .flamegraph-tooltip button { grid-column: 1; grid-row: auto; justify-self: start; align-self: auto; } .context-fields { grid-template-columns: 1fr; } + .evidence-health-strip, .cpu-unit-strip { grid-template-columns: 1fr; } + .selected-frame-panel { grid-template-columns: 1fr; } } diff --git a/web/src/visualization/flamegraph.test.tsx b/web/src/visualization/flamegraph.test.tsx index e1f0c5e..294c6cc 100644 --- a/web/src/visualization/flamegraph.test.tsx +++ b/web/src/visualization/flamegraph.test.tsx @@ -37,7 +37,7 @@ test("inspects, searches, and zooms nested frames while keeping long labels read const inspector = screen.getByRole("status"); expect(within(inspector).getByText("Native/system")).toBeInTheDocument(); expect(within(inspector).getByText("libjvm.so.VeryLongNativeFrameNameThatWillNeedEllipsis")).toBeInTheDocument(); - expect(within(inspector).getByText("Total CPU")).toBeInTheDocument(); + expect(within(inspector).getByText("Total Samples")).toBeInTheDocument(); expect(within(inspector).getByText("Self CPU")).toBeInTheDocument(); fireEvent.click(nativeFrame); @@ -194,6 +194,28 @@ test("keeps flamegraph context when no frames match the search", () => { expect(screen.getByText(/Search highlights matching frames/)).toBeInTheDocument(); }); +test("hides native and runtime frames without removing application frames", () => { + render( + , + ); + + fireEvent.click(screen.getByRole("button", { name: "Hide Native" })); + + expect(screen.queryByRole("button", { name: /libc\.so/ })).not.toBeInTheDocument(); + expect(screen.queryByRole("button", { name: /Thread\.run/ })).not.toBeInTheDocument(); + expect(screen.getByRole("button", { name: /Checkout\.handle/ })).toBeInTheDocument(); +}); + test("shows a custom empty state when there are no samples", () => { render(); diff --git a/web/src/visualization/flamegraph.tsx b/web/src/visualization/flamegraph.tsx index 9ef34ab..5a17d7d 100644 --- a/web/src/visualization/flamegraph.tsx +++ b/web/src/visualization/flamegraph.tsx @@ -10,6 +10,8 @@ type Props = { searchQuery?: string; onSearchQueryChange?: (query: string) => void; onReset?: () => void; + formatValue?: (value: number) => string; + valueLabel?: string; }; type Frame = FlamegraphNode & { @@ -27,12 +29,13 @@ type Frame = FlamegraphNode & { type FrameCategory = "application" | "runtime" | "native"; -export function Flamegraph({ root, metadata, emptyMessage = "No profile samples returned for this service and time range.", highlightQuery = "", insight, searchQuery, onSearchQueryChange, onReset }: Props) { +export function Flamegraph({ root, metadata, emptyMessage = "No profile samples returned for this service and time range.", highlightQuery = "", insight, searchQuery, onSearchQueryChange, onReset, formatValue = defaultFormatValue, valueLabel = "Samples" }: Props) { const [internalQuery, setInternalQuery] = useState(""); const [zoomPath, setZoomPath] = useState("root"); const [selectedPath, setSelectedPath] = useState("root"); const [hoveredPath, setHoveredPath] = useState(); const [zoomHistory, setZoomHistory] = useState([]); + const [hideSystemFrames, setHideSystemFrames] = useState(false); const query = searchQuery ?? internalQuery; const setQuery = (nextQuery: string) => { if (searchQuery === undefined) { @@ -41,13 +44,14 @@ export function Flamegraph({ root, metadata, emptyMessage = "No profile samples onSearchQueryChange?.(nextQuery); }; const frames = useMemo(() => layout(root, zoomPath, query, highlightQuery), [root, highlightQuery, query, zoomPath]); - const depth = Math.max(0, ...frames.map((frame) => frame.depth)); + const visibleFrames = useMemo(() => (hideSystemFrames ? frames.filter((frame) => frame.path === zoomPath || frame.category === "application") : frames), [frames, hideSystemFrames, zoomPath]); + const depth = Math.max(0, ...visibleFrames.map((frame) => frame.depth)); const rowHeight = 32; const queryActive = query.trim().length > 0; const highlightActive = highlightQuery.trim().length > 0; const zoomed = zoomPath !== "root"; - const selectedFrame = (highlightActive && selectedPath === "root" ? frames.find((frame) => frame.matched) : frames.find((frame) => frame.path === selectedPath)) ?? frames[0]; - const inspectedFrame = frames.find((frame) => frame.path === hoveredPath) ?? selectedFrame; + const selectedFrame = (highlightActive && selectedPath === "root" ? visibleFrames.find((frame) => frame.matched) : visibleFrames.find((frame) => frame.path === selectedPath)) ?? visibleFrames[0]; + const inspectedFrame = visibleFrames.find((frame) => frame.path === hoveredPath) ?? selectedFrame; const zoomTrail = zoomed ? ["root", ...zoomHistory.slice(1), zoomPath].map((path) => findByPath(root, path)?.name ?? "root") : []; const resetZoom = () => { setQuery(""); @@ -72,13 +76,14 @@ export function Flamegraph({ root, metadata, emptyMessage = "No profile samples return history.slice(0, -1); }); }; - const hasSamples = (root.value > 0 || (root.children?.length ?? 0) > 0) && frames.length > 0; + const hasSamples = (root.value > 0 || (root.children?.length ?? 0) > 0) && visibleFrames.length > 0; return (
setQuery(event.target.value)} /> +

{zoomed @@ -103,11 +108,11 @@ export function Flamegraph({ root, metadata, emptyMessage = "No profile samples )} {hasSamples && inspectedFrame && selectedFrame && ( - zoomToPath(selectedFrame.path)} /> + zoomToPath(selectedFrame.path)} formatValue={formatValue} valueLabel={valueLabel} /> )} {hasSamples ? (

- {frames.map((frame) => ( + {visibleFrames.map((frame) => ( ))}
@@ -139,12 +144,12 @@ export function Flamegraph({ root, metadata, emptyMessage = "No profile samples
-
Samples
-
{selectedFrame.value.toLocaleString()}
+
{valueLabel}
+
{displayFrameValue(selectedFrame, formatValue)}
Self
-
{selectedFrame.self.toLocaleString()}
+
{formatValue(selectedFrame.self)}
Total CPU
@@ -207,7 +212,15 @@ function normalizeFrameSearch(value: string) { return value.trim().replaceAll("/", ".").toLowerCase(); } -function FrameInspector({ frame, onFocus }: { frame: Frame; onFocus: () => void }) { +function defaultFormatValue(value: number) { + return value.toLocaleString(); +} + +function displayFrameValue(frame: Frame, formatValue: (value: number) => string) { + return frame.display_value ?? formatValue(frame.value); +} + +function FrameInspector({ frame, onFocus, formatValue, valueLabel }: { frame: Frame; onFocus: () => void; formatValue: (value: number) => string; valueLabel: string }) { return (
@@ -216,12 +229,12 @@ function FrameInspector({ frame, onFocus }: { frame: Frame; onFocus: () => void
-
Total CPU
-
{frame.value.toLocaleString()} {frame.totalPercent.toFixed(1)}%
+
Total {valueLabel}
+
{displayFrameValue(frame, formatValue)} {frame.totalPercent.toFixed(1)}%
Self CPU
-
{frame.self.toLocaleString()} {frame.selfPercent.toFixed(1)}%
+
{formatValue(frame.self)} {frame.selfPercent.toFixed(1)}%
Depth
diff --git a/web/tests/real-acceptance.spec.ts b/web/tests/real-acceptance.spec.ts index 8b89c93..e18da8a 100644 --- a/web/tests/real-acceptance.spec.ts +++ b/web/tests/real-acceptance.spec.ts @@ -12,19 +12,21 @@ const requireDeadlockEvidence = process.env.REAL_ACCEPTANCE_REQUIRE_DEADLOCK === test.skip(!enabled, "Set REAL_ACCEPTANCE=1 to run against a real deployed cluster UI."); test.use({ video: "on", screenshot: "only-on-failure" }); -test("real cluster service diagnosis flow exposes status, profile, deadlock, and ingestion surfaces", async ({ page }) => { +test("real cluster Java profiling workbench exposes status, CPU, Wall Clock, I/O, GC, deadlock, and ingestion surfaces", async ({ page }) => { const consoleMessages: string[] = []; page.on("console", (message) => consoleMessages.push(`[${message.type()}] ${message.text()}`)); page.on("pageerror", (error) => consoleMessages.push(`[pageerror] ${error.message}`)); await page.goto(baseURL, { waitUntil: "networkidle" }); - await expect(page.getByRole("heading", { name: "Service diagnosis" })).toBeVisible(); + await expect(page.getByText("Java Profiler")).toBeVisible(); + await expect(page.getByRole("button", { name: "CPU profiles", exact: true })).toBeVisible(); + await expect(page.getByRole("button", { name: "GC pauses", exact: true })).toBeVisible(); await page.getByRole("textbox", { name: "Namespace", exact: true }).fill(namespace); await page.getByRole("textbox", { name: "Service", exact: true }).fill(service); await page.getByLabel("Range").selectOption("60"); - await page.getByRole("button", { name: "status", exact: true }).click(); + await page.getByRole("button", { name: "Service status", exact: true }).click(); await expect(page.getByRole("heading", { name: "Target status" })).toBeVisible(); const statusCell = page.getByRole("cell", { name: /accepted|unsupported_jvm|temporary_expired|disabled_by_metadata/ }).first(); const hasFilteredJavaStatus = await statusCell @@ -48,12 +50,12 @@ test("real cluster service diagnosis flow exposes status, profile, deadlock, and } await page.screenshot({ path: `${artifactDir}/ui-01-status.png`, fullPage: true }); - await page.getByRole("button", { name: "cpu", exact: true }).click(); + await page.getByRole("button", { name: "CPU profiles", exact: true }).click(); const analysis = page.getByRole("region", { name: "CPU profile analysis" }); - await expect(analysis.getByRole("heading", { name: "CPU profile" })).toBeVisible(); - await expect(page.getByRole("columnheader", { name: "Symbol" })).toBeVisible(); - await expect(page.getByRole("columnheader", { name: "Self CPU" })).toBeVisible(); - await expect(page.getByRole("columnheader", { name: "Total CPU" })).toBeVisible(); + await expect(analysis.getByRole("heading", { name: "Single Pod CPU profile" })).toBeVisible(); + await expect(page.getByRole("button", { name: "Symbol" })).toBeVisible(); + await expect(page.getByRole("button", { name: "Self CPU" })).toBeVisible(); + await expect(page.getByRole("button", { name: "Total CPU" })).toBeVisible(); const topTable = page.getByRole("region", { name: "Top table" }); await expect(topTable.getByRole("button", { name: /DemoHttpService\.handleWork/ }).first()).toBeVisible(); const firstDataRow = topTable.locator("tbody tr").first(); @@ -63,18 +65,24 @@ test("real cluster service diagnosis flow exposes status, profile, deadlock, and await expect(page.getByPlaceholder("Search frame")).toHaveValue(""); await expect(page.getByRole("button", { name: /^root\s+\d/ })).toBeVisible(); await expect(page.getByText(/Full sampled stack context/)).toBeVisible(); - await expect(page.getByText(/start from DemoHttpService|Start by inspecting this method|inspect both DemoHttpService/)).toBeVisible(); const legend = page.getByLabel("Frame categories"); await expect(legend.getByText("Application Java")).toBeVisible(); await expect(legend.getByText("JVM/runtime")).toBeVisible(); await expect(legend.getByText("Native/system")).toBeVisible(); - await expect(page.getByRole("button", { name: /so\.6/ }).first()).not.toHaveClass(/flame-row-dimmed/); + const nativeFrame = page.getByRole("button", { name: /so\.6|libjvm|pthread|\[vdso\]/i }).first(); + const hasNativeFrame = await nativeFrame + .waitFor({ state: "visible", timeout: 2_000 }) + .then(() => true) + .catch(() => false); + if (hasNativeFrame) { + await expect(nativeFrame).not.toHaveClass(/flame-row-dimmed/); + } const demoFrame = page.getByRole("button", { name: /DemoHttpService\.(burnCpu|handleWork)/ }).first(); await expect(demoFrame).toBeVisible(); await demoFrame.click(); const selectedFrame = page.getByRole("region", { name: "Selected flamegraph frame" }); await expect(selectedFrame).toContainText(/DemoHttpService\.(burnCpu|handleWork)/); - await expect(selectedFrame).toContainText(/Samples/); + await expect(selectedFrame).toContainText(/CPU time/); await expect(selectedFrame).toContainText(/Total CPU/); await expect(selectedFrame).toContainText(/Self CPU/); const inspector = page.getByRole("status"); @@ -84,7 +92,9 @@ test("real cluster service diagnosis flow exposes status, profile, deadlock, and await page.getByPlaceholder("Search frame").fill("burnCpu"); await expect(page.getByText(/Search highlights matching frames/)).toBeVisible(); await expect(page.getByRole("button", { name: /^root\s+\d/ })).toBeVisible(); - await expect(page.getByRole("button", { name: /so\.6/ }).first()).toHaveClass(/flame-row-dimmed/); + if (hasNativeFrame) { + await expect(nativeFrame).toHaveClass(/flame-row-dimmed/); + } await page.getByRole("button", { name: "Focus selected" }).click(); await expect(page.getByText(/Focused stack context/)).toBeVisible(); await expect(page.getByRole("navigation", { name: "Focused flamegraph path" })).toContainText("Focused"); @@ -100,7 +110,22 @@ test("real cluster service diagnosis flow exposes status, profile, deadlock, and await expect(page.getByRole("region", { name: "Flamegraph", exact: true })).toBeVisible(); await page.screenshot({ path: `${artifactDir}/ui-02-cpu.png`, fullPage: true }); - await page.getByRole("button", { name: "deadlocks", exact: true }).click(); + await page.getByRole("button", { name: "Wall Clock profiles", exact: true }).click(); + await expect(page.getByRole("heading", { name: "Single Pod Wall Clock profile" })).toBeVisible(); + await expect(page.getByRole("region", { name: "Top table" })).toContainText(/DemoHttpService/); + await page.screenshot({ path: `${artifactDir}/ui-03-wall-clock.png`, fullPage: true }); + + await page.getByRole("button", { name: "I/O wait profiles", exact: true }).click(); + await expect(page.getByRole("heading", { name: "Single Pod I/O wait profile" })).toBeVisible(); + await expect(page.getByRole("region", { name: "Top table" })).toContainText(/DemoHttpService/); + await page.screenshot({ path: `${artifactDir}/ui-04-io.png`, fullPage: true }); + + await page.getByRole("button", { name: "GC pauses", exact: true }).click(); + await expect(page.getByRole("heading", { name: "GC pauses" })).toBeVisible(); + await expect(page.getByText(/JVM GC|gc_pause|Allocation correlation/).first()).toBeVisible(); + await page.screenshot({ path: `${artifactDir}/ui-05-gc.png`, fullPage: true }); + + await page.getByRole("button", { name: "Deadlock diagnosis", exact: true }).click(); await expect(page.getByRole("heading", { name: "Deadlock cycles" })).toBeVisible(); if (requireDeadlockEvidence) { await expect(page.getByText("No deadlock cycles returned for this service and time range.")).toBeHidden(); @@ -116,14 +141,14 @@ test("real cluster service diagnosis flow exposes status, profile, deadlock, and await expect(emptyDeadlockState).toBeVisible(); } } - await page.screenshot({ path: `${artifactDir}/ui-03-deadlocks.png`, fullPage: true }); + await page.screenshot({ path: `${artifactDir}/ui-06-deadlocks.png`, fullPage: true }); - await page.getByRole("button", { name: "ingestion", exact: true }).click(); + await page.getByRole("button", { name: "Ingestion health", exact: true }).click(); const ingestion = page.getByRole("region", { name: "Ingestion health" }); await expect(ingestion).toBeVisible(); await expect(ingestion.getByText("Loading ingestion evidence.")).toBeHidden(); await expect(ingestion.getByText(/accepted x [1-9]\d*/i).first()).toBeVisible(); - await page.screenshot({ path: `${artifactDir}/ui-04-ingestion.png`, fullPage: true }); + await page.screenshot({ path: `${artifactDir}/ui-07-ingestion.png`, fullPage: true }); await test.info().attach("browser-console", { body: consoleMessages.join("\n") || "no browser console messages",