feat: fix response parsing and llm request tracking by mayankpande88 · Pull Request #154 · nudgebee/node-agent

mayankpande88 · 2025-09-23T19:07:31Z

Fix for clickhouse metrics
Fix for actual destination mapping in labels
Fix for dns resolution in traces , instead of showing public IPs it should show dns
Rebase with upstream
Fix for metric collection failing with duplicate metrics was collected before with the same name and label values
fix for garbage data , issue is we are only reading first 16 bytes of request and response , they might contain chunked data or encoded data , first 16 bytes works fine for initial headers but other data may be partial.

* update version Signed-off-by: Vanshikav123 <vanshikav928@gmail.com> * addressing review Signed-off-by: Vanshikav123 <vanshikav928@gmail.com> * addressing review Signed-off-by: Vanshikav123 <vanshikav928@gmail.com> --------- Signed-off-by: Vanshikav123 <vanshikav928@gmail.com>

…iqueness metrics: enhance the `instance` label uniqueness to avoid collisions

containers: add support for ZFS filesystem

add FoundationDB application type

add support for Oracle Cloud metadata

add support for disabling log monitoring via container ENV variable

…usage

cgroup v1: use hierarchical total_* counters for accurate RSS/memory usage

FoundationDB support

…ACES=disable

eBPF traces improvements

grpc: get grpc-status from payload if possible

add `container_nodejs_event_loop_blocked_time_seconds_total`

read python stats directly from the ebpf map

add per-container Pressure Stall Information (PSI) metrics

mayankpande88 · 2025-10-04T12:23:47Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces several significant improvements and new features, including LLM request tracking, pressure stall information (PSI) metrics, and numerous optimizations for performance and memory usage. Key changes involve refactoring L7 metric collection to reduce cardinality, using a minimal pod struct to lower memory footprint, and improving DNS resolution in traces. While the overall changes are excellent, I've identified a few areas for improvement, such as potential regressions in L7 tracing for certain syscalls, code duplication, and some remaining debug artifacts that should be cleaned up.

- Re-enable iovec processing in l7.c for better network data handling - Add new is_amqp_frame() function for improved RabbitMQ AMQP frame detection - Simplify PostgreSQL query validation logic by removing unnecessary null-termination check - Optimize sendmmsg handling to process single message instead of loop 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…ion as ClickHouse - Enhanced ClickHouse detection with AMQP pattern rejection to prevent false positives - Added comprehensive AMQP method frame detection for better RabbitMQ coverage - Improved AMQP frame validation with proper frame type checking - Added protocol-specific validation to reduce cross-protocol misdetection This addresses the issue where RabbitMQ traffic on port 5672 was being incorrectly detected as ClickHouse protocol, generating wrong metrics. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…ements - Reduce .NET telemetry memory usage by 90% (10MB -> 1MB buffer) - Enhance HTTP path normalization with better regex patterns for cardinality reduction - Implement port-based protocol detection for RabbitMQ (5672) and ClickHouse (9000/8123) - Add destination port tracking in eBPF connection struct - Fix import issues in l7.go (add strconv, remove unused flags) These changes address high memory usage causing OOMKills by reducing Prometheus label cardinality and improving protocol detection accuracy. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Removed verbose debug logs that were generating excessive noise: - "Skipping collection for container - called too recently" - Collection timing information for every container check This reduces log volume significantly while maintaining functional behavior. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit introduces tools to identify and monitor Prometheus metric cardinality that has been causing OOM issues in node-agent pods reaching 1Gi memory limits. Changes: - Add cardinality_analysis.go: periodic monitoring and analysis of metric cardinality - Add debug_metrics.go: debug endpoints and memory protection middleware - Modified main.go: integrated cardinality monitoring and debug endpoints Features: - /debug/metrics endpoint for real-time cardinality analysis - /debug/memory endpoint for runtime memory statistics - Automatic cardinality analysis every 5min when memory > 300MB - Memory circuit breaker: rejects requests when memory > 800MB - Detailed logging of high-cardinality metrics (>100 series) and critical metrics (>1000 series) - Top 10 metrics reporting with estimated memory usage per metric type This addresses the root cause analysis for OOM pods that were consuming 336MB+ in Prometheus metrics, with 94% memory usage from prometheus.MakeLabelPairs and prometheus.(*wrappingMetric).Write functions. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Fixed type conversion and import issues: - Added missing dto import for prometheus client model - Fixed uint64 to int64 type conversions for memory calculations - Corrected LabelPair type reference to use dto.LabelPair These fixes ensure the cardinality analysis code compiles correctly in the container build environment. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…uce cardinality - Keep path+method labels for HTTP request counters (container_http_requests_total) - Remove path+method labels from HTTP latency histograms (container_http_requests_duration_seconds_total) - This prevents cardinality explosion where each unique HTTP path creates multiple histogram buckets - Reduces memory consumption and prevents OOMKilled pods due to Prometheus metric explosion 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…e cardinality - Add groupHttpStatus function to group HTTP status codes into 2xx, 3xx, 4xx, 5xx categories - Apply status grouping to histogram metrics while keeping detailed status codes for counters - Combined with previous path/method removal, this dramatically reduces histogram cardinality - Counters retain full granularity (path, method, detailed status) for detailed request tracking - Histograms now only have: base_labels + grouped_status (5 categories) vs thousands of combinations 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Release c.lock immediately after updating collection counters and time check - Prevents 1,337-5,045 seconds of lock contention during metrics collection - L7 metrics collection now happens without holding container lock - Should reduce metrics endpoint response time from 2+ minutes to seconds - Fixes vmagent timeout issues and prevents OOM kills 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Run gofmt on cardinality_analysis.go, debug_metrics.go, and main.go - Ensures consistent code formatting across the codebase 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

mayankpande88 · 2025-10-11T08:20:06Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a substantial set of features and fixes, primarily focused on enhancing LLM request tracing, improving DNS resolution for traces, and addressing performance issues related to metric cardinality and data parsing. The introduction of MinimalPod and an LRU cache for connection stats are excellent optimizations for memory management. My review has identified a critical race condition in the metric collection logic that needs to be addressed to ensure stability. Additionally, there are opportunities to improve the accuracy of a memory metric calculation and to refactor some duplicated code for better maintainability. Overall, these are valuable improvements, and addressing the feedback will further strengthen the agent's reliability.

…erhead - Add detailed lock timing measurements to Container.Collect() and Registry operations - Log warnings when lock acquisition takes >100ms or locks held >500ms - Remove debug_metrics.go entirely to eliminate overhead - Simplify metrics endpoint without middleware processing - Focus on identifying root cause of API unresponsiveness 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove unused cardinality_analysis.go file - Add 30-second timeout protection to metrics endpoint - Reduce collection throttling from 5s to 1s for faster response - Simplify lock contention logging to reduce verbosity - Remove excessive debug logging that could impact performance These changes focus on making the API more responsive and eliminating unnecessary processing overhead that could cause unresponsiveness. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add detailed phase-by-phase timing measurements - Log lock acquisition time (>100ms) - Track basic metrics, network listen, connection stats phases - Monitor L7 stats collection (often the slowest part) - Log total collection time with severity levels: - Error: >2s, Warning: >1s, Info: >500ms - Add COLLECT_SKIP logs for duplicate collection prevention - Use klog verbosity levels for granular control This will help identify whether performance issues are from: 1. Lock contention (already logged) 2. Connection stats processing 3. L7 metrics collection 4. Process/application metrics 5. Overall collection pipeline 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Replaced mutex-based throttling in Container.Collect() with lock-free atomic operations to eliminate the final source of lock contention. Key changes: - Changed collectCallCount from int to int64 for atomic operations - Changed lastCollectTime from time.Time to int64 (Unix nanoseconds) - Used atomic.AddInt64(), atomic.LoadInt64(), atomic.StoreInt64() - Eliminated lock acquisition for throttling logic entirely Benefits: - Removes 622ms lock contention observed in pod jbzcc - Maintains exact same throttling behavior (1 second minimum interval) - No data loss risk (failure mode is "collect too often" not "lose data") - Race conditions are minor and don't affect correctness 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Vperiodt and others added 30 commits June 16, 2025 11:44

fix cgroup handling for Incus containers (#224)

bcbb471

metrics: enhance the instance label uniqueness to avoid collisions

85208c9

Merge pull request #228 from coroot/metrics-enhance-instance-label-un…

0f55c4a

…iqueness metrics: enhance the `instance` label uniqueness to avoid collisions

containers: add support for ZFS filesystem (coroot/coroot#371)

36d7f25

Merge pull request #229 from coroot/containers-zfs

6b15ef1

containers: add support for ZFS filesystem

Update netDeviceFilterRe to allow enP4p65s0 and enP2p33s0 (#232)

6d9279e

Add support for monitoring eMMC drives (#233)

f7a2da8

add FoundationDB application type

a0f10d4

Merge pull request #235 from coroot/foundationdb_app_type

eb43dc0

add FoundationDB application type

add support for Oracle Cloud metadata

b349fee

Merge pull request #236 from coroot/oracle_cloud_support

7937399

add support for Oracle Cloud metadata

add support for disabling log monitoring via container ENV variable

0099b5f

Merge pull request #237 from coroot/disable_log_monitoring

ee83bcd

add support for disabling log monitoring via container ENV variable

cgroup v1: use hierarchical total_* counters for accurate RSS/memory …

d29d9ed

…usage

Merge pull request #238 from coroot/cgroup_v1_mem

94fcaf7

cgroup v1: use hierarchical total_* counters for accurate RSS/memory usage

FoundationDB support

4fe3e12

Merge pull request #239 from coroot/fdb_support

73630f0

FoundationDB support

ebpf traces: allow disabling traces per container with COROOT_EBPF_TR…

3888932

…ACES=disable

ebpf traces: add sampling support

c7d5d32

Merge pull request #240 from coroot/ebpf_traces_improvements

c71a416

eBPF traces improvements

grpc: get grpc-status from payload if possible

da8b5a4

Merge pull request #243 from coroot/grpc_status

2e9d8aa

grpc: get grpc-status from payload if possible

add container_nodejs_event_loop_blocked_time_seconds_total

aa169a6

Merge pull request #245 from coroot/nodejs_event_loop_monitoring

aa81b30

add `container_nodejs_event_loop_blocked_time_seconds_total`

read python stats directly from the ebpf map

611166e

Merge pull request #246 from coroot/python_ebpf_map

d7c8349

read python stats directly from the ebpf map

add per-container Pressure Stall Information (PSI) metrics

3a7cbbd

Merge pull request #247 from coroot/psi_metrics

f331454

add per-container Pressure Stall Information (PSI) metrics

feat: fix response parsing and llm request tracking

1f394c0

gemini-code-assist Bot reviewed Oct 4, 2025

View reviewed changes

Comment thread cgroup/memory.go

Comment thread ebpftracer/ebpf/l7/l7.c

Comment thread Dockerfile

Comment thread common/ip_resolver.go

Comment thread common/net.go

Comment thread containers/container.go Outdated

blue4209211 previously approved these changes Oct 4, 2025

View reviewed changes

mayankpande88 dismissed blue4209211’s stale review via 9461bb8 October 4, 2025 14:09

mayankpande88 and others added 7 commits October 4, 2025 21:28

mayankpande88 force-pushed the llm-handler branch 2 times, most recently from f2d4996 to fd7de5d Compare October 10, 2025 06:45

mayankpande88 and others added 2 commits October 11, 2025 12:24

style: fix Go formatting issues

e8d25e3

- Run gofmt on cardinality_analysis.go, debug_metrics.go, and main.go - Ensures consistent code formatting across the codebase 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

gemini-code-assist Bot reviewed Oct 11, 2025

View reviewed changes

Comment thread containers/container.go Outdated

Comment thread cgroup/memory.go

Comment thread Dockerfile

Comment thread cgroup/cgroup.go

Comment thread common/ip_resolver.go

blue4209211 previously approved these changes Oct 11, 2025

View reviewed changes

mayankpande88 dismissed blue4209211’s stale review via 9d4f275 October 12, 2025 10:55

mayankpande88 and others added 5 commits October 12, 2025 16:27

fix: remove unused encoding/json import

41c6c3d

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

style: fix formatting with gofmt

44de62e

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

blue4209211 approved these changes Oct 13, 2025

View reviewed changes

hsundar-git approved these changes Oct 13, 2025

View reviewed changes

mayankpande88 merged commit 0fd5884 into main Oct 13, 2025
2 checks passed

mayankpande88 deleted the llm-handler branch October 13, 2025 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fix response parsing and llm request tracking#154

feat: fix response parsing and llm request tracking#154
mayankpande88 merged 175 commits into
mainfrom
llm-handler

mayankpande88 commented Sep 23, 2025 •

edited by blue4209211

Loading

Uh oh!

mayankpande88 commented Oct 4, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mayankpande88 commented Oct 11, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

mayankpande88 commented Sep 23, 2025 • edited by blue4209211 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayankpande88 commented Oct 4, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mayankpande88 commented Oct 11, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mayankpande88 commented Sep 23, 2025 •

edited by blue4209211

Loading