Skip to content

feat: fix response parsing and llm request tracking#154

Merged
mayankpande88 merged 175 commits into
mainfrom
llm-handler
Oct 13, 2025
Merged

feat: fix response parsing and llm request tracking#154
mayankpande88 merged 175 commits into
mainfrom
llm-handler

Conversation

@mayankpande88
Copy link
Copy Markdown
Contributor

@mayankpande88 mayankpande88 commented Sep 23, 2025

Fix for clickhouse metrics
Fix for actual destination mapping in labels
Fix for dns resolution in traces , instead of showing public IPs it should show dns
Rebase with upstream
Fix for metric collection failing with duplicate metrics was collected before with the same name and label values
fix for garbage data , issue is we are only reading first 16 bytes of request and response , they might contain chunked data or encoded data , first 16 bytes works fine for initial headers but other data may be partial.

Vperiodt and others added 30 commits June 16, 2025 11:44
* update version

Signed-off-by: Vanshikav123 <vanshikav928@gmail.com>

* addressing review

Signed-off-by: Vanshikav123 <vanshikav928@gmail.com>

* addressing review

Signed-off-by: Vanshikav123 <vanshikav928@gmail.com>

---------

Signed-off-by: Vanshikav123 <vanshikav928@gmail.com>
…iqueness

metrics: enhance the `instance` label uniqueness to avoid collisions
containers: add support for ZFS filesystem
add FoundationDB application type
add support for Oracle Cloud metadata
add support for disabling log monitoring via container ENV variable
cgroup v1: use hierarchical total_* counters for accurate RSS/memory usage
grpc: get grpc-status from payload if possible
add `container_nodejs_event_loop_blocked_time_seconds_total`
read python stats directly from the ebpf map
add per-container Pressure Stall Information (PSI) metrics
@mayankpande88
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several significant improvements and new features, including LLM request tracking, pressure stall information (PSI) metrics, and numerous optimizations for performance and memory usage. Key changes involve refactoring L7 metric collection to reduce cardinality, using a minimal pod struct to lower memory footprint, and improving DNS resolution in traces. While the overall changes are excellent, I've identified a few areas for improvement, such as potential regressions in L7 tracing for certain syscalls, code duplication, and some remaining debug artifacts that should be cleaned up.

Comment thread cgroup/memory.go
Comment thread ebpftracer/ebpf/l7/l7.c
Comment thread Dockerfile
Comment thread common/ip_resolver.go
Comment thread common/net.go
Comment thread containers/container.go Outdated
blue4209211
blue4209211 previously approved these changes Oct 4, 2025
- Re-enable iovec processing in l7.c for better network data handling
- Add new is_amqp_frame() function for improved RabbitMQ AMQP frame detection
- Simplify PostgreSQL query validation logic by removing unnecessary null-termination check
- Optimize sendmmsg handling to process single message instead of loop

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
mayankpande88 and others added 7 commits October 4, 2025 21:28
…ion as ClickHouse

- Enhanced ClickHouse detection with AMQP pattern rejection to prevent false positives
- Added comprehensive AMQP method frame detection for better RabbitMQ coverage
- Improved AMQP frame validation with proper frame type checking
- Added protocol-specific validation to reduce cross-protocol misdetection

This addresses the issue where RabbitMQ traffic on port 5672 was being incorrectly
detected as ClickHouse protocol, generating wrong metrics.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ements

- Reduce .NET telemetry memory usage by 90% (10MB -> 1MB buffer)
- Enhance HTTP path normalization with better regex patterns for cardinality reduction
- Implement port-based protocol detection for RabbitMQ (5672) and ClickHouse (9000/8123)
- Add destination port tracking in eBPF connection struct
- Fix import issues in l7.go (add strconv, remove unused flags)

These changes address high memory usage causing OOMKills by reducing Prometheus
label cardinality and improving protocol detection accuracy.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Removed verbose debug logs that were generating excessive noise:
- "Skipping collection for container - called too recently"
- Collection timing information for every container check

This reduces log volume significantly while maintaining functional behavior.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit introduces tools to identify and monitor Prometheus metric cardinality
that has been causing OOM issues in node-agent pods reaching 1Gi memory limits.

Changes:
- Add cardinality_analysis.go: periodic monitoring and analysis of metric cardinality
- Add debug_metrics.go: debug endpoints and memory protection middleware
- Modified main.go: integrated cardinality monitoring and debug endpoints

Features:
- /debug/metrics endpoint for real-time cardinality analysis
- /debug/memory endpoint for runtime memory statistics
- Automatic cardinality analysis every 5min when memory > 300MB
- Memory circuit breaker: rejects requests when memory > 800MB
- Detailed logging of high-cardinality metrics (>100 series) and critical metrics (>1000 series)
- Top 10 metrics reporting with estimated memory usage per metric type

This addresses the root cause analysis for OOM pods that were consuming 336MB+
in Prometheus metrics, with 94% memory usage from prometheus.MakeLabelPairs
and prometheus.(*wrappingMetric).Write functions.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed type conversion and import issues:
- Added missing dto import for prometheus client model
- Fixed uint64 to int64 type conversions for memory calculations
- Corrected LabelPair type reference to use dto.LabelPair

These fixes ensure the cardinality analysis code compiles correctly
in the container build environment.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…uce cardinality

- Keep path+method labels for HTTP request counters (container_http_requests_total)
- Remove path+method labels from HTTP latency histograms (container_http_requests_duration_seconds_total)
- This prevents cardinality explosion where each unique HTTP path creates multiple histogram buckets
- Reduces memory consumption and prevents OOMKilled pods due to Prometheus metric explosion

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…e cardinality

- Add groupHttpStatus function to group HTTP status codes into 2xx, 3xx, 4xx, 5xx categories
- Apply status grouping to histogram metrics while keeping detailed status codes for counters
- Combined with previous path/method removal, this dramatically reduces histogram cardinality
- Counters retain full granularity (path, method, detailed status) for detailed request tracking
- Histograms now only have: base_labels + grouped_status (5 categories) vs thousands of combinations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@mayankpande88 mayankpande88 force-pushed the llm-handler branch 2 times, most recently from f2d4996 to fd7de5d Compare October 10, 2025 06:45
mayankpande88 and others added 2 commits October 11, 2025 12:24
- Release c.lock immediately after updating collection counters and time check
- Prevents 1,337-5,045 seconds of lock contention during metrics collection
- L7 metrics collection now happens without holding container lock
- Should reduce metrics endpoint response time from 2+ minutes to seconds
- Fixes vmagent timeout issues and prevents OOM kills

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Run gofmt on cardinality_analysis.go, debug_metrics.go, and main.go
- Ensures consistent code formatting across the codebase

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@mayankpande88
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a substantial set of features and fixes, primarily focused on enhancing LLM request tracing, improving DNS resolution for traces, and addressing performance issues related to metric cardinality and data parsing. The introduction of MinimalPod and an LRU cache for connection stats are excellent optimizations for memory management. My review has identified a critical race condition in the metric collection logic that needs to be addressed to ensure stability. Additionally, there are opportunities to improve the accuracy of a memory metric calculation and to refactor some duplicated code for better maintainability. Overall, these are valuable improvements, and addressing the feedback will further strengthen the agent's reliability.

Comment thread containers/container.go Outdated
Comment thread cgroup/memory.go
Comment thread Dockerfile
Comment thread cgroup/cgroup.go
Comment thread common/ip_resolver.go
blue4209211
blue4209211 previously approved these changes Oct 11, 2025
…erhead

- Add detailed lock timing measurements to Container.Collect() and Registry operations
- Log warnings when lock acquisition takes >100ms or locks held >500ms
- Remove debug_metrics.go entirely to eliminate overhead
- Simplify metrics endpoint without middleware processing
- Focus on identifying root cause of API unresponsiveness

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
mayankpande88 and others added 5 commits October 12, 2025 16:27
- Remove unused cardinality_analysis.go file
- Add 30-second timeout protection to metrics endpoint
- Reduce collection throttling from 5s to 1s for faster response
- Simplify lock contention logging to reduce verbosity
- Remove excessive debug logging that could impact performance

These changes focus on making the API more responsive and eliminating
unnecessary processing overhead that could cause unresponsiveness.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add detailed phase-by-phase timing measurements
- Log lock acquisition time (>100ms)
- Track basic metrics, network listen, connection stats phases
- Monitor L7 stats collection (often the slowest part)
- Log total collection time with severity levels:
  - Error: >2s, Warning: >1s, Info: >500ms
- Add COLLECT_SKIP logs for duplicate collection prevention
- Use klog verbosity levels for granular control

This will help identify whether performance issues are from:
1. Lock contention (already logged)
2. Connection stats processing
3. L7 metrics collection
4. Process/application metrics
5. Overall collection pipeline

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Replaced mutex-based throttling in Container.Collect() with lock-free
atomic operations to eliminate the final source of lock contention.

Key changes:
- Changed collectCallCount from int to int64 for atomic operations
- Changed lastCollectTime from time.Time to int64 (Unix nanoseconds)
- Used atomic.AddInt64(), atomic.LoadInt64(), atomic.StoreInt64()
- Eliminated lock acquisition for throttling logic entirely

Benefits:
- Removes 622ms lock contention observed in pod jbzcc
- Maintains exact same throttling behavior (1 second minimum interval)
- No data loss risk (failure mode is "collect too often" not "lose data")
- Race conditions are minor and don't affect correctness

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@mayankpande88 mayankpande88 merged commit 0fd5884 into main Oct 13, 2025
2 checks passed
@mayankpande88 mayankpande88 deleted the llm-handler branch October 13, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants