chore: rebase with upstream by mayankpande88 · Pull Request #137 · nudgebee/node-agent

mayankpande88 · 2025-06-03T15:49:13Z

No description provided.

remove usage of Prometheus WAL

add GPU metrics

prometheus: avoid retaining compression buffer between iterations

gemini-code-assist

Hello @mayankpande88, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary of this pull request. While the title suggests a simple rebase with upstream, the changes included are quite substantial, primarily introducing NVIDIA GPU monitoring capabilities and a new Prometheus remote write agent for sending metrics.

The core intent appears to be enhancing the node agent's observability by adding detailed GPU metrics and providing a mechanism to push collected metrics to a remote Prometheus endpoint, complementing the existing HTTP server exposition. The rebase likely incorporated these features from another branch.

Highlights

NVIDIA GPU Monitoring: Introduces a new gpu package and collector (gpu/gpu.go) that uses the go-nvml library to gather node-level GPU metrics (info, memory, temperature, power, utilization) and process-level GPU utilization samples. These metrics are exposed via Prometheus.
Process-level GPU Usage Aggregation: Adds logic within the containers package (containers/container.go, containers/process.go, containers/registry.go) to receive process-level GPU utilization samples from the new GPU collector and aggregate them per container over a defined window (gpuStatsWindow). New container-level metrics (container_resources_gpu_usage_percent, container_resources_gpu_memory_usage_percent) are introduced.
Prometheus Remote Write Agent: Adds a new prom/remote_writer.go file implementing a Prometheus remote write agent. This agent scrapes metrics from the registry, serializes them into Prometheus remote write format, compresses them with Snappy, and sends them to a configured endpoint (flags.MetricsEndpoint). It includes spooling to disk (flags.WalDir, flags.MaxSpoolSize) if the remote endpoint is unreachable.
Ollama Application Type Detection: Adds detection for the 'ollama' application type based on process command lines in containers/app.go.
Dependency and Build Updates: Adds the github.com/NVIDIA/go-nvml dependency and updates github.com/grafana/pyroscope/ebpf. A linker flag (-extldflags='-Wl,-z,lazy') is added to the Dockerfile build process.
WAL Spool Size Configuration: Introduces a new flag max-spool-size (flags/flags.go) to configure the maximum size of the on-disk spool used by the remote write agent.

Changelog

Click here to see the changelog

Dockerfile
- Added -extldflags='-Wl,-z,lazy' to the go build command (line 9).
containers/app.go
- Added detection for 'ollama' application type based on command line suffix (line 123).
containers/container.go
- Added gpuStatsWindow constant (line 40).
- Added gpuStats map field to Container struct (line 150).
- Initialized gpuStats map in NewContainer (line 212).
- Added logic in Collect to reset, aggregate, and collect container-level GPU usage metrics (lines 416-434).
containers/metrics.go
- Defined new Prometheus metrics GpuUsagePercent and GpuMemoryUsagePercent (lines 55-56, 109-110).
containers/process.go
- Defined GpuUsage struct and Reset method (lines 17-24).
- Added gpuUsageSamples field to Process struct (line 44).
- Added addGpuUsageSample function to add and prune GPU usage samples (lines 113-116).
- Added getGPUUsage function to calculate aggregated GPU usage from samples (lines 118-134).
- Added removeOldGpuUsageSamples function to remove old samples (lines 137-147).
containers/registry.go
- Imported the gpu package (line 17).
- Added gpuProcessUsageSampleChan field to Registry struct (line 74).
- Modified NewRegistry to accept gpuProcessUsageSampleChan (line 77).
- Initialized gpuProcessUsageSampleChan in NewRegistry (line 129).
- Added case in handleEvents to process GPU process usage samples (lines 222-227).
flags/flags.go
- Added MaxSpoolSize flag for the WAL directory spool size (line 51).
go.mod
- Added github.com/NVIDIA/go-nvml dependency (line 8).
- Added github.com/golang/snappy dependency (line 19).
- Updated github.com/grafana/pyroscope/ebpf from v0.4.8 to v0.4.9 (line 21).
- Reorganized some indirect dependencies.
go.sum
- Added checksum entries for github.com/NVIDIA/go-nvml (lines 97-98).
- Updated checksums for github.com/golang-jwt/jwt/v5 (lines 496-497).
- Updated checksums for github.com/grafana/pyroscope/ebpf (lines 509-510).
- Reorganized/updated checksums for various indirect dependencies.
gpu/gpu.go
- New file containing the implementation for the NVIDIA GPU collector, including metric definitions, NVML initialization, device discovery, process utilization polling, and metric collection.
main.go
- Imported the gpu package (line 17).
- Initialized and registered the gpu.NewCollector() (lines 189-195).
- Passed the gpuCollector.ProcessUsageSampleCh to containers.NewRegistry (line 206).
- Updated the call to prom.StartAgent to pass the registry (line 215).
prom/remote_writer.go
- New file containing the implementation for the Prometheus remote write agent, including scraping, serialization, compression, sending, and spooling logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Green chips hum and glow,
Metrics flow, a steady stream,
Usage climbs and falls.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant changes, primarily adding support for NVIDIA GPU monitoring and implementing a Prometheus remote write agent with spooling capabilities. It also includes minor updates like recognizing the 'ollama' application type and adjusting Dockerfile build flags. The addition of GPU monitoring is a valuable feature, and the remote write agent provides robustness for metric collection. However, the implementation of the GPU metric aggregation and the remote write agent's spool management could be improved for correctness and efficiency. Several areas related to metric calculation and data handling require attention.

Summary of Findings

GPU Metric Aggregation: The aggregation logic for process-level GPU usage at the container level seems incorrect, potentially leading to misleading metrics. The method used to calculate the average GPU/Memory usage over the window in containers/process.go also appears incorrect.
GPU Sample Handling: The lastTs handling in the GPU poller might lead to missed or duplicated samples. Additionally, samples with 0% utilization are filtered out, which could affect the accuracy of average calculations.
Prometheus Remote Write Implementation: The manual implementation of Prometheus WriteRequest protobuf conversion is complex and duplicates functionality available in the standard client library. The spool management logic for truncation could be inefficient and potentially incomplete in bringing the spool size below the limit.
Channel Buffering: The blocking send on the GPU sample channel could cause the poller to slow down or miss samples if the consumer cannot keep up.
Dockerfile Build Flags: The addition of -extldflags='-Wl,-z,lazy' is a standard optimization for Go builds.
Spool Directory Permissions: The spool directory is created with 0750 permissions. (Not commented on directly due to review settings).
Spool Monitoring: There are no explicit metrics for monitoring the spool size or send failures. (Not commented on directly due to review settings).

Merge Readiness

This pull request introduces significant new functionality (GPU monitoring, remote write spooling) but contains several high and medium severity issues related to metric correctness, data handling, and implementation efficiency. Specifically, the GPU metric aggregation and calculation logic appears flawed, and the manual protobuf conversion in the remote writer is a maintainability concern. I recommend addressing these issues before merging. I am unable to approve the pull request; please ensure other reviewers approve this code before merging.

def and others added 8 commits May 27, 2025 16:45

remove usage of Prometheus WAL

5181a8e

Merge pull request #213 from coroot/remove_prometheus_wal

d1c8216

remove usage of Prometheus WAL

add GPU metrics

65b2bf6

Merge pull request #216 from coroot/gpu_metrics

7a43425

add GPU metrics

metrics: log successful exports

a632481

prometheus: avoid retaining compression buffer between iterations

564d0ad

Merge pull request #217 from coroot/fix_mem_leak

6d71571

prometheus: avoid retaining compression buffer between iterations

Merge remote-tracking branch 'upstream/main' into rebase-23-5

e094f1c

gemini-code-assist Bot reviewed Jun 3, 2025

View reviewed changes

gemini-code-assist Bot suggested changes Jun 3, 2025

View reviewed changes

Comment thread containers/container.go

Comment thread containers/process.go

Comment thread gpu/gpu.go

Comment thread prom/remote_writer.go

Comment thread Dockerfile

Comment thread gpu/gpu.go

Comment thread gpu/gpu.go

Comment thread prom/remote_writer.go

Comment thread prom/remote_writer.go

Merge remote-tracking branch 'origin/main' into rebase-23-5

bd59e93

mayankpande88 force-pushed the rebase-23-5 branch 2 times, most recently from b801a5a to 95b6baf Compare June 4, 2025 05:15

blue4209211 previously approved these changes Jun 4, 2025

View reviewed changes

mayankpande88 dismissed blue4209211’s stale review via 1f209ca June 4, 2025 09:26

mayankpande88 force-pushed the rebase-23-5 branch from 95b6baf to 1f209ca Compare June 4, 2025 09:26

fix: fix for docker build

5f6b5bd

mayankpande88 force-pushed the rebase-23-5 branch from 1f209ca to 5f6b5bd Compare June 4, 2025 09:55

fix: fix for build

89d6941

blue4209211 approved these changes Jun 9, 2025

View reviewed changes

mayankpande88 merged commit 21192f5 into main Jun 9, 2025
2 checks passed

mayankpande88 deleted the rebase-23-5 branch June 9, 2025 11:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: rebase with upstream#137

chore: rebase with upstream#137
mayankpande88 merged 11 commits into
mainfrom
rebase-23-5

mayankpande88 commented Jun 3, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mayankpande88 commented Jun 3, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants