Skip to content

chore: rebase with upstream#137

Merged
mayankpande88 merged 11 commits into
mainfrom
rebase-23-5
Jun 9, 2025
Merged

chore: rebase with upstream#137
mayankpande88 merged 11 commits into
mainfrom
rebase-23-5

Conversation

@mayankpande88
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mayankpande88, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary of this pull request. While the title suggests a simple rebase with upstream, the changes included are quite substantial, primarily introducing NVIDIA GPU monitoring capabilities and a new Prometheus remote write agent for sending metrics.

The core intent appears to be enhancing the node agent's observability by adding detailed GPU metrics and providing a mechanism to push collected metrics to a remote Prometheus endpoint, complementing the existing HTTP server exposition. The rebase likely incorporated these features from another branch.

Highlights

  • NVIDIA GPU Monitoring: Introduces a new gpu package and collector (gpu/gpu.go) that uses the go-nvml library to gather node-level GPU metrics (info, memory, temperature, power, utilization) and process-level GPU utilization samples. These metrics are exposed via Prometheus.
  • Process-level GPU Usage Aggregation: Adds logic within the containers package (containers/container.go, containers/process.go, containers/registry.go) to receive process-level GPU utilization samples from the new GPU collector and aggregate them per container over a defined window (gpuStatsWindow). New container-level metrics (container_resources_gpu_usage_percent, container_resources_gpu_memory_usage_percent) are introduced.
  • Prometheus Remote Write Agent: Adds a new prom/remote_writer.go file implementing a Prometheus remote write agent. This agent scrapes metrics from the registry, serializes them into Prometheus remote write format, compresses them with Snappy, and sends them to a configured endpoint (flags.MetricsEndpoint). It includes spooling to disk (flags.WalDir, flags.MaxSpoolSize) if the remote endpoint is unreachable.
  • Ollama Application Type Detection: Adds detection for the 'ollama' application type based on process command lines in containers/app.go.
  • Dependency and Build Updates: Adds the github.com/NVIDIA/go-nvml dependency and updates github.com/grafana/pyroscope/ebpf. A linker flag (-extldflags='-Wl,-z,lazy') is added to the Dockerfile build process.
  • WAL Spool Size Configuration: Introduces a new flag max-spool-size (flags/flags.go) to configure the maximum size of the on-disk spool used by the remote write agent.

Changelog

Click here to see the changelog
  • Dockerfile
    • Added -extldflags='-Wl,-z,lazy' to the go build command (line 9).
  • containers/app.go
    • Added detection for 'ollama' application type based on command line suffix (line 123).
  • containers/container.go
    • Added gpuStatsWindow constant (line 40).
    • Added gpuStats map field to Container struct (line 150).
    • Initialized gpuStats map in NewContainer (line 212).
    • Added logic in Collect to reset, aggregate, and collect container-level GPU usage metrics (lines 416-434).
  • containers/metrics.go
    • Defined new Prometheus metrics GpuUsagePercent and GpuMemoryUsagePercent (lines 55-56, 109-110).
  • containers/process.go
    • Defined GpuUsage struct and Reset method (lines 17-24).
    • Added gpuUsageSamples field to Process struct (line 44).
    • Added addGpuUsageSample function to add and prune GPU usage samples (lines 113-116).
    • Added getGPUUsage function to calculate aggregated GPU usage from samples (lines 118-134).
    • Added removeOldGpuUsageSamples function to remove old samples (lines 137-147).
  • containers/registry.go
    • Imported the gpu package (line 17).
    • Added gpuProcessUsageSampleChan field to Registry struct (line 74).
    • Modified NewRegistry to accept gpuProcessUsageSampleChan (line 77).
    • Initialized gpuProcessUsageSampleChan in NewRegistry (line 129).
    • Added case in handleEvents to process GPU process usage samples (lines 222-227).
  • flags/flags.go
    • Added MaxSpoolSize flag for the WAL directory spool size (line 51).
  • go.mod
    • Added github.com/NVIDIA/go-nvml dependency (line 8).
    • Added github.com/golang/snappy dependency (line 19).
    • Updated github.com/grafana/pyroscope/ebpf from v0.4.8 to v0.4.9 (line 21).
    • Reorganized some indirect dependencies.
  • go.sum
    • Added checksum entries for github.com/NVIDIA/go-nvml (lines 97-98).
    • Updated checksums for github.com/golang-jwt/jwt/v5 (lines 496-497).
    • Updated checksums for github.com/grafana/pyroscope/ebpf (lines 509-510).
    • Reorganized/updated checksums for various indirect dependencies.
  • gpu/gpu.go
    • New file containing the implementation for the NVIDIA GPU collector, including metric definitions, NVML initialization, device discovery, process utilization polling, and metric collection.
  • main.go
    • Imported the gpu package (line 17).
    • Initialized and registered the gpu.NewCollector() (lines 189-195).
    • Passed the gpuCollector.ProcessUsageSampleCh to containers.NewRegistry (line 206).
    • Updated the call to prom.StartAgent to pass the registry (line 215).
  • prom/remote_writer.go
    • New file containing the implementation for the Prometheus remote write agent, including scraping, serialization, compression, sending, and spooling logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Green chips hum and glow,
Metrics flow, a steady stream,
Usage climbs and falls.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant changes, primarily adding support for NVIDIA GPU monitoring and implementing a Prometheus remote write agent with spooling capabilities. It also includes minor updates like recognizing the 'ollama' application type and adjusting Dockerfile build flags. The addition of GPU monitoring is a valuable feature, and the remote write agent provides robustness for metric collection. However, the implementation of the GPU metric aggregation and the remote write agent's spool management could be improved for correctness and efficiency. Several areas related to metric calculation and data handling require attention.

Summary of Findings

  • GPU Metric Aggregation: The aggregation logic for process-level GPU usage at the container level seems incorrect, potentially leading to misleading metrics. The method used to calculate the average GPU/Memory usage over the window in containers/process.go also appears incorrect.
  • GPU Sample Handling: The lastTs handling in the GPU poller might lead to missed or duplicated samples. Additionally, samples with 0% utilization are filtered out, which could affect the accuracy of average calculations.
  • Prometheus Remote Write Implementation: The manual implementation of Prometheus WriteRequest protobuf conversion is complex and duplicates functionality available in the standard client library. The spool management logic for truncation could be inefficient and potentially incomplete in bringing the spool size below the limit.
  • Channel Buffering: The blocking send on the GPU sample channel could cause the poller to slow down or miss samples if the consumer cannot keep up.
  • Dockerfile Build Flags: The addition of -extldflags='-Wl,-z,lazy' is a standard optimization for Go builds.
  • Spool Directory Permissions: The spool directory is created with 0750 permissions. (Not commented on directly due to review settings).
  • Spool Monitoring: There are no explicit metrics for monitoring the spool size or send failures. (Not commented on directly due to review settings).

Merge Readiness

This pull request introduces significant new functionality (GPU monitoring, remote write spooling) but contains several high and medium severity issues related to metric correctness, data handling, and implementation efficiency. Specifically, the GPU metric aggregation and calculation logic appears flawed, and the manual protobuf conversion in the remote writer is a maintainability concern. I recommend addressing these issues before merging. I am unable to approve the pull request; please ensure other reviewers approve this code before merging.

Comment thread containers/container.go
Comment thread containers/process.go
Comment thread gpu/gpu.go
Comment thread prom/remote_writer.go
Comment thread Dockerfile
Comment thread gpu/gpu.go
Comment thread gpu/gpu.go
Comment thread prom/remote_writer.go
Comment thread prom/remote_writer.go
@mayankpande88 mayankpande88 force-pushed the rebase-23-5 branch 2 times, most recently from b801a5a to 95b6baf Compare June 4, 2025 05:15
blue4209211
blue4209211 previously approved these changes Jun 4, 2025
@mayankpande88 mayankpande88 merged commit 21192f5 into main Jun 9, 2025
2 checks passed
@mayankpande88 mayankpande88 deleted the rebase-23-5 branch June 9, 2025 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants