chore: rebase with upstream#137
Conversation
remove usage of Prometheus WAL
add GPU metrics
prometheus: avoid retaining compression buffer between iterations
There was a problem hiding this comment.
Hello @mayankpande88, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team,
Gemini here, providing a summary of this pull request. While the title suggests a simple rebase with upstream, the changes included are quite substantial, primarily introducing NVIDIA GPU monitoring capabilities and a new Prometheus remote write agent for sending metrics.
The core intent appears to be enhancing the node agent's observability by adding detailed GPU metrics and providing a mechanism to push collected metrics to a remote Prometheus endpoint, complementing the existing HTTP server exposition. The rebase likely incorporated these features from another branch.
Highlights
- NVIDIA GPU Monitoring: Introduces a new
gpupackage and collector (gpu/gpu.go) that uses thego-nvmllibrary to gather node-level GPU metrics (info, memory, temperature, power, utilization) and process-level GPU utilization samples. These metrics are exposed via Prometheus. - Process-level GPU Usage Aggregation: Adds logic within the
containerspackage (containers/container.go,containers/process.go,containers/registry.go) to receive process-level GPU utilization samples from the new GPU collector and aggregate them per container over a defined window (gpuStatsWindow). New container-level metrics (container_resources_gpu_usage_percent,container_resources_gpu_memory_usage_percent) are introduced. - Prometheus Remote Write Agent: Adds a new
prom/remote_writer.gofile implementing a Prometheus remote write agent. This agent scrapes metrics from the registry, serializes them into Prometheus remote write format, compresses them with Snappy, and sends them to a configured endpoint (flags.MetricsEndpoint). It includes spooling to disk (flags.WalDir,flags.MaxSpoolSize) if the remote endpoint is unreachable. - Ollama Application Type Detection: Adds detection for the 'ollama' application type based on process command lines in
containers/app.go. - Dependency and Build Updates: Adds the
github.com/NVIDIA/go-nvmldependency and updatesgithub.com/grafana/pyroscope/ebpf. A linker flag (-extldflags='-Wl,-z,lazy') is added to the Dockerfile build process. - WAL Spool Size Configuration: Introduces a new flag
max-spool-size(flags/flags.go) to configure the maximum size of the on-disk spool used by the remote write agent.
Changelog
Click here to see the changelog
- Dockerfile
- Added
-extldflags='-Wl,-z,lazy'to the go build command (line 9).
- Added
- containers/app.go
- Added detection for 'ollama' application type based on command line suffix (line 123).
- containers/container.go
- Added
gpuStatsWindowconstant (line 40). - Added
gpuStatsmap field toContainerstruct (line 150). - Initialized
gpuStatsmap inNewContainer(line 212). - Added logic in
Collectto reset, aggregate, and collect container-level GPU usage metrics (lines 416-434).
- Added
- containers/metrics.go
- Defined new Prometheus metrics
GpuUsagePercentandGpuMemoryUsagePercent(lines 55-56, 109-110).
- Defined new Prometheus metrics
- containers/process.go
- Defined
GpuUsagestruct andResetmethod (lines 17-24). - Added
gpuUsageSamplesfield toProcessstruct (line 44). - Added
addGpuUsageSamplefunction to add and prune GPU usage samples (lines 113-116). - Added
getGPUUsagefunction to calculate aggregated GPU usage from samples (lines 118-134). - Added
removeOldGpuUsageSamplesfunction to remove old samples (lines 137-147).
- Defined
- containers/registry.go
- Imported the
gpupackage (line 17). - Added
gpuProcessUsageSampleChanfield toRegistrystruct (line 74). - Modified
NewRegistryto acceptgpuProcessUsageSampleChan(line 77). - Initialized
gpuProcessUsageSampleChaninNewRegistry(line 129). - Added case in
handleEventsto process GPU process usage samples (lines 222-227).
- Imported the
- flags/flags.go
- Added
MaxSpoolSizeflag for the WAL directory spool size (line 51).
- Added
- go.mod
- Added
github.com/NVIDIA/go-nvmldependency (line 8). - Added
github.com/golang/snappydependency (line 19). - Updated
github.com/grafana/pyroscope/ebpffrom v0.4.8 to v0.4.9 (line 21). - Reorganized some indirect dependencies.
- Added
- go.sum
- Added checksum entries for
github.com/NVIDIA/go-nvml(lines 97-98). - Updated checksums for
github.com/golang-jwt/jwt/v5(lines 496-497). - Updated checksums for
github.com/grafana/pyroscope/ebpf(lines 509-510). - Reorganized/updated checksums for various indirect dependencies.
- Added checksum entries for
- gpu/gpu.go
- New file containing the implementation for the NVIDIA GPU collector, including metric definitions, NVML initialization, device discovery, process utilization polling, and metric collection.
- main.go
- Imported the
gpupackage (line 17). - Initialized and registered the
gpu.NewCollector()(lines 189-195). - Passed the
gpuCollector.ProcessUsageSampleChtocontainers.NewRegistry(line 206). - Updated the call to
prom.StartAgentto pass the registry (line 215).
- Imported the
- prom/remote_writer.go
- New file containing the implementation for the Prometheus remote write agent, including scraping, serialization, compression, sending, and spooling logic.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Green chips hum and glow,
Metrics flow, a steady stream,
Usage climbs and falls.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces significant changes, primarily adding support for NVIDIA GPU monitoring and implementing a Prometheus remote write agent with spooling capabilities. It also includes minor updates like recognizing the 'ollama' application type and adjusting Dockerfile build flags. The addition of GPU monitoring is a valuable feature, and the remote write agent provides robustness for metric collection. However, the implementation of the GPU metric aggregation and the remote write agent's spool management could be improved for correctness and efficiency. Several areas related to metric calculation and data handling require attention.
Summary of Findings
- GPU Metric Aggregation: The aggregation logic for process-level GPU usage at the container level seems incorrect, potentially leading to misleading metrics. The method used to calculate the average GPU/Memory usage over the window in
containers/process.goalso appears incorrect. - GPU Sample Handling: The
lastTshandling in the GPU poller might lead to missed or duplicated samples. Additionally, samples with 0% utilization are filtered out, which could affect the accuracy of average calculations. - Prometheus Remote Write Implementation: The manual implementation of Prometheus WriteRequest protobuf conversion is complex and duplicates functionality available in the standard client library. The spool management logic for truncation could be inefficient and potentially incomplete in bringing the spool size below the limit.
- Channel Buffering: The blocking send on the GPU sample channel could cause the poller to slow down or miss samples if the consumer cannot keep up.
- Dockerfile Build Flags: The addition of
-extldflags='-Wl,-z,lazy'is a standard optimization for Go builds. - Spool Directory Permissions: The spool directory is created with
0750permissions. (Not commented on directly due to review settings). - Spool Monitoring: There are no explicit metrics for monitoring the spool size or send failures. (Not commented on directly due to review settings).
Merge Readiness
This pull request introduces significant new functionality (GPU monitoring, remote write spooling) but contains several high and medium severity issues related to metric correctness, data handling, and implementation efficiency. Specifically, the GPU metric aggregation and calculation logic appears flawed, and the manual protobuf conversion in the remote writer is a maintainability concern. I recommend addressing these issues before merging. I am unable to approve the pull request; please ensure other reviewers approve this code before merging.
b801a5a to
95b6baf
Compare
95b6baf to
1f209ca
Compare
1f209ca to
5f6b5bd
Compare
No description provided.