Skip to content

Commit

Permalink
NVIDIA-DCGM integration docs
Browse files Browse the repository at this point in the history
  • Loading branch information
RamanaReddy8801 committed Oct 30, 2023
1 parent bd8bf3a commit cc6b4c5
Show file tree
Hide file tree
Showing 3 changed files with 135 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
---
title: NVIDIA DCGM integration
tags:
- NVIDIA integration
- DCGM integration
- New Relic integrations
metaDescription: Use New Relic infrastructure agent to get a dashboard with DCGM metrics.
---
import infrastructureNVIDIADCGMDashboard from 'images/infrastructure_screenshot-full_nvidia-dcgm-dashboard.webp'

Our NVIDIA DCGM integration assists you in monitoring the status of GPUs. This integration leverages our infrastructure agent and the Prometheus remote write integration, which is seamlessly integrated with NVIDIA's SMI utility. It provides you with a pre-built dashboard containing crucial DCGM metrics, including GPU utilization, XID error counts, clock and performance states, temperature, power usage.

Check warning on line 11 in src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx

View workflow job for this annotation

GitHub Actions / vale-linter

[vale] reported by reviewdog 🐶 [new-relic.ComplexWords] Consider using 'use' instead of 'utilization'. Raw Output: {"message": "[new-relic.ComplexWords] Consider using 'use' instead of 'utilization'.", "location": {"path": "src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx", "range": {"start": {"line": 11, "column": 320}}}, "severity": "INFO"}

<img
title="NVIDIA DCGM dashboard"
alt="NVIDIA DCGM dashboard"
src={infrastructureNVIDIADCGMDashboard}
/>

<figcaption>
After you set up our NVIDIA DCGM integration, we give you a dashboard for your DCGM metrics.
</figcaption>

## Install the infrastructure agent [#infra]

To get data into New Relic, install our infrastructure agent. Our infrastructure agent collects and ingests data so you can keep track of your DCGM performance.

You can install the infrastructure agent two different ways:

* Our [guided install](https://one.newrelic.com/nr1-core?state=4f81feab-35f7-e97e-9903-52510f8542bd) is a CLI tool that inspects your system and installs the infrastructure agent alongside the application monitoring agent that best works for your system. To learn more about how our guided install works, check out our [Guided install overview](/docs/infrastructure/host-integrations/installation/new-relic-guided-install-overview).
* If you'd rather install our infrastructure agent manually, you can follow a tutorial for manual installation for [Linux](/docs/infrastructure/install-infrastructure-agent/linux-installation/install-infrastructure-monitoring-agent-linux), [Windows](/docs/infrastructure/install-infrastructure-agent/windows-installation/install-infrastructure-monitoring-agent-windows/).

## Configure DCGM-Exporte
1. Clone the repository
```
git clone https://github.com/NVIDIA/dcgm-exporter
```
2. Below command is used to open the `dcgm-exporter` directory present in the cloned repository.

Check warning on line 37 in src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx

View workflow job for this annotation

GitHub Actions / vale-linter

[vale] reported by reviewdog 🐶 [Microsoft.Passive] 'is used' looks like passive voice. Raw Output: {"message": "[Microsoft.Passive] 'is used' looks like passive voice.", "location": {"path": "src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx", "range": {"start": {"line": 37, "column": 18}}}, "severity": "INFO"}
```
cd dcgm-exporter
```
3. Install neccessary binarys using below commands.
```
make binary
```

```
sudo make install
```
5. Start the dcgm-exporter using below command
```
dcgm-exporter &
```
6. Use the below command to see details of dcgm metrics
```
curl localhost:9400/metrics
```

## NVIDIA-DCGM configuration on Prometheus

To install Prometheus on a Linux system, you can follow these steps. Prometheus is a popular open-source monitoring and alerting tool.

1. Visit the [Prometheus download page](https://prometheus.io/download/) to find the latest release.
2. Select the appropriate version for your operating system and architecture. For Linux, you'll likely choose the linux-amd64 version. Copy the download link for the tarball (.tar.gz file).
3. Once downloaded Prometheus, untar the download tar file.
```shell
tar -xvzf <filename.tar.gz>
```
4. Go to the Prometheus folder
```shell
cd <download folder>
```
5. Open your `Prometheus.yml` file and add the following lines.
``yml
---
scrape_configs:
- job_name: NVIDI
static_configs:
- targets:['localhost:9400']
```
6. Start the Prometheus using the below command
```shell
./prometheus --config.file=prometheus.yml
```

## Configure Prometheus remote write for NVIDIA GPUs

Using the Prometheus Remote Write integration, you can point your Prometheus servers at New Relic to store and visualize your data. You can explore other ways to integrate Prometheus data [in our docs](https://docs.newrelic.com/docs/integrations/prometheus-integrations/get-started/send-prometheus-metric-data-new-relic/).
1. Generate your Prometheus remote_write URL
* Add a name to identify your data source (Example: NVIDIA)
2. On clicking the "save" button, you should see a code snippet with `remote_write` configuration. It looks like below

```yml
remote_write:
- url: https://metric-api.newrelic.com/prometheus/v1/write?prometheus_server=NVIDIA
bearer_token: 51688d472b5af64a8331801ff621baa8226aNRAL
```
3. Add this remote_write URL to your main Prometheus configuration file.
4. Reload or restart your Prometheus server

## Restart the New Relic infrastructure agent
Before you can start reading your data, use the instructions in our [infrastructure agent docs](/docs/infrastructure/install-infrastructure-agent/manage-your-agent/start-stop-restart-infrastructure-agent/) to restart your infrastructure agent.

```shell
sudo systemctl restart newrelic-infra.service
```
## Monitor your application

Check warning on line 106 in src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx

View workflow job for this annotation

GitHub Actions / vale-linter

[vale] reported by reviewdog 🐶 [new-relic.ComplexWords] Consider using 'check' or 'watch' instead of 'Monitor'. Raw Output: {"message": "[new-relic.ComplexWords] Consider using 'check' or 'watch' instead of 'Monitor'.", "location": {"path": "src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx", "range": {"start": {"line": 106, "column": 4}}}, "severity": "INFO"}

You can choose our pre-built dashboard template named `nvidia-dcgm` to monitor your application metrics.

Check warning on line 108 in src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx

View workflow job for this annotation

GitHub Actions / vale-linter

[vale] reported by reviewdog 🐶 [new-relic.ComplexWords] Consider using 'check' or 'watch' instead of 'monitor'. Raw Output: {"message": "[new-relic.ComplexWords] Consider using 'check' or 'watch' instead of 'monitor'.", "location": {"path": "src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx", "range": {"start": {"line": 108, "column": 72}}}, "severity": "INFO"}

1. Go to **[one.newrelic.com](https://one.newrelic.com/)** and click on **+ Add data**.
2. Click on the **Dashboards** tab.
3. In the search box, type `nvidia-dcgm`.
4. When you see our pre-build dashboard, click on it to install it in your account.

Once your application is integrated by following the above steps, the dashboard should display metrics.

Check warning on line 115 in src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx

View workflow job for this annotation

GitHub Actions / vale-linter

[vale] reported by reviewdog 🐶 [Microsoft.Passive] 'is integrated' looks like passive voice. Raw Output: {"message": "[Microsoft.Passive] 'is integrated' looks like passive voice.", "location": {"path": "src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx", "range": {"start": {"line": 115, "column": 23}}}, "severity": "INFO"}

To instrument the nvidia-dcgm quickstart and to see metrics and alerts, you can also follow our [Nvidia-DCGM quickstart page](https://newrelic.com/instant-observability/nvidia-dcgm) by clicking on the “Install now” button.

Here are some example queries:

**Example:** view the count of the device GPU temperature

```sql
SELECT latest(DCGM_FI_DEV_GPU_TEMP) FROM Metric WHERE metricName LIKE 'DCGM_FI_DEV_GPU_TEMP' TIMESERIES
```

## What's next?

Check warning on line 127 in src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx

View workflow job for this annotation

GitHub Actions / vale-linter

[vale] reported by reviewdog 🐶 [Microsoft.HeadingPunctuation] Don't use end punctuation in headings. Raw Output: {"message": "[Microsoft.HeadingPunctuation] Don't use end punctuation in headings.", "location": {"path": "src/content/docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration.mdx", "range": {"start": {"line": 127, "column": 14}}}, "severity": "WARNING"}

To learn more about building NRQL queries and generating dashboards, check out these docs:

* [Introduction to the query builder](/docs/query-your-data/explore-query-data/query-builder/introduction-query-builder) to create basic and advanced queries.
* [Introduction to dashboards](/docs/query-your-data/explore-query-data/dashboards/introduction-dashboards) to customize your dashboard and carry out different actions.
* [Manage your dashboard](/docs/query-your-data/explore-query-data/dashboards/manage-your-dashboard) to adjust your <InlinePopover type="dashboards" /> display mode, or to add more content to your dashboard.
Binary file not shown.
2 changes: 2 additions & 0 deletions src/nav/infrastructure.yml
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,8 @@ pages:
path: /docs/infrastructure/host-integrations/host-integrations-list/nginx/nginx-config
- title: NVIDIA GPU integration
path: /docs/infrastructure/host-integrations/host-integrations-list/nvidia-gpu-integration
- title: NVIDIA DCGM integration
path: /docs/infrastructure/host-integrations/host-integrations-list/nvidia-dcgm-integration
- title: Openstack Controller integration
path: /docs/infrastructure/host-integrations/host-integrations-list/openstack-controller-integration
- title: Oracle Database integration
Expand Down

0 comments on commit cc6b4c5

Please sign in to comment.