Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in nv-hostengine log #141

Open
itzsimpl opened this issue Dec 20, 2023 · 7 comments
Open

Errors in nv-hostengine log #141

itzsimpl opened this issue Dec 20, 2023 · 7 comments

Comments

@itzsimpl
Copy link

We use dcdm-exporter as described in https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html#connecting-to-an-existing-dcgm-agent. The nv-hostengine is version 3.1.8, the dcgm-exporter container is nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04.

We use a custom metrics file with the following metrics:

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
#DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# PCIe,,
DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data including both protocol headers and payload transmitted over PCIe bus (in B/s).
DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data including both protocol headers and payload received over PCIe bus (in B/s).

# NVLink,,
DCGM_FI_PROF_NVLINK_TX_BYTES,                    gauge, The rate of data not including protocol headers transmitted over NVLink (in B/s).
DCGM_FI_PROF_NVLINK_RX_BYTES,                    gauge, The rate of data not including protocol headers received over NVLink (in B/s).

# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

# DCP metrics,,
DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).

On a DGX-H100 system, with DGXOS6 installed and latest FW updates I have noticed the following errors in the nv-hostengine logs.

2023-12-20 06:44:42.219 ERROR [8826:13917] Got nvml st 10 from nvmlGpmSampleGet(). [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmGpmManager.cpp:147] [DcgmGpmManagerEntity::MaybeFetchNewSample]
2023-12-20 06:44:42.219 ERROR [8826:13917] Got unexpected return -8 from m_gpmManager.GetLatestSample [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10473] [DcgmCacheManager::BufferOrCacheLatestGpuValue]

Any ideas what these are?

In addition, if we enable DCGM_FI_DEV_XID_ERRORS then the logs get filled quite quickly by the following ERROR:

2023-12-19 13:58:17.953 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.954 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.954 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 2 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.955 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 3 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.955 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 4 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.956 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 5 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.956 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 6 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.956 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 7 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.953 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.954 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.954 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 2 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.956 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 3 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.957 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 4 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.957 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 5 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.957 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 6 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.958 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 7 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.953 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.954 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.955 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 2 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.955 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 3 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.955 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 4 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.956 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 5 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.956 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 6 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.957 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 7 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.951 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.952 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.953 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 2 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.953 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 3 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.953 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 4 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.954 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 5 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.954 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 6 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.955 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 7 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 14:00:17.953 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
@nikkon-dev
Copy link
Collaborator

@itzsimpl,

Can you please check the dmesg messages and confirm if you are using the GSP driver?

@itzsimpl
Copy link
Author

@nikkon-dev

The installed drivers are 535.129.03, based on https://download.nvidia.com/XFree86/Linux-x86_64/535.129.03/README/gsp.html,

# nvidia-smi -q | grep -i gsp
    GSP Firmware Version                  : 535.129.03

but

# cat /proc/driver/nvidia/gpus/0000\:1b\:00.0/information 
Model:           NVIDIA H100 80GB HBM3
IRQ:             18
GPU UUID:        GPU-875f3ca0-9de4-e78c-9cea-5140b030b627
Video BIOS:      96.00.89.00.01
Bus Type:        PCIe
DMA Size:        52 bits
DMA Mask:        0xfffffffffffff
Bus Location:    0000:1b:00.0
Device Minor:    0
GPU Firmware:    535.129.03
GPU Excluded:    No

I don't see GSP mentioned in dmesg.

Could you provide more details on to what to look for in dmesg?

@jfolz
Copy link

jfolz commented Jan 19, 2024

We see the same errors on DGX-H100. Same nv-hostengine, driver, (GSP) firmware, etc.

ERROR [597577:597597] Got nvml st 10 from nvmlGpmSampleGet(). [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmGpmManager.cpp:147] [DcgmGpmManagerEntity::MaybeFetchNewSample]
ERROR [597577:597597] Got unexpected return -8 from m_gpmManager.GetLatestSample [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10473] [DcgmCacheManager::BufferOrCacheLatestGpuValue]

Not sure if this is relevant, but here's the metrics collected by dcgm-exporter.

DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.

@nikkon-dev please let us know if you need additional information, and if this is a DCGM or dcgm-exporter issue.

@itzsimpl
Copy link
Author

itzsimpl commented Feb 1, 2024

@nikkon-dev Any news on this?

Upgraded to dcgm 3.3.3 and dcgm-exporter 3.3.3-3.2.0

# dcgmi -v
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e

Hostengine build info:
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e

Still seeing the errors, but found also

2024-02-01 13:34:36.214 ERROR [8878:8879] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@֟�O^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 13:34:36.215 ERROR [8878:8879] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 13:34:36.215 ERROR [8878:8879] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]

The system has the latest DGXOS 6.1, latest fw 1.1.3, and all ubuntu packages upgrades applied; the driver is 535.154.05.

@pintohutch
Copy link

FWIW I'm seeing these same messages using libraries from the nvcr.io/nvidia/cloud-native/dcgm:3.3.3-1-ubuntu22.04 container.

time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR

For the GPU firmare:

cat /proc/driver/nvidia/gpus/0000:00:03.0/information
Model: 		 NVIDIA L4
IRQ:   		 11
GPU UUID: 	 GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3
Video BIOS: 	 95.04.29.00.07
Bus Type: 	 PCI
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:00:03.0
Device Minor: 	 0
GPU Firmware: 	 535.129.03
GPU Excluded:	 No

@pintohutch
Copy link

Though I want to point out, I'm deploying dcgm-exporter along with the DCGM libraries in "embedded mode" and I can see it exposing 0-value metrics for Field 230 (DCGM_FI_DEV_XID_ERRORS):

# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0

So I dunno if dcgm-exporter is elegantly handling/defaulting here in the case of errors, or if its reporting an incorrect metric because of real errors from DCGM.

@PrakChandra
Copy link

I am also facing the same issue where the logs are in error state.

When I change the tag to latest for the dcgm image nvcr.io/nvidia/k8s/dcgm-exporter:latest , I see the following logs

time="2024-05-24T08:48:21Z" level=info msg="Starting dcgm-exporter"
time="2024-05-24T08:48:21Z" level=info msg="DCGM successfully initialized!"
time="2024-05-24T08:48:21Z" level=info msg="Collecting DCP Metrics"
time="2024-05-24T08:48:21Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-05-24T08:48:21Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-24T08:48:21Z" level=info msg="Pipeline starting"
time="2024-05-24T08:48:21Z" level=info msg="Starting webserver"```

However, I am not getting the metrics on Grafana. I can see the nv-hostengine logs which do not look good

2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2


The .csv file is as follows

```root@dcgm-exporter-28jbg:/etc/dcgm-exporter# cat default-counters.csv
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes

# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants