- 
                Notifications
    
You must be signed in to change notification settings  - Fork 2.6k
 
Description
Describe the bug
I'm trying to run a training script by specifying the GPU I would like to use. However, all GPUs on the machine are being used, even when I request only one.
Steps to reproduce
Run the training with:
python train.py --task Isaac-Cartpole-v0 --num_envs 32 --headless --device cuda:1
Then check nvidia-smi.
Before running the training, if I run nvidia-smi, I get the following. Since this is a shared machine, I cannot currently provide a situation where all GPUs are idle:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:21:00.0 Off |                    0 |
| N/A   46C    P0            114W /  350W |   18748MiB /  46068MiB |     36%      Default |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00000000:C1:00.0 Off |                    0 |
| N/A   30C    P8             32W /  350W |       2MiB /  46068MiB |      0%      Default |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    Off |   00000000:E1:00.0 Off |                    0 |
| N/A   45C    P0            143W /  350W |   43829MiB /  46068MiB |      0%      Default |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|=========================================================================================|
|    0   N/A  N/A   1757715    C+G   ...b/_isaac_sim/kit/python/bin/python3      18707MiB |
|    2   N/A  N/A   2288867      C   python                                      43820MiB |
+-----------------------------------------------------------------------------------------+
However, after launching the command:
python train.py --task Isaac-Cartpole-v0 --num_envs 32 --headless --device cuda:1
I observe that GPU 0 and 2 are also slightly used, even though they weren't requested:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:21:00.0 Off |                    0 |
| N/A   46C    P0            118W /  350W |   19347MiB /  46068MiB |     63%      Default |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00000000:C1:00.0 Off |                    0 |
| N/A   32C    P0             82W /  350W |    2688MiB /  46068MiB |      8%      Default |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    Off |   00000000:E1:00.0 Off |                    0 |
| N/A   45C    P0            106W /  350W |   44317MiB /  46068MiB |      0%      Default |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|=========================================================================================|
|    0   N/A  N/A   1757715    C+G   ...b/_isaac_sim/kit/python/bin/python3      18707MiB |
|    0   N/A  N/A   1949442    C+G   ...b/_isaac_sim/kit/python/bin/python3        569MiB |
|    1   N/A  N/A   1949442    C+G   ...b/_isaac_sim/kit/python/bin/python3       2653MiB |
|    2   N/A  N/A   1949442    C+G   ...b/_isaac_sim/kit/python/bin/python3        461MiB |
|    2   N/A  N/A   2288867      C   python                                      43820MiB |
+-----------------------------------------------------------------------------------------+
Output Log
This is the output of the script when I run it (shoeing that the cuda:1 is used):
python train.py --task Isaac-Cartpole-v0 --num_envs 32 --headless --device cuda:1
[INFO][AppLauncher]: Loading experience file: /workspace/isaaclab/apps/isaaclab.python.headless.kit
Loading user config located at: '/isaac-sim/kit/data/Kit/Isaac-Sim/4.5/user.config.json'
[Info] [carb] Logging to file: /isaac-sim/kit/logs/Kit/Isaac-Sim/4.5/kit_20250325_160554.log
2025-03-25 16:05:54 [0ms] [Warning] [omni.kit.app.plugin] No crash reporter present, dumps uploading isn't available.
2025-03-25 16:05:55 [576ms] [Warning] [omni.usd_config.extension] Enable omni.materialx.libs extension to use MaterialX
2025-03-25 16:05:55 [851ms] [Warning] [omni.datastore] OmniHub is inaccessible
2025-03-25 16:05:55 [1,168ms] [Warning] [omni.isaac.dynamic_control] omni.isaac.dynamic_control is deprecated as of Isaac Sim 4.5. No action is needed from end-users.
|---------------------------------------------------------------------------------------------|
| Driver Version: 550.107.02    | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name                             | Active | LDA | GPU Memory | Vendor-ID | LUID       |
|     |                                  |        |     |            | Device-ID | UUID       |
|     |                                  |        |     |            | Bus-ID    |            |
|---------------------------------------------------------------------------------------------|
| 0   | NVIDIA L40S                      | Yes: 0 |     | 46068   MB | 10de      | 0          |
|     |                                  |        |     |            | 26b9      | e0ed5aea.. |
|     |                                  |        |     |            | 21        |            |
|---------------------------------------------------------------------------------------------|
| 1   | NVIDIA L40S                      | Yes: 1 |     | 46068   MB | 10de      | 0          |
|     |                                  |        |     |            | 26b9      | 254755a4.. |
|     |                                  |        |     |            | c1        |            |
|---------------------------------------------------------------------------------------------|
| 2   | NVIDIA L40S                      | Yes: 2 |     | 46068   MB | 10de      | 0          |
|     |                                  |        |     |            | 26b9      | 4cf82512.. |
|     |                                  |        |     |            | e1        |            |
|=============================================================================================|
| OS: 22.04.5 LTS (Jammy Jellyfish) ubuntu, Version: 22.04.5, Kernel: 5.15.0-121-generic
| XServer Vendor: The X.Org Foundation, XServer Version: 12101004 (1.21.1.4)
| Processor: AMD EPYC 9334 32-Core Processor
| Cores: 64 | Logical Cores: 128
|---------------------------------------------------------------------------------------------|
| Total Memory (MB): 1031756 | Free Memory: 978786
| Total Page/Swap (MB): 16383 | Free Page/Swap: 0
|---------------------------------------------------------------------------------------------|
2025-03-25 16:06:01 [6,468ms] [Warning] [gpu.foundation.plugin] ECC is enabled on physical device 0
2025-03-25 16:06:01 [6,468ms] [Warning] [gpu.foundation.plugin] ECC is enabled on physical device 1
2025-03-25 16:06:01 [6,468ms] [Warning] [gpu.foundation.plugin] ECC is enabled on physical device 2
2025-03-25 16:06:01 [6,469ms] [Warning] [gpu.foundation.plugin] IOMMU is enabled.
2025-03-25 16:06:01 [6,469ms] [Warning] [gpu.foundation.plugin] Detected IOMMU is enabled. Running CUDA peer-to-peer bandwidth and latency validation.
Cuda failure ../../../source/plugins/carb.cudainterop/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest.cu:681: 'peer access is already enabled'
2025-03-25 16:06:01 [6,500ms] [Warning] [omni.kvdb.plugin] Disabling key-value database because another kit process is locking it
[INFO]: Parsing configuration from: isaaclab_tasks.manager_based.classic.cartpole.cartpole_env_cfg:CartpoleEnvCfg
[INFO]: Parsing configuration from: isaaclab_tasks.manager_based.classic.cartpole.agents.rsl_rl_ppo_cfg:CartpolePPORunnerCfg
[INFO] Logging experiment in directory: /workspace/isaaclab/scripts/reinforcement_learning/rsl_rl/logs/rsl_rl/cartpole
Exact experiment name requested from command line: 2025-03-25_16-06-01
Setting seed: 42
[INFO]: Base environment:
        Environment device    : cuda:1
        Environment seed      : 42
        Physics step-size     : 0.008333333333333333
        Rendering step-size   : 0.016666666666666666
        Environment step-size : 0.016666666666666666
[INFO]: Time taken for scene creation : 1.724378 seconds
[INFO]: Scene manager:  <class InteractiveScene>
        Number of environments: 32
        Environment spacing   : 4.0
        Source prim name      : /World/envs/env_0
        Global prim paths     : []
        Replicate physics     : True
[INFO]: Starting the simulation. This may take a few seconds. Please wait...
2025-03-25 16:06:04 [9,510ms] [Warning] [isaaclab.actuators.actuator_pd] The <ImplicitActuatorCfg> object has a value for 'effort_limit'. This parameter will be removed in the future. To set the effort limit, please use 'effort_limit_sim' instead.
2025-03-25 16:06:04 [9,510ms] [Warning] [isaaclab.actuators.actuator_pd] The <ImplicitActuatorCfg> object has a value for 'velocity_limit'. Previously, although this value was specified, it was not getting used by implicit actuators. Since this parameter affects the simulation behavior, we continue to not use it. This parameter will be removed in the future. To set the velocity limit, please use 'velocity_limit_sim' instead.
2025-03-25 16:06:04 [9,512ms] [Warning] [isaaclab.actuators.actuator_pd] The <ImplicitActuatorCfg> object has a value for 'effort_limit'. This parameter will be removed in the future. To set the effort limit, please use 'effort_limit_sim' instead.
2025-03-25 16:06:04 [9,512ms] [Warning] [isaaclab.actuators.actuator_pd] The <ImplicitActuatorCfg> object has a value for 'velocity_limit'. Previously, although this value was specified, it was not getting used by implicit actuators. Since this parameter affects the simulation behavior, we continue to not use it. This parameter will be removed in the future. To set the velocity limit, please use 'velocity_limit_sim' instead.
[INFO]: Time taken for simulation start : 0.610869 seconds```
System Info
Please describe the characteristics of your environment:
- Commit: https://github.com/isaac-sim/IsaacLab/tree/v2.0.2
 - Isaac Sim Version: Isaac Sim 4.5
 - OS: Ubuntu 20.04
 - GPU: NVIDIA L40S
 - CUDA: 12.4
 - GPU Driver: 550.107.02
 
Additional Context
Checklist
- I have checked that there is no similar issue in the repo (required)
 - I have verified that the issue is not caused by running Isaac Sim itself, but is related to this repository
 
Acceptance Criteria
Add the criteria for when this task is considered done. If not known at issue creation time, you can add it once the issue is assigned.
- Ensure that training runs only on the specified GPU and does not allocate memory or resources on other devices.