Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: no kernel image is available for execution on the device [BUG] #5724

Closed
MiyazonoKaori opened this issue Jul 4, 2024 · 5 comments
Labels
bug Something isn't working training

Comments

@MiyazonoKaori
Copy link

MiyazonoKaori commented Jul 4, 2024

[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   pld_enabled .................. False
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   pld_params ................... False
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   prescale_gradients ........... False
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   scheduler_name ............... None
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   scheduler_params ............. None
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   seq_parallel_communication_data_type  torch.float32
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   sparse_attention ............. None
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   sparse_gradients_enabled ..... False
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   steps_per_print .............. inf
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   timers_config ................ enabled=True synchronized=True
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   train_batch_size ............. 80
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   train_micro_batch_size_per_gpu  1
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   use_data_before_expert_parallel_  False
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   use_node_local_storage ....... False
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   wall_clock_breakdown ......... False
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   weight_quantization_config ... None
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   world_size ................... 8
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   zero_allow_untested_optimizer  True
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   zero_enabled ................. True
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   zero_force_ds_cpu_optimizer .. True
[2024-07-04 11:42:43,784] [INFO] [config.py:1004:print]   zero_optimization_stage ...... 3
[2024-07-04 11:42:43,784] [INFO] [config.py:990:print_user_config]   json = {
    "train_batch_size": 80, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 10, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "none", 
            "nvme_path": null
        }, 
        "offload_param": {
            "device": "none", 
            "nvme_path": null
        }, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "bf16": {
        "enabled": true
    }, 
    "fp16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
Training started
user:21411:21881 [7] NCCL INFO Using non-device net plugin version 0
user:21404:21874 [0] NCCL INFO Using non-device net plugin version 0
user:21410:21879 [6] NCCL INFO Using non-device net plugin version 0
user:21409:21878 [5] NCCL INFO Using non-device net plugin version 0
user:21410:21879 [6] NCCL INFO Using network IB
user:21405:21880 [1] NCCL INFO Using non-device net plugin version 0
user:21408:21875 [4] NCCL INFO Using non-device net plugin version 0
user:21411:21881 [7] NCCL INFO Using network IB
user:21404:21874 [0] NCCL INFO Using network IB
user:21407:21877 [3] NCCL INFO Using non-device net plugin version 0
user:21406:21876 [2] NCCL INFO Using non-device net plugin version 0
user:21409:21878 [5] NCCL INFO Using network IB
user:21405:21880 [1] NCCL INFO Using network IB
user:21408:21875 [4] NCCL INFO Using network IB
user:21407:21877 [3] NCCL INFO Using network IB
user:21406:21876 [2] NCCL INFO Using network IB
user:21404:21874 [0] NCCL INFO bootstrapSplit: comm 0x7f52c40a14a0 parent 0xb5ea260 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE
user:21411:21881 [7] NCCL INFO bootstrapSplit: comm 0x7f8e700a1920 parent 0xc534c40 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE
user:21410:21879 [6] NCCL INFO bootstrapSplit: comm 0x7fa4740a14d0 parent 0xc423a50 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE
user:21404:21874 [0] NCCL INFO comm 0x7f52c40a14a0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x1fdb679bccd41959 - Init START
user:21408:21875 [4] NCCL INFO bootstrapSplit: comm 0x7f9fc80a1460 parent 0xbe430d0 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE
user:21407:21877 [3] NCCL INFO bootstrapSplit: comm 0x7f3f640a1460 parent 0xc68c080 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE
user:21411:21881 [7] NCCL INFO comm 0x7f8e700a1920 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e4000 commId 0x1fdb679bccd41959 - Init START
user:21409:21878 [5] NCCL INFO bootstrapSplit: comm 0x7f7b800a14a0 parent 0xc6707c0 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE
user:21405:21880 [1] NCCL INFO bootstrapSplit: comm 0x7f36480a14a0 parent 0xc76dd70 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE
user:21406:21876 [2] NCCL INFO bootstrapSplit: comm 0x7fe1d40a12a0 parent 0xc731630 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE
user:21410:21879 [6] NCCL INFO comm 0x7fa4740a14d0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bf000 commId 0x1fdb679bccd41959 - Init START
user:21408:21875 [4] NCCL INFO comm 0x7f9fc80a1460 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x1fdb679bccd41959 - Init START
user:21407:21877 [3] NCCL INFO comm 0x7f3f640a1460 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 66000 commId 0x1fdb679bccd41959 - Init START
user:21409:21878 [5] NCCL INFO comm 0x7f7b800a14a0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ae000 commId 0x1fdb679bccd41959 - Init START
user:21405:21880 [1] NCCL INFO comm 0x7f36480a14a0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2d000 commId 0x1fdb679bccd41959 - Init START
user:21406:21876 [2] NCCL INFO comm 0x7fe1d40a12a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3f000 commId 0x1fdb679bccd41959 - Init START
user:21404:21874 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
user:21404:21874 [0] NCCL INFO NVLS multicast support is available on dev 0
user:21410:21879 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
user:21410:21879 [6] NCCL INFO NVLS multicast support is available on dev 6
user:21408:21875 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
user:21408:21875 [4] NCCL INFO NVLS multicast support is available on dev 4
user:21407:21877 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
user:21409:21878 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:21409:21878 [5] NCCL INFO NVLS multicast support is available on dev 5
user:21405:21880 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
user:21405:21880 [1] NCCL INFO NVLS multicast support is available on dev 1
user:21411:21881 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
user:21411:21881 [7] NCCL INFO NVLS multicast support is available on dev 7
user:21406:21876 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
user:21407:21877 [3] NCCL INFO NVLS multicast support is available on dev 3
user:21406:21876 [2] NCCL INFO NVLS multicast support is available on dev 2
user:21406:21876 [2] NCCL INFO comm 0x7fe1d40a12a0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
user:21405:21880 [1] NCCL INFO comm 0x7f36480a14a0 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
user:21404:21874 [0] NCCL INFO comm 0x7f52c40a14a0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
user:21406:21876 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
user:21411:21881 [7] NCCL INFO comm 0x7f8e700a1920 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
user:21410:21879 [6] NCCL INFO comm 0x7fa4740a14d0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
user:21405:21880 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
user:21406:21876 [2] NCCL INFO P2P Chunksize set to 524288
user:21407:21877 [3] NCCL INFO comm 0x7f3f640a1460 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
user:21409:21878 [5] NCCL INFO comm 0x7f7b800a14a0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
user:21408:21875 [4] NCCL INFO comm 0x7f9fc80a1460 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
user:21404:21874 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
user:21405:21880 [1] NCCL INFO P2P Chunksize set to 524288
user:21411:21881 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
user:21410:21879 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
user:21404:21874 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
user:21411:21881 [7] NCCL INFO P2P Chunksize set to 524288
user:21404:21874 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
user:21409:21878 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
user:21410:21879 [6] NCCL INFO P2P Chunksize set to 524288
user:21408:21875 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
user:21407:21877 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
user:21404:21874 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
user:21407:21877 [3] NCCL INFO P2P Chunksize set to 524288
user:21409:21878 [5] NCCL INFO P2P Chunksize set to 524288
user:21408:21875 [4] NCCL INFO P2P Chunksize set to 524288
user:21404:21874 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
user:21404:21874 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
user:21404:21874 [0] NCCL INFO P2P Chunksize set to 524288
user:21404:21874 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21404:21874 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Connected all rings
user:21410:21879 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Connected all rings
user:21410:21879 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21410:21879 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 16/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 17/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 18/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Connected all rings
user:21407:21877 [3] NCCL INFO Connected all rings
user:21409:21878 [5] NCCL INFO Channel 19/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 20/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 21/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Connected all rings
user:21409:21878 [5] NCCL INFO Channel 22/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21409:21878 [5] NCCL INFO Channel 23/0 : 5[5] -> 4[4] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Connected all rings
user:21404:21874 [0] NCCL INFO Connected all rings
user:21411:21881 [7] NCCL INFO Connected all rings
user:21411:21881 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 16/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 17/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 18/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 19/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 20/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 21/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 22/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Channel 23/0 : 7[7] -> 6[6] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 16/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 17/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 18/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 19/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 20/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 21/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 22/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21408:21875 [4] NCCL INFO Channel 23/0 : 4[4] -> 3[3] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21405:21880 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 16/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21406:21876 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 17/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 18/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 19/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 20/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 21/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 22/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21407:21877 [3] NCCL INFO Channel 23/0 : 3[3] -> 2[2] via P2P/CUMEM
user:21411:21881 [7] NCCL INFO Connected all trees
user:21411:21881 [7] NCCL INFO NVLS comm 0x7f8e700a1920 headRank 7 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
user:21410:21879 [6] NCCL INFO Connected all trees
user:21410:21879 [6] NCCL INFO NVLS comm 0x7fa4740a14d0 headRank 6 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
user:21404:21874 [0] NCCL INFO Connected all trees
user:21409:21878 [5] NCCL INFO Connected all trees
user:21405:21880 [1] NCCL INFO Connected all trees
user:21404:21874 [0] NCCL INFO NVLS comm 0x7f52c40a14a0 headRank 0 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
user:21405:21880 [1] NCCL INFO NVLS comm 0x7f36480a14a0 headRank 1 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
user:21409:21878 [5] NCCL INFO NVLS comm 0x7f7b800a14a0 headRank 5 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
user:21406:21876 [2] NCCL INFO Connected all trees
user:21408:21875 [4] NCCL INFO Connected all trees
user:21407:21877 [3] NCCL INFO Connected all trees
user:21408:21875 [4] NCCL INFO NVLS comm 0x7f9fc80a1460 headRank 4 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
user:21407:21877 [3] NCCL INFO NVLS comm 0x7f3f640a1460 headRank 3 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
user:21406:21876 [2] NCCL INFO NVLS comm 0x7fe1d40a12a0 headRank 2 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
user:21405:21880 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:21405:21880 [1] NCCL INFO 24 coll channels, 0 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
user:21406:21876 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:21406:21876 [2] NCCL INFO 24 coll channels, 0 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
user:21407:21877 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:21407:21877 [3] NCCL INFO 24 coll channels, 0 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
user:21408:21875 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:21408:21875 [4] NCCL INFO 24 coll channels, 0 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
user:21409:21878 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:21409:21878 [5] NCCL INFO 24 coll channels, 0 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
user:21410:21879 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:21410:21879 [6] NCCL INFO 24 coll channels, 0 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
user:21411:21881 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:21411:21881 [7] NCCL INFO 24 coll channels, 0 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
user:21404:21874 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:21404:21874 [0] NCCL INFO 24 coll channels, 0 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
user:21407:21877 [3] NCCL INFO comm 0x7f3f640a1460 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 66000 commId 0x1fdb679bccd41959 - Init COMPLETE
user:21411:21881 [7] NCCL INFO comm 0x7f8e700a1920 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e4000 commId 0x1fdb679bccd41959 - Init COMPLETE
user:21409:21878 [5] NCCL INFO comm 0x7f7b800a14a0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ae000 commId 0x1fdb679bccd41959 - Init COMPLETE
user:21405:21880 [1] NCCL INFO comm 0x7f36480a14a0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2d000 commId 0x1fdb679bccd41959 - Init COMPLETE
user:21406:21876 [2] NCCL INFO comm 0x7fe1d40a12a0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3f000 commId 0x1fdb679bccd41959 - Init COMPLETE
user:21410:21879 [6] NCCL INFO comm 0x7fa4740a14d0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId bf000 commId 0x1fdb679bccd41959 - Init COMPLETE
user:21408:21875 [4] NCCL INFO comm 0x7f9fc80a1460 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9b000 commId 0x1fdb679bccd41959 - Init COMPLETE
user:21404:21874 [0] NCCL INFO comm 0x7f52c40a14a0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 19000 commId 0x1fdb679bccd41959 - Init COMPLETE
2024-07-04 11:42:54 - epoch: 0, current step: 1, total step: 70000
2024-07-04 11:42:58 - epoch: 0, current step: 2, total step: 70000
2024-07-04 11:43:01 - epoch: 0, current step: 3, total step: 70000
2024-07-04 11:43:04 - epoch: 0, current step: 4, total step: 70000
2024-07-04 11:43:08 - epoch: 0, current step: 5, total step: 70000
2024-07-04 11:43:11 - epoch: 0, current step: 6, total step: 70000
2024-07-04 11:43:14 - epoch: 0, current step: 7, total step: 70000
2024-07-04 11:43:18 - epoch: 0, current step: 8, total step: 70000
2024-07-04 11:43:21 - epoch: 0, current step: 9, total step: 70000
[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/fineturn/train.py", line 750, in <module>
[rank7]:     app.run(main)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 308, in run
[rank7]:     _run_main(main, args)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[rank7]:     sys.exit(main(argv))
[rank7]:              ^^^^^^^^^^
[rank7]:   File "/home/fineturn/train.py", line 746, in main
[rank7]:     trainer.train()
[rank7]:   File "/home/fineturn/trainer.py", line 312, in train
[rank7]:     losses = self.train_step(batch)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/home/fineturn/trainer.py", line 274, in train_step
[rank7]:     self.accelerator.backward(total_loss)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2117, in backward
[rank7]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
[rank7]:     self.engine.step()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2161, in step
[rank7]:     self._take_model_step(lr_kwargs)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2067, in _take_model_step
[rank7]:     self.optimizer.step()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank7]:     ret_val = func(*args, **kwargs)
[rank7]:               ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2050, in step
[rank7]:     self._optimizer_step(sub_group_id)
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 947, in _optimizer_step
[rank7]:     self.optimizer.step()
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank7]:     return wrapped(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank7]:     out = func(*args, **kwargs)
[rank7]:           ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step
[rank7]:     multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32],
[rank7]:   File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in __call__
[rank7]:     return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank7]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank7]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank7]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Ubuntu 20.04
8 Nvidia H100 GPUs
nvcc -v : cuda_12.1
nvidia-smi CUDA : 12.2

torch==2.3.0
deepspeed==0.14.3
accelerate==0.30.1

gradient_accumulation_steps = 10

installed deepspeed have same error:
DS_BUILD_FUSED_ADAM=1 DS_BUILD_OPS=1 python -m pip install deepspeed==0.14.3
DS_BUILD_FUSED_ADAM=1 python -m pip install deepspeed==0.14.3
DS_BUILD_OPS=1 python -m pip install deepspeed==0.14.3

@MiyazonoKaori MiyazonoKaori added bug Something isn't working training labels Jul 4, 2024
@getao
Copy link

getao commented Sep 4, 2024

Hi, did you solve the problem? I encountered the same issue. Could you please provide a solution to this?

@MiyazonoKaori
Copy link
Author

Hi, did you solve the problem? I encountered the same issue. Could you please provide a solution to this?

reinstall Deepspeed, the issue was fixed

python -m pip uninstall deepspeed
python -m pip cache remove deepspeed
DS_BUILD_CPU_ADAM=1 python -m pip install deepspeed==0.14.4

@getao
Copy link

getao commented Sep 4, 2024

Thank you. So, the key is setting DS_BUILD_CPU_ADAM=1?

@MiyazonoKaori
Copy link
Author

Thank you. So, the key is setting DS_BUILD_CPU_ADAM=1?

remove cache and setting DS_BUILD_CPU_ADAM=1

@getao
Copy link

getao commented Sep 17, 2024

DS_BUILD_CPU_ADAM=1 python -m pip install deepspeed==0.14.4

Thank you. I tried this but it still doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants