We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running step3 with following step, it is hanging. Is there any way to fix this up?
#!/bin/bash # Copyright (c) Microsoft Corporation. # SPDX-License-Identifier: Apache-2.0 # DeepSpeed Team ACTOR_MODEL_PATH=facebook/opt-2.7b CRITIC_MODEL_PATH=facebook/opt-350m ACTOR_ZERO_STAGE=$3 CRITIC_ZERO_STAGE=$4 OUTPUT=$5 if [ "$OUTPUT" == "" ]; then OUTPUT=./output_opt6.7_lora fi if [ "$ACTOR_ZERO_STAGE" == "" ]; then ACTOR_ZERO_STAGE=3 fi if [ "$CRITIC_ZERO_STAGE" == "" ]; then CRITIC_ZERO_STAGE=3 fi mkdir -p $OUTPUT Num_Padding_at_Beginning=1 # this is model related Actor_Lr=9.65e-6 Critic_Lr=5e-6 deepspeed --master_port 12346 main.py \ --data_path Dahoas/rm-static \ --data_split 2,4,4 \ --actor_model_name_or_path $ACTOR_MODEL_PATH \ --critic_model_name_or_path $CRITIC_MODEL_PATH \ --num_padding_at_beginning 1 \ --per_device_generation_batch_size 4 \ --per_device_training_batch_size 4 \ --generation_batches 1 \ --ppo_epochs 1 \ --max_answer_seq_len 256 \ --max_prompt_seq_len 256 \ --actor_learning_rate ${Actor_Lr} \ --critic_learning_rate ${Critic_Lr} \ --actor_weight_decay 0.1 \ --critic_weight_decay 0.1 \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --gradient_accumulation_steps 1 \ --actor_gradient_checkpointing \ --actor_dropout 0.0 \ --num_warmup_steps 100 \ --deepspeed --seed 1234 \ --enable_hybrid_engine \ --actor_zero_stage $ACTOR_ZERO_STAGE \ --critic_zero_stage $CRITIC_ZERO_STAGE \ --actor_lora_dim 128 \ --output_dir $OUTPUT \ &> $OUTPUT/training_2.7b_lora_my.log
[2024-01-02 07:29:30,646] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-02 07:29:32,889] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-01-02 07:29:32,944] [INFO] [runner.py:571:main] cmd = /opt/conda/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=12346 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path facebook/opt-2.7b --critic_model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --per_device_generation_batch_size 4 --per_device_training_batch_size 4 --generation_batches 1 --ppo_epochs 1 --max_answer_seq_len 256 --max_prompt_seq_len 256 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --actor_gradient_checkpointing --actor_dropout 0.0 --num_warmup_steps 100 --deepspeed --seed 1234 --enable_hybrid_engine --actor_zero_stage 3 --critic_zero_stage 3 --actor_lora_dim 128 --output_dir ./output_opt6.7_lora [2024-01-02 07:29:34,584] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-02 07:29:36,227] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.14.3 [2024-01-02 07:29:36,227] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2024-01-02 07:29:36,228] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2024-01-02 07:29:36,228] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2024-01-02 07:29:36,228] [INFO] [launch.py:163:main] dist_world_size=8 [2024-01-02 07:29:36,228] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2024-01-02 07:29:39,012] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-02 07:29:39,061] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-02 07:29:39,069] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-02 07:29:39,094] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-02 07:29:39,146] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-02 07:29:39,166] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-02 07:29:39,226] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-02 07:29:39,234] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) /opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2024-01-02 07:29:41,543] [INFO] [comm.py:637:init_distributed] cdb=None /opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2024-01-02 07:29:41,979] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-02 07:29:42,005] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-02 07:29:42,088] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-02 07:29:42,165] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-02 07:29:42,180] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-02 07:29:42,184] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-02 07:29:42,184] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-01-02 07:29:42,187] [INFO] [comm.py:637:init_distributed] cdb=None pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:6883 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0> pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:6883 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:6883 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:6883 [0] NCCL INFO cudaDriverVersion 11070 NCCL version 2.14.3+cuda11.7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] ofi_init:1304 NCCL WARN NET/OFI Only EFA provider is supported pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] ofi_init:1355 NCCL WARN NET/OFI aws-ofi-nccl initialization failed pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/IB : No device found. pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]veth-app1-2:169.255.254.2<0> pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Using network Socket pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 00/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 01/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 02/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 03/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 04/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 05/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 06/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 07/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 08/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 09/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 10/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 11/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 12/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 13/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 14/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 15/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 16/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 17/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 18/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 19/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 20/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 21/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 22/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 23/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Connected all rings pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Connected all trees pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO comm 0x55ebcc06e050 rank 0 nranks 8 cudaDev 0 busId 101c0 - Init COMPLETE Setting model_config.dropout to 0.0 Setting model_config.attention_dropout to 0.0 Setting model_config.activation_dropout to 0.0 Setting model_config.dropout to 0.0 Setting model_config.attention_dropout to 0.0 Setting model_config.activation_dropout to 0.0 Setting model_config.dropout to 0.0 Setting model_config.attention_dropout to 0.0 Setting model_config.activation_dropout to 0.0 Setting model_config.dropout to 0.0 Setting model_config.attention_dropout to 0.0 Setting model_config.activation_dropout to 0.0 ************************[start] Initializing Actor Model [start] ************************* Setting model_config.dropout to 0.0 Setting model_config.attention_dropout to 0.0 Setting model_config.activation_dropout to 0.0 Setting model_config.dropout to 0.0 Setting model_config.attention_dropout to 0.0 Setting model_config.activation_dropout to 0.0 Setting model_config.dropout to 0.0 Setting model_config.attention_dropout to 0.0 Setting model_config.activation_dropout to 0.0 Setting model_config.dropout to 0.0 Setting model_config.attention_dropout to 0.0 Setting model_config.activation_dropout to 0.0 [2024-01-02 07:30:41,488] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 517, num_elems = 2.78B Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
[2024-01-02 07:40:06,091] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/opt/conda/lib/python3.9/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/opt/conda/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.12.6, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7 shared memory (/dev/shm) size .... 143.40 GB
The text was updated successfully, but these errors were encountered:
rm -rf /root/.cache/torch_extensions/py39_cu117 may be help. For after using another instance, it hasn't hung anymore. FYI.
rm -rf /root/.cache/torch_extensions/py39_cu117
Sorry, something went wrong.
No branches or pull requests
When running step3 with following step, it is hanging. Is there any way to fix this up?
torch: 1.13.1
cuda: 11.7
GPU: A100 * 8
ds_report
The text was updated successfully, but these errors were encountered: