Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed gets stuck when training #12418

Closed
SamsTheGreatest opened this issue Jun 29, 2021 · 33 comments
Closed

DeepSpeed gets stuck when training #12418

SamsTheGreatest opened this issue Jun 29, 2021 · 33 comments

Comments

@SamsTheGreatest
Copy link

Environment info

  • transformers version: 4.8.1
  • Platform: Linux-4.15.0-140-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.9.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: single gpu

Who can help

@stas00

Information

Trying to replicate this, I am using a 125M GPT Neo model and fine-tune it with using the Trainer. Training arguments include a DeepSpeed option. The Trainer gets stuck with:

[2021-06-29 14:29:44,747] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.4.1, git-hash=unknown, git-branch=unknown
[2021-06-29 14:29:44,757] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1

ds_report gives:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/anaconda3/envs/esemala/lib/python3.7/site-packages/torch']
torch version .................... 1.9.0
torch cuda version ............... 11.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/jovyan/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.4.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.1

Is there a way to debug this?

To Replicate

I modified the original code slightly to remove the errors:

training_args = tr.TrainingArguments(output_dir=save_dir, num_train_epochs=5, logging_steps=300, save_steps=300,
                                  per_device_train_batch_size=1, per_device_eval_batch_size=1,warmup_steps=50,
                                     learning_rate=0.001,adam_epsilon=1e-06,fp16=True,
                                  weight_decay=0.01, logging_dir=f'{save_dir}/logs', deepspeed='./ds_config.json')

and ds_config.json is now:

{
  "fp16": {
    "enabled": true,
    "min_loss_scale": 1,
    "opt_level": "O3"
  },
  "zero_optimization": {
    "stage": 3,
    "cpu_offload": true,
    "cpu_offload_params" : true,
    "contiguous_gradients": true,
    "overlap_comm": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 0.001,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-6
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.001,
      "warmup_num_steps": 50
    }
  },
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "steps_per_print":1
}
@stas00
Copy link
Contributor

stas00 commented Jun 29, 2021

I added your changes to the original and I am not able to reproduce the hanging with "EleutherAI/gpt-neo-2.7B" as it is in the original.

I'm on transformers master, but I don't think it makes any difference.

If you want me to try anything else please fork https://github.com/dredwardhyde/gpt-neo-fine-tuning-example/, apply whatever changes you need and share the link to your fork.

To debug hanging do:

pip install py-spy
sudo py-spy dump --PID pid_of_the_hanging_process

and share the backtraces.

Unrelated, if you could make a PR to https://github.com/dredwardhyde/gpt-neo-fine-tuning-example/ with the new ds_config.json it'd help others.

@SamsTheGreatest
Copy link
Author

Thanks @stas00,

I installed transformers with pip.

Created a simple example and packed everything into a repo along with all the requirements. Attaching the link to the repo here https://github.com/SamsTheGreatest/gpt-neo-with-deepspeed.git. I have put other relevant info is in the README. Hopefully, it will help to shine some light on this.

Unfortunately, I don't have sudo access. Maybe there is another way to backtrace it? If I could have interrupted the kernel in Jupiter, it would show me some traceback, however in this case, when I start the Trainer, I can't even interrupt the kernel anymore.

@stas00
Copy link
Contributor

stas00 commented Jun 29, 2021

That's a wonderful way to do it, @SamsTheGreatest - thank you!

OK, so I run your fork and it's running just fine. i.e. it started training - I didn't wait for it to finish.

wrt, debug

  1. try py-spy w/o sudo if your system has ptrace set to 0
cat /proc/sys/kernel/yama/ptrace_scope

you don't need sudo to attach to the process.

  1. if it's >0, then used faulthandler

add this to your code:

import faulthandler
faulthandler.dump_traceback_later(20, repeat=True)

and when you run it, it will dump the bt for each thread every 20 sec.

(I haven't tried it in the notebook, but it should probably work just fine)

@SamsTheGreatest
Copy link
Author

Thanks @stas00, that's very detailed!

cat /proc/sys/kernel/yama/ptrace_scope yields 1 so ill do it with faulthandler.

Accidentally found out that when removing DeepSpeed option from trainer, it still gets stuck. Removing

# os.environ['MASTER_ADDR'] = 'localhost'
# os.environ['MASTER_PORT'] = '9994'
# os.environ['RANK'] = "0"
# os.environ['LOCAL_RANK'] = "0"
# os.environ['WORLD_SIZE'] = "1"

starts training as expected again. I also tried letting the settings to be discovered via mpi4py, as you wrote in the original post, it says mpi4py needs to be installed (can't install as I need sudo .....again). Could it be all due to the fact that I'm running things not on my own machine directly but using kubeflow notebook server?

I have dumped the traceback files from all 3 experiments into the same repo. FP16 is on during all of them. No settings means that os.environ is commented out. I have also labeled the start of training with \n\nNow training\n\n.

Thanks again

@stas00
Copy link
Contributor

stas00 commented Jun 30, 2021

You don't need sudo to install mpi4py - this is just pip install mpi4py

Perhaps you're the first one to run deepspeed on kubeflow, by looking at the traces seems like it has some distributed issues there

Thank you for making the traces. It seems to be stuck at:

 File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1080 in broadcast

It might be something specific to the their jupyter setup? If I understand correctly kubeflow is notebook only, right?

Can you run deepspeed from the command line? e.g. as in this example? https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb

All the os.environ code is that we are emulating a distributed launcher in the notebook. (instead of runing torch.distributed.launch or the deepspeed launcher.)

Also try a different port?

A different address? Perhaps 127.0.0.1 or find its IP address?

It's very possible that the distributed network gets stuck because of either of these 2 as it can't network.

Deepspeed requires a fully distributed setup even with just one gpu, since it wasn't really designed for that kind of situation in mind (But perhaps it could).

@SamsTheGreatest
Copy link
Author

SamsTheGreatest commented Jul 5, 2021

Hi @stas00,

Sorry for the long wait. Tried other IP, but all yield Permission errors and such.. The correct IP seems to be localhost or IP of the Kubernetes Pod. This are the only options I have tried that don't yield errors, however the script still hangs at the same spot.

The notebook you referenced, hangs at the same spot unfortunately.

Downloading: 5.40kB [00:00, 3.13MB/s] 

Using amp fp16 backend 

[2021-07-05 08:20:38,917] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.2, git-hash=unknown, git-branch=unknown 

[2021-07-05 08:20:43,129] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1 

^CKilling subprocess 4452 

Main process received SIGINT, exiting 

Traceback (most recent call last): 

  File "/home/jovyan/anaconda3/envs/esemala/bin/deepspeed", line 6, in <module> 

    main() 

  File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/launcher/runner.py", line 362, in main 

    result.wait() 

  File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/subprocess.py", line 1019, in wait 

    return self._wait(timeout=timeout) 

  File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/subprocess.py", line 1653, in _wait 

    (pid, sts) = self._try_wait(0) 

  File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/subprocess.py", line 1611, in _try_wait 

    (pid, sts) = os.waitpid(self.pid, wait_flags) 

KeyboardInterrupt 

(esemala) tf-docker ~/transformers >

(Had to keyboard-interrupt it)

I have installed transformers and deepspeed as suggested in the notebook.

PS: quick suggestion: in the last cell, when running the example, one might consider changing rm -r output_dir to rm -rf output_dir so that we don't get an error if the directory does not exist.

Could we investigate this a little further? Maybe there is something wrong with the mismatch of cuda and cuda-toolkit installed? nvcc -V yields 10.1, however the latest pytorch is installed as for 11.1. Trying to follow this tutorial ,now, instead of installing OPs for Deepspeed just in time, I treid DS_BUILD_OPS=1 pip install ., however it says

Exception: Installed CUDA version 10.1 does not match the version torch was compiled with 11.1, unable to compile cuda/cpp extensions without a matching cuda version.

@stas00
Copy link
Contributor

stas00 commented Jul 5, 2021

So the issue in this one is in launching a pytorch subprocess here.

Is there a way I could have a direct access to the same environment?

PS: quick suggestion: in the last cell, when running the example, one might consider changing rm -r output_dir to rm -rf output_dir so that we don't get an error if the directory does not exist.

That's a great suggestion, @SamsTheGreatest - done!

Exception: Installed CUDA version 10.1 does not match the version torch was compiled with 11.1, unable to compile cuda/cpp extensions without a matching cuda version.

You need to install pytorch built with cuda 10 for that. As of this writing this is done with:

pip install torch torchvision torchaudio

Normally find the right command here: https://pytorch.org/get-started/locally/

DS will handle minor version mismatch no problem.

@SamsTheGreatest
Copy link
Author

SamsTheGreatest commented Jul 7, 2021

@stas00, Unfortunately, I am not authorized to do that.. but I can provide you with the exact docker image I am using. Here is a link: https://github.com/kubeflow/kubeflow/tree/v1.2.0/components/tensorflow-notebook-image

I tried installing torch for 10.1, process still hangs at

File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1080 in broadcast

just as before.

Now, I had to rebuild the docker container as sudo password wasn't set. I am now root, so I installed conda 11.1.1 for linux. All versions are now matching and I managed to build all OPs for deepspeed except async_io (I assume I don't need it atm..) using DS_BUILD_OPS=1 pip install .. So.. now ds_report shows that all OPs are installed and all cuda versions are matching.

Still hangs at the same spot...

Reading though some issues, could it be that its due to the nccl usage? Is there a trivial way to set backend to gloo within the notebook I shared with you @stas00?

@stas00
Copy link
Contributor

stas00 commented Jul 7, 2021

I'm not succeeding at building that Docker image. If I use build_image.sh it hangs, if I try to docker build . it fails with some deps missing. Do you have a ready docker image I could pull?

Since kubeflow is run in a docker image most likely the issue has something to do with its setup/configuration.

Reading though some issues, could it be that its due to the nccl usage? Is there a trivial way to set backend to gloo within the notebook I shared with you @stas00?

It's very possible. I haven't run into this myself, so I trust your research.

gloo doesn't provide the same functionality as nccl, but it looks that Deepspeed docs say it should work.

OK, what if you do: deepspeed.init_distributed("gloo") here? instead of deepspeed.init_distributed()

deepspeed.init_distributed()

I found this issue microsoft/DeepSpeed#1030 where a user was able to use the gloo backend with Deepspeed.

@SamsTheGreatest
Copy link
Author

@stas00 consulted internally again and tried using "gloo" as you specified. Colleagues said they could not manage to run nccl on kubeflow either. Basically cloned the transformers repo and changed the training_args as you specified.

Changed model for trainer like so too:

trainer = tr.Trainer(model=model.requires_grad_(False), 
                    args=training_args, ..... 

Now, with gloo code runs a little further!!

[2021-07-08 15:28:56,767] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.3+c9fee82, git-hash=c9fee82, git-branch=master
[2021-07-08 15:28:56,775] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
[2021-07-08 15:28:56,891] [INFO] [engine.py:177:__init__] DeepSpeed Flops Profiler Enabled: False
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-ccb66750b859> in <module>
     10 # Start training process!
     11 
---> 12 trainer.train()
     13 trainer.save_model(save_dir)
     14 tokenizer.save_pretrained(save_dir+'/tokenizer/')

~/anaconda3/envs/esemala/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1122         if args.deepspeed:
   1123             deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
-> 1124                 self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
   1125             )
   1126             self.model = deepspeed_engine.module

~/anaconda3/envs/esemala/lib/python3.7/site-packages/transformers/deepspeed.py in deepspeed_init(trainer, num_training_steps, resume_from_checkpoint)
    369         config_params=config,
    370         optimizer=optimizer,
--> 371         lr_scheduler=lr_scheduler,
    372     )
    373 

~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params)
    134                                  collate_fn=collate_fn,
    135                                  config=config,
--> 136                                  config_params=config_params)
    137     else:
    138         assert mpu is None, "mpu must be None with pipeline parallelism"

~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params, dont_change_device)
    189         self.lr_scheduler = None
    190         if model_parameters or optimizer:
--> 191             self._configure_optimizer(optimizer, model_parameters)
    192             self._configure_lr_scheduler(lr_scheduler)
    193             self._report_progress(0)

~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
    701                 logger.info('Using client Optimizer as basic optimizer')
    702         else:
--> 703             basic_optimizer = self._configure_basic_optimizer(model_parameters)
    704             if self.global_rank == 0:
    705                 logger.info(

~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
    772                     optimizer = DeepSpeedCPUAdam(model_parameters,
    773                                                  **optimizer_parameters,
--> 774                                                  adamw_mode=effective_adam_w_mode)
    775                 else:
    776                     from deepspeed.ops.adam import FusedAdam

~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/ops/adam/cpu_adam.py in __init__(self, model_params, lr, bias_correction, betas, eps, weight_decay, amsgrad, adamw_mode)
     72                             bias_correction=bias_correction,
     73                             amsgrad=amsgrad)
---> 74         super(DeepSpeedCPUAdam, self).__init__(model_params, default_args)
     75 
     76         self.opt_id = DeepSpeedCPUAdam.optimizer_id

~/anaconda3/envs/esemala/lib/python3.7/site-packages/torch/optim/optimizer.py in __init__(self, params, defaults)
     47         param_groups = list(params)
     48         if len(param_groups) == 0:
---> 49             raise ValueError("optimizer got an empty parameter list")
     50         if not isinstance(param_groups[0], dict):
     51             param_groups = [{'params': param_groups}]

ValueError: optimizer got an empty parameter list

Trying to battle this value error now, is it because AdamW was used and now its DeepSpeedCPUAdam? Shall I be concerned that CPU is being used?

We are using multi-node with single GPU in each cluster, so those issue could be arising from such architecture, but I'm not sure.

I will respond on your request for the Docker image a little later once I get it sorted out.

Thanks again

@SamsTheGreatest
Copy link
Author

Now, concerning the Docker image. We used the same docker image as one I shared, but at the end used USER root instead of jovyan.

also used those commands for this. Sorry I didn't share this earlier, was not the one involved with images...

python build_image.py --tf_version=1.15.2 --platform=gpu tf_notebook

pip install --upgrade pip

python3 -m pip install -r tensorflow-notebook-image/requirements.txt

If it helps I will try building the image and pushing it to docker hub myself, with all necessary requirements (on top of what I gave you I just installed necessary version of torch, compatible with cuda 10.1, huggingface transformers and deepspeed). But I would likely need some time for this...till next week or so

@stas00
Copy link
Contributor

stas00 commented Jul 8, 2021

@SamsTheGreatest, glad to see you made some progress!

Not sure why you needed to turn gradients off - that surely won't work as the optimizer now has no params to optimize, which is probably the reason why you had that most recent failure.


As we are progressing with the diagnosis of OP, it's becoming clear now that this issue has little to do with transformers (other than having a hardcoded nccl backend) and we should probably try to sort it out on the DeepSpeed Issues-side of things. Once sorted out we can then adjust the HF Trainer to do the right thing as deepspeed needs it.

Could you please open a new issue at https://github.com/microsoft/DeepSpeed/issues and I suppose the topic should be something along the lines of: using deepspeed in env where nccl doesn't work

And then specific sub-issues:

  1. make deepspeed work on kubeflow - nccl-backend hangs - your OP report
  2. make deepspeed work with the 'gloo' backend - your last gloo-specific report DeepSpeed gets stuck when training #12418 (comment)

or perhaps these should be 2 separate issues? I trust your judgment.

And from there let's see what the Deepspeed developers need, i.e. whether they will want the image or they already know what to do.

@SamsTheGreatest
Copy link
Author

Thanks, @stas00! Yes it seems reasonable, I will reply shortly to this in a little more detail. Also, discovered one more thing. Remember I mentioned this,

Accidentally found out that when removing DeepSpeed option from trainer, it still gets stuck.

When trying the same but also changing nccl to gloo in training_args.py, gets everything unstuck aswell!

torch.distributed.init_process_group(backend="gloo")
            device = torch.device("cuda", self.local_rank)
            self._n_gpu = 1

Could we conclude that for some reason nccl doesn't work on with the current hardware setup? Could there be a particular reason for that?

@stas00
Copy link
Contributor

stas00 commented Jul 9, 2021

Great to know that this is not deepspeed specific then - thank you for the experiments, @SamsTheGreatest

I'd say make a short repro script like:

echo 'import torch; torch.distributed.init_process_group(backend="nccl")' > run
python -m torch.distributed.launch --nproc_per_node=2 run

and if it hangs file an issue at pytorch? Hopefully someone on their team has dealt with kubeflow.

It probably has to do with how it builds Docker with regards to pytorch and cuda tools, or the interface to the gpu cards.

For example, what happens if you install the normal pytorch on that kubeflow instance after it was built? That would tests whether the issue is with how the pre-built pytorch was created while building the kubeflow image.

@SamsTheGreatest
Copy link
Author

@stas00

Not sure why you needed to turn gradients off - that surely won't work as the optimizer now has no params to optimize, which is probably the reason why you had that most recent failure.

yes turning on gradients doesn't make any sense. I was attempting to battle the issue with using 'gloo' backend that you referred to... not sure how to fix it microsoft/DeepSpeed#1030

@stas00
Copy link
Contributor

stas00 commented Jul 9, 2021

Also, have a look at when to use which backend notes here: https://pytorch.org/docs/stable/distributed.html

Scroll down to "Which backend to use?"

Do any of these ring a bell?


And also these may aid debugging the NCCL issues:

    export NCCL_DEBUG=INFO
    export NCCL_DEBUG_SUBSYS=ALL

Finally, you can attach to a hanging process with strace (or start it under strace) and see where it is hanging on the libc-level.

@stas00
Copy link
Contributor

stas00 commented Jul 9, 2021

@stas00

Not sure why you needed to turn gradients off - that surely won't work as the optimizer now has no params to optimize, which is probably the reason why you had that most recent failure.

yes turning on gradients doesn't make any sense. I was attempting to battle the issue with using 'gloo' backend that you referred to... not sure how to fix it microsoft/DeepSpeed#1030

Open a new issue there?

@jeffra
Copy link
Contributor

jeffra commented Jul 15, 2021

@SamsTheGreatest trying to get caught up on this thread but are you able to run NCCL without deepspeed? Even if we can get the gloo backend working I suspect the performance would not be ideal. Can you try a simple all-reduce test in your environment using NCCL?

We often run this gist on our systems to test basic NCCL functionality: https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36

@stas00
Copy link
Contributor

stas00 commented Jul 15, 2021

Can you try a simple all-reduce test in your environment using NCCL?

So a simple test could be something like:

# test.py
import torch.distributed as dist
import argparse
import torch
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", local_rank)

dist.init_process_group("nccl")
dist.all_reduce(torch.ones(1).to(device), op=dist.ReduceOp.SUM)
# to run
python -m torch.distributed.launch --nproc_per_node=2 test.py

adjust the number of gpus above - probably just 1 in your case. You have only 1 gpu, correct?

Edit I see you reported earlier 1 gpu per node,

We are using multi-node with single GPU in each cluster, so those issue could be arising from such architecture, but I'm not sure.

so then you need to adapt the above to include the --nnode= as well.

@github-actions
Copy link

github-actions bot commented Aug 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@dseddah
Copy link

dseddah commented Sep 23, 2021

Hi,
I'm having the same issue when trying to reproduce the Academic-Budget-Bert code.
I've run the provided test.py code and encountered the same behavior

# test.py
import torch.distributed as dist
import argparse
import torch
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)

dist.init_process_group("nccl")
dist.all_reduce(torch.ones(1).to(device), op=dist.ReduceOp.SUM)

@^CTraceback (most recent call last):
  File "/home/ROCQ/alpage/seddah/src/miniconda3/envs/budgetBERT/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ROCQ/alpage/seddah/src/miniconda3/envs/budgetBERT/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ROCQ/alpage/seddah/src/miniconda3/envs/budgetBERT/lib/python3.9/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/ROCQ/alpage/seddah/src/miniconda3/envs/budgetBERT/lib/python3.9/site-packages/torch/distributed/launch.py", line 253, in main
    process.wait()
  File "/home/ROCQ/alpage/seddah/src/miniconda3/envs/budgetBERT/lib/python3.9/subprocess.py", line 1189, in wait
    return self._wait(timeout=timeout)
  File "/home/ROCQ/alpage/seddah/src/miniconda3/envs/budgetBERT/lib/python3.9/subprocess.py", line 1917, in _wait
    (pid, sts) = self._try_wait(0)
  File "/home/ROCQ/alpage/seddah/src/miniconda3/envs/budgetBERT/lib/python3.9/subprocess.py", line 1875, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

So, if anyone has a workaround, that would be great.

Best,
Djamé

@kangqiyue
Copy link

I think you could try this solution:
rm -rf ~/.cache/torch_extensions/

ref: #12715

@fahadh4ilyas
Copy link

Is this already solved? I also have this problem when training inside pod.

@Joshuaclymer
Copy link

Creating a new pod has solved this issue for me a couple of times.

@loveunk
Copy link

loveunk commented Dec 29, 2023

try export NCCL_P2P_DISABLE=1, it works for me.

@stas00
Copy link
Contributor

stas00 commented Dec 29, 2023

It may work but at what cost?

The NCCL_P2P_DISABLE variable disables the peer to peer (P2P) transport, which uses CUDA direct access between GPUs, using NVLink or PCI.

You will lose on performance greatly.

@loveunk
Copy link

loveunk commented Dec 30, 2023

It may work but at what cost?

The NCCL_P2P_DISABLE variable disables the peer to peer (P2P) transport, which uses CUDA direct access between GPUs, using NVLink or PCI.

You will lose on performance greatly.

I am using Zero stage 2 for training on a single host with multi-GPUs, the performance scaleup is ok for me.
In my case, the training task is GPU-computation-bound, not GPU-communication-bound.

@stas00
Copy link
Contributor

stas00 commented Dec 30, 2023

If you don't care for your training to finish faster then your approach definitely works.

It's not about whether it's comms-bound or gpu-bound, it's about the wasted time on comms. Please see the diagram at https://github.com/stas00/ml-engineering/tree/master/network#single-node-training to have a better understanding that comms aren't instant.

I was just flagging to future readers that this is not the right solution for many users. Instead they need to figure out what's wrong with their network setup and enjoy the fast P2P comms and faster training time.

@loveunk
Copy link

loveunk commented Dec 30, 2023

If you don't care for your training to finish faster then your approach definitely works.

It's not about whether it's comms-bound or gpu-bound, it's about the wasted time on comms. Please see the diagram at https://github.com/stas00/ml-engineering/tree/master/network#single-node-training to have a better understanding that comms aren't instant.

I was just flagging to future readers that this is not the right solution for many users. Instead they need to figure out what's wrong with their network setup and enjoy the fast P2P comms and faster training time.

Agree with your points. Thank you for sharing.

@sunwhw
Copy link

sunwhw commented Feb 21, 2024

torch_extensions

running on the server which is shared by many users, will this deletion affect other users?

@wentinghome
Copy link

try export NCCL_P2P_DISABLE=1, it works for me.

Hey, my pod has 8GPUs, the multi-GPU training works on 4GPUs, but stuck/training not start on 8GPUs.
I try export NCCL_P2P_DISABLE=1, the 4GPUs and 8GPUs does not work any more, wondering if there is any suggestions? thank you

@chanangad
Copy link

Facing the same issue.

@amyeroberts
Copy link
Collaborator

@wentinghome @chanangad Could you open a new issue, including details about the running environment, the error (when did it stop working) and a reproducible code snippet? This helps us track possibly new issues and when they are resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests