-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed gets stuck when training #12418
Comments
I added your changes to the original and I am not able to reproduce the hanging with "EleutherAI/gpt-neo-2.7B" as it is in the original. I'm on transformers master, but I don't think it makes any difference. If you want me to try anything else please fork https://github.com/dredwardhyde/gpt-neo-fine-tuning-example/, apply whatever changes you need and share the link to your fork. To debug hanging do:
and share the backtraces. Unrelated, if you could make a PR to https://github.com/dredwardhyde/gpt-neo-fine-tuning-example/ with the new ds_config.json it'd help others. |
Thanks @stas00, I installed Created a simple example and packed everything into a repo along with all the requirements. Attaching the link to the repo here https://github.com/SamsTheGreatest/gpt-neo-with-deepspeed.git. I have put other relevant info is in the README. Hopefully, it will help to shine some light on this. Unfortunately, I don't have sudo access. Maybe there is another way to backtrace it? If I could have interrupted the kernel in Jupiter, it would show me some traceback, however in this case, when I start the |
That's a wonderful way to do it, @SamsTheGreatest - thank you! OK, so I run your fork and it's running just fine. i.e. it started training - I didn't wait for it to finish. wrt, debug
you don't need
add this to your code:
and when you run it, it will dump the bt for each thread every 20 sec. (I haven't tried it in the notebook, but it should probably work just fine) |
Thanks @stas00, that's very detailed!
Accidentally found out that when removing DeepSpeed option from trainer, it still gets stuck. Removing
starts training as expected again. I also tried letting the settings to be discovered via I have dumped the traceback files from all 3 experiments into the same repo. Thanks again |
You don't need Perhaps you're the first one to run deepspeed on kubeflow, by looking at the traces seems like it has some distributed issues there Thank you for making the traces. It seems to be stuck at:
It might be something specific to the their jupyter setup? If I understand correctly kubeflow is notebook only, right? Can you run deepspeed from the command line? e.g. as in this example? https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb All the Also try a different port? A different address? Perhaps It's very possible that the distributed network gets stuck because of either of these 2 as it can't network. Deepspeed requires a fully distributed setup even with just one gpu, since it wasn't really designed for that kind of situation in mind (But perhaps it could). |
Hi @stas00, Sorry for the long wait. Tried other IP, but all yield Permission errors and such.. The correct IP seems to be localhost or IP of the Kubernetes Pod. This are the only options I have tried that don't yield errors, however the script still hangs at the same spot. The notebook you referenced, hangs at the same spot unfortunately. Downloading: 5.40kB [00:00, 3.13MB/s]
Using amp fp16 backend
[2021-07-05 08:20:38,917] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.2, git-hash=unknown, git-branch=unknown
[2021-07-05 08:20:43,129] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
^CKilling subprocess 4452
Main process received SIGINT, exiting
Traceback (most recent call last):
File "/home/jovyan/anaconda3/envs/esemala/bin/deepspeed", line 6, in <module>
main()
File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/launcher/runner.py", line 362, in main
result.wait()
File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/subprocess.py", line 1653, in _wait
(pid, sts) = self._try_wait(0)
File "/home/jovyan/anaconda3/envs/esemala/lib/python3.7/subprocess.py", line 1611, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
(esemala) tf-docker ~/transformers > (Had to keyboard-interrupt it) I have installed transformers and deepspeed as suggested in the notebook. PS: quick suggestion: in the last cell, when running the example, one might consider changing Could we investigate this a little further? Maybe there is something wrong with the mismatch of cuda and cuda-toolkit installed? Exception: Installed CUDA version 10.1 does not match the version torch was compiled with 11.1, unable to compile cuda/cpp extensions without a matching cuda version. |
So the issue in this one is in launching a pytorch subprocess here. Is there a way I could have a direct access to the same environment?
That's a great suggestion, @SamsTheGreatest - done!
You need to install pytorch built with cuda 10 for that. As of this writing this is done with:
Normally find the right command here: https://pytorch.org/get-started/locally/ DS will handle minor version mismatch no problem. |
@stas00, Unfortunately, I am not authorized to do that.. but I can provide you with the exact docker image I am using. Here is a link: https://github.com/kubeflow/kubeflow/tree/v1.2.0/components/tensorflow-notebook-image I tried installing torch for 10.1, process still hangs at
just as before. Now, I had to rebuild the docker container as Still hangs at the same spot... Reading though some issues, could it be that its due to the |
I'm not succeeding at building that Docker image. If I use Since kubeflow is run in a docker image most likely the issue has something to do with its setup/configuration.
It's very possible. I haven't run into this myself, so I trust your research. gloo doesn't provide the same functionality as nccl, but it looks that Deepspeed docs say it should work. OK, what if you do:
I found this issue microsoft/DeepSpeed#1030 where a user was able to use the gloo backend with Deepspeed. |
@stas00 consulted internally again and tried using "gloo" as you specified. Colleagues said they could not manage to run Changed model for trainer like so too: trainer = tr.Trainer(model=model.requires_grad_(False),
args=training_args, ..... Now, with [2021-07-08 15:28:56,767] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.3+c9fee82, git-hash=c9fee82, git-branch=master
[2021-07-08 15:28:56,775] [INFO] [utils.py:13:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
[2021-07-08 15:28:56,891] [INFO] [engine.py:177:__init__] DeepSpeed Flops Profiler Enabled: False
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-ccb66750b859> in <module>
10 # Start training process!
11
---> 12 trainer.train()
13 trainer.save_model(save_dir)
14 tokenizer.save_pretrained(save_dir+'/tokenizer/')
~/anaconda3/envs/esemala/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1122 if args.deepspeed:
1123 deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
-> 1124 self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
1125 )
1126 self.model = deepspeed_engine.module
~/anaconda3/envs/esemala/lib/python3.7/site-packages/transformers/deepspeed.py in deepspeed_init(trainer, num_training_steps, resume_from_checkpoint)
369 config_params=config,
370 optimizer=optimizer,
--> 371 lr_scheduler=lr_scheduler,
372 )
373
~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params)
134 collate_fn=collate_fn,
135 config=config,
--> 136 config_params=config_params)
137 else:
138 assert mpu is None, "mpu must be None with pipeline parallelism"
~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params, dont_change_device)
189 self.lr_scheduler = None
190 if model_parameters or optimizer:
--> 191 self._configure_optimizer(optimizer, model_parameters)
192 self._configure_lr_scheduler(lr_scheduler)
193 self._report_progress(0)
~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
701 logger.info('Using client Optimizer as basic optimizer')
702 else:
--> 703 basic_optimizer = self._configure_basic_optimizer(model_parameters)
704 if self.global_rank == 0:
705 logger.info(
~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
772 optimizer = DeepSpeedCPUAdam(model_parameters,
773 **optimizer_parameters,
--> 774 adamw_mode=effective_adam_w_mode)
775 else:
776 from deepspeed.ops.adam import FusedAdam
~/anaconda3/envs/esemala/lib/python3.7/site-packages/deepspeed/ops/adam/cpu_adam.py in __init__(self, model_params, lr, bias_correction, betas, eps, weight_decay, amsgrad, adamw_mode)
72 bias_correction=bias_correction,
73 amsgrad=amsgrad)
---> 74 super(DeepSpeedCPUAdam, self).__init__(model_params, default_args)
75
76 self.opt_id = DeepSpeedCPUAdam.optimizer_id
~/anaconda3/envs/esemala/lib/python3.7/site-packages/torch/optim/optimizer.py in __init__(self, params, defaults)
47 param_groups = list(params)
48 if len(param_groups) == 0:
---> 49 raise ValueError("optimizer got an empty parameter list")
50 if not isinstance(param_groups[0], dict):
51 param_groups = [{'params': param_groups}]
ValueError: optimizer got an empty parameter list Trying to battle this value error now, is it because We are using multi-node with single GPU in each cluster, so those issue could be arising from such architecture, but I'm not sure. I will respond on your request for the Docker image a little later once I get it sorted out. Thanks again |
Now, concerning the Docker image. We used the same docker image as one I shared, but at the end used also used those commands for this. Sorry I didn't share this earlier, was not the one involved with images... python build_image.py --tf_version=1.15.2 --platform=gpu tf_notebook
pip install --upgrade pip
python3 -m pip install -r tensorflow-notebook-image/requirements.txt If it helps I will try building the image and pushing it to docker hub myself, with all necessary requirements (on top of what I gave you I just installed necessary version of torch, compatible with cuda 10.1, huggingface transformers and deepspeed). But I would likely need some time for this...till next week or so |
@SamsTheGreatest, glad to see you made some progress! Not sure why you needed to turn gradients off - that surely won't work as the optimizer now has no params to optimize, which is probably the reason why you had that most recent failure. As we are progressing with the diagnosis of OP, it's becoming clear now that this issue has little to do with Could you please open a new issue at https://github.com/microsoft/DeepSpeed/issues and I suppose the topic should be something along the lines of: using deepspeed in env where nccl doesn't work And then specific sub-issues:
or perhaps these should be 2 separate issues? I trust your judgment. And from there let's see what the Deepspeed developers need, i.e. whether they will want the image or they already know what to do. |
Thanks, @stas00! Yes it seems reasonable, I will reply shortly to this in a little more detail. Also, discovered one more thing. Remember I mentioned this,
When trying the same but also changing torch.distributed.init_process_group(backend="gloo")
device = torch.device("cuda", self.local_rank)
self._n_gpu = 1 Could we conclude that for some reason |
Great to know that this is not deepspeed specific then - thank you for the experiments, @SamsTheGreatest I'd say make a short repro script like:
and if it hangs file an issue at pytorch? Hopefully someone on their team has dealt with kubeflow. It probably has to do with how it builds Docker with regards to pytorch and cuda tools, or the interface to the gpu cards. For example, what happens if you install the normal pytorch on that kubeflow instance after it was built? That would tests whether the issue is with how the pre-built pytorch was created while building the kubeflow image. |
yes turning on gradients doesn't make any sense. I was attempting to battle the issue with using 'gloo' backend that you referred to... not sure how to fix it microsoft/DeepSpeed#1030 |
Also, have a look at when to use which backend notes here: https://pytorch.org/docs/stable/distributed.html Scroll down to "Which backend to use?" Do any of these ring a bell? And also these may aid debugging the NCCL issues:
Finally, you can attach to a hanging process with |
Open a new issue there? |
@SamsTheGreatest trying to get caught up on this thread but are you able to run NCCL without deepspeed? Even if we can get the gloo backend working I suspect the performance would not be ideal. Can you try a simple all-reduce test in your environment using NCCL? We often run this gist on our systems to test basic NCCL functionality: https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36 |
So a simple test could be something like:
adjust the number of gpus above - probably just 1 in your case. You have only 1 gpu, correct? Edit I see you reported earlier 1 gpu per node,
so then you need to adapt the above to include the |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi,
So, if anyone has a workaround, that would be great. Best, |
I think you could try this solution: ref: #12715 |
Is this already solved? I also have this problem when training inside pod. |
Creating a new pod has solved this issue for me a couple of times. |
try |
It may work but at what cost?
You will lose on performance greatly. |
I am using Zero stage 2 for training on a single host with multi-GPUs, the performance scaleup is ok for me. |
If you don't care for your training to finish faster then your approach definitely works. It's not about whether it's comms-bound or gpu-bound, it's about the wasted time on comms. Please see the diagram at https://github.com/stas00/ml-engineering/tree/master/network#single-node-training to have a better understanding that comms aren't instant. I was just flagging to future readers that this is not the right solution for many users. Instead they need to figure out what's wrong with their network setup and enjoy the fast P2P comms and faster training time. |
Agree with your points. Thank you for sharing. |
running on the server which is shared by many users, will this deletion affect other users? |
Hey, my pod has 8GPUs, the multi-GPU training works on 4GPUs, but stuck/training not start on 8GPUs. |
Facing the same issue. |
@wentinghome @chanangad Could you open a new issue, including details about the running environment, the error (when did it stop working) and a reproducible code snippet? This helps us track possibly new issues and when they are resolved |
Environment info
transformers
version: 4.8.1Who can help
@stas00
Information
Trying to replicate this, I am using a 125M GPT Neo model and fine-tune it with using the Trainer. Training arguments include a DeepSpeed option. The Trainer gets stuck with:
ds_report gives:
Is there a way to debug this?
To Replicate
I modified the original code slightly to remove the errors:
and ds_config.json is now:
The text was updated successfully, but these errors were encountered: