Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My code seems to hang when skip_remainder_batch=False. #182

Open
Fragile-azalea opened this issue Aug 9, 2022 · 7 comments
Open

My code seems to hang when skip_remainder_batch=False. #182

Fragile-azalea opened this issue Aug 9, 2022 · 7 comments

Comments

@Fragile-azalea
Copy link

Fragile-azalea commented Aug 9, 2022

Describe the bug
Hi, Authors. My code seems to hang when skip_remainder_batch=False.

To Reproduce
Steps to reproduce the behavior:

git clone https://github.com/microsoft/tutel --branch main
python3 -m pip uninstall tutel -y
python3 ./tutel/setup.py

cd ./tutel/tutel/examples/fairseq_moe
git clone https://github.com/facebookresearch/fairseq --branch main
cd fairseq/ && git checkout b5e7b250913120409b872a940fbafec4d43c7b13
# This patch is an example to train Fairseq MoE transformers.
# Note that the current patch only works for `legacy_ddp` backend, and `--checkpoint-activations` must be disabled.
git apply ../fairseq_patch.diff
python3 -m pip install omegaconf==2.0.5 hydra-core==1.0.7
python3 -m pip install --no-deps --editable .

#fix bug in https://github.com/facebookresearch/fairseq/blob/main/fairseq/tasks/translation.py#L441-L442
#get dataset followed by https://github.com/facebookresearch/fairseq/tree/main/examples/translation


 CUDA_VISIBLE_DEVICES=0,1  MOE=3 L_AUX_WT=0.01 SKIP_EXPERT=1 fairseq-train  fairseq/data-bin/iwslt14.tokenized.de-en     --arch transformer_iwslt_de_en --share-decoder-input-output-embed     --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0     --lr 10e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000     --dropout 0.3 --weight-decay 0.0001     --criterion label_smoothed_cross_entropy --label-smoothing 0.1     --max-tokens 4096 --eval-bleu     --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'     --eval-bleu-detok moses     --eval-bleu-remove-bpe  --eval-bleu-print-samples     --best-checkpoint-metric bleu --maximize-best-checkpoint-metric  --ddp-backend legacy_ddp --max-update 100000

Logs

2022-08-09 10:51:01 | INFO | fairseq.utils | rank   0: capabilities =  7.5  ; total memory = 10.761 GB ;[0/1773]
NVIDIA GeForce RTX 2080 Ti
2022-08-09 10:51:01 | INFO | fairseq.utils | rank   1: capabilities =  7.5  ; total memory = 10.761 GB ; name =
NVIDIA GeForce RTX 2080 Ti
2022-08-09 10:51:01 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers**********
*************
2022-08-09 10:51:01 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2022-08-09 10:51:01 | INFO | fairseq_cli.train | max tokens per device = 4096 and max sentences per device = Non
e
2022-08-09 10:51:01 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last.pt
2022-08-09 10:51:01 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last.pt
2022-08-09 10:51:01 | INFO | fairseq.trainer | loading train data for epoch 1
2022-08-09 10:51:01 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: fairseq/data-bin/iwslt14.to
kenized.de-en/train.de-en.de
2022-08-09 10:51:01 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: fairseq/data-bin/iwslt14.to
kenized.de-en/train.de-en.en
2022-08-09 10:51:01 | INFO | fairseq.tasks.translation | fairseq/data-bin/iwslt14.tokenized.de-en train de-en 16
0239 examples
2022-08-09 10:51:01 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --am
p
2022-08-09 10:51:01 | INFO | fairseq.data.iterators | grouped total_num_itrs = 551
epoch 001:   0%|                                                          | 0/551 [00:00<?, ?it/s]2022-08-09 10:
51:01 | INFO | fairseq.trainer | begin training epoch 1
2022-08-09 10:51:01 | INFO | fairseq_cli.train | Start iterating over samples
/home/xinglinpan/tutel/tutel/examples/fairseq_moe/fairseq/fairseq/utils.py:374: UserWarning: amp_C fused kernels
 unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
/home/xinglinpan/tutel/tutel/examples/fairseq_moe/fairseq/fairseq/utils.py:374: UserWarning: amp_C fused kernels
 unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
epoch 001: 100%|▉| 550/551 [02:12<00:00,  4.54it/s, loss=9.244, nll_loss=8.59, ppl=385.3, wps=31462022-08-09 10:
53:14 | INFO | fairseq_cli.train | begin validation on "valid" subset
                                                                                                 2022-08-09 10:5
3:19 | INFO | fairseq.tasks.translation | example hypothesis: they don't don't don't don't don't don't don't't't
't
2022-08-09 10:53:19 | INFO | fairseq.tasks.translation | example reference: they're just not moving.

The possible problem is that not all devices are provided with data in the last iteration on the valid, so alltoall is always pending other processes. If SKIP_MOE=1, there is no this phenomenon.

@Fragile-azalea
Copy link
Author

{"max_len_a": 1.2, "max_len_b": 10} means that max_len will differ in different GPUs(please see https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335-L576).
so the alltoall of the largest length is always pending other processes.
One solution is that using {"max_len": 20} instead.
But I don't really understand the effect of this change on the BLEU score.

@ghostplant ghostplant added the duplicate This issue or pull request already exists label Aug 9, 2022
@ghostplant
Copy link
Contributor

ghostplant commented Aug 9, 2022

Thanks for your information. This is a dup of #173.

We'll update the fairseq patch to add inequivalent_tokens=True which is recently in tutel but not in fairseq patch. You may apply it by yourself temporarily.

@Fragile-azalea
Copy link
Author

If I understand correctly, I respectfully disagree with this view. In #173, there are inequivalent tokens in different devices. However, in this case, there is no forward in some devices due to the difference in max_len. (https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335)

@ghostplant
Copy link
Contributor

That's interesting. If it's true that one GPU performs 5 forwards and another GPU performs 6 forwards, does traditional data parallel even work? I think the application itself has to do something special to avoid that, right?

@Fragile-azalea
Copy link
Author

The number of forwarding may be different in different GPUs during the validation process. In some codes, forwards are even only found on GPU0 during the validation process.

In this code, two sentences with different lengths(e.g, "How are you" and "Thank you") are distributed into two GPUs. BLEU score seems to be calculated word by word. As a result, "How" and "Thank" is viewed as data parallelism, and "are" and "you" is viewed as data parallelism too. However, there is no word that can be viewed as data parallelism with "you" in "How are you".

@ghostplant
Copy link
Contributor

OK, this root cause makes sense. MoE layer within one process does not know application's purpose on whether other processes are going to forward MoE together with it or not. Even though it knows some of the processes are suspended and not to join in the validation procedure, part of expert parameters stored in those processes would be inaccessible by the validation process.

So, seems like there are no more solutions expect to do changes on application side, which is to open evaluation in every processes. Can you provide the code lines that performs related validation procedure?

@ghostplant ghostplant added application patch and removed duplicate This issue or pull request already exists labels Aug 9, 2022
@Fragile-azalea
Copy link
Author

Yes, I pretty much agree that opening evaluation in every process can fix this hang. In this code, we delete some BLEU args (i.e. max_len_a, max_len_b) to fix it. So we retrain our transformer by CUDA_VISIBLE_DEVICES=0,1 MOE=3 L_AUX_WT=0.01 SKIP_EXPERT=1 fairseq-train fairseq/data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 10e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --eval-bleu --eval-bleu-args '{"beam": 5}' --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --ddp-backend legacy_ddp --max-update 100000. It seems that the code works well and few BLEU scores will be influenced by deleting these arguments. But I don't sure about it completely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants