Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using multiple GPUs #45

Open
eomiso opened this issue May 17, 2022 · 1 comment
Open

Error when using multiple GPUs #45

eomiso opened this issue May 17, 2022 · 1 comment

Comments

@eomiso
Copy link

eomiso commented May 17, 2022

Thank you for the awesome and interesting research and project. I was wondering if anyone has encountered the following error when using multiple gpu. I have 4 Titan V gpus and to use them I've set the local rank to -1. But it seems a problem occurs during the forward calculation.

export DATA_DIR=datasets/paranmt_filtered

source style-venv/bin/activate

BASE_DIR=style_paraphrase

python -m torch.distributed.launch --nproc_per_node=1 $BASE_DIR/run_lm_finetuning.py \
    --output_dir=$BASE_DIR/saved_models/test_paraphrase \
    --model_type=gpt2 \
    --model_name_or_path=gpt2-large \
    --data_dir=$DATA_DIR \
    --do_train \
    --save_steps 500 \
    --logging_steps 20 \
    --save_total_limit -1 \
    --evaluate_during_training \
    --num_train_epochs 3 \
    --gradient_accumulation_steps 2 \
    --per_gpu_train_batch_size 5 \
    --per_gpu_eval_batch_size 5 \
    --job_id paraphraser_test \
    --learning_rate 5e-5 \
    --prefix_input_type original \
    --global_dense_feature_list none \
    --specific_style_train -1 \
    --optimizer adam \
    --fp16 \
    --fp16_opt_level "O3" \
    --overwrite_output_dir \
    --local_rank -1
Traceback (most recent call last):
  File "style_paraphrase/run_lm_finetuning.py", line 505, in <module>
    main()
  File "style_paraphrase/run_lm_finetuning.py", line 422, in main
    global_step, tr_loss = train(args, gpt2_model, train_dataset, tokenizer)
  File "style_paraphrase/run_lm_finetuning.py", line 228, in train
    loss = gpt2_model(batch)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/workspace/style-transformer/style-transfer-paraphrase/style_paraphrase/utils.py", line 87, in forward
    labels=labels
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1059, in forward
    return_dict=return_dict,
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 832, in forward
    inputs_embeds = self.wte(input_ids)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/root/miniconda3/envs/style-venv/lib/python3.7/site-packages/torch/nn/functional.py", line 1814, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:403
@martiansideofthemoon
Copy link
Owner

martiansideofthemoon commented May 29, 2022

hi @eomiso,
Did you try --nproc_per_node=4? Also I don't think you have to set local_rank explicitly, i think PyTorch does it for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants