Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU support #2

Closed
MXueguang opened this issue Apr 15, 2021 · 4 comments
Closed

Multi GPU support #2

MXueguang opened this issue Apr 15, 2021 · 4 comments

Comments

@MXueguang
Copy link

Hi,
Is the current version of encoder training with GC support multiple GPUs?
I tried to run the training with NQ dataset by following the instructions in README.md but on a machine with 2 GPUs.
seems it is running slower than on a single GPU?
i.e. on a single GPU, one step cost about 4 sec, but with two GPU, one step cost about 24 sec

@luyug
Copy link
Owner

luyug commented Apr 16, 2021

We do have some local patches for multi cards but even the current TOT should not have overhead this big.

You can probably run a profiler to see what is bottlenecking it.

We can also help investigate the problem if you provide more information.

@MXueguang
Copy link
Author

Hi @luyug, Thank you for your help.
I loaded data and then ran two steps to see the time.

This is the head of the profile when I using two GPU (two 2080Ti, 11G).

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3   25.258    8.419   25.258    8.419 decoder.py:343(raw_decode)
       82    9.238    0.113    9.238    0.113 :0(run_backward)
  1156049    9.164    0.000   15.516    0.000 module.py:774(__setattr__)
2898444/367168    6.542    0.000    7.415    0.000 module.py:1215(named_modules)
     2114    4.500    0.002    4.500    0.002 :0(acquire)
3265133/3265132    3.398    0.000    3.397    0.000 :0(get)
      883    3.255    0.004    5.865    0.007 :0(read)
      310    3.243    0.010    3.243    0.010 :0(normal_)
       65    2.610    0.040    2.610    0.040 :0(utf_8_decode)
  2471319    2.546    0.000    2.548    0.000 :0(isinstance)
     2092    2.455    0.001    2.457    0.001 :0(to)
      504    2.263    0.004    2.263    0.004 :0(_scatter)
      168    1.831    0.011   29.583    0.176 replicate.py:78(replicate)
   145338    1.607    0.000   12.096    0.000 module.py:1376(_replicate_for_data_parallel)
      187    1.333    0.007    1.333    0.007 :0(_cuda_isDriverSufficient)
   955800    1.077    0.000    1.077    0.000 :0(items)
   136794    1.036    0.000    8.640    0.000 module.py:1048(_named_members)

v.s. The profile by running on single GPU:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3   25.166    8.389   25.166    8.389 decoder.py:343(raw_decode)
       82    3.695    0.045    3.695    0.045 :0(run_backward)
      857    3.231    0.004    5.836    0.007 :0(read)
      310    3.229    0.010    3.229    0.010 :0(normal_)
       74    2.605    0.035    2.605    0.035 :0(utf_8_decode)
     1838    2.414    0.001    2.415    0.001 :0(to)
      172    1.585    0.009    1.585    0.009 :0(_cuda_isDriverSufficient)
      292    0.869    0.003    0.869    0.003 :0(uniform_)
    15362    0.724    0.000    0.724    0.000 :0(matmul)
36480/160    0.507    0.000    3.585    0.022 module.py:710(_call_impl)
      398    0.387    0.001    0.387    0.001 :0(copy_)
      412    0.387    0.001    0.387    0.001 :0(_set_from_file)
    11680    0.272    0.000    1.047    0.000 functional.py:1655(linear)
        1    0.235    0.235   31.222   31.222 __init__.py:274(load)
    27907    0.199    0.000    0.350    0.000 module.py:774(__setattr__)

**************** CONFIGURATION **************** 
adam_betas                     -->   (0.9, 0.999)
adam_eps                       -->   1e-08
batch_size                     -->   128
checkpoint_file_name           -->   dpr_biencoder
ctx_chunk_size                 -->   8
dev_batch_size                 -->   16
dev_file                       -->   data/retriever/nq-dev.json
device                         -->   cuda
distributed_world_size         -->   1
do_lower_case                  -->   True
dropout                        -->   0.1
encoder_model_type             -->   hf_bert
eval_per_epoch                 -->   1
fix_ctx_encoder                -->   False
fp16                           -->   True
fp16_opt_level                 -->   O1
global_loss_buf_sz             -->   2097152
grad_cache                     -->   True
gradient_accumulation_steps    -->   1
hard_negatives                 -->   1
learning_rate                  -->   2e-05
local_rank                     -->   -1
log_batch_step                 -->   100
max_grad_norm                  -->   2.0
model_file                     -->   None
n_gpu                          -->   1
no_cuda                        -->   False
num_train_epochs               -->   40.0
other_negatives                -->   0
output_dir                     -->   model
pretrained_file                -->   None
pretrained_model_cfg           -->   bert-base-uncased
projection_dim                 -->   0
q_chunk_size                   -->   16
seed                           -->   12345
sequence_length                -->   256
shuffle_positive_ctx           -->   False
train_file                     -->   data/retriever/nq-train.json
train_files_upsample_rates     -->   None
train_rolling_loss_step        -->   100
val_av_rank_bsz                -->   128
val_av_rank_hard_neg           -->   30
val_av_rank_max_qs             -->   1000
val_av_rank_other_neg          -->   30
val_av_rank_start_epoch        -->   30
warmup_steps                   -->   1237
weight_decay                   -->   0.0

@luyug
Copy link
Owner

luyug commented Apr 16, 2021

A few things

  • This is python profiler, I suppose? Can you run the Pytorch profiler? That gives cuda kernel time as well.
  • How did you launch the script? Are you using DDP? The code was adjusted assuming DDP since DP is in general discouraged by Pytorch in all cases. I am not sure what will happen if you do DP.
  • Try turn off AMP. You will definitely see slower single card training but maybe we can learn more from the multi/single card time ratio.
  • Based on the number you have here, it seems backward was consuming quite a lot more time. (We can know better with a more detailed profile.) I do have a local patch that reduces the number of gradient sync in backward, but again, 6x time is definitely not expected. I have not idea what that __setattr__ is doing.

@MXueguang
Copy link
Author

Ah, I launched with DP. Running with DDP works!
Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants