Multi-GPU training vs Distributed training #53

llidev · 2018-11-24T00:49:45Z

Hi,

I have a question about Multi-GPU vs Distributed training, probably unrelated to BERT itself.

I have a 4-GPU server, and was trying to run run_classifier.py in two ways:

(a) run single-node distributed training with 4 processes and minibatch of 32 each
(b) run Multi-GPU training with minibatch of 128, and all other hyperparams keep the same

Intuitively I believe a and b should yield the closed accuracy and training times. Below please find my observations:

(a) runs ~20% faster than (b).
(b) yields a better final evaluation accuracy of ~4% than (a)

The first looks like reasonable since I guess the loss.mean() is done by CPU which may be slower than using NCCL directly? However, I don't quite understand the second observation. Can you please give any hint or reference about the possible cause?

Thanks!

The text was updated successfully, but these errors were encountered:

thomwolf · 2018-11-26T09:03:23Z

Hi,

Thanks for the feedback, it's always interesting to compare the various possible ways to train the model indeed.

The most likely cause for (2) is that MRPC is a small dataset and the model shows a high variance in the results depending on the initialization of the weights for example (see the original BERT repo on that also). The distributed and multi-gpu setups probably do not use the random generators in the exact same order which lead to different initializations.

You can have an intuition of that by training with different seeds, you will see there is easily a 10% variation in the final accuracy...

If you can do that, a better way to compare the results would thus be to take something like 10 different seeds for each training condition and compare the mean and standard deviation of the results.

llidev · 2018-11-27T09:22:06Z

Thanks for your feedback!

After some investigations, it looks like t_total is not set properly for distributed training in BertAdam. The actual t_total per distributed worker should be divided by the worker count.

I have included the following fix in my PR #58

    t_total = num_train_steps
    if args.local_rank != -1:
        t_total = t_total // torch.distributed.get_world_size()
    optimizer = BertAdam(optimizer_grouped_parameters,
                         lr=args.learning_rate,
                         warmup=args.warmup_proportion,
                         t_total=t_total)

Pop

Add step time metrics via `xla_execution_time_step`

thomwolf closed this as completed Nov 26, 2018

davidefiocco mentioned this issue Nov 30, 2018

Accuracy on classification task is lower than the official tensorflow version #68

Closed

maeotaku mentioned this issue May 23, 2019

bert->onnx ->caffe2 weird error #633

Closed

jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023

Merge pull request huggingface#53 from jamesthesnake/pop

c40ad29

Pop

jonb377 added a commit to jonb377/hf-transformers that referenced this issue Apr 5, 2024

Merge pull request huggingface#53 from pytorch-tpu/jonbolin/step-time

9834eec

Add step time metrics via `xla_execution_time_step`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training vs Distributed training #53

Multi-GPU training vs Distributed training #53

llidev commented Nov 24, 2018

thomwolf commented Nov 26, 2018

llidev commented Nov 27, 2018

Multi-GPU training vs Distributed training #53

Multi-GPU training vs Distributed training #53

Comments

llidev commented Nov 24, 2018

thomwolf commented Nov 26, 2018

llidev commented Nov 27, 2018