About GPU and processing time in pretraining of LayoutLMv3 #917

kash203 · 2022-11-11T06:53:52Z

Describe

Model I am using (UniLM, MiniLM, LayoutLM ...): LaoyutLMv3

Question

I would like to estimate how long does pretraining take time.
So I would like to know the GPU you used and the time it took for pretraining.

In the environment at hand, it seems that it will take about 2-3 months for 4 x A100(80GB) with base model ...
I would also like to consider whether this is the correct time.

Condition:

Input length to the transformer is about 711 that contains word(514) and image(197)
- batch size per GPU: about 50
- using gradient accumulation: 10=> 50 x 4 x 10 =2,000
using AMP
target training step: 500,000
GPU usage is almost full with nvidia-smi.

HYPJUDY · 2022-11-16T08:54:32Z

Below are the details of our training setting using v100 (32G) GPUs:

	Data Size	Steps	Number of GPUs	Batch Size per GPU	Gradient Accumulation	Time
LayoutLMv3-base	11 million	500,000	4x8	8	8	468.5h
LayoutLMv3-large	11 million	500,000	8x8	4	8	805h

I recommend using fewer training steps if you want to reduce the training time with the available hardware. For example, 150,000 steps should be enough to achieve slightly worse results on most tasks.

kash203 · 2022-11-17T23:56:57Z

Thank you for your very helpful answer!
My time estimate looks plausible.
In particular, shortening the number of steps is helpful for me.

I'll try that number of steps.
I think it should keep number of warm up steps (base: 24,000, large: 50,000 steps for warm up).

HYPJUDY · 2022-11-18T05:06:51Z

I am glad the answer was helpful to you!
To my surprise, you can reach 50 batch size with an A100 (80G), much larger than the 8 I used with the V100 (32G).

There are two ways to use fewer training steps. I use the same warm-up ratios as in the paper for both methods.

The total number of training steps is 150,000. LayoutLMv3's ablation models adopt this setup (see the caption of Table 3 in the paper). Note that we use a larger learning rate at the same time.
The total number of training steps is 500,000, and you can stop the job at 150,000 steps.

kash203 · 2022-11-22T02:27:07Z

Thanks for the additional advice!
The batch size difference makes me wonder too. The models have almost the same number of parameters.

I see that I was wrong about the warm-up. It's a ratio, not a number.
I wonder you used lr = 3e-4 when number of steps = 150,000,
I thought you were using scheduler as transformers.get_constant_schedule_with_warmup, but maybe you are using transformers.get_linear_schedule_with_warmup?

kash203 closed this as completed Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About GPU and processing time in pretraining of LayoutLMv3 #917

About GPU and processing time in pretraining of LayoutLMv3 #917

kash203 commented Nov 11, 2022 •

edited

HYPJUDY commented Nov 16, 2022

kash203 commented Nov 17, 2022 •

edited

HYPJUDY commented Nov 18, 2022

kash203 commented Nov 22, 2022

About GPU and processing time in pretraining of LayoutLMv3 #917

About GPU and processing time in pretraining of LayoutLMv3 #917

Comments

kash203 commented Nov 11, 2022 • edited

Describe

Question

HYPJUDY commented Nov 16, 2022

kash203 commented Nov 17, 2022 • edited

HYPJUDY commented Nov 18, 2022

kash203 commented Nov 22, 2022

kash203 commented Nov 11, 2022 •

edited

kash203 commented Nov 17, 2022 •

edited