Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About GPU and processing time in pretraining of LayoutLMv3 #917

Closed
kash203 opened this issue Nov 11, 2022 · 4 comments
Closed

About GPU and processing time in pretraining of LayoutLMv3 #917

kash203 opened this issue Nov 11, 2022 · 4 comments

Comments

@kash203
Copy link

kash203 commented Nov 11, 2022

Describe

Model I am using (UniLM, MiniLM, LayoutLM ...): LaoyutLMv3

Question

I would like to estimate how long does pretraining take time.
So I would like to know the GPU you used and the time it took for pretraining.

In the environment at hand, it seems that it will take about 2-3 months for 4 x A100(80GB) with base model ...
I would also like to consider whether this is the correct time.

Condition:

  • Input length to the transformer is about 711 that contains word(514) and image(197)
    • batch size per GPU: about 50
    • using gradient accumulation: 10=> 50 x 4 x 10 =2,000
  • using AMP
  • target training step: 500,000
  • GPU usage is almost full with nvidia-smi.
@HYPJUDY
Copy link
Contributor

HYPJUDY commented Nov 16, 2022

Below are the details of our training setting using v100 (32G) GPUs:

Data Size Steps Number of GPUs Batch Size per GPU Gradient Accumulation Time
LayoutLMv3-base 11 million 500,000 4x8 8 8 468.5h
LayoutLMv3-large 11 million 500,000 8x8 4 8 805h

I recommend using fewer training steps if you want to reduce the training time with the available hardware. For example, 150,000 steps should be enough to achieve slightly worse results on most tasks.

@kash203
Copy link
Author

kash203 commented Nov 17, 2022

Thank you for your very helpful answer!
My time estimate looks plausible.
In particular, shortening the number of steps is helpful for me.

I'll try that number of steps.
I think it should keep number of warm up steps (base: 24,000, large: 50,000 steps for warm up).

@kash203 kash203 closed this as completed Nov 17, 2022
@HYPJUDY
Copy link
Contributor

HYPJUDY commented Nov 18, 2022

I am glad the answer was helpful to you!
To my surprise, you can reach 50 batch size with an A100 (80G), much larger than the 8 I used with the V100 (32G).

There are two ways to use fewer training steps. I use the same warm-up ratios as in the paper for both methods.

  1. The total number of training steps is 150,000. LayoutLMv3's ablation models adopt this setup (see the caption of Table 3 in the paper). Note that we use a larger learning rate at the same time.
  2. The total number of training steps is 500,000, and you can stop the job at 150,000 steps.

@kash203
Copy link
Author

kash203 commented Nov 22, 2022

Thanks for the additional advice!
The batch size difference makes me wonder too. The models have almost the same number of parameters.

I see that I was wrong about the warm-up. It's a ratio, not a number.
I wonder you used lr = 3e-4 when number of steps = 150,000,
I thought you were using scheduler as transformers.get_constant_schedule_with_warmup, but maybe you are using transformers.get_linear_schedule_with_warmup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants