New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data parallelism【multi-gpu train】+pure ViT work + small modify #150
Conversation
That's great news! |
Thanks a lot, I can benefit greatly from each pull request that you reviewed. test again for both vit and hybrid, only have below problem. Traceback (most recent call last):
File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 102, in <module>
train(args)
File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 27, in train
model = get_model(args, training=True)
File "/yuhang/LaTeX-OCR/pix2tex/models/utils.py", line 49, in get_model
en_attn_layers = encoder.module.attn_layers if available_gpus > 1 else encoder.attn_layers
File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1186, in __getattr__
type(self).__name__, name))
AttributeError: 'CustomVisionTransformer' object has no attribute 'attn_layers' If there is a better way to re-write 0aefdbf, please modify it. |
Sorry, my fault. I didn't try with Why don't you just watch the |
…into data-parallelism
because I don't know what it used to do before, like the I read the doc and found it related to the Thanks for your guidance. |
based on this stackoverflow QA, it seems git pull command fetch 6c53105 and automatically merge to local, then commit 4ffbeca, it's kind beyond my expectation. I thought it only commit 5cbbcb9. |
Before I merge this: There is something that went wrong along the way. Missing key(s) in state_dict: "decoder.net.attn_layers.layers.0.1.to_out.weight", "decoder.net.attn_layers.layers.0.1.to_out.bias", "decoder.net.attn_layers.layers.1.1.to_out.weight", "decoder.net.attn_layers.layers.1.1.to_out.bias", "decoder.net.attn_layers.layers.2.1.net.0.0.weight", "decoder.net.attn_layers.layers.2.1.net.0.0.bias", "decoder.net.attn_layers.layers.3.1.to_out.weight", "decoder.net.attn_layers.layers.3.1.to_out.bias", "decoder.net.attn_layers.layers.4.1.to_out.weight", "decoder.net.attn_layers.layers.4.1.to_out.bias", "decoder.net.attn_layers.layers.5.1.net.0.0.weight", "decoder.net.attn_layers.layers.5.1.net.0.0.bias", "decoder.net.attn_layers.layers.6.1.to_out.weight", "decoder.net.attn_layers.layers.6.1.to_out.bias", "decoder.net.attn_layers.layers.7.1.to_out.weight", "decoder.net.attn_layers.layers.7.1.to_out.bias", "decoder.net.attn_layers.layers.8.1.net.0.0.weight", "decoder.net.attn_layers.layers.8.1.net.0.0.bias", "decoder.net.attn_layers.layers.9.1.to_out.weight", "decoder.net.attn_layers.layers.9.1.to_out.bias", "decoder.net.attn_layers.layers.10.1.to_out.weight", "decoder.net.attn_layers.layers.10.1.to_out.bias", "decoder.net.attn_layers.layers.11.1.net.0.0.weight", "decoder.net.attn_layers.layers.11.1.net.0.0.bias".
Unexpected key(s) in state_dict: "decoder.net.attn_layers.layers.0.1.to_out.0.weight", "decoder.net.attn_layers.layers.0.1.to_out.0.bias", "decoder.net.attn_layers.layers.1.1.to_out.0.weight", "decoder.net.attn_layers.layers.1.1.to_out.0.bias", "decoder.net.attn_layers.layers.2.1.net.0.proj.weight", "decoder.net.attn_layers.layers.2.1.net.0.proj.bias", "decoder.net.attn_layers.layers.3.1.to_out.0.weight", "decoder.net.attn_layers.layers.3.1.to_out.0.bias", "decoder.net.attn_layers.layers.4.1.to_out.0.weight", "decoder.net.attn_layers.layers.4.1.to_out.0.bias", "decoder.net.attn_layers.layers.5.1.net.0.proj.weight", "decoder.net.attn_layers.layers.5.1.net.0.proj.bias", "decoder.net.attn_layers.layers.6.1.to_out.0.weight", "decoder.net.attn_layers.layers.6.1.to_out.0.bias", "decoder.net.attn_layers.layers.7.1.to_out.0.weight", "decoder.net.attn_layers.layers.7.1.to_out.0.bias", "decoder.net.attn_layers.layers.8.1.net.0.proj.weight", "decoder.net.attn_layers.layers.8.1.net.0.proj.bias", "decoder.net.attn_layers.layers.9.1.to_out.0.weight", "decoder.net.attn_layers.layers.9.1.to_out.0.bias", "decoder.net.attn_layers.layers.10.1.to_out.0.weight", "decoder.net.attn_layers.layers.10.1.to_out.0.bias", "decoder.net.attn_layers.layers.11.1.net.0.proj.weight", "decoder.net.attn_layers.layers.11.1.net.0.proj.bias". I'll look into it |
One other thing I just remembered: What if you train you model on multiple GPUs and then try to finetune or evaluate on a single GPU machine? |
I will test that situation right now.
There is no rush to merge this PR, I think we have plenty of time to test and troubleshoot the situation. |
Solved in ff2641c |
You are right, I am dealing it.
|
I referenced nn.DataParallel.forward and parallelism_tutorial , the solution is gradually emerging. Write a function only use multi-gpu when be called, no need wrapper anymore or change any other part code to compatible with But I encounter some AssertionError after some steps, hope I can solve it as soon as possible in the next few days. Loss: 0.9900: 4%|█████▍ | 289/7891 [01:35<41:53, 3.03it/s]
Traceback (most recent call last):
File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 123, in <module>
train(args)
File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 75, in train
encoded = data_parallel(encoder,inputs=im[j:j+microbatch].to(device), device_ids=[0,1,2])
File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 29, in data_parallel
outputs = nn.parallel.parallel_apply(replicas, inputs,kwargs)
File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 40, in parallel_apply
assert len(modules) == len(kwargs_tup)
AssertionError |
which way do you think is better to switch GPU? @lukas-blecher
#multi-gpu
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m pix2tex.train --config model/settings/config-vit.yaml
#one-gpu
export CUDA_VISIBLE_DEVICES=0
python -m pix2tex.train --config model/settings/config-vit.yaml
#multi-gpu
gpu_indices: [0,1,2,3,4,5,6,7]
python -m pix2tex.train --config model/settings/config-vit.yaml
#one GPU
gpu_indices: [0]
python -m pix2tex.train --config model/settings/config-vit.yaml
#or default is one GPU and set gpu_indices:null
gpu_indices:null |
Option 1 is better. export CUDA_VISIBLE_DEVICES=2,3 in python Also you don't need to change the yaml setting when running the script on another machine or multiple times at the same time |
Agree, one command then problem solved. After a couple of hours of usage, I found that if the GPU cards are shared with different guys at the server, some of them directly occupied part memory with several GPUs. Then I need to check the Linux history command or search at Google to copy so I set option 1 as default and compatible with option 2 for this situation. = ̄ω ̄= code tested in
|
I've never seen the |
pytorch lib first import
I think the method is almost the same with DataParellel.forward, but I am not quite sure.
I am testing it. The reason I haven't use An alternative solution that comes up to me is re-write model.forward and includes encoder.forward and decoder.forward in it, then nn.DataParellel wrappered model can directly use mode(xxxx) to forward in multi-gpu mode(I guess, haven't tried). but I notice you already wrote generate at model.forward. |
Ok I see, you basically reimplemented that function. Looks fine, should also have the same performance. Another way would have been to do something like this, when saving: new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[:7] # remove `module.`
new_state_dict[name] = v It's quite hacky. I like your solution better. |
batchsize: 512 each trained 2 epoch
BLEU: 0.637, ED: 3.04e-01, ACC: 0.292: 0%| | 0/389 [00:00<?, ?it/s]
Loss: 0.0629: 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 1016/1067 [26:33<01:19, 1.57s/it]
BLEU: 0.599, ED: 3.38e-01, ACC: 0.321: 3%|███▍ | 10/389 [00:16<10:07, 1.60s/it]
Loss: 0.0806: 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1019/1067 [23:02<01:05, 1.36s/it]
I create a branch from ff2641c git checkout -b nn_parallel ff2641cd4bddaa672e22823bb0a405f8e87bcf15 then BLEU: 0.069, ED: 2.81e+00, ACC: 0.039: 0%| | 0/389 [00:08<?, ?it/s]
Loss: 0.3394: 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 1016/1067 [22:50<01:08, 1.35s/it]
BLEU: 0.340, ED: 7.07e-01, ACC: 0.192: 3%|███▍ | 10/389 [00:22<14:18, 2.26s/it]
Loss: 0.1870: 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1019/1067 [23:06<01:05, 1.36s/it] |
Got it, will try this tomorrow. |
I don't think this will be necessary. You found a better solution already.
This. That's the way to go. Handeling the parallel data stuff in the forward method of |
Gotcha, I will implement it ASAP. |
Test again, seems same with 2. nn.DataParallel BLEU: 0.069, ED: 2.81e+00, ACC: 0.039: 0%| | 0/389 [00:08<?, ?it/s]
Loss: 0.3394: 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 1016/1067 [21:28<01:04, 1.27s/it]
BLEU: 0.341, ED: 7.05e-01, ACC: 0.192: 3%|███▍ | 10/389 [00:22<14:31, 2.30s/it]
Loss: 0.1870: 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1019/1067 [21:29<01:00, 1.27s/it] |
This reverts some parts of commit e2b55fb.
The Also moved the |
Test again or are there any other things we need to do? |
This shouldn't have broken anything. But maybe check if it run on multiple gpus at all. If it does there is a good chance it still works as before :) I also think this PR is almost ready to merge. How does the ViT performance compare to the hybrid? |
Oops, the same config as previous but something happened. Loss: 0.4742: 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 999/1067 [21:24<01:27, 1.29s/it]
Traceback (most recent call last):
File "/root/anaconda3/envs/pix2tex/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/root/anaconda3/envs/pix2tex/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 116, in <module>
train(args)
File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 81, in train
bleu_score, edit_distance, token_accuracy = evaluate(model, valdataloader, args, num_batches=int(args.valbatches*e/args.epochs), name='val')
File "/root/anaconda3/envs/pix2tex/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/yuhang/LaTeX-OCR/pix2tex/eval.py", line 54, in evaluate
dec = model.generate(im.to(device), temperature=args.get('temperature', .2))
File "/root/anaconda3/envs/pix2tex/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/yuhang/LaTeX-OCR/pix2tex/models/utils.py", line 37, in generate
eos_token=self.args.eos_token, context=self.encoder(x), temperature=temperature)
File "/root/anaconda3/envs/pix2tex/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/yuhang/LaTeX-OCR/pix2tex/models/transformer.py", line 40, in generate
out = torch.cat((out, sample), dim=-1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 20 for tensor number 1 in the list. I am looking at it, do you have any idea about it?
Up to now, the best result is bleu 0.8 with the default config, if there is any news I will share it with you. |
Should be fine now |
Yeah, it's work fine. BLEU: 0.180, ED: 1.12e+00, ACC: 0.092: 0%| | 0/389 [00:01<?, ?it/s]
Loss: 0.3501: 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 1016/1067 [23:01<01:09, 1.36s/it]
Loss: 0.4497: 30%|██████████████████████████████████████████████▋ | 319/1067 [07:17<11:13, 1.11it/s] |
pure ViT structure
We discussed pure ViT structure at #131 .
And I do come up with same result, the model can't converge. In fact, I would hope that larger pure vit can achieve high performance, it really frustrated me. But in recent days, #147 (comment) give me some idea, because the training loss curve is so familiar like pure vit training curve, so I think the reason why pure vit can't fit maybe due to batch size.
I taken and modified models.py from 844bc21.
Here is the good news, it's working.
How to use
Data parallelism
I think multi-GPU training can save more time and a larger batch size, so refer to some documents and blogs and make such changes.
Also, it's compatible with one GPU.
How to use
References:
small modify
I think both hybrid and pure vit work together, why not put them together. so create a folder named as
structures
.the branch is based on 720978d, please feel free to correct any inappropriate code.😁