Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data parallelism【multi-gpu train】+pure ViT work + small modify #150

Merged
merged 21 commits into from May 20, 2022

Conversation

TITC
Copy link
Collaborator

@TITC TITC commented May 17, 2022

pure ViT structure

We discussed pure ViT structure at #131 .

  1. Initially, I used a pure ViT (6ecc3f4). But the encoder was just not performing very well. The model produced latex code but it has nothing to do with the input image.

And I do come up with same result, the model can't converge. In fact, I would hope that larger pure vit can achieve high performance, it really frustrated me. But in recent days, #147 (comment) give me some idea, because the training loss curve is so familiar like pure vit training curve, so I think the reason why pure vit can't fit maybe due to batch size.

I taken and modified models.py from 844bc21.

Here is the good news, it's working.

image

How to use

# for vit
 python -m pix2tex.train --config model/settings/config-vit.yaml --structure vit
# for hybrid, default is hybrid
 python -m pix2tex.train --config model/settings/config.yaml --structure hybrid
 python -m pix2tex.train --config model/settings/config.yaml

Data parallelism

I think multi-GPU training can save more time and a larger batch size, so refer to some documents and blogs and make such changes.
Also, it's compatible with one GPU.

How to use

#for one GPU
export CUDA_VISIBLE_DEVICES=6
 python -m pix2tex.train --config model/settings/config-vit.yaml --structure vit
#for multi GPU
export CUDA_VISIBLE_DEVICES=6,7
 python -m pix2tex.train --config model/settings/config-vit.yaml --structure vit

References:

  1. Technique 1: Data Parallelism
  2. data_parallel_tutorial.ipynb
  3. AttributeError: 'DataParallel' object has no attribute 'train_model' jytime/Mask_RCNN_Pytorch#2 (comment)

small modify

I think both hybrid and pure vit work together, why not put them together. so create a folder named as structures.


the branch is based on 720978d, please feel free to correct any inappropriate code.😁

@lukas-blecher
Copy link
Owner

lukas-blecher commented May 17, 2022

That's great news!
I've restructured the models code a little bit so that it is easier to call and removed some duplicate code.
Feel free to take a look, and tell me if I messed up somewhere.
I've moved the structure argument into the config and combined everything in the new models module.

@lukas-blecher lukas-blecher added the enhancement New feature or request label May 17, 2022
@TITC
Copy link
Collaborator Author

TITC commented May 17, 2022

Thanks a lot, I can benefit greatly from each pull request that you reviewed.


test again for both vit and hybrid, only have below problem.

Traceback (most recent call last):                                                                   
  File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/runpy.py", line 193, in _run_module_as_main 
    "__main__", mod_spec)                                                                                                                                                                                  
  File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/runpy.py", line 85, in _run_code                                                                                                                 
    exec(code, run_globals)                                                                          
  File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 102, in <module>                                                                                                                                         
    train(args)                                                                                                                                                                                            
  File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 27, in train                              
    model = get_model(args, training=True)                                                           
  File "/yuhang/LaTeX-OCR/pix2tex/models/utils.py", line 49, in get_model                            
    en_attn_layers = encoder.module.attn_layers if available_gpus > 1 else encoder.attn_layers                                                                                                             
  File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1186, in __getattr__                                                                             
    type(self).__name__, name))
AttributeError: 'CustomVisionTransformer' object has no attribute 'attn_layers'

If there is a better way to re-write 0aefdbf, please modify it.

@lukas-blecher
Copy link
Owner

Sorry, my fault. I didn't try with wandb: True

Why don't you just watch the model = Model(encoder, decoder, args)?

@TITC
Copy link
Collaborator Author

TITC commented May 17, 2022

because I don't know what it used to do before, like the random seeds stuff, haha

I read the doc and found it related to the gradient. maybe it can give some further clues around fit model.
image

Thanks for your guidance.

@TITC
Copy link
Collaborator Author

TITC commented May 18, 2022

based on this stackoverflow QA, it seems git pull command fetch 6c53105 and automatically merge to local, then commit 4ffbeca, it's kind beyond my expectation. I thought it only commit 5cbbcb9.

@lukas-blecher
Copy link
Owner

Before I merge this: There is something that went wrong along the way.
When I try to load the pretrained model, the decoder architecture has changed

        Missing key(s) in state_dict: "decoder.net.attn_layers.layers.0.1.to_out.weight", "decoder.net.attn_layers.layers.0.1.to_out.bias", "decoder.net.attn_layers.layers.1.1.to_out.weight", "decoder.net.attn_layers.layers.1.1.to_out.bias", "decoder.net.attn_layers.layers.2.1.net.0.0.weight", "decoder.net.attn_layers.layers.2.1.net.0.0.bias", "decoder.net.attn_layers.layers.3.1.to_out.weight", "decoder.net.attn_layers.layers.3.1.to_out.bias", "decoder.net.attn_layers.layers.4.1.to_out.weight", "decoder.net.attn_layers.layers.4.1.to_out.bias", "decoder.net.attn_layers.layers.5.1.net.0.0.weight", "decoder.net.attn_layers.layers.5.1.net.0.0.bias", "decoder.net.attn_layers.layers.6.1.to_out.weight", "decoder.net.attn_layers.layers.6.1.to_out.bias", "decoder.net.attn_layers.layers.7.1.to_out.weight", "decoder.net.attn_layers.layers.7.1.to_out.bias", "decoder.net.attn_layers.layers.8.1.net.0.0.weight", "decoder.net.attn_layers.layers.8.1.net.0.0.bias", "decoder.net.attn_layers.layers.9.1.to_out.weight", "decoder.net.attn_layers.layers.9.1.to_out.bias", "decoder.net.attn_layers.layers.10.1.to_out.weight", "decoder.net.attn_layers.layers.10.1.to_out.bias", "decoder.net.attn_layers.layers.11.1.net.0.0.weight", "decoder.net.attn_layers.layers.11.1.net.0.0.bias".
        Unexpected key(s) in state_dict: "decoder.net.attn_layers.layers.0.1.to_out.0.weight", "decoder.net.attn_layers.layers.0.1.to_out.0.bias", "decoder.net.attn_layers.layers.1.1.to_out.0.weight", "decoder.net.attn_layers.layers.1.1.to_out.0.bias", "decoder.net.attn_layers.layers.2.1.net.0.proj.weight", "decoder.net.attn_layers.layers.2.1.net.0.proj.bias", "decoder.net.attn_layers.layers.3.1.to_out.0.weight", "decoder.net.attn_layers.layers.3.1.to_out.0.bias", "decoder.net.attn_layers.layers.4.1.to_out.0.weight", "decoder.net.attn_layers.layers.4.1.to_out.0.bias", "decoder.net.attn_layers.layers.5.1.net.0.proj.weight", "decoder.net.attn_layers.layers.5.1.net.0.proj.bias", "decoder.net.attn_layers.layers.6.1.to_out.0.weight", "decoder.net.attn_layers.layers.6.1.to_out.0.bias", "decoder.net.attn_layers.layers.7.1.to_out.0.weight", "decoder.net.attn_layers.layers.7.1.to_out.0.bias", "decoder.net.attn_layers.layers.8.1.net.0.proj.weight", "decoder.net.attn_layers.layers.8.1.net.0.proj.bias", "decoder.net.attn_layers.layers.9.1.to_out.0.weight", "decoder.net.attn_layers.layers.9.1.to_out.0.bias", "decoder.net.attn_layers.layers.10.1.to_out.0.weight", "decoder.net.attn_layers.layers.10.1.to_out.0.bias", "decoder.net.attn_layers.layers.11.1.net.0.proj.weight", "decoder.net.attn_layers.layers.11.1.net.0.proj.bias".

I'll look into it

@lukas-blecher
Copy link
Owner

One other thing I just remembered:

What if you train you model on multiple GPUs and then try to finetune or evaluate on a single GPU machine?
Won't there be an error because of the nn.DataParallel wrapper? I think the state dict should be saved without the wrapper, which makes saving and loading a bit more complicated

@TITC
Copy link
Collaborator Author

TITC commented May 18, 2022

I will test that situation right now.

What if you train you model on multiple GPUs and then try to finetune or evaluate on a single GPU machine?

There is no rush to merge this PR, I think we have plenty of time to test and troubleshoot the situation.

@lukas-blecher
Copy link
Owner

Before I merge this: There is something that went wrong along the way. When I try to load the pretrained model, the decoder architecture has changed

        Missing key(s) in state_dict: "decoder.net.attn_layers.layers.0.1.to_out.weight", "decoder.net.attn_layers.layers.0.1.to_out.bias", "decoder.net.attn_layers.layers.1.1.to_out.weight", "decoder.net.attn_layers.layers.1.1.to_out.bias", "decoder.net.attn_layers.layers.2.1.net.0.0.weight", "decoder.net.attn_layers.layers.2.1.net.0.0.bias", "decoder.net.attn_layers.layers.3.1.to_out.weight", "decoder.net.attn_layers.layers.3.1.to_out.bias", "decoder.net.attn_layers.layers.4.1.to_out.weight", "decoder.net.attn_layers.layers.4.1.to_out.bias", "decoder.net.attn_layers.layers.5.1.net.0.0.weight", "decoder.net.attn_layers.layers.5.1.net.0.0.bias", "decoder.net.attn_layers.layers.6.1.to_out.weight", "decoder.net.attn_layers.layers.6.1.to_out.bias", "decoder.net.attn_layers.layers.7.1.to_out.weight", "decoder.net.attn_layers.layers.7.1.to_out.bias", "decoder.net.attn_layers.layers.8.1.net.0.0.weight", "decoder.net.attn_layers.layers.8.1.net.0.0.bias", "decoder.net.attn_layers.layers.9.1.to_out.weight", "decoder.net.attn_layers.layers.9.1.to_out.bias", "decoder.net.attn_layers.layers.10.1.to_out.weight", "decoder.net.attn_layers.layers.10.1.to_out.bias", "decoder.net.attn_layers.layers.11.1.net.0.0.weight", "decoder.net.attn_layers.layers.11.1.net.0.0.bias".
        Unexpected key(s) in state_dict: "decoder.net.attn_layers.layers.0.1.to_out.0.weight", "decoder.net.attn_layers.layers.0.1.to_out.0.bias", "decoder.net.attn_layers.layers.1.1.to_out.0.weight", "decoder.net.attn_layers.layers.1.1.to_out.0.bias", "decoder.net.attn_layers.layers.2.1.net.0.proj.weight", "decoder.net.attn_layers.layers.2.1.net.0.proj.bias", "decoder.net.attn_layers.layers.3.1.to_out.0.weight", "decoder.net.attn_layers.layers.3.1.to_out.0.bias", "decoder.net.attn_layers.layers.4.1.to_out.0.weight", "decoder.net.attn_layers.layers.4.1.to_out.0.bias", "decoder.net.attn_layers.layers.5.1.net.0.proj.weight", "decoder.net.attn_layers.layers.5.1.net.0.proj.bias", "decoder.net.attn_layers.layers.6.1.to_out.0.weight", "decoder.net.attn_layers.layers.6.1.to_out.0.bias", "decoder.net.attn_layers.layers.7.1.to_out.0.weight", "decoder.net.attn_layers.layers.7.1.to_out.0.bias", "decoder.net.attn_layers.layers.8.1.net.0.proj.weight", "decoder.net.attn_layers.layers.8.1.net.0.proj.bias", "decoder.net.attn_layers.layers.9.1.to_out.0.weight", "decoder.net.attn_layers.layers.9.1.to_out.0.bias", "decoder.net.attn_layers.layers.10.1.to_out.0.weight", "decoder.net.attn_layers.layers.10.1.to_out.0.bias", "decoder.net.attn_layers.layers.11.1.net.0.proj.weight", "decoder.net.attn_layers.layers.11.1.net.0.proj.bias".

I'll look into it

Solved in ff2641c

@TITC
Copy link
Collaborator Author

TITC commented May 18, 2022

You are right, I am dealing it.

What if you train you model on multiple GPUs and then try to finetune or evaluate on a single GPU machine?

@TITC
Copy link
Collaborator Author

TITC commented May 18, 2022

I referenced nn.DataParallel.forward and parallelism_tutorial , the solution is gradually emerging. Write a function only use multi-gpu when be called, no need wrapper anymore or change any other part code to compatible with nn.DataParallel

But I encounter some AssertionError after some steps, hope I can solve it as soon as possible in the next few days.

Loss: 0.9900:   4%|█████▍                                                                                                                                               | 289/7891 [01:35<41:53,  3.03it/s]
Traceback (most recent call last):
  File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 123, in <module>
    train(args)
  File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 75, in train
    encoded = data_parallel(encoder,inputs=im[j:j+microbatch].to(device), device_ids=[0,1,2])
  File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 29, in data_parallel
    outputs = nn.parallel.parallel_apply(replicas, inputs,kwargs)
  File "/root/anaconda3/envs/latex_ocr_test/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 40, in parallel_apply
    assert len(modules) == len(kwargs_tup)
AssertionError

@TITC
Copy link
Collaborator Author

TITC commented May 19, 2022

which way do you think is better to switch GPU? @lukas-blecher

  1. use system environment to specify GPU index
#multi-gpu
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m pix2tex.train --config model/settings/config-vit.yaml
#one-gpu
export CUDA_VISIBLE_DEVICES=0
python -m pix2tex.train --config model/settings/config-vit.yaml
  1. write GPU indices to config.yaml
#multi-gpu
gpu_indices: [0,1,2,3,4,5,6,7]
python -m pix2tex.train --config model/settings/config-vit.yaml
#one GPU
gpu_indices: [0]
python -m pix2tex.train --config model/settings/config-vit.yaml
#or default is one GPU and set gpu_indices:null
gpu_indices:null

@lukas-blecher
Copy link
Owner

Option 1 is better.
I think it is straight forward selecting the gpus with that option.
This is how I belive it works (not 100% sure): Say

export CUDA_VISIBLE_DEVICES=2,3

in python cuda:0 and cuda:1 will correspond to the devices 2, 3.

Also you don't need to change the yaml setting when running the script on another machine or multiple times at the same time

@TITC
Copy link
Collaborator Author

TITC commented May 19, 2022

Agree, one command then problem solved.

After a couple of hours of usage, I found that if the GPU cards are shared with different guys at the server, some of them directly occupied part memory with several GPUs. Then I need to check the Linux history command or search at Google to copy export CUDA_VISIBLE_DEVICES=xxxxx switch to memory available gpus, one time is convenient but frequently makes it kinda annoying.

so I set option 1 as default and compatible with option 2 for this situation. = ̄ω ̄=


code tested in

  • model trained on one GPU env loaded at multi-GPU env, vit.
  • model trained on multi GPU env and loaded at one GPU env, vit.
  • training in multi GPU, both vit and hybrid.
  • training in one GPU, both vit and hybrid.

@lukas-blecher
Copy link
Owner

I've never seen the nn.parallel before. Did you notice performance differences between the nn.DataParellel class and the new method?

@TITC
Copy link
Collaborator Author

TITC commented May 19, 2022

pytorch lib first import from .data_parallel import DataParallel, data_parallel at nn/parallel/init.py, then from .parallel import DataParallel at nn/init.py.

nn.DataParellel is nn.parallel.data_parallel.DataParellel, it's basically a part of nn.parallel.


I think the method is almost the same with DataParellel.forward, but I am not quite sure.

Did you notice performance differences between the nn.DataParellel class and the new method?

I am testing it.


The reason I haven't use nn.DataParellel is that it only can work at self.module. If I use nn.DataParellel to wrapper model, then it will not work for the encoder and decoder. If I wrapper directly at encoder and decoder then I haven't found a solution to load weight.

An alternative solution that comes up to me is re-write model.forward and includes encoder.forward and decoder.forward in it, then nn.DataParellel wrappered model can directly use mode(xxxx) to forward in multi-gpu mode(I guess, haven't tried).

but I notice you already wrote generate at model.forward.

@lukas-blecher
Copy link
Owner

Ok I see, you basically reimplemented that function. Looks fine, should also have the same performance.

Another way would have been to do something like this, when saving:

new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k[:7]  # remove `module.`
    new_state_dict[name] = v

It's quite hacky. I like your solution better.

@TITC
Copy link
Collaborator Author

TITC commented May 19, 2022

batchsize: 512
micro_batchsize: 128
gpu: RTX3090*3
encoder: vit

each trained 2 epoch

  1. this method
BLEU: 0.637, ED: 3.04e-01, ACC: 0.292:   0%|                                                                                                                                              | 0/389 [00:00<?, ?it/s]
Loss: 0.0629:  95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌       | 1016/1067 [26:33<01:19,  1.57s/it]
BLEU: 0.599, ED: 3.38e-01, ACC: 0.321:   3%|███▍                                                                                                                                 | 10/389 [00:16<10:07,  1.60s/it]
Loss: 0.0806:  96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████       | 1019/1067 [23:02<01:05,  1.36s/it]
  1. nn.DataParallel

I create a branch from ff2641c

git checkout -b nn_parallel ff2641cd4bddaa672e22823bb0a405f8e87bcf15

then

BLEU: 0.069, ED: 2.81e+00, ACC: 0.039:   0%|                                                                                                                                              | 0/389 [00:08<?, ?it/s]
Loss: 0.3394:  95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌       | 1016/1067 [22:50<01:08,  1.35s/it]
BLEU: 0.340, ED: 7.07e-01, ACC: 0.192:   3%|███▍                                                                                                                                 | 10/389 [00:22<14:18,  2.26s/it]
Loss: 0.1870:  96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████       | 1019/1067 [23:06<01:05,  1.36s/it]

@TITC
Copy link
Collaborator Author

TITC commented May 19, 2022

Another way would have been to do something like this

Got it, will try this tomorrow.

@lukas-blecher
Copy link
Owner

Another way would have been to do something like this

Got it, will try this tomorrow.

I don't think this will be necessary. You found a better solution already.

An alternative solution that comes up to me is re-write model.forward and includes encoder.forward and decoder.forward in it,

This. That's the way to go. Handeling the parallel data stuff in the forward method of Model is the way to go.
Move the current forward to generate or something similar and replace all model calls in eval and cli with the gererate function.

@TITC
Copy link
Collaborator Author

TITC commented May 19, 2022

Gotcha, I will implement it ASAP.

@TITC
Copy link
Collaborator Author

TITC commented May 20, 2022

Test again, seems same with 2. nn.DataParallel

BLEU: 0.069, ED: 2.81e+00, ACC: 0.039:   0%|                                                                                                                                              | 0/389 [00:08<?, ?it/s]
Loss: 0.3394:  95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌       | 1016/1067 [21:28<01:04,  1.27s/it]
BLEU: 0.341, ED: 7.05e-01, ACC: 0.192:   3%|███▍                                                                                                                                 | 10/389 [00:22<14:31,  2.30s/it]
Loss: 0.1870:  96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████       | 1019/1067 [21:29<01:00,  1.27s/it]

@lukas-blecher
Copy link
Owner

The generate function doesn't exist for the LatexOCR class. I've corrected this from a previous commit, replaced the correct lines of code with the new generate function of Model.

Also moved the data_parallel into the Model class.

@TITC
Copy link
Collaborator Author

TITC commented May 20, 2022

Test again or are there any other things we need to do?

@lukas-blecher
Copy link
Owner

This shouldn't have broken anything. But maybe check if it run on multiple gpus at all. If it does there is a good chance it still works as before :)

I also think this PR is almost ready to merge.

How does the ViT performance compare to the hybrid?

@TITC
Copy link
Collaborator Author

TITC commented May 20, 2022

Oops, the same config as previous but something happened.

Loss: 0.4742:  94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████          | 999/1067 [21:24<01:27,  1.29s/it]
Traceback (most recent call last):
  File "/root/anaconda3/envs/pix2tex/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/anaconda3/envs/pix2tex/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 116, in <module>
    train(args)
  File "/yuhang/LaTeX-OCR/pix2tex/train.py", line 81, in train
    bleu_score, edit_distance, token_accuracy = evaluate(model, valdataloader, args, num_batches=int(args.valbatches*e/args.epochs), name='val')
  File "/root/anaconda3/envs/pix2tex/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/yuhang/LaTeX-OCR/pix2tex/eval.py", line 54, in evaluate
    dec = model.generate(im.to(device), temperature=args.get('temperature', .2))
  File "/root/anaconda3/envs/pix2tex/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/yuhang/LaTeX-OCR/pix2tex/models/utils.py", line 37, in generate
    eos_token=self.args.eos_token, context=self.encoder(x), temperature=temperature)
  File "/root/anaconda3/envs/pix2tex/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/yuhang/LaTeX-OCR/pix2tex/models/transformer.py", line 40, in generate
    out = torch.cat((out, sample), dim=-1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 20 for tensor number 1 in the list.

I am looking at it, do you have any idea about it?


How does the ViT performance compare to the hybrid?

Up to now, the best result is bleu 0.8 with the default config, if there is any news I will share it with you.

@lukas-blecher
Copy link
Owner

Should be fine now

@TITC
Copy link
Collaborator Author

TITC commented May 20, 2022

Yeah, it's work fine.

BLEU: 0.180, ED: 1.12e+00, ACC: 0.092:   0%|                                                                                                                                              | 0/389 [00:01<?, ?it/s]
Loss: 0.3501:  95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌       | 1016/1067 [23:01<01:09,  1.36s/it]
Loss: 0.4497:  30%|██████████████████████████████████████████████▋                                                                                                             | 319/1067 [07:17<11:13,  1.11it/s]

@lukas-blecher lukas-blecher merged commit 06b7a9a into lukas-blecher:main May 20, 2022
@TITC TITC deleted the data-parallelism branch May 21, 2022 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants