Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAE Training Time #42

Closed
yufeng9819 opened this issue Apr 20, 2023 · 2 comments
Closed

VAE Training Time #42

yufeng9819 opened this issue Apr 20, 2023 · 2 comments

Comments

@yufeng9819
Copy link

Hey,
Thanks for your great work.@ZENGXH
I would like to ask how long it takes to train VAE in all categories.
I train VAE in all categories on 8 V100 16GB for 15days with batchsize 12. But only 4000 epochs have been trained.

2023-04-20 18:52:20.615 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E4112 iter[371/372] | [Loss] 8847.80 | [exp] ../exp/0405/all/1c389bh_hvae_lion_B12 | [step] 1530035 | [url] none | [time] 5.0m (~325h) |[best] 199 0.001x1e-2

Is there anyway to accelerate the training process? ( For example: increase batchsize ?)

Another problem is that it is hard for me to judge whether VAE is well trained (I think visualisation is not a comprehensive way to reflect the effectiveness of VAE training). Especially when the training process takes a lot of time, it is important to guarantee the training effect.

@ZENGXH
Copy link
Collaborator

ZENGXH commented Apr 21, 2023

Hi,
I think 15 days is probably enough (I only train for 7 days with 4A100, I stop early due to the paper deadline). for 55 class, we don't need to run the same number of epochs as the single class data since there is too much data.

In terns of acceleration, yes increase batch-size should help, especially for diffusion model training.

One thing you can try is to investigate the loss curve, and see whether they are at the flatten region (converged stage). For the diffusion model training, you can evaluate the 1-nna metric to see whether it's fully converged or not.

For reference, this the my reconstruction results when I stop my vae training:
valrecont_step_889600
This is my training curve:
image
image
image

@yufeng9819
Copy link
Author

Great!
Thanks for your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants