Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anyone reproduced the celeba-HQ results in the paper #37

Open
winwinJJiang opened this issue Jul 26, 2018 · 12 comments
Open

anyone reproduced the celeba-HQ results in the paper #37

winwinJJiang opened this issue Jul 26, 2018 · 12 comments

Comments

@winwinJJiang
Copy link

Hi, does any one re-produced the HQ (256*256) image? I have problem with GPUs can not train toooo long time.

@gwern
Copy link

gwern commented Aug 2, 2018

The Glow paper is very unclear on the computational demands, but if you look at the README's example command for CelebA, or the blog post, you'll see that they train the CelebA model on 40 GPUs for an unspecified (but probably >week) of time. That's almost 1 GPU-year, so it's no surprise that people trying it out on 1 or 2 GPUs (like myself) for a few days - or weeks at most - haven't reached similar results.

If you just want to generate 256px images, you might be better off with ProGAN; at least until Glow gets self-attention or progressive growing, it won't be competitive. Consider what it would cost to reproduce on AWS spot right now: ~$0.3/h for a single p2.xlarge instance at spot; 40 of those, for say a week which is 7*24=168h; 0.3*40*168=$2,016 - assuming nothing goes wrong. (And things might well go wrong: I've run into a bunch of Glow crashes where it crashes due to lacking 'invertibility'. There's no mention of this in the Glow repo or issues, and the default checkpointing is very rare, so I assume it wasn't a problem because of very large minibatches from using 40 GPUs.)

@prafullasd
Copy link
Contributor

prafullasd commented Aug 2, 2018

Yes we trained with 40 GPU's for about a week, but samples did start to look good after a couple of days. If you're getting invertibility errors with small batch sizes, try increasing the warmup epochs or decrease the learning rate.
A repository that seems to be able to get similar results to ours is - https://github.com/musyoku/chainer-glow

To train faster, you could work on a smaller resolution, use a smaller model or try to tweak the learning rate / optimizer for faster convergence (especially if you're using big batch sizes). If you want to use a larger minibatch per GPU, you can try implementing the O(1) memory version, which uses the reversibility to not have to store activations while backpropogating, thus using GPU memory that is independent of depth of model. An example implementation of O(1) memory reversible flow models in Tensorflow is here (this one does RealNVP) - https://github.com/unixpickle/cnn-toys/tree/master/cnn_toys/real_nvp

@prafullasd prafullasd reopened this Aug 2, 2018
@iRmantou
Copy link

Hi @prafullasd, I have read your paper and codes, the results are amazing, but I am new about horovod, I notice your commands use mpiexec ... without “-H” or other parameters , It's very simple compared with usage example in horovod github site as follows:

run on 4 machines with 4 GPUs each

$ mpirun -np 16 \ -H server1:4,server2:4,server3:4,server4:4 \ -bind-to none -map-by slot \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ -mca pml ob1 -mca btl ^openib \ python train.py

Is there anything I missed? please give me some advice, thank you very much!
@gwern how much gpu you used? in one machine or a GPUs cluster?

@gwern
Copy link

gwern commented Aug 21, 2018

2x1080ti.

@iRmantou
Copy link

@gwern thanks for your reply! could u give me the commands you used?
just mpiexec -n 8 python train.py --problem cifar ... without any other parameters? such like -H -bind-to none and so on

@gwern
Copy link

gwern commented Aug 22, 2018

Yes, I just copied their command with mpiexec -n 2 (since I only have 2 GPUs, of course) and it worked. I didn't add any of the stuff you mentioned.

@iRmantou
Copy link

@gwern Thank you so much!

@Avmb
Copy link

Avmb commented Aug 28, 2018

@prafullasd about the invertibility issue, would it make sense to force approximate orthogonality of the 1x1 convolutions using a penalty? You'll avoid non-invertibility and numerical instability errors, moreover if the approximation is good enough, you can even save computation time by replacing matrix inversion with transpose and removing the determinant computation (you just need to do it once at init to determine whether it's 1 or -1, then it stays the same during training actually you don't need even that since only the absolute value matters).

@nshepperd
Copy link

A good alternative to fix the invertibility issue would be to use the LU decomposition (which is included in the code, in model.py, but not used by default), with the diagonal entries of both triangular matrices fixed to 1 (which is not currently the case in the code). This would fix the determinant to 1 and ensure the matrix is always invertible.

Forcing approximate orthogonality with a penalty term is not a bad idea as well.

@nshepperd
Copy link

To follow up on this, I implemented the orthogonality penalty, as a simple -20*||(w'w - I)||_F^2 term in the objective function (at invertible_1x1_conv). That is, the summed elementwise squared difference between w'w and the identity matrix, where w is the weights of the 1x1 convolution. 20 was the lowest penalty multiplier that seemed to reliably keep the total squared difference small (<0.4).

After this, I still had an invertibility crash, so I thought it had to be some sudden spiking gradient issue / numerical instability bringing it away from invertibility, as the orthogonality penalty would have brought it back to invertibility if it was something gradual. Looking at the revnet2d_step, I saw that the code applies a sigmoid function to the scale factors, to produce the value "s" from the paper, (interestingly, yes, sigmoid, not exp as in the paper). I was pretty suspicious of this sigmoid, as it can get arbitrarily close to 0, which means that the log(s) calculated for the determinant (as well as the 1/s for the reverse step) could in principle produce an arbitrarily large value, and hence gradient...

My solution to this was to add an epsilon (currently 0.1, but I haven't experimented with this hyperparameter much yet) to the output of tf.nn.sigmoid(h[:, :, :, 1::2] + 2.), to constrain it to be >>0. I haven't had an invertibility crash with this yet, after running all day, and the epsilon doesn't seem to meaningfully affect the model power. This has also had the positive effect of removing some artifacts in the samples that are clearly due to that 1/s becoming very large.

@Avmb
Copy link

Avmb commented Aug 30, 2018

Bounding the s away from zero makes sense. I suppose they didn't do it because the optimization objective generally maximizes |s|, which is probably why they used a sigmoid instead of an exp, but in some cases the model may try go for a lower s for some reason (maybe to reduce the entropy if the input has too much compared to the target latent?)

@nuges01
Copy link

nuges01 commented Oct 4, 2018

@nshepperd, did your solution in your last paragraph above end up fixing the issue as you continued training beyond 1 day?

Also, would you say it was the combination of modifications you made that fixed it, or would it suffice to just add the epsilon? Would you mind adding snippets of your changes for the rest of us who are struggling with the issue? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants