New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getting KL divergence to work #92
Comments
I argee with you. So I temporarily set kl_weight to zero. Otherwise the recon_loss cannot be reduced. In this version, kl_loss is contradict to recon_loss. |
maybe someone can email the paper authors to see if this loss was used at all? |
I must have been used as they mention an increasing weight parameter in the paper. Still, I am trying, but I cant seem to figure out his e mail adress. On the paper it says Aditya Ramesh <_@adityaramesh.com so I tried Aditya_Ramesh@adityaramesh.com, Aditya.Ramesh@adityaramesh.com but they dont exist... |
Has anyone had any more insights/updates on this? I'm running into the exact same issue (on an independent DALL-E repro) and bashing my head against the wall trying to understand the behaviour! |
in the train_vae script the kl_loss is set to zero via the weight parameter and also in my elaborate runs of experiments, I found that including the KL term does more harm than it helps. @karpathy also mentioned trouble in getting it to work properly.
did anyone achieve any progress on this matter?
Also, this
DALLE-pytorch/dalle_pytorch/dalle_pytorch.py
Line 196 in 7658e60
rather use the soft_one_hot values than the raw logits?
Also, I find it a little confusing that we are actually annealing the temperature of gumbel-softmax, thus steering the it towards one_hot sampling when at the same time we are trying to encourage the distribution to be close to a uniform prior. Isnt this a contradiction?
The text was updated successfully, but these errors were encountered: