New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KL Loss is right? #6
Comments
Hi @BridgetteSong. Yes, (closed-form) KL-divergence between two Gaussians is different from our KL loss. It's because we use KL-divergence between a Gaussian and a Normalizing flow rather than two Gaussians. Therefore, there is no closed-form KL like Gaussians. Equation 4 of our paper shows that the prior distribution is not Gaussian. If you're not familiar with normalizing flows, and if you don't know how to calculate their log-likelihood (which is needed for calculating KL), it would be better to look these blog posts first: nf1 and nf2. These are great illustrative blog posts about normalizing flows containing model implementations. |
Thank you for your reply.
is my understanding right? or is this kl_loss equal to your kl loss? |
@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant. Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one. The kl divergence is the mean of the difference of log probabilities as follows:
As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):
On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:
As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see Line 449 in 2e561ba
Lines 179 to 209 in 2e561ba
Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:
This is the explanation of the kl loss ( Lines 57 to 60 in 2e561ba
|
Thank you very much for your patience and detailed answer, I got it. |
BTW, as the prior is product of Gaussian and the jacobian determinant, and considering properties of Gaussian distribution(if X ~ N(u, σ**2), aX+b ~ N(au+b, (aσ)**2)), so the prior is always a Gaussian distribution when the jacobian determinant is a constant. So can we calculate KL-divergence using abovementioned two Gaussian KL-divergence or use torch API to get KL-divergence directly like this? |
Good point! In case the channel dimension of latent variables is one, the prior is Gaussian when the jacobian determinant is a constant. The normalizing flow of prior also provides non-linear transformation using neural-networks while maintaining a constant jacobian determinant, resulting in non-Gaussian prior distribution. If the normalizing flow of prior only allows linear transformation or the channel dimension of latent variables is one, you can use the KL-divergence btw two Gaussians. But, in general, you cannot use it. |
Thank you very much again. I totally understand now. I learned much from your detailed answer. |
Hi @jaywalnut310, thanks for your detailed answer. Very helpful! I'd like to ask two more questions. I'd appreciate it if I can have your answers.
You mentioned you set up the normalizing flow to be volume-preserving. Did this way benefit the model? In my understanding, it can be replaced by a more complicated non-volume-preserving flow model.
As far as I know, the KL divergence value lies in the range between [0,+inf]. But according to your formula, its value could be negative? (ref: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8) |
How about taking absolute value to overcome kl loss is negative? --- a/losses.py
+++ b/losses.py
@@ -54,7 +54,7 @@ def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
logs_p = logs_p.float()
z_mask = z_mask.float()
- kl = logs_p - logs_q - 0.5
+ kl = torch.abs(logs_p - logs_q - 0.5) |
Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”,doesn't the kl divergence need integration? You directly kl=mean(log(q(z/x))) - mean(log(p(z/c))),is this an approximate formula?In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation, m_p and logs_p? |
@yanggeng1995 I will give some additional supplements:
|
@BridgetteSong Thanks for your answer. There is another question, why not calculate the negative log-likelihood of Gaussian distribution based on z_p, m_p and logs_p, isn't it more convenient? |
@yanggeng1995 it is very convenient to compute ∫q(z/x) * log(q(z/x))dz as it is Gaussian. And as for ∫q(z/x) * log(p(z/c)))dz, it is also convenient to compute it if you understand approximate sampling: ∫q(z/x) * log(p(z/c)))dz ≈ log(p(z/c)). p(z/c) is product of Gaussian and the jacobian determinant. To compute log(p(z/c)), we need first sample z ~ posterior(), and get z_p = NormalizedFlow(z), finally use z_p to compute log-likelihood of prior Gaussian: N(z_p, m_p, logs_p). so log(p(z/c)) = logdet(df/dz) + log(N(z_p, m_p, logs_p)) = 0 - logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2. I think it is also right if you directly use kl_loss ≈ log(q(z/x)) - log(p(z/c)), just log(q(z/x)) = - logs_q - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_q) * (z - m_q) ** 2 where z ~ posterior(m_q, logs_q), and log(p(z/c)) is the same. I think the author's method is more concise and more accurate. |
Hi, Is it work? |
@980202006 It will not work. As usually, KL_loss will not be negative if your inputs and network are right. When kl_loss < 0, it means your prior distribution is almost same as posterior distribution, so posterior distribution fails to learn as a complicated distribution.
But in usually, you need not add this constraint, as when your KL_Loss < 0, it means your network is trained unsuccessfully, although you add this constraint, you can't get right results. |
|
A larger batch size may help : ) , I think there are some abnormal or extreme point may ruin your model. And once I enlarge my batch size, problem disappear |
When I searched KL-divergence between two Gaussians, I got this which is diffenrent from your KL loss
https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians
The text was updated successfully, but these errors were encountered: