Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KL Loss is right? #6

Closed
BridgetteSong opened this issue Jun 15, 2021 · 17 comments
Closed

KL Loss is right? #6

BridgetteSong opened this issue Jun 15, 2021 · 17 comments
Labels
good first issue Good for newcomers

Comments

@BridgetteSong
Copy link

BridgetteSong commented Jun 15, 2021

When I searched KL-divergence between two Gaussians, I got this which is diffenrent from your KL loss
https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians
kl

@jaywalnut310
Copy link
Owner

jaywalnut310 commented Jun 15, 2021

Hi @BridgetteSong. Yes, (closed-form) KL-divergence between two Gaussians is different from our KL loss. It's because we use KL-divergence between a Gaussian and a Normalizing flow rather than two Gaussians. Therefore, there is no closed-form KL like Gaussians. Equation 4 of our paper shows that the prior distribution is not Gaussian.

If you're not familiar with normalizing flows, and if you don't know how to calculate their log-likelihood (which is needed for calculating KL), it would be better to look these blog posts first: nf1 and nf2. These are great illustrative blog posts about normalizing flows containing model implementations.

@BridgetteSong
Copy link
Author

Thank you for your reply.
According to my understanding, posterior distribution is Gaussian, and prior distribution is product of prior Gaussian and absolute value of the determinant (Equation 4). So KL loss is following:

1. q(z/x) = torch.distributions.normal.Normal(m_q, exp(logs_q))
2. p(z/c) = torch.distributions.normal.Normal(m_p, exp(logs_p)) * torch.abs(jacobian determinant)
3. kl_loss = torch.distributions.kl.kl_divergence(q(z/x), p(z/c))

is my understanding right? or is this kl_loss equal to your kl loss?
I will appreciate if you can give a detailed explanation.

@jaywalnut310
Copy link
Owner

jaywalnut310 commented Jun 16, 2021

@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant.

Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one.

The kl divergence is the mean of the difference of log probabilities as follows:

  • mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)

As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):

  • mean of log(q(z/x)) = negative entropy of q(z|x) = -logs_q - 0.5 - 0.5 * log(2*pi)

On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:

  • log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + logdet(df/dz), where f is a normalizing flow.

As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see

vits/models.py

Line 449 in 2e561ba

self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)
and

vits/models.py

Lines 179 to 209 in 2e561ba

class ResidualCouplingBlock(nn.Module):
def __init__(self,
channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
n_flows=4,
gin_channels=0):
super().__init__()
self.channels = channels
self.hidden_channels = hidden_channels
self.kernel_size = kernel_size
self.dilation_rate = dilation_rate
self.n_layers = n_layers
self.n_flows = n_flows
self.gin_channels = gin_channels
self.flows = nn.ModuleList()
for i in range(n_flows):
self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
self.flows.append(modules.Flip())
def forward(self, x, x_mask, g=None, reverse=False):
if not reverse:
for flow in self.flows:
x, _ = flow(x, x_mask, g=g, reverse=reverse)
else:
for flow in reversed(self.flows):
x = flow(x, x_mask, g=g, reverse=reverse)
return x
):

  • log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + 0 = -logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2

Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:

  • (logs_p - logs_q - 0.5) + 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2, where f(z) is z_p in our code.

This is the explanation of the kl loss (

vits/losses.py

Lines 57 to 60 in 2e561ba

kl = logs_p - logs_q - 0.5
kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
kl = torch.sum(kl * z_mask)
l = kl / torch.sum(z_mask)
).

@jaywalnut310 jaywalnut310 added the good first issue Good for newcomers label Jun 16, 2021
@BridgetteSong
Copy link
Author

Thank you very much for your patience and detailed answer, I got it.

@BridgetteSong
Copy link
Author

BTW, as the prior is product of Gaussian and the jacobian determinant, and considering properties of Gaussian distribution(if X ~ N(u, σ**2), aX+b ~ N(au+b, (aσ)**2)), so the prior is always a Gaussian distribution when the jacobian determinant is a constant. So can we calculate KL-divergence using abovementioned two Gaussian KL-divergence or use torch API to get KL-divergence directly like this?
kl_loss = torch.distributions.kl.kl_divergence(q(z/x), p(z/c))

@jaywalnut310
Copy link
Owner

Good point! In case the channel dimension of latent variables is one, the prior is Gaussian when the jacobian determinant is a constant.
However, when the channel dimension of latent variables exceeds one, it's not true.
For example, Let (x1, x2) ~ N( (0, 0), I), then transform it into (y1, y2) = (x1, cos(x1) + x2).
Because of the non-linear transformation, the joint distribution of (y1, y2) is not Gaussian.
However, the jacobian determinant is still one, as the first order derivatives are : dy1/dx1 = 1, dy1/dx2 = 0, dy2/dx1 = -sin(x1), dy2/dx2 = 1.

The normalizing flow of prior also provides non-linear transformation using neural-networks while maintaining a constant jacobian determinant, resulting in non-Gaussian prior distribution. If the normalizing flow of prior only allows linear transformation or the channel dimension of latent variables is one, you can use the KL-divergence btw two Gaussians. But, in general, you cannot use it.

@BridgetteSong
Copy link
Author

Thank you very much again. I totally understand now. I learned much from your detailed answer.

@haoheliu
Copy link

haoheliu commented Dec 7, 2021

Hi @jaywalnut310, thanks for your detailed answer. Very helpful! I'd like to ask two more questions. I'd appreciate it if I can have your answers.

As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see

You mentioned you set up the normalizing flow to be volume-preserving. Did this way benefit the model? In my understanding, it can be replaced by a more complicated non-volume-preserving flow model.

The kl divergence is the mean of the difference of log probabilities as follows:
mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)

As far as I know, the KL divergence value lies in the range between [0,+inf]. But according to your formula, its value could be negative? (ref: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8)

@candlewill
Copy link

How about taking absolute value to overcome kl loss is negative?

--- a/losses.py
+++ b/losses.py
@@ -54,7 +54,7 @@ def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
   logs_p = logs_p.float()
   z_mask = z_mask.float()
 
-  kl = logs_p - logs_q - 0.5
+  kl = torch.abs(logs_p - logs_q - 0.5)

@yanggeng1995
Copy link

@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant.

Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one.

The kl divergence is the mean of the difference of log probabilities as follows:

  • mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)

As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):

  • mean of log(q(z/x)) = negative entropy of q(z|x) = -logs_q - 0.5 - 0.5 * log(2*pi)

On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:

  • log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + logdet(df/dz), where f is a normalizing flow.

As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see

vits/models.py

Line 449 in 2e561ba

self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)

and

vits/models.py

Lines 179 to 209 in 2e561ba

class ResidualCouplingBlock(nn.Module):
def __init__(self,
channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
n_flows=4,
gin_channels=0):
super().__init__()
self.channels = channels
self.hidden_channels = hidden_channels
self.kernel_size = kernel_size
self.dilation_rate = dilation_rate
self.n_layers = n_layers
self.n_flows = n_flows
self.gin_channels = gin_channels
self.flows = nn.ModuleList()
for i in range(n_flows):
self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
self.flows.append(modules.Flip())
def forward(self, x, x_mask, g=None, reverse=False):
if not reverse:
for flow in self.flows:
x, _ = flow(x, x_mask, g=g, reverse=reverse)
else:
for flow in reversed(self.flows):
x = flow(x, x_mask, g=g, reverse=reverse)
return x

):

  • log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + 0 = -logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2

Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:

  • (logs_p - logs_q - 0.5) + 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2, where f(z) is z_p in our code.

This is the explanation of the kl loss (

vits/losses.py

Lines 57 to 60 in 2e561ba

kl = logs_p - logs_q - 0.5
kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
kl = torch.sum(kl * z_mask)
l = kl / torch.sum(z_mask)

).

Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”,doesn't the kl divergence need integration? You directly kl=mean(log(q(z/x))) - mean(log(p(z/c))),is this an approximate formula?In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation, m_p and logs_p?

@BridgetteSong
Copy link
Author

BridgetteSong commented Sep 9, 2022

Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”,doesn't the kl divergence need integration? You directly kl=mean(log(q(z/x))) - mean(log(p(z/c))),is this an approximate formula?In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation, m_p and logs_p?

@yanggeng1995 I will give some additional supplements:

  1. KL_loss = ∫q(z/x) * (log(q(z/x)) - log(p(z/c)))dz = ∫q(z/x) * log(q(z/x))dz - ∫q(z/x) * log(p(z/c)))dz
  2. as q(z/x) is gaussian, so ∫q(z/x) * log(q(z/x))dz = -logs_q - 0.5 - 0.5 * log(2*pi).
  3. we can't directly compute ∫q(z/x) * log(p(z/c)))dz, so we only approximately compute it by sampling method. as usually, we sample some z values and average them. In the VAE code, usually sampling one z is enough, so ∫q(z/x) * log(p(z/c)))dz ≈ mean(log(p(z/c))) = log(p(z/c)).

@yanggeng1995
Copy link

yanggeng1995 commented Sep 9, 2022

@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant.
Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one.
The kl divergence is the mean of the difference of log probabilities as follows:

  • mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)

As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):

  • mean of log(q(z/x)) = negative entropy of q(z|x) = -logs_q - 0.5 - 0.5 * log(2*pi)

On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:

  • log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + logdet(df/dz), where f is a normalizing flow.

As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see

vits/models.py

Line 449 in 2e561ba

self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)

and

vits/models.py

Lines 179 to 209 in 2e561ba

class ResidualCouplingBlock(nn.Module):
def __init__(self,
channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
n_flows=4,
gin_channels=0):
super().__init__()
self.channels = channels
self.hidden_channels = hidden_channels
self.kernel_size = kernel_size
self.dilation_rate = dilation_rate
self.n_layers = n_layers
self.n_flows = n_flows
self.gin_channels = gin_channels
self.flows = nn.ModuleList()
for i in range(n_flows):
self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
self.flows.append(modules.Flip())
def forward(self, x, x_mask, g=None, reverse=False):
if not reverse:
for flow in self.flows:
x, _ = flow(x, x_mask, g=g, reverse=reverse)
else:
for flow in reversed(self.flows):
x = flow(x, x_mask, g=g, reverse=reverse)
return x

):

  • log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + 0 = -logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2

Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:

  • (logs_p - logs_q - 0.5) + 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2, where f(z) is z_p in our code.

This is the explanation of the kl loss (

vits/losses.py

Lines 57 to 60 in 2e561ba

kl = logs_p - logs_q - 0.5
kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
kl = torch.sum(kl * z_mask)
l = kl / torch.sum(z_mask)

).

Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”,doesn't the kl divergence need integration? You directly kl=mean(log(q(z/x))) - mean(log(p(z/c))),is this an approximate formula?In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation, m_p and logs_p?

@yanggeng1995 I will give some additional supplements:

  1. KL_loss = ∫q(z/x) * (log(q(z/x)) - log(p(z/c)))dx = ∫q(z/x) * log(q(z/x))dx - ∫q(z/x) * log(p(z/c)))dx
  2. as q(z/x) is gaussian, so ∫q(z/x) * log(q(z/x))dx = -logs_q - 0.5 - 0.5 * log(2*pi).
  3. we can't directly compute ∫q(z/x) * log(p(z/c)))dx, so we only approximately compute it by sampling method. as usually, we sample some z values and mean of them. In the VAE code, usually sampling one z is enough, so ∫q(z/x) * log(p(z/c)))dx ≈ mean(log(p(z/c))) = log(p(z/c)).

@BridgetteSong Thanks for your answer. There is another question, why not calculate the negative log-likelihood of Gaussian distribution based on z_p, m_p and logs_p, isn't it more convenient?

@BridgetteSong
Copy link
Author

BridgetteSong commented Sep 9, 2022

@yanggeng1995 it is very convenient to compute ∫q(z/x) * log(q(z/x))dz as it is Gaussian. And as for ∫q(z/x) * log(p(z/c)))dz, it is also convenient to compute it if you understand approximate sampling: ∫q(z/x) * log(p(z/c)))dz ≈ log(p(z/c)).

p(z/c) is product of Gaussian and the jacobian determinant. To compute log(p(z/c)), we need first sample z ~ posterior(), and get z_p = NormalizedFlow(z), finally use z_p to compute log-likelihood of prior Gaussian: N(z_p, m_p, logs_p).

so log(p(z/c)) = logdet(df/dz) + log(N(z_p, m_p, logs_p)) = 0 - logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2.

I think it is also right if you directly use kl_loss ≈ log(q(z/x)) - log(p(z/c)), just log(q(z/x)) = - logs_q - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_q) * (z - m_q) ** 2 where z ~ posterior(m_q, logs_q), and log(p(z/c)) is the same. I think the author's method is more concise and more accurate.

@980202006
Copy link

How about taking absolute value to overcome kl loss is negative?

--- a/losses.py
+++ b/losses.py
@@ -54,7 +54,7 @@ def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
   logs_p = logs_p.float()
   z_mask = z_mask.float()
 
-  kl = logs_p - logs_q - 0.5
+  kl = torch.abs(logs_p - logs_q - 0.5)

Hi, Is it work?

@BridgetteSong
Copy link
Author

@980202006 It will not work. As usually, KL_loss will not be negative if your inputs and network are right. When kl_loss < 0, it means your prior distribution is almost same as posterior distribution, so posterior distribution fails to learn as a complicated distribution.
When kl_loss < 0, The first thing you should to do is checking your inputs and network. If you must add some constraints in the loss formula, you should add abs function to all items not first item like this:

  • kl = logs_p - logs_q - 0.5
  • kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
  • kl = torch.clamp(kl, min=0.0)

But in usually, you need not add this constraint, as when your KL_Loss < 0, it means your network is trained unsuccessfully, although you add this constraint, you can't get right results.

@fenling
Copy link

fenling commented Sep 14, 2023

Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”,doesn't the kl divergence need integration? You directly kl=mean(log(q(z/x))) - mean(log(p(z/c))),is this an approximate formula?In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation, m_p and logs_p?

@yanggeng1995 I will give some additional supplements:

  1. KL_loss = ∫q(z/x) * (log(q(z/x)) - log(p(z/c)))dz = ∫q(z/x) * log(q(z/x))dz - ∫q(z/x) * log(p(z/c)))dz
  2. as q(z/x) is gaussian, so ∫q(z/x) * log(q(z/x))dz = -logs_q - 0.5 - 0.5 * log(2*pi).
  3. we can't directly compute ∫q(z/x) * log(p(z/c)))dz, so we only approximately compute it by sampling method. as usually, we sample some z values and average them. In the VAE code, usually sampling one z is enough, so **_∫q(z/x) * log(p(z/c)))dz ≈ mean(log(p(z/c))) = log(p(z/c)).
    @BridgetteSong hi,i want to know why mean(log(p(z/c))) = log(p(z/c)).sampling one z is enought,why?

@Cheneng
Copy link

Cheneng commented Dec 21, 2023

@980202006 It will not work. As usually, KL_loss will not be negative if your inputs and network are right. When kl_loss < 0, it means your prior distribution is almost same as posterior distribution, so posterior distribution fails to learn as a complicated distribution. When kl_loss < 0, The first thing you should to do is checking your inputs and network. If you must add some constraints in the loss formula, you should add abs function to all items not first item like this:

  • kl = logs_p - logs_q - 0.5
  • kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
  • kl = torch.clamp(kl, min=0.0)

But in usually, you need not add this constraint, as when your KL_Loss < 0, it means your network is trained unsuccessfully, although you add this constraint, you can't get right results.

A larger batch size may help : ) , I think there are some abnormal or extreme point may ruin your model. And once I enlarge my batch size, problem disappear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

8 participants