Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to calculate the loss in the EM setting? #2

Closed
ngthanhtin opened this issue Oct 18, 2022 · 7 comments
Closed

How to calculate the loss in the EM setting? #2

ngthanhtin opened this issue Oct 18, 2022 · 7 comments

Comments

@ngthanhtin
Copy link

Hi,
Thank you for your great work, it is really interesting and complicated too.
I still do not understand how you calculate the loss in the EM setting, you said in the paper that this is a self-supervised setting in the EM case, and calculates the pixel-wise nll between the pseudo-target and the joint likelihood block.
But I found in your code, you still use the image label to input to the nll loss, and here did you take the sum over two last channels which will output a specific vector and use it to compare which the image label? If this is what you have done, it will contradict what you said in the paper, right?

The second one is the attribution map of p(z,y^|x) and p(y^|x,z) here are similar, so this violate the equation p(y, z|x) = p(y|x,z)p(z|x), right? Can you also clarify on this, why these two maps are so similar?
Screen Shot 2022-10-18 at 2 47 49 PM

Hope you can clarify these, they are the things I am thinking about a lot.

Best,
Tin

@ngthanhtin
Copy link
Author

Hello @coallaoh @junsukchoe @naver-ai, could you please explain that to me? This is really helpful for my research. Thanks

@coallaoh
Copy link
Collaborator

coallaoh commented Oct 21, 2022

The final loss for CALM-EM is at

calm/main.py

Line 163 in 0ebb8a5

loss = self.criterion(features, target)

That is, it's in the form of NLL(features, target).
From the reference of torch.nn.NLLLoss (https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html), you find that NLL(features, target) assumes features to be log probabilities already and it merely computes:

- \sum_i features_{i, target_i} (or mean, depending on how you set the reduce function)

Now, let's take a look at how features are computed for CALM-EM.

Your link at

inputs = inputs.sum(dim=[2, 3])
is the right location to look into this.

features = inputs = (latent_posterior * joint_likelihood).sum(dim=[2, 3])

In maths notation, this quantity is

features_{ic} = \sum_k p'(z=k | x^i, y=c) log p(y=c, z=k | x^i)

where p' denotes the prob distribution computed form the previous-iteration model f_former that is not trained (no backpropagation). See detach() operation for latent_posterior. Please also note that joint_likelihood is already log-ed (is_log=True).

Combining everything together, we have the following expression:

loss = - \sum_i features_{i, target_i}
= - \sum_i \sum_k p'(z=k | x^i, y=target_i) log p(y=target_i, z=k | x^i)

Please note that this may be interpreted as self-supervising the pixel(z)-wise predictions p(y, z|x) with its own estimation of the cue location z for the true class y: p'(z|x, y). I understand that self-supervision has the connotation of not using any human-supplied annotation, while ours uses the GT class label. We used "self-supervision" here in the sense that the pixel-wise GT is not used and is replaced with a pseudo-pixel-wise-GT generated from a mere GT class label.

@coallaoh
Copy link
Collaborator

The maps of p(z,y^|x) and p(y^|x,z) are similar because they are identical up to linear scaling. Heatmaps are usually drawn with max-normalisation which outputs the same heatmap for the ones that are identical up to linear scaling.

@ngthanhtin
Copy link
Author

Hi @coallaoh , thank you for your quick reply. From my understanding, you said that the feature_ic means the input pixel (x) at location i (or at position i) and the class label c.

And in the loss function:

loss = - \sum_i features_{i, target_i}
= - \sum_i \sum_k p'(z=k | x^i, y=target_i) log p(y=target_i, z=k | x^i)

the target_i means the label of the pixel at location i. So my question is the target_i here is the same for all location i and it is equal to the class label c, is it right?

@coallaoh
Copy link
Collaborator

No all the indices are mixed up.

the target_i means the label of the pixel at location i.

No target_i means the GT class label for sample i. i is the sample index. k is the location index (and so you can write z=k etc.)

... equal to the class label c, is it right?

Class label c is a free index, not a designated index. One could say for example c=target_i, meaning that you set the class index c as the GT class label for sample i.

@ngthanhtin
Copy link
Author

ngthanhtin commented Oct 21, 2022

Yes thanks @coallaoh ,
This figure makes me misunderstand your code and what you explain here (calculate loss between pseudo-target and joint distribution), for now, I got the picture, thank you.
Screen Shot 2022-10-21 at 1 16 17 PM

@coallaoh
Copy link
Collaborator

Sounds great. Thanks for your interest in our work :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants