dumb theoretical questions #27

hlsfin · 2022-01-18T20:33:08Z

I have not read the paper entirely so forgive me; but is there a way that information seeps from the unlabeled dataset to the label dataset? And if it doesn't, can we just have any dataset that could take the place of the unlabeled dataset, or does it have to look 'similar' to the label dataset? Thank you.

dagleaves · 2022-01-18T20:57:49Z

tl;dr: 1. yes... 2. maybe, but it is hard to say for certain.

I am not the creator of this repository, but I have read the original paper and used the idea. I will answer this will the assumption that Meta Pseudo Labels works properly and as discussed in the original paper. With that in mind, here is what I would say about your question from what I understand of it.

There is no way that information can be transferred (seep) between the datasets per se without moving data from one set to the other outright. However, from your question, I believe you are referring more to data leakage in the model itself, which is actually technically a goal of MPL.

With semi-supervised learning, dual-network architectures (including, but not limited to MPL), the goal is to approximate the labels of unknown samples from a model trained on labeled samples. This works best if the unlabeled dataset looks "similar" to the labeled dataset as you have said.

MPL introduces its feedback signal between the student network and the teacher network to provide a sort of metric for how beneficial the pseudo-labels from the teacher are for the student's performance on the labeled data. The idea with this is that if the student performs worse on labeled data after being provided a batch of pseudo-labels, then the pseudo-labels must have been wrong. The teacher is then adjusted accordingly. Assuming MPL works in this way as they have described, then information is certainly transferred from the unlabeled data to the teacher model (who was only trained on the labeled data). This theoretically generalizes both the teacher and student models to unlabeled data as you are asking, though they only consider the student model for some reason in their paper.

Whether it has to look similar to the labeled dataset comes down to whether it works the way we/they think it does. From what I can tell, would it help for it to look similar? Absolutely. Is it necessary? Not completely. The point at which it is too dissimilar is unknown and would likely change on a data-specific basis.

To provide a concrete example, if you were to train a model to classify images containing cats and the labeled dataset contained only house cats and the unlabeled contained wild cats, you'd probably be fine. It might even perform close to if the labeled set contained both. However, if your unlabeled dataset only contained dogs, I do not know how you would fare. In my testing, with my specific application, I found it performed better when I seeded the unlabeled set with a portion of labeled data (moved a portion of the labeled data to the unlabeled dataset). Is this because the unlabeled set was too dissimilar? Maybe. Was it because the unlabeled set was unbalanced (containing too few ground truth positive samples)? Also maybe. It is hard to say.

I hope this helps.

hlsfin · 2022-01-20T16:08:53Z

So if my understanding is correct; While yes information does leak from the unlabeled validation set to the labeled training set, but not in the form of direct 'here are the labels for validation set', but in a similarity vs dissimilarity context per image. correct?

(btw, i really appreciate that you took the time to write all this.)

hlsfin closed this as completed Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dumb theoretical questions #27

dumb theoretical questions #27

hlsfin commented Jan 18, 2022

dagleaves commented Jan 18, 2022 •

edited

hlsfin commented Jan 20, 2022 •

edited

dumb theoretical questions #27

dumb theoretical questions #27

Comments

hlsfin commented Jan 18, 2022

dagleaves commented Jan 18, 2022 • edited

hlsfin commented Jan 20, 2022 • edited

dagleaves commented Jan 18, 2022 •

edited

hlsfin commented Jan 20, 2022 •

edited