Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assign Doubt for Dissimilarity from Labelled Set #12

Closed
koaning opened this issue Nov 14, 2021 · 10 comments
Closed

Assign Doubt for Dissimilarity from Labelled Set #12

koaning opened this issue Nov 14, 2021 · 10 comments

Comments

@koaning
Copy link
Owner

koaning commented Nov 14, 2021

Suppose that y can contain NaN values if they aren't labeled. In that case, we may want to favor a subset of these NaN values. In particular: if they differ substantially from the already labeled datapoints.

The idea here is that we may be able to sample more diverse datapoints.

@koaning
Copy link
Owner Author

koaning commented Nov 30, 2021

Snorkel seems to have a similar notion with their ABSTAIN label. In their case -1 indicates "no label" and positive integers would indicate a label. Not 100% sure if I want to commit to treating -1 as a special citizen though. Maybe nan is better. Dunno yet.

@glevv
Copy link

glevv commented Dec 13, 2021

It should be doable with the help of sklearn semi-supervised methods

@koaning
Copy link
Owner Author

koaning commented Dec 13, 2021

Before working on an implementation it would be good to first confirm via a relevant example that the approach has merit to it. But I agree that it'd be grand to re-use sklearn tools.

@Garve
Copy link
Contributor

Garve commented Dec 15, 2021

I don't exactly understand the goal here. So, the usual reasons should output weirdly labeled samples in some way.
Here, you want to output weird samples, right? So we mix two goals here. If the ensemble outputs indices, we first have to check if there is a label assigned to the corresponding sample (check the label in this case) or not (label this yourself to make it easier for any model because this is a unique or perhaps hard to classify sample).

Do you mean it like this?

@koaning
Copy link
Owner Author

koaning commented Dec 15, 2021

You're right to say this library tries to help find "weirdly labeled samples". But ... the idea is that we may also want to find examples that haven't been labeled yet but which certainly deserve attention. For example; one could argue that an outlier deserves attention, even without a label attached.

What I'm proposing here is that we might also help the user find examples worth checking even if there are only a few labels. The user-case here might be the early phase of a project where we only have relatively few datapoints with labels.

@Garve
Copy link
Contributor

Garve commented Dec 15, 2021

This is also something that we can do with ModAL already, as great active learning library.

@koaning
Copy link
Owner Author

koaning commented Dec 15, 2021

Never heard of that, got a link?

@Garve
Copy link
Contributor

Garve commented Dec 15, 2021

Here.

I think it also has some overlap with what you want for doubtlab.

@koaning
Copy link
Owner Author

koaning commented Dec 15, 2021

Hah! How have I not know about this library before. It's grand!

I think we might be able to host some specific query strategies in this library (for text/images specifically let's say). But before going there I may just make some calmcode videos on this library. It looks really well designed.

@Garve
Copy link
Contributor

Garve commented Dec 15, 2021

Happy to inspire you! I also like this one a lot.

@koaning koaning closed this as completed Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants