-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for multiple mask tokens. #14716
Conversation
a3fca5e
to
34c02f4
Compare
8068d62
to
61686e2
Compare
- Original implem: huggingface#10222 Co-authored-by: njafer <naveen.jafer@oracle.com>
we add information to the tasks to specify tasks where we know for sure if we need the tokenizer/feature_extractor or not.
61686e2
to
5b0189e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO_FEATURE_EXTRACTOR_TASKS = set() | ||
NO_TOKENIZER_TASKS = set() | ||
for task, values in SUPPORTED_TASKS.items(): | ||
if values["type"] == "text": | ||
NO_FEATURE_EXTRACTOR_TASKS.add(task) | ||
elif values["type"] in {"audio", "image"}: | ||
NO_TOKENIZER_TASKS.add(task) | ||
elif values["type"] != "multimodal": | ||
raise ValueError(f"SUPPORTED_TASK {task} contains invalid type {values['type']}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach, it should make the pipelines more robust to models with different capabilities in terms of preprocessors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as well!
@@ -340,3 +338,27 @@ def fill_mask_with_duplicate_targets_and_top_k(self, model, tokenizer): | |||
# The target list contains duplicates, so we can't output more | |||
# than them | |||
self.assertEqual(len(outputs), 3) | |||
|
|||
def fill_mask_with_multiple_masks(self, model, tokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we perhaps add a test for Perceiver (similar to the image classification models)?
Or is this not required here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is pretty orthogonal to Perceiver.
We could add a slow test for sure, but it doesn't have to be Perceiver specific.
In fact, I'll add something on the random model (it just needs to be consistent, actual values are less important)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this looks good to me! Looking forward to the additional test, feel free to merge whenever.
* Adding support for multiple mask tokens. - Original implem: huggingface#10222 Co-authored-by: njafer <naveen.jafer@oracle.com> * In order to accomodate optionally multimodal models like Perceiver we add information to the tasks to specify tasks where we know for sure if we need the tokenizer/feature_extractor or not. * Adding info in the documentation about multi masks. + marked as experimental. * Add a copy() to prevent overriding the same tensor over and over. * Fixup. * Adding small test for multi mask with real values.. Co-authored-by: njafer <naveen.jafer@oracle.com>
What does this PR do?
When presented multiple masks, it's impossible to retrieve the conjugate probabilities.
Instead of trying to workaround that (see discussions in previous PR) this PR instead
just outputs the raw
top_k
proposition at each locus, since it gets trickier to find a goodproxy for "joint probabilities". Instead of trying to solve this impossible problem we simply
show exactly what the model outputs.
@naveenjafer mentionned as co-author since much of this PR was pulled from there.
This PR was resurrected partly because Perceiver (byte-level model) need to do this type of masking to be useful.
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.