-
Notifications
You must be signed in to change notification settings - Fork 301
Stripping the MASK token #876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
def detokenize(self, ids): | ||
blank_token_id = self.token_to_id("") | ||
ids = tf.where(ids == self.mask_token_id, blank_token_id, ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We make frequent use of tf.ragged.boolean_mask
to do stuff like this.
Would ids = tf.ragged.boolean_mask(ids, tf.not_equal(ids, self.mask_token_id))
work?
Also please add a unit test to verify this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdangerw I have added the requested changes. A few things to note are:
- In
tf.ragged.boolean_mask(ids, tf.not_equal(ids, self.mask_token_id))
, I set thedefault _value
to the blank token. - Successfully added a simple unit test.
Do let me know if there are any changes to be made. Thank You!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the proper fix based on the results after seeing the checks below. I am going to try to use my original method and the added unit test to see if that works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you just don't need the blank ids at all anymore. There is no need for a default value, essentially what you are doing is passing a mask of all location that should be kept to the boolean mask function.
Did you try the ids = tf.ragged.boolean_mask(ids, tf.not_equal(ids, self.mask_token_id))
line? Remove blank_token_id
entirely.
def detokenize(self, ids): | ||
blank_token_id = self.token_to_id("") | ||
mask = tf.not_equal(ids, self.mask_token_id) | ||
ids = tf.ragged.boolean_mask(ids, mask, default_value=blank_token_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheAthleticCoder, your tests are failing because default_value
is not a valid argument: https://www.tensorflow.org/api_docs/python/tf/ragged/boolean_mask. Please remove default_value
and trigger the tests again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* Stripping the MASK token * Stripping the MASK token * added unit test * fixed detokenize function * check if test unit is correct * changed MASK token index * trial using tokenizer mask id * using tf ragged boolean mask * reformatted the prev commit
* Stripping the MASK token * Stripping the MASK token * added unit test * fixed detokenize function * check if test unit is correct * changed MASK token index * trial using tokenizer mask id * using tf ragged boolean mask * reformatted the prev commit
Resolves #829
I hope this PR solves the issue! @mattdangerw and @abheesht17
Would like to finish this issue so that I can resolve the issue of Speeding up testing for the Deberta model.