Skip to content

Conversation

TheAthleticCoder
Copy link
Contributor

Resolves #829
I hope this PR solves the issue! @mattdangerw and @abheesht17
Would like to finish this issue so that I can resolve the issue of Speeding up testing for the Deberta model.


def detokenize(self, ids):
blank_token_id = self.token_to_id("")
ids = tf.where(ids == self.mask_token_id, blank_token_id, ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make frequent use of tf.ragged.boolean_mask to do stuff like this.

Would ids = tf.ragged.boolean_mask(ids, tf.not_equal(ids, self.mask_token_id)) work?

Also please add a unit test to verify this behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdangerw I have added the requested changes. A few things to note are:

  1. In tf.ragged.boolean_mask(ids, tf.not_equal(ids, self.mask_token_id)), I set the default _value to the blank token.
  2. Successfully added a simple unit test.

Do let me know if there are any changes to be made. Thank You!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the proper fix based on the results after seeing the checks below. I am going to try to use my original method and the added unit test to see if that works correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just don't need the blank ids at all anymore. There is no need for a default value, essentially what you are doing is passing a mask of all location that should be kept to the boolean mask function.

Did you try the ids = tf.ragged.boolean_mask(ids, tf.not_equal(ids, self.mask_token_id)) line? Remove blank_token_id entirely.

def detokenize(self, ids):
blank_token_id = self.token_to_id("")
mask = tf.not_equal(ids, self.mask_token_id)
ids = tf.ragged.boolean_mask(ids, mask, default_value=blank_token_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheAthleticCoder, your tests are failing because default_value is not a valid argument: https://www.tensorflow.org/api_docs/python/tf/ragged/boolean_mask. Please remove default_value and trigger the tests again.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mattdangerw mattdangerw merged commit b35e83f into keras-team:master Mar 23, 2023
@TheAthleticCoder TheAthleticCoder deleted the issue829 branch March 23, 2023 07:01
kanpuriyanawab pushed a commit to kanpuriyanawab/keras-nlp that referenced this pull request Mar 26, 2023
* Stripping the MASK token

* Stripping the MASK token

* added unit test

* fixed detokenize function

* check if test unit is correct

* changed MASK token index

* trial using tokenizer mask id

* using tf ragged boolean mask

* reformatted the prev commit
kanpuriyanawab pushed a commit to kanpuriyanawab/keras-nlp that referenced this pull request Mar 26, 2023
* Stripping the MASK token

* Stripping the MASK token

* added unit test

* fixed detokenize function

* check if test unit is correct

* changed MASK token index

* trial using tokenizer mask id

* using tf ragged boolean mask

* reformatted the prev commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deberta tokenizer.detokenize() errors out with mask token
3 participants