Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom tokenizer layer #75

Closed
ptamas88 opened this issue Sep 21, 2020 · 5 comments
Closed

Custom tokenizer layer #75

ptamas88 opened this issue Sep 21, 2020 · 5 comments

Comments

@ptamas88
Copy link

Hi,
I would like to incorporate the tokenization process into a model which is using bert layer.
Here is my custom layer:

class TokenizationLayer(tf.keras.layers.Layer):
    def __init__(self, vocab_path, max_length, **kwargs):
        self.vocab_path = vocab_path
        self.length = max_length
        self.tokenizer = bert.bert_tokenization.FullTokenizer(vocab_path, do_lower_case=False)
        super(TokenizationLayer, self).__init__(**kwargs)

    def call(self,inputs):
        tokens = self.tokenizer.tokenize(inputs)
        ids = self.tokenizer.convert_tokens_to_ids(tokens)
        ids += [self.tokenizer.vocab['[PAD]']] * (self.length-len(ids))
        return ids

And here is my code to test the custom layer within a dummy model:

inputs = tf.keras.layers.Input(shape=(), dtype='string')
tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
outputs = tokenization_layer(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

I get the following traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-68-8df4885e5c7a> in <module>
      1 inputs = tf.keras.layers.Input(shape=(), dtype='string')
      2 tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
----> 3 outputs = tokenization_layer(inputs)
      4 model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
    924     if _in_functional_construction_mode(self, inputs, args, kwargs, input_list):
    925       return self._functional_construction_call(inputs, args, kwargs,
--> 926                                                 input_list)
    927 
    928     # Maintains info about the `Layer.call` stack.

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in _functional_construction_call(self, inputs, args, kwargs, input_list)
   1115           try:
   1116             with ops.enable_auto_cast_variables(self._compute_dtype_object):
-> 1117               outputs = call_fn(cast_inputs, *args, **kwargs)
   1118 
   1119           except errors.OperatorNotAllowedInGraphError as e:

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    256       except Exception as e:  # pylint:disable=broad-except
    257         if hasattr(e, 'ag_error_metadata'):
--> 258           raise e.ag_error_metadata.to_exception(e)
    259         else:
    260           raise

ValueError: in user code:

    <ipython-input-60-d6c12f7d1b14>:17 call  *
        tokens = self.tokenizer.tokenize(inputs)
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:172 tokenize  *
        for token in self.basic_tokenizer.tokenize(text):
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:198 tokenize  *
        text = convert_to_unicode(text)
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:86 convert_to_unicode  *
        raise ValueError("Unsupported string type: %s" % (type(text)))

    ValueError: Unsupported string type: <class 'tensorflow.python.framework.ops.Tensor'>

Can you lease help how to solve this issue?
I think the problem is that the tokenizer gets tensors not string and that is why it can't tokenize it.
But if that is the case how should I mkae this work?
Thanks

@ptamas88 ptamas88 changed the title Customer tokenizer layer Custom tokenizer layer Sep 21, 2020
@Shiro-LK
Copy link

@ptamas88 Did you succeed to make it works ? I have the same question

@kpe
Copy link
Owner

kpe commented Oct 21, 2020

yes, usually the tokenizer is not part of the graph. For this you'll need a tokenizer that has a TF implementation, like sentencepiece when using albert. For BERT you might try the tf.text BertTokenizer (https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/BertTokenizer.md) - I haven't used it myself, but it should work.

@kpe
Copy link
Owner

kpe commented Oct 21, 2020

hope that helps:

pip install tensorflow-text

and then try something along those lines:

import tensorflow_text as text

tokenizer = text.BertTokenizer(os.path.join(ckpt_dir, 'vocab.txt'))
tok_ids = tokenizer.tokenize(["hello, cruel world!", "abcccccccd"]).merge_dims(-2,-1).to_tensor(shape=(2, max_seq_len))

@kpe kpe closed this as completed Oct 21, 2020
@ptamas88
Copy link
Author

@ptamas88 Did you succeed to make it works ? I have the same question

haven't tried since, but i will check out the solution @kpe mentioned

@keeson
Copy link

keeson commented Jan 26, 2021

it didn't work, still throw OperatorNotAllowedInGraphError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants