Add a byte pair encoding (BPE) tokenizer layer #46

mattdangerw · 2022-03-16T18:23:32Z

We would like to add a BPE tokenizer (used by gpt-2, roberta and others). This ideally should be configurable to be compatible with the actual tokenization used by gpt-2 and roberta, and run inside a tensorflow graph.

abheesht17 · 2022-03-16T18:37:21Z

Nice! Should I draw up a rough implementation and share a Colab notebook?

mattdangerw · 2022-03-16T18:59:06Z

This is a pretty important feature, as it will unlock some important models and is widely used.

However, there are some technical roadblocks here currently. We would like to keep our tokenizer running inside the tensorflow graph using tensorflow ops, and currently the tokenization ops are all provided by tf-text.

There is not a BPE tokenizer offered by tf text, but in theory SentencePiece should be configurable in a way that is compatible. See tensorflow/text#763

The first thing to do would be to see if that is possible. Try configuring SentencePiece tokenizer for tf text and see if it can be configured to be actually compatible with the tokenizers for gpt-2 and roberta (testing against huggingface tokenizers would probably be the simplest to do this). A colab showing compatibility would "unblock" this work, and if it's not possible currently we may have to apply some fixes to tf-text and sentencepiece.

From there we could produce a design that would essentially hide the complexity of sentence piece under the hood. We would need to think about the vocab format we provide (a vocab and merges file?).

mattdangerw · 2022-03-16T19:00:49Z

@abheesht17 you are definitely welcome to help with this! This will require some diving into other libraries, to understand the support we have today.

abheesht17 · 2022-03-17T02:25:39Z

Great, will do 👍🏼

abheesht17 · 2022-03-17T16:34:56Z

Hey, @mattdangerw. I went through this issue. So, essentially, this is what you want me to do:

Use the SentencePiece library, and configure it so as to train a byte-level BPE tokeniser. Use a small text corpus for training.
Use the .model file obtained after training and pass it to TensorFlow Text's SentencePiece tokeniser class.
Now, use the same corpus and train Hugging Face's GPT-2 Tokeniser, and check whether the vocabulary obtained is similar, and check the output on a few input samples.

Is this correct?

mattdangerw · 2022-03-18T01:38:49Z

I'm not sure we need to actually train a sentence piece model, though that might help understand things.

Basically, the public API we can rely on that might give us the op support we need is tf text's SentencepieceTokenizer, but that takes a sentence piece model proto as input.

End users will want probably want to use this layer with "vocab json" and "merges txt" files provided by official gpt/roberta githubs or huggingface. We can keep thinking about the file format we would want, but asking end users to construct a sentence piece model is probably a non-starter.

So, the question we could try to answer is can we manually construct a sentence piece model proto from gpt vocab and merge files in a way that's compatible. If so, we could build this layer on top of the existing tf text API, and not rule out more direct support from tf text in the future. If not, we will need to go back to the drawing board a little bit and figure out how to get op level support here.

So putting that into a list:

Start with a vocab and merges files for say gpt2.
Generate some correct output for some sample text (probably easiest to use huggingface here? could also try using the tokenizer impl from gpt2 github)
Try building a tf text SentencepieceTokenizer from those files that matches the real tokenizer output.

It may turn out we are more blocked here than we think from tensorflow/text#763, but this would be the way to find out.

abheesht17 · 2022-03-18T05:03:47Z

Ah, understood. Thanks for clarifying!

abheesht17 · 2022-03-20T07:37:14Z

Some useful articles about how Hugging Face tokenises the input text (given vocab.txt and merges.txt):

huggingface/transformers#1083 (comment)
huggingface/transformers#4777

Tokenise text using merges.txt
Map the tokens to indices using vocab.json

abheesht17 · 2022-03-30T04:10:51Z

Hey, @mattdangerw. Sorry for the delay, forgot about it. I opened an issue on the SentencePiece repository: google/sentencepiece#739. The author of the repo mentions this: "manual model modification/creation is totally unsupported."

However, looks like we may be able to add tokens from the vocab to the pieces attribute. I don't think they have Python wrappers/APIs for adding "pieces". However, they do have a function in C++, AddPieces. See this unit test: https://github.com/google/sentencepiece/blob/bc53923a9147dc8ffa54034c8ed774de78cc4d39/src/bpe_model_test.cc#L52. I'll try to use this function, and reproduce the output we get using HF. Give me a day or two.

aleemkhan62 · 2022-07-03T16:13:55Z

Hi All,

Just curious if anyone has found any sort of work around for this issue. My conclusion after reading related issues is that its not currently possible to incorporate popular BPE tokenizers (roberta/GPT2) within tensorflow-text pipelines?

chenmoneygithub · 2022-07-09T18:21:50Z

@aleemkhan62 Currently you can use BPE via tf_text.SentencePieceTokenizer only if you have a pretrained model proto. We are looking into a better solution on it! please stay tuned, thanks!

mattdangerw · 2022-07-19T17:14:59Z

To add a little more color for others finding this issue, you can train a BPE-style vocabulary with sentecepiece today, and a sentencpiece model can be used with tensorflow text, or the SentencePieceTokenizer in this library. However than might not have the exact behavior as roberta/gpt2 tokenization.

We are currently working on a way to support the actual vocabulary files used by roberta/gpt2 (merges.txt and vocab.json), with exactly equivalent tokenization, running inside the tf graph.

piEsposito · 2022-10-14T16:19:35Z

Any updates here?

abheesht17 · 2022-10-14T17:06:24Z

Any updates here?

#303

mattdangerw · 2023-01-07T00:52:19Z

Closing this! We have an implementation released -> https://keras.io/api/keras_nlp/tokenizers/byte_pair_tokenizer/

If anyone encounters issue with the tokenizer, please file a bug!

mattdangerw added the type:feature New feature or request label Mar 16, 2022

mattdangerw mentioned this issue Mar 16, 2022

Add Remaining Tokenizers #45

Closed

4 tasks

mattdangerw assigned abheesht17 Mar 18, 2022

mattdangerw closed this as completed Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a byte pair encoding (BPE) tokenizer layer #46

Add a byte pair encoding (BPE) tokenizer layer #46

mattdangerw commented Mar 16, 2022

abheesht17 commented Mar 16, 2022

mattdangerw commented Mar 16, 2022 •

edited

mattdangerw commented Mar 16, 2022

abheesht17 commented Mar 17, 2022

abheesht17 commented Mar 17, 2022

mattdangerw commented Mar 18, 2022

abheesht17 commented Mar 18, 2022

abheesht17 commented Mar 20, 2022 •

edited

abheesht17 commented Mar 30, 2022 •

edited

aleemkhan62 commented Jul 3, 2022

chenmoneygithub commented Jul 9, 2022

mattdangerw commented Jul 19, 2022

piEsposito commented Oct 14, 2022

abheesht17 commented Oct 14, 2022

mattdangerw commented Jan 7, 2023

Add a byte pair encoding (BPE) tokenizer layer #46

Add a byte pair encoding (BPE) tokenizer layer #46

Comments

mattdangerw commented Mar 16, 2022

abheesht17 commented Mar 16, 2022

mattdangerw commented Mar 16, 2022 • edited

mattdangerw commented Mar 16, 2022

abheesht17 commented Mar 17, 2022

abheesht17 commented Mar 17, 2022

mattdangerw commented Mar 18, 2022

abheesht17 commented Mar 18, 2022

abheesht17 commented Mar 20, 2022 • edited

abheesht17 commented Mar 30, 2022 • edited

aleemkhan62 commented Jul 3, 2022

chenmoneygithub commented Jul 9, 2022

mattdangerw commented Jul 19, 2022

piEsposito commented Oct 14, 2022

abheesht17 commented Oct 14, 2022

mattdangerw commented Jan 7, 2023

mattdangerw commented Mar 16, 2022 •

edited

abheesht17 commented Mar 20, 2022 •

edited

abheesht17 commented Mar 30, 2022 •

edited