How do I pre-train the T5 model in HuggingFace library using my own text corpus? #5079

abhisheknovoic · 2020-06-17T10:41:14Z

Hello,

I understand how the T5 architecture works and I have my own large corpus where I decide to mask a sequence of tokens and replace them with sentinel tokens.

I also understand about the tokenizers in HuggingFace, specially the T5 tokenizer.

Can someone point me to a document or refer me to the class that I need to use to pretrain T5 model on my corpus using the masked language model approach?

Thanks

patil-suraj · 2020-06-17T10:54:35Z

Hi, @abhisheknovoic this might help you https://huggingface.co/transformers/model_doc/t5.html#training
check the Unsupervised denoising training section

abhisheknovoic · 2020-06-17T11:15:54Z

@patil-suraj , do you mean this class? - T5ForConditionalGeneration

Also, at the top of the page, there is the following code:

lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)

Any idea which class is the model instantiated from? I could not find any class with lm_labels parameter.

Thanks

patil-suraj · 2020-06-17T11:19:02Z

Yes, it's T5ForConditionalGeneration, and lm_lables is now changed to labels.

Pinging @patrickvonplaten for more details.

abhisheknovoic · 2020-06-17T11:50:44Z

@patil-suraj , I tried the following code which throws an error. Any idea why? Thanks

In [32]: from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

In [33]: input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')

In [34]: labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')

In [35]: config = T5Config()

In [36]: model = T5ForConditionalGeneration(config=config)

In [37]: model(input_ids=input_ids, lm_labels=labels)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-37-6717b0ecfbf5> in <module>
----> 1 model(input_ids=input_ids, lm_labels=labels)

/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

/usr/local/lib/python3.7/site-packages/transformers/modeling_t5.py in forward(self, input_ids, attention_mask, encoder_outputs, decoder_input_ids, decoder_attention_mask, decoder_past_key_value_states, use_cache, lm_labels, inputs_embeds, decoder_inputs_embeds, head_mask)
   1068         if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
   1069             # get decoder inputs from shifting lm labels to the right
-> 1070             decoder_input_ids = self._shift_right(lm_labels)
   1071
   1072         # If decoding with past key value states, only the last tokens

/usr/local/lib/python3.7/site-packages/transformers/modeling_t5.py in _shift_right(self, input_ids)
    609         assert (
    610             decoder_start_token_id is not None
--> 611         ), "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information"
    612
    613         # shift inputs to the right

AssertionError: self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information

My versions are

transformers==2.11.0
tokenizers==0.7.0

patil-suraj · 2020-06-17T11:55:12Z

If you are using 2.11.0 then use lm_labels and if you are using master then use labels

abhisheknovoic · 2020-06-17T11:59:10Z

@patil-suraj , thanks. I have installed the master version. It still complains with the same error. It seems like I need to specify something for the decoder_start_token_id.

abhisheknovoic · 2020-06-17T12:01:14Z

Ok, I got it working. I initialized config like follows:

config = T5Config(decoder_start_token_id=tokenizer.convert_tokens_to_ids(['<pad>'])[0])

abhisheknovoic · 2020-06-17T12:05:35Z

@patil-suraj , however, if we use the master branch, it seems like the tokenizers are broken. The T5 tokenizer doesn't tokenize the sentinel tokens correctly.

patrickvonplaten · 2020-06-18T07:09:42Z

@patil-suraj , do you mean this class? - T5ForConditionalGeneration

Also, at the top of the page, there is the following code:
lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)
Any idea which class is the model instantiated from? I could not find any class with lm_labels parameter.

Thanks

Feel free to also open a PR to correct lm_labels to labels in the comment :-)

patrickvonplaten · 2020-06-18T07:11:57Z

Just saw that @patil-suraj already did this - awesome thanks :-)

@abhisheknovoic regarding the T5 tokenizer, can you post some code here that shows that T5 tokenization is broken (would be great if we can easily reproduce the error)

patil-suraj · 2020-06-18T07:14:20Z

@patrickvonplaten it would be nice if we also add seq-2-seq (t5, bart) model pre-training examples in official examples

cc @sshleifer

patrickvonplaten · 2020-06-18T08:01:23Z

Definitely!

ncoop57 · 2020-06-24T20:07:03Z

Not sure if this should be a separate issue or not, but I am having difficulty training my own T5 tokenizer. When training a BPE tokenizer using the amazing huggingface tokenizer library and attempting to load it via

tokenizer = T5Tokenizer.from_pretrained('./tokenizer')

I get the following error:

OSError: Model name './tokenizer/' was not found in tokenizers model name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). We assumed './tokenizer/' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

I attempted to train a sentencepiece model instead using the, again amazing, huggingface tokenizer library, I get the same error because the tokenizer.save method does not actual generate the spiece.model file.

Am I doing something wrong?

Tranformers version: 2.11.0
Tokenizers version: 0.7.0

Here is a colab to reproduce the error: https://colab.research.google.com/drive/1WX1Q2Ze9k0SxFMLLv1aFgVGBFMEVTyDe?usp=sharing

patrickvonplaten · 2020-06-25T09:41:23Z

@mfuntowicz @n1t0 - maybe you can help here

santhoshkolloju · 2020-07-01T04:53:27Z

Definitely!

The pre-training scripts would really help.original mesh transformer is very complicated to understand.

stale · 2020-08-30T07:47:46Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

PiotrNawrot · 2023-03-16T16:29:24Z

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

patil-suraj mentioned this issue Jun 17, 2020

[docs] fix T5 training doc #5080

Merged

stale bot added the wontfix label Aug 30, 2020

stale bot closed this as completed Sep 6, 2020

patrickvonplaten mentioned this issue Mar 1, 2021

discrepancy between the Huggingface T5Tokenizer and the original T5tokenizer #10218

Closed

StephennFernandes mentioned this issue Apr 3, 2022

can i use the transformers pretraining script of T5 as mT5 ? #16571

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I pre-train the T5 model in HuggingFace library using my own text corpus? #5079

How do I pre-train the T5 model in HuggingFace library using my own text corpus? #5079

abhisheknovoic commented Jun 17, 2020

patil-suraj commented Jun 17, 2020

abhisheknovoic commented Jun 17, 2020

patil-suraj commented Jun 17, 2020 •

edited

Loading

abhisheknovoic commented Jun 17, 2020

patil-suraj commented Jun 17, 2020

abhisheknovoic commented Jun 17, 2020

abhisheknovoic commented Jun 17, 2020

abhisheknovoic commented Jun 17, 2020

patrickvonplaten commented Jun 18, 2020

patrickvonplaten commented Jun 18, 2020

patil-suraj commented Jun 18, 2020

patrickvonplaten commented Jun 18, 2020

ncoop57 commented Jun 24, 2020

patrickvonplaten commented Jun 25, 2020

santhoshkolloju commented Jul 1, 2020

stale bot commented Aug 30, 2020

PiotrNawrot commented Mar 16, 2023

How do I pre-train the T5 model in HuggingFace library using my own text corpus? #5079

How do I pre-train the T5 model in HuggingFace library using my own text corpus? #5079

Comments

abhisheknovoic commented Jun 17, 2020

patil-suraj commented Jun 17, 2020

abhisheknovoic commented Jun 17, 2020

patil-suraj commented Jun 17, 2020 • edited Loading

abhisheknovoic commented Jun 17, 2020

patil-suraj commented Jun 17, 2020

abhisheknovoic commented Jun 17, 2020

abhisheknovoic commented Jun 17, 2020

abhisheknovoic commented Jun 17, 2020

patrickvonplaten commented Jun 18, 2020

patrickvonplaten commented Jun 18, 2020

patil-suraj commented Jun 18, 2020

patrickvonplaten commented Jun 18, 2020

ncoop57 commented Jun 24, 2020

patrickvonplaten commented Jun 25, 2020

santhoshkolloju commented Jul 1, 2020

stale bot commented Aug 30, 2020

PiotrNawrot commented Mar 16, 2023

patil-suraj commented Jun 17, 2020 •

edited

Loading