Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I pre-train the T5 model in HuggingFace library using my own text corpus? #5079

Closed
abhisheknovoic opened this issue Jun 17, 2020 · 17 comments
Labels

Comments

@abhisheknovoic
Copy link

Hello,

I understand how the T5 architecture works and I have my own large corpus where I decide to mask a sequence of tokens and replace them with sentinel tokens.

I also understand about the tokenizers in HuggingFace, specially the T5 tokenizer.

Can someone point me to a document or refer me to the class that I need to use to pretrain T5 model on my corpus using the masked language model approach?

Thanks

@patil-suraj
Copy link
Contributor

Hi, @abhisheknovoic this might help you https://huggingface.co/transformers/model_doc/t5.html#training
check the Unsupervised denoising training section

@abhisheknovoic
Copy link
Author

@patil-suraj , do you mean this class? - T5ForConditionalGeneration

Also, at the top of the page, there is the following code:

lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)

Any idea which class is the model instantiated from? I could not find any class with lm_labels parameter.

Thanks

@patil-suraj
Copy link
Contributor

patil-suraj commented Jun 17, 2020

Yes, it's T5ForConditionalGeneration, and lm_lables is now changed to labels.

Pinging @patrickvonplaten for more details.

@abhisheknovoic
Copy link
Author

@patil-suraj , I tried the following code which throws an error. Any idea why? Thanks

In [32]: from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

In [33]: input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')

In [34]: labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')

In [35]: config = T5Config()

In [36]: model = T5ForConditionalGeneration(config=config)

In [37]: model(input_ids=input_ids, lm_labels=labels)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-37-6717b0ecfbf5> in <module>
----> 1 model(input_ids=input_ids, lm_labels=labels)

/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

/usr/local/lib/python3.7/site-packages/transformers/modeling_t5.py in forward(self, input_ids, attention_mask, encoder_outputs, decoder_input_ids, decoder_attention_mask, decoder_past_key_value_states, use_cache, lm_labels, inputs_embeds, decoder_inputs_embeds, head_mask)
   1068         if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
   1069             # get decoder inputs from shifting lm labels to the right
-> 1070             decoder_input_ids = self._shift_right(lm_labels)
   1071
   1072         # If decoding with past key value states, only the last tokens

/usr/local/lib/python3.7/site-packages/transformers/modeling_t5.py in _shift_right(self, input_ids)
    609         assert (
    610             decoder_start_token_id is not None
--> 611         ), "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information"
    612
    613         # shift inputs to the right

AssertionError: self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information

My versions are

transformers==2.11.0
tokenizers==0.7.0

@patil-suraj
Copy link
Contributor

If you are using 2.11.0 then use lm_labels and if you are using master then use labels

@abhisheknovoic
Copy link
Author

@patil-suraj , thanks. I have installed the master version. It still complains with the same error. It seems like I need to specify something for the decoder_start_token_id.

@abhisheknovoic
Copy link
Author

Ok, I got it working. I initialized config like follows:

config = T5Config(decoder_start_token_id=tokenizer.convert_tokens_to_ids(['<pad>'])[0])

@abhisheknovoic
Copy link
Author

@patil-suraj , however, if we use the master branch, it seems like the tokenizers are broken. The T5 tokenizer doesn't tokenize the sentinel tokens correctly.

@patrickvonplaten
Copy link
Contributor

@patil-suraj , do you mean this class? - T5ForConditionalGeneration

Also, at the top of the page, there is the following code:

lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)

Any idea which class is the model instantiated from? I could not find any class with lm_labels parameter.

Thanks

Feel free to also open a PR to correct lm_labels to labels in the comment :-)

@patrickvonplaten
Copy link
Contributor

Just saw that @patil-suraj already did this - awesome thanks :-)

@abhisheknovoic regarding the T5 tokenizer, can you post some code here that shows that T5 tokenization is broken (would be great if we can easily reproduce the error)

@patil-suraj
Copy link
Contributor

@patrickvonplaten it would be nice if we also add seq-2-seq (t5, bart) model pre-training examples in official examples

cc @sshleifer

@patrickvonplaten
Copy link
Contributor

Definitely!

@ncoop57
Copy link
Contributor

ncoop57 commented Jun 24, 2020

Not sure if this should be a separate issue or not, but I am having difficulty training my own T5 tokenizer. When training a BPE tokenizer using the amazing huggingface tokenizer library and attempting to load it via

tokenizer = T5Tokenizer.from_pretrained('./tokenizer')

I get the following error:

OSError: Model name './tokenizer/' was not found in tokenizers model name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). We assumed './tokenizer/' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

I attempted to train a sentencepiece model instead using the, again amazing, huggingface tokenizer library, I get the same error because the tokenizer.save method does not actual generate the spiece.model file.

Am I doing something wrong?

Tranformers version: 2.11.0
Tokenizers version: 0.7.0

Here is a colab to reproduce the error: https://colab.research.google.com/drive/1WX1Q2Ze9k0SxFMLLv1aFgVGBFMEVTyDe?usp=sharing

@patrickvonplaten
Copy link
Contributor

@mfuntowicz @n1t0 - maybe you can help here

@santhoshkolloju
Copy link

Definitely!

The pre-training scripts would really help.original mesh transformer is very complicated to understand.

@stale
Copy link

stale bot commented Aug 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@PiotrNawrot
Copy link

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants