[t5 tokenizer] add info logs #9897

stas00 · 2021-01-30T02:22:53Z

This PR (was modified from the original):

adds info logs that correlated to saved tokenizer files on tokenizer.save_pretrained()

original PR note

This PR

adds code to save t5 fast tokenizer tokenizer.json file on tokenizer.save_pretrained()
adds info logs that correlated to saved tokenizer files on tokenizer.save_pretrained()

Context:

I needed to create a new t5 smallish model and the created model won't work w/o tokenizer.json.
Also as I was debugging why I was missing that file, I enabled logging and saw that we were getting logs for every saved file, but tokenizer files, so this PR fixes that, so it's consistent and helps one to see if something is missing.

Here is an example:

TRANSFORMERS_VERBOSITY=info PYTHONPATH=/hf/transformers-master/src python t5-make-very-small-model.py
[....]
Configuration saved in t5-very-small-random/config.json
Model weights saved in t5-very-small-random/pytorch_model.bin
Configuration saved in t5-very-small-random/config.json
tokenizer config file saved in t5-very-small-random/tokenizer_config.json
Special tokens file saved in t5-very-small-random/special_tokens_map.json
Copy vocab file to t5-very-small-random/spiece.model
tokenizer config file saved in t5-very-small-random/tokenizer_config.json
Special tokens file saved in t5-very-small-random/special_tokens_map.json
Copy vocab file to t5-very-small-random/spiece.model
Copy tokenizer file to t5-very-small-random/tokenizer.json

I'm not sure why I needed to save both:

tokenizer.save_pretrained(mname_very_small)
tokenizer_fast.save_pretrained(mname_very_small)

note tokenization_t5.py doesn't have it! both t5 tokenizer files:

VOCAB_FILES_NAMES = {"vocab_file": "spiece.model"}
VOCAB_FILES_NAMES = {"vocab_file": "spiece.model", "tokenizer_file": "tokenizer.json"}

As I flagged on slack https://huggingface.co/sshleifer/t5-tinier-random fails to be used since it's missing this fast tokenizer.json file from the s3 set of files,

Traceback (most recent call last): 
 File "./finetune_trainer.py", line 373, in <module> 
   main() 
 File "./finetune_trainer.py", line 205, in main 
   tokenizer = AutoTokenizer.from_pretrained( 
 File "/home/stas/hf/transformers/src/transformers/models/auto/tokenization_auto.py", line 385, in from_pretrained 
   return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) 
 File "/home/stas/hf/transformers/src/transformers/tokenization_utils_base.py", line 1768, in from_pretrained 
   return cls._from_pretrained( 
 File "/home/stas/hf/transformers/src/transformers/tokenization_utils_base.py", line 1841, in _from_pretrained 
   tokenizer = cls(*init_inputs, **init_kwargs) 
 File "/home/stas/hf/transformers/src/transformers/models/t5/tokenization_t5_fast.py", line 139, in __init__ 
   super().__init__( 
 File "/home/stas/hf/transformers/src/transformers/tokenization_utils_fast.py", line 86, in __init__ 
   fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)

it could be a symptom for another problem in our code.

@LysandreJik, @sgugger

sgugger · 2021-01-30T14:57:14Z

I don't think this is something that should be done in save_vocabulary. You have the option in save_pretrained to set legacy_format to False to generate that tokenizer.json file. I'm not an expert in the tokenization side with all the stuff that was added for backward compatibility so I don't know if there is a better option.

I wasn't aware havin this file was mandatory for some models to use the fast tokenizer. Are you sure you have sentencepiece installed? It might be due to this that the conversion slow to fast does not work automatically

Anyhow, once we have found the right way to generate that tokenizer.json file, it should be added on the model sharing doc page, next to the section on how to generate TF/PyTorch checkpoints, so that people know what to do to have the most complete model on the hub.

stas00 · 2021-01-31T06:01:40Z

I don't have a problem to add it anywhere else, who do we tag on this?

Let the code speak for itself:

python -c "from transformers import T5Tokenizer, T5TokenizerFast; mname_from='sshleifer/t5-tinier-random'; tokenizer = T5Tokenizer.from_pretrained(mname_from); tokenizer_fast = T5TokenizerFast.from_pretrained(mname_from)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mnt/disc1/data/trash/src/transformers/src/transformers/tokenization_utils_base.py", line 1762, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/disc1/data/trash/src/transformers/src/transformers/tokenization_utils_base.py", line 1835, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/mnt/disc1/data/trash/src/transformers/src/transformers/models/t5/tokenization_t5_fast.py", line 139, in __init__
    super().__init__(
  File "/mnt/disc1/data/trash/src/transformers/src/transformers/tokenization_utils_fast.py", line 86, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: No such file or directory (os error 2)

If footokenizer.from_pretrained() fetches tokenizer.json then footokenizer.save_pretrained() must save it too.

I wasn't aware havin gthis file was mandatory for some models to use the fast tokenizer. Are you sure you have sentencepiece installed? It might be due to this that the conversion slow to fast does not work automatically

pip install sentencepiece
Requirement already satisfied: sentencepiece in /mnt/nvme1/anaconda3/envs/main-38/lib/python3.8/site-packages (0.1.91)

If you look at the trace it is hunting for that file and can't find it.

Anyhow, once we have found the right way to generate that tokenizer.json file, it should be added on the model sharing doc page, next to the section on how to generate TF/PyTorch checkpoints, so that people know what to do to have the most complete model on the hub.

Agreed!

@LysandreJik, @n1t0

stas00 · 2021-02-11T20:26:29Z

ok, so as @sgugger suggested on slack, the fast tokenizer saving will be handled on the core-level some time in the future, so I removed that part from this PR, leaving just the logger part.

sgugger

LGTM, thanks for updating!

LysandreJik

LGTM, thanks @stas00

stas00 added 2 commits January 29, 2021 18:16

save fast tokenizer + add info logs

3162cfa

fix tests

cd51421

stas00 added 2 commits February 11, 2021 12:23

Merge remote-tracking branch 'origin/master' into save_tokenizer

2588664

remove the saving of fast tokenizer

f0662de

stas00 requested review from sgugger and LysandreJik February 11, 2021 20:26

stas00 changed the title ~~[t5 tokenizer] save fast tokenizer + add info logs~~ [t5 tokenizer] add info logs Feb 11, 2021

sgugger approved these changes Feb 11, 2021

View reviewed changes

LysandreJik approved these changes Feb 13, 2021

View reviewed changes

LysandreJik merged commit 8fae93c into huggingface:master Feb 13, 2021

stas00 deleted the save_tokenizer branch February 14, 2021 05:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[t5 tokenizer] add info logs #9897

[t5 tokenizer] add info logs #9897

stas00 commented Jan 30, 2021 •

edited

sgugger commented Jan 30, 2021

stas00 commented Jan 31, 2021 •

edited

stas00 commented Feb 11, 2021

sgugger left a comment

LysandreJik left a comment

[t5 tokenizer] add info logs #9897

[t5 tokenizer] add info logs #9897

Conversation

stas00 commented Jan 30, 2021 • edited

sgugger commented Jan 30, 2021

stas00 commented Jan 31, 2021 • edited

stas00 commented Feb 11, 2021

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

stas00 commented Jan 30, 2021 •

edited

stas00 commented Jan 31, 2021 •

edited