Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[t5 tokenizer] add info logs #9897

Merged
merged 4 commits into from Feb 13, 2021
Merged

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Jan 30, 2021

This PR (was modified from the original):

  • adds info logs that correlated to saved tokenizer files on tokenizer.save_pretrained()

original PR note

This PR

  • adds code to save t5 fast tokenizer tokenizer.json file on tokenizer.save_pretrained()
  • adds info logs that correlated to saved tokenizer files on tokenizer.save_pretrained()

Context:

  • I needed to create a new t5 smallish model and the created model won't work w/o tokenizer.json.
  • Also as I was debugging why I was missing that file, I enabled logging and saw that we were getting logs for every saved file, but tokenizer files, so this PR fixes that, so it's consistent and helps one to see if something is missing.

Here is an example:

TRANSFORMERS_VERBOSITY=info PYTHONPATH=/hf/transformers-master/src python t5-make-very-small-model.py
[....]
Configuration saved in t5-very-small-random/config.json
Model weights saved in t5-very-small-random/pytorch_model.bin
Configuration saved in t5-very-small-random/config.json
tokenizer config file saved in t5-very-small-random/tokenizer_config.json
Special tokens file saved in t5-very-small-random/special_tokens_map.json
Copy vocab file to t5-very-small-random/spiece.model
tokenizer config file saved in t5-very-small-random/tokenizer_config.json
Special tokens file saved in t5-very-small-random/special_tokens_map.json
Copy vocab file to t5-very-small-random/spiece.model
Copy tokenizer file to t5-very-small-random/tokenizer.json

I'm not sure why I needed to save both:

tokenizer.save_pretrained(mname_very_small)
tokenizer_fast.save_pretrained(mname_very_small)

note tokenization_t5.py doesn't have it! both t5 tokenizer files:

VOCAB_FILES_NAMES = {"vocab_file": "spiece.model"}
VOCAB_FILES_NAMES = {"vocab_file": "spiece.model", "tokenizer_file": "tokenizer.json"}

As I flagged on slack https://huggingface.co/sshleifer/t5-tinier-random fails to be used since it's missing this fast tokenizer.json file from the s3 set of files,

Traceback (most recent call last): 
 File "./finetune_trainer.py", line 373, in <module> 
   main() 
 File "./finetune_trainer.py", line 205, in main 
   tokenizer = AutoTokenizer.from_pretrained( 
 File "/home/stas/hf/transformers/src/transformers/models/auto/tokenization_auto.py", line 385, in from_pretrained 
   return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) 
 File "/home/stas/hf/transformers/src/transformers/tokenization_utils_base.py", line 1768, in from_pretrained 
   return cls._from_pretrained( 
 File "/home/stas/hf/transformers/src/transformers/tokenization_utils_base.py", line 1841, in _from_pretrained 
   tokenizer = cls(*init_inputs, **init_kwargs) 
 File "/home/stas/hf/transformers/src/transformers/models/t5/tokenization_t5_fast.py", line 139, in __init__ 
   super().__init__( 
 File "/home/stas/hf/transformers/src/transformers/tokenization_utils_fast.py", line 86, in __init__ 
   fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)

it could be a symptom for another problem in our code.

@LysandreJik, @sgugger

@sgugger
Copy link
Collaborator

sgugger commented Jan 30, 2021

I don't think this is something that should be done in save_vocabulary. You have the option in save_pretrained to set legacy_format to False to generate that tokenizer.json file. I'm not an expert in the tokenization side with all the stuff that was added for backward compatibility so I don't know if there is a better option.

I wasn't aware havin this file was mandatory for some models to use the fast tokenizer. Are you sure you have sentencepiece installed? It might be due to this that the conversion slow to fast does not work automatically

Anyhow, once we have found the right way to generate that tokenizer.json file, it should be added on the model sharing doc page, next to the section on how to generate TF/PyTorch checkpoints, so that people know what to do to have the most complete model on the hub.

@stas00
Copy link
Contributor Author

stas00 commented Jan 31, 2021

I don't have a problem to add it anywhere else, who do we tag on this?

  1. Let the code speak for itself:
python -c "from transformers import T5Tokenizer, T5TokenizerFast; mname_from='sshleifer/t5-tinier-random'; tokenizer = T5Tokenizer.from_pretrained(mname_from); tokenizer_fast = T5TokenizerFast.from_pretrained(mname_from)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mnt/disc1/data/trash/src/transformers/src/transformers/tokenization_utils_base.py", line 1762, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/disc1/data/trash/src/transformers/src/transformers/tokenization_utils_base.py", line 1835, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/mnt/disc1/data/trash/src/transformers/src/transformers/models/t5/tokenization_t5_fast.py", line 139, in __init__
    super().__init__(
  File "/mnt/disc1/data/trash/src/transformers/src/transformers/tokenization_utils_fast.py", line 86, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: No such file or directory (os error 2)
  1. If footokenizer.from_pretrained() fetches tokenizer.json then footokenizer.save_pretrained() must save it too.

I wasn't aware havin gthis file was mandatory for some models to use the fast tokenizer. Are you sure you have sentencepiece installed? It might be due to this that the conversion slow to fast does not work automatically

pip install sentencepiece
Requirement already satisfied: sentencepiece in /mnt/nvme1/anaconda3/envs/main-38/lib/python3.8/site-packages (0.1.91)

If you look at the trace it is hunting for that file and can't find it.

Anyhow, once we have found the right way to generate that tokenizer.json file, it should be added on the model sharing doc page, next to the section on how to generate TF/PyTorch checkpoints, so that people know what to do to have the most complete model on the hub.

Agreed!

@LysandreJik, @n1t0

@stas00
Copy link
Contributor Author

stas00 commented Feb 11, 2021

ok, so as @sgugger suggested on slack, the fast tokenizer saving will be handled on the core-level some time in the future, so I removed that part from this PR, leaving just the logger part.

@stas00 stas00 changed the title [t5 tokenizer] save fast tokenizer + add info logs [t5 tokenizer] add info logs Feb 11, 2021
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for updating!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @stas00

@LysandreJik LysandreJik merged commit 8fae93c into huggingface:master Feb 13, 2021
@stas00 stas00 deleted the save_tokenizer branch February 14, 2021 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants