New model: RoBERTa
Thanks to Myle Ott from Facebook for his help.
Tokenizer sequence pair handling
Tokenizers get two new methods:
These methods add the model-specific special tokens to sequences. The sentence pair creates a list of tokens with the
sep tokens according to the way the model was trained.
Sequence pair examples:
[CLS] SEQUENCE_0 [SEP] SEQUENCE_1 [SEP]
[CLS] SEQUENCE_0 [SEP] [SEP] SEQUENCE_1 [SEP]
Tokenizer encoding function
The tokenizer encode function gets two new arguments:
tokenizer.encode(text, text_pair=None, add_special_tokens=False)
text_pair is specified,
encode will return a tuple of encoded sequences. If the
add_special_tokens is set to
True, the sequences will be built with the models' respective special tokens using the previously described methods.
AutoConfig, AutoModel and AutoTokenizer
There are three new classes with this release that instantiate one of the base model classes of the library from a pre-trained model configuration:
Those classes take as input a pre-trained model name or path and instantiate one of the corresponding classes. The input string indicates to the class which architecture should be instantiated. If the string contains "bert",
AutoConfig instantiates a
AutoModel instantiates a
AutoTokenizer instantiates a
The same can be done for all the library's base models. The Auto classes check for the associated strings: "openai-gpt", "gpt2", "transfo-xl", "xlnet", "xlm" and "roberta". The documentation associated with this change can be found here.
Some examples have been refactored to better reflect the current library. Those are:
finetune_on_pregenerated.py, as well as
run_glue.py that has been adapted to the RoBERTa model. The examples
run_glue.py have better dataset processing with caching.
Bug fixes and improvements to the library modules
- Fixed multi-gpu training when using FP16 (@zijunsun)
- Re-added the possibility to import BertPretrainedModel (@thomwolf)
- Improvements to tensorflow -> pytorch checkpoints (@dhpollack)
- Fixed save_pretrained to save the correct added tokens (@joelgrus)
- Fixed version issues in run_openai_gpt (@rabeehk)
- Fixed issue with line return with Chinese BERT (@Yiqing-Zhou)
- Added more flexibility regarding the
- Fixed issues regarding backward compatibility to Pytorch 1.0.0 (@thomwolf)
- Added the unknown token to GPT-2 (@thomwolf)