Skip to content

New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials

Choose a tag to compare
@LysandreJik LysandreJik released this 29 Jun 15:11

New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials

Breaking changes since v2

  • In #4874 the language modeling BERT has been split in two: BertForMaskedLM and BertLMHeadModel. BertForMaskedLM therefore cannot do causal language modeling anymore, and cannot accept the lm_labels argument.
  • The Trainer data collator is now a method instead of a class
  • Directly setting a tokenizer special token attributes (e.g. tokenizer.mask_token = '<mask>' now only associate the token to the attribute of the tokenizer but doesn't add the token to the vocabulary if it is not in the vocabulary. Tokens are only added by using the tokenizer.add_special_tokens() and tokenizer.add_tokens() methods
  • The prepare_for_model method was removed as part of the new tokenizer API.
  • The truncation method is now only_first by default.

New Tokenizer API (@n1t0, @thomwolf, @mfuntowicz)

The tokenizers has evolved quickly in version 2, with the addition of rust tokenizers. It now has a simpler and more flexible API aligned between Python (slow) and Rust (fast) tokenizers. This new API let you control truncation and padding deeper allowing things like dynamic padding or padding to a multiple of 8.

The redesigned API is explained in detail here #4510 and here:

Notable changes:

  • it's now possible to truncate to the max input length of a model while padding the longest sequence in a batch
  • padding and truncation are decoupled and easier to control
  • it's possible to pad to a multiple of a predefined length, e.g. 8 which can give significant speeds up on recent NVIDIA GPU (V100)
  • a generic wrapper using tokenizer.__call__ can be used for all case (single sequence, pair of sequences to groups, batches, etc...)
  • tokenizers now accept pre-tokenized inputs (when the input is already split in word strings e.g. for NER)
  • All the Rust tokenizers are now fully tested like slow tokenizers
  • A new class AddedToken can be used to have a more fine-grained control on how added tokens behave during tokenization. In particular the user can control (1) whether left and right spaces are removed around the token during tokenization (2) whether the token will be identified inside another word and (3) whether the token will be recognized in normalized forms (e.g. in lower case if the tokenizer uses lower-casing)
  • Serialization issues where fixed
  • Possiblity to create NumPy tensors when using return_tensors parameter on tokenizers.
  • Introduced a new enum TensorType to map all the possible tensor backends we support: TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY
  • Tokenizers now accept TensorType enum on encode(...), encode_plus(...), batch_encode_plus(...) tokenizer method for return_tensors parameters.
  • BatchEncoding new property is_fast indicates if the BatchEncoding comes from a Python (slow) tokenizer or a Rust (fast) tokenizer.
  • Slow and Fast Tokenizers are now picklable. So is their output, the dict sub-class BatchEncoding.

Several PRs to make the API more stable have been made:

  • [tokenizers] Fix #5081 and improve backward compatibility #5125 (@thomwolf)
  • Tokenizers API developments #5103 (@thomwolf)
  • Clearer error message in the use-case of #5169 (@thomwolf)
  • Add more tests on tokenizers serialization - fix bugs #5056 (@thomwolf)
  • [Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING #5252 (@thomwolf)
  • [tokenizers] Several small improvements and bug fixes #5287
  • Add pad_to_multiple_of on tokenizers (reimport) #5054 (@mfuntowicz)
  • [tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308

TensorFlow improvements (@jplu, @dzorlu, @LysandreJik)

Very big release for TensorFlow!

  • TensorFlow models can now compute the loss themselves, using the TFPretrainedModel.compute_loss method. #4530
  • Can now resize token embeddings in TensorFlow #4351
  • Cleaning TensorFlow models #5229

Enhanced documentation (@sgugger)

We welcome @sgugger as a team member in New York. He already introduced a lot of very cool documentation changes:

  • Added a model summary #4789
  • Expose classes used in documentation #4808
  • Explain how to preview the docs in a PR #4795
  • Clean documentation #4849
  • Remove old doc page and add note about cache in installation #5027
  • Fix all sphynx warnings #5068 (@sgugger)
  • Update pipeline examples to doctest syntax #5030
  • Reorganize documentation #5064
  • Update installation page and add contributing to the doc #5084
  • Update glossary #5148
  • Quick tour #5145
  • Switch master/stable doc and add older releases #5193
  • Add version control menu #5222
  • Don't recreate old docs #5243
  • Tokenization tutorial #5257
  • Remove links for all docs #5280
  • New model sharing tutorial #5323

Training & fine-tuning quickstart

  • Our own @joeddav added a training & fine-tuning quickstart to the documentation #5034!


The MobileBERT from MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, was added to the library for both PyTorch and TensorFlow.

A single checkpoint is added: mobilebert-uncased which is the uncased_L-24_H-128_B-512_A-4_F-4_OPT checkpoint converted to our API.

This model was first implemented in PyTorch by @lonePatient, ported to the library by @vshampor, then finalized and implemented in Tensorflow by @LysandreJik.

Eli5 examples (@yjernite) #4968

  • The examples/eli5 folder contains training code for the dense retriever and to fine-tune a BART model, the jupyter notebook for the blog post, and the code for the live demo.

  • The RetriBert model implements the dense passage retriever. It's basically a wrapper for two Bert models and projection matrices, but it does gradient checkpointing in a way that is very different from a concurrent PR and Yacine thought it would be easier to write its own class for now and see if we can merge into the BART code later.

Enhanced examples/seq2seq (@sshleifer)

  • the examples/seq2seq folder is a combination of the old examples/summarization and examples/translation folders.
  • Finetuning works well for summarization, more experiments needed for translation. Finetuning works on multi-gpu, saves rouge scores during validation, and provides --freeze_encoder and --freeze_embeds options. These options make finetuning BART 5x faster on the cnn/dailymail dataset.
  • Distillbart code is added in It only supports summarization, for now.
  • Evaluation works well for both summarization and translation.
  • New weights and biases shared task for collaboration on the XSUM summarization task

Distilbart (@sshleifer)

  • Distilbart models are smaller versions of bart-large-cnn and bart-large-xsum. They can be loaded using BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-xsum-12-6'), for example See this tweet for more info on available models and their speed/performance.
  • Commands to reproduce are available in the examples/seq2seq folder

BERT Loses Patience (@JetRunner)

Add BERT Loses Patience (Patience-based Early Exit) based on the paper and the official implementation

Unifying label arguments (@sgugger) #4722

  • Deprecate any argument that's not labels (like masked_lm_labels, lm_labels, etc.) to labels.

NumPy type in tokenizers (@mfuntowicz) #4585

Introduce a new tensor type for return_tensors on tokenizer for NumPy.

  • As we're introducing more than two tensor backend alternatives I created an enum TensorType listing all the possible tensor we can create TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY. This might help newcomers who don't know about "tf", "pt".
    Note: TensorType are compatible with previous "tf", "pt" and now "np" str to allow backward compatibility (+unittest)

  • Numpy is now a possible target when creating tensors. This is usefull for JAX.

Community notebooks

Benchmarks (@patrickvonplaten)

The benchmark script was consolidated and some features were added:

Adds the functionality to measure the following functionalities for TF and PT (#4912):

  • Tensorflow:

    • Inference: CPU, GPU, GPU + XLA, GPU + eager mode, CPU + eager mode, TPU
  • PyTorch:

    • Inference: CPU, CPU + torchscript, GPU, GPU + torchscript, GPU + mixed precision, Torch/XLA TPU
    • Training: CPU, GPU, GPU + mixed precision, Torch/XLA TPU
  • [Benchmark] Add encoder decoder to benchmark and clean labels #4810

  • [Benchmark] add tpu and torchscipt for benchmark #4850

  • [Benchmark] Extend Benchmark to all model type extensions #5241

  • [Benchmarks] improve Example Plotter #5245

Hidden states, attentions and cache

Before v3.0.0, the way to handle attentions, model hidden states, and whether to use the cache in models that have it for sequential decoding was to specify an argument in the configuration. In version v3.0.0, while we do maintain that argument for backwards compatibility, we introduce a new way of handling these through the forward and call methods.

Revamped AutoModels (@patrickvonplaten)

The AutoModelWithLMHead encompasses all models with a language modeling head, not making the distinction between causal, masked and seq2seq models. Three new auto models are added:

  • AutoModelForCausalLM for Autoregressive models
  • AutoModelForMaskedLM for Autoencoding models
  • AutoModelForSeq2SeqCausalLM for Sequence-to-sequence models with causal LM for the decoder

New model & tokenizer architectures


  • Fixed a bug causing invalid ordering of the inputs in the underlying ONNX IR.
  • Increased logging to giv ethe user more information about the exported variables.

Bug fixes and improvements