Skip to content

v4.23.0: Whisper, Deformable DETR, Conditional DETR, MarkupLM, MSN, `safetensors`

Compare
Choose a tag to compare
@LysandreJik LysandreJik released this 10 Oct 21:26
· 5141 commits to main since this release

Whisper

The Whisper model was proposed in Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.

Whisper is an encoder-decoder Transformer trained on 680,000 hours of labeled (transcribed) audio. The model shows impressive performance and robustness in a zero-shot setting, in multiple languages.

Deformable DETR

The Deformable DETR model was proposed in Deformable DETR: Deformable Transformers for End-to-End Object Detection by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai.

Deformable DETR mitigates the slow convergence issues and limited feature spatial resolution of the original DETR by leveraging a new deformable attention module which only attends to a small set of key sampling points around a reference.

Conditional DETR

The Conditional DETR model was proposed in Conditional DETR for Fast Training Convergence by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang.

Conditional DETR presents a conditional cross-attention mechanism for fast DETR training. Conditional DETR converges 6.7脳 to 10脳 faster than DETR.

Time Series Transformer

The Time Series Transformer model is a vanilla encoder-decoder Transformer for time series forecasting.

The model is trained in a similar way to how one would train an encoder-decoder Transformer (like T5 or BART) for machine translation; i.e. teacher forcing is used. At inference time, one can autoregressively generate samples, one time step at a time.

鈿狅笍 This is a recently introduced model and modality, so the API hasn't been tested extensively. There may be some bugs or slight breaking changes to fix it in the future. If you see something strange, file a Github Issue.

Masked Siamese Networks

The ViTMSN model was proposed in Masked Siamese Networks for Label-Efficient Learning by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.

MSN (masked siamese networks) consists of a joint-embedding architecture to match the prototypes of masked patches with that of the unmasked patches. With this setup, the method yields excellent performance in the low-shot and extreme low-shot regimes for image classification, outperforming other self-supervised methods such as DINO. For instance, with 1% of ImageNet-1K labels, the method achieves 75.7% top-1 accuracy.

MarkupLM

The MarkupLM model was proposed in MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei.

MarkupLM is BERT, but applied to HTML pages instead of raw text documents. The model incorporates additional embedding layers to improve performance, similar to LayoutLM.

The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains state-of-the-art results on 2 important benchmarks: WebSRC and SWDE.

Security & safety

We explore a new serialization format not using Pickle that we can then leverage in the three frameworks we support: PyTorch, TensorFlow, and JAX. We leverage the safetensors library for that.

Support is for PyTorch models only at this stage, and still experimental.

Computer vision post-processing methods overhaul

The processors for computer vision have been overhauled to ensure they have consistent naming, input arguments and outputs.
鈿狅笍 The existing methods that are superseded by the introduced methods post_process_object_detection, post_process_semantic_segmentation, post_process_instance_segmentation, post_process_panoptic_segmentation are now deprecated.

馃毃 Breaking changes

The following changes are bugfixes that we have chosen to fix even if it changes the resulting behavior. We mark them as breaking changes, so if you are using this part of the codebase, we recommend you take a look at the PRs to understand what changes were done exactly..

Breaking change for ViT parameter initialization

Breaking change for the top_p argument of the TopPLogitsWarper of the generate method.

Model head additions

OPT and BLOOM now have question answering heads available.

Pipelines

There is now a zero-shot object detection pipeline.

TensorFlow architectures

The GroupViT model is now available in TensorFlow.

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @flozi00
    • german autoclass (#19049)
    • correct spelling in README (#19092)
    • german processing (#19121)
    • german training, accelerate and model sharing (#19171)
  • @DeppMeng
    • Add support for conditional detr (#18948)
  • @sayakpaul
    • MSN (Masked Siamese Networks) for ViT (#18815)
    • fix: ckpt paths. (#19159)
    • Add expected output to the sample code for ViTMSNForImageClassification (#19183)
  • @IMvision12
    • Updated hf_argparser.py (#19188)
    • Added tests for yaml and json parser (#19219)
    • Added Type hints for LED TF (#19315)
    • Making ConvBert Tokenizer independent from bert Tokenizer (#19347)
    • Added Type hints for XLM TF (#19333)
  • @ariG23498
    • [TensorFlow] Adding GroupViT (#18020)
    • fix: renamed variable name (#18850)
  • @Mustapha-AJEGHRIR
    • Fix m2m_100.mdx doc example missing labels (#19149)
    • Making camembert independent from roberta, clean (#19337)
    • Make Camembert TF version independent from Roberta (#19364)
  • @D3xter1922
    • removing XLMConfig inheritance from FlaubertConfig (#19326)
    • [WIP]remove XLMTokenizer inheritance from FlaubertTokenizer (#19330)
    • remove RobertaConfig inheritance from MarkupLMConfig (#19404)
  • @srhrshr
    • Frees LongformerTokenizer of the Roberta dependency (#19346)
    • Removes Roberta and Bert config dependencies from Longformer (#19343)
    • removes prophet config dependencies from xlm-prophet (#19400)
  • @sahamrit
    • [WIP] Add ZeroShotObjectDetectionPipeline (#18445) (#18930)
  • @Davidy22
    • Copy BertTokenizer dependency into retribert tokenizer (#19371)
  • @rchan26
    • Remove dependency of Bert from Squeezebert tokenizer (#19403)
    • Remove dependency of Roberta in Blenderbot (#19411)
  • @harry7337
    • Removed Bert and XML Dependency from Herbert (#19410)
  • @Infrared1029
    • Remove Dependency between Bart and LED (slow/fast) (#19408)
  • @Steboss89
    • Add Italian translation for add_new_model.mdx (#18713)