Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


2019/12/10 We have changed the model name from MUSE(parallel MUlti-Scale attEntion) to PRIME(PaRallel Intersected Multi-scale AttEntion)


Relevent links:

About the paper:

TL;DR: A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

We ask three questions:

  • Is attention alone good enough?
  • Is parallel representation learning applicable to sequence data and tasks?
  • How to design a module that combines both inductive bias of convolution and self-attention?

We find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.

Key features:

  • Design a multi-branch schema evolving self attention and first successfully combine convolution and self-attention in one module for sequence tasks by the proposed shared projection,
  • SOTA on three main translation datasets, including WMT14 En-Fr, WMT14 En-De and IWSLT14 De-En,
  • Parallel learn sequence representations and thus have potential for acceleration.


  1. Better than previous models on large NMT datasets; can scale to small datasets and base model setting.
  2. The shared projection is key to combine conv and self-attn; generate better long sequences;potential for acceleration. )
Task size test (BLEU)
IWSLT14 De-En Base 36.3
WMT14 En-De Large 29.9
WMT14 En-Fr Large 43.5

Requirements and Installation

  • PyTorch version >= 1.0.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • torch==1.3.1 with cuda==10.0

Installing from source

To install from source and develop locally:

pip install --editable . --user

We provide pre-trained models and detailed example training and evaluation in examples/parallel_intersected_multi-scale_attention(Prime)/


Please cite as:

  title={MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning},
  author={Zhao, Guangxiang and Sun, Xu and Xu, Jingjing and Zhang, Zhiyuan and Luo, Liangchen},
  journal={arXiv preprint arXiv:1911.09483},


The code is based on fairseq-0.6.2, the main code can be seen in fairseq\models\ for parallel representation learning) and fairseq\models\ bigger matrix, code for acceleration), fairseq\modules\ for combining convolution and self-attention)


A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.





No releases published


No packages published