News

2019/12/10 We have changed the model name from MUSE(parallel MUlti-Scale attEntion) to PRIME(PaRallel Intersected Multi-scale AttEntion)

Introduction

Core Code:

Code for parallel representation learning: fairseq\models\combine_transformer.py
Code for combining convolution and self-attention: fairseq\modules\multihead_attention.py
Code for acceleration, bm means big matrix: fairseq\models\transformer_bm.py

Relevent links:

Arxiv pdf: https://arxiv.org/abs/1911.09483
Pre-trained models as well as instructions for training: examples/parallel_intersected_multi-scale_attention(Prime)/README.md
Reddit post link

About the paper:

TL;DR: A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

We ask three questions:

Is attention alone good enough？
Is parallel representation learning applicable to sequence data and tasks?
How to design a module that combines both inductive bias of convolution and self-attention？

We find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.

Key features:

Design a multi-branch schema evolving self attention and first successfully combine convolution and self-attention in one module for sequence tasks by the proposed shared projection,
SOTA on three main translation datasets, including WMT14 En-Fr, WMT14 En-De and IWSLT14 De-En,
Parallel learn sequence representations and thus have potential for acceleration.

Results:

Better than previous models on large NMT datasets; can scale to small datasets and base model setting.
The shared projection is key to combine conv and self-attn; generate better long sequences;potential for acceleration. )

Task	size	test (BLEU)
IWSLT14 De-En	Base	36.3
WMT14 En-De	Large	29.9
WMT14 En-Fr	Large	43.5

Requirements and Installation

PyTorch version >= 1.0.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
torch==1.3.1 with cuda==10.0

Installing from source

To install from source and develop locally:

pip install --editable . --user

We provide pre-trained models and detailed example training and evaluation in examples/parallel_intersected_multi-scale_attention(Prime)/README.md.

Citation

Please cite as:

@article{zhao2019muse,
  title={MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning},
  author={Zhao, Guangxiang and Sun, Xu and Xu, Jingjing and Zhang, Zhiyuan and Luo, Liangchen},
  journal={arXiv preprint arXiv:1911.09483},
  year={2019}
}

Notes

The code is based on fairseq-0.6.2

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
docs		docs
examples		examples
fairseq		fairseq
scripts		scripts
tests		tests
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
PATENTS		PATENTS
README.md		README.md
average_checkpoints.py		average_checkpoints.py
eval_lm.py		eval_lm.py
generate.py		generate.py
interactive.py		interactive.py
preprocess.py		preprocess.py
prime_jounal_ver_.pdf		prime_jounal_ver_.pdf
score.py		score.py
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

Introduction

Requirements and Installation

Citation

Notes

About

Releases

Packages

Contributors 2

Languages

lancopku/Prime

Folders and files

Latest commit

History

Repository files navigation

News

Introduction

Requirements and Installation

Citation

Notes

About

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages