Compressed Transformer

Application of tensors method for compression neural machine translation model

The project support training and translation with trained model without beam search.

Note that this project is still a work in progress.

We use WMT16 (Multi30k) dataset and IWSLT14 for our experiments.
To learn more about Transformer you could read "Annotated Transformer".

Main Results

We have reduced the model up to 2.6 times for IWSLT14 dataset. Without embeddings layer, our model is less than the original one up to 6.2 times using small ranks.

It turned out that you can effortlessly reduce the number of parameters of the model based on a transformer by 2 using tensor decomposition without loss of quality.

In this project we explore the compression techniques which can both decrease the size of the model and speed up inference. We evaluate and compare several compression techniques in the context of two layer MLP on MNIST task. We pick the best of them Tensor Train and Tucker methods and show its efficacy for state of the art NLP model, i.e. Transformer Network

Requirments

pytorch 0.4.1+
nltk
torchtext

For MNIST experiments

tntorch

Run NMT

Run (with cuda, if it's available):

PYTHONPATH="." python3 nmt/train.py --seed 45  --save_model_after 1000 \
 --valid_max_num 120  --lower --min_freq 3 --lower --tokenize --batch 82

For train compressed model:

PYTHONPATH="." python3 nmt/train.py --seed 45  --save_model_after 1000 \
 --valid_max_num 120  --lower --min_freq 3 --lower --tokenize --batch 82 \
 --compress --compress_mode tt --exp compressed

Note 1: use multi-gpu mode via parameters --multi-gpu --num_devices #

Note 2: default decomposition is Tensor Train, but you can try Tucker via --compress_mode tucker (preferable)

For test use:

sh scripts/run_test.sh $path_to_your_uncompressed_model

or for compressed:

sh scripts/run_test_compressed.sh $path_to_your_compressed_model

How model can be compressed?

The Feed-Forward network is a simple block with one hidden linear layer and one projection layer. The most popular choice for the dimensions are 512x2048 for the hidden layer and 2048x512 for the projection layer, so in one Feed-Forward network we have 2M parameters and this block is repeated in each of the 2N Transformer blocks. That is why we decided to compress these parts of the model. We applied method of tensoring neural networks to every linear layer in the feed-forward block to obtain significant compression.

Each first layer is represented as a tensor with 2x4x8x4x2 input modes and 4x4x8x4x4 output modes respectively and the second layer has these parameters in the reverse order. Every tensor is decomposed with one of the tensor decomposition method. We refer to [2].

Results

Fully-compressed model

We conducted experiments for the Tucker's decomposition applied to all matrices except embeddings and projection for logits.

In this setting, we achieve the highest compression ration: (up to 2.6 for full model and up to 6.2 without embeddings)

dataset	Test	Compression Rate
Multi30k	0.441	1.61
IWSLT 14	0.297	1.61

Note: All results below are given without compression self-attention and Embedding layer which consist the main part parameters.

Multi30k

An example of training for the WMT'16 Multimodal Translation task (http://www.statmt.org/wmt16/multimodal-task.html).

Please, use the batch as large as possible.
Net can converge After 5 epoch with batch size 128.

Train set contains 29100 pairs of sentences (ge-en).

Total numbers of parameters in the model: 64231918

Num of the epoch - 20, batch size - 150 small-transformer model has 3 block-layers instead of 6

Method	BLEU	N_comp/N	Compression Rate	Time
original model	0.442	0/6	1.0	41.38
tt-transformer	0.407	6/6	1.644	84.56
small-transformer	0.403	0/3	1.6	-
tt-small-transformer	0.396	3/3	2.1713	34.94
tt-transformer	0.468	5/6	1.484	59.833
tt-transformer	0.455	4/6	1.353	60.4
tt-transformer	0.472	3/6	1.243	45.28
tt-transformer	0.450	2/6	1.150	43.97
tt-transformer	0.369	1/6	1.07	34.63
tucker	0.447	6/6	1.64	43.83

num parameters in tucker: 39062062

tt with max_rank = 2 Num parameters in compress fc layer 416

Method	BLEU	Compression Rate	Time
original model	0.442	1.0	41.38
all tt2	0.434	1.65	76.89
5/6 tt2	0.489	1.485	75.24
3/6 tt2	0.412	1.243	71.36
1/6 tt2	0.414	1.069	40.97
small-transformer	0.403	1.6	31.28
tt4	0.407	1.64	84.56
1/2 tt4	0.472	1.24	45.28
Tucker	0.447	1.64	43.83

We use tt-decomposition for every fc layer in encoder and decoder in the following way:

ranks of the first layer : 2x4x4x2
ranks of the second layer : 4x4x4x4 (ones-dimensions have been omitted)

Compression ratio is the ration of #(original parameters) to #(parameters in compressed networks)

IWSLT14

Note: for experiments with IWSLT we use batch size 256 and 4 GPU with at least 10gb of memory.

It's interesting that network with less parameters can be train in more efficiently way with small batch.

model	bleu-valid	bleu-test	time
transformer	0.165	0.1363	146.16
compressed 5	0.212	0.197	237.97
compressed 3	0.212	0.192	187.35

TT and tucker used with rank=2

model	bleu	time
transformer	0.291	154.23
compressed 6	0.292	241.24
compressed 3	0.297	180.81
tucker	0.283	198.3

*time for inference

Results for small batch:

MNIST experiments

Method	Test Accuracy	Compression Rate
Original Model	0.9770	1.00
Magnitude Pruning	0.9546	8.15
SVD	0.9669	7.54
SVD + Fine-Tuning	0.9641	12.68
Variational Dropout	0.9841	23.39
Tucker	0.9247	28.82
Tensor Train	0.8507	72.02
Tensor Train + Fine-Tuning	0.9620	72.02

Feedback

If you have ideas or find mistakes please leave a note.

TODO

-coming soon

Citation

Please cite this repository if it was useful for your research

@misc{compressed_transformer,
  title={Compresed Transformer},
  author={Khakhulin, Taras and Saparina, Irina and Shevchenko, Aleksandr and Konobeev, Michael},
  year={2019},
  publisher={Github},
  journal={GitHub repository},
  howpublished={\url{https://github.com/khakhulin/compressed-transformer}},
}

References

Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.
Novikov, Alexander, et al. "Tensorizing neural networks." Advances in Neural Information Processing Systems. 2015.
Rush, Alexander, et al. "The Annotated Transformer." NLP Open Source Software, 2018

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
imgs		imgs
logs		logs
mnist		mnist
nmt		nmt
preprocessing		preprocessing
scripts		scripts
tensortorch		tensortorch
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compress_attention.md		compress_attention.md
logger.py		logger.py

License

khakhulin/compressed-transformer

Folders and files

Latest commit

History

Repository files navigation

Compressed Transformer

Main Results

Requirments

Run NMT

How model can be compressed?

Results

Fully-compressed model

Multi30k

IWSLT14

MNIST experiments

Feedback

TODO

Citation

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages