AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers

This repository contains code, data and pretrained models used in AutoMoE (pre-print). This repository builds on Hardware Aware Transformer (HAT)'s repository.

AutoMoE Framework

AutoMoE Key Result

The following table shows the performance of AutoMoE vs. baselines on standard machine translation benchmarks: WMT'14 En-De, WMT'14 En-Fr and WMT'19 En-De.

WMT’14 En-De	Network	# Active Params (M)	Sparsity (%)	FLOPs (G)	BLEU	GPU Hours
Transformer	Dense	176	0	10.6	28.4	184
Evolved Transformer	NAS over Dense	47	0	2.9	28.2	2,192,000
HAT	NAS over Dense	56	0	3.5	28.2	264
AutoMoE (6 Experts)	NAS over Sparse	45	62	2.9	28.2	224

WMT’14 En-Fr	Network	# Active Params (M)	Sparsity (%)	FLOPs (G)	BLEU	GPU Hours
Transformer	Dense	176	0	10.6	41.2	240
Evolved Transformer	NAS over Dense	175	0	10.8	41.3	2,192,000
HAT	NAS over Dense	57	0	3.6	41.5	248
AutoMoE (6 Experts)	NAS over Sparse	46	72	2.9	41.6	236
AutoMoE (16 Experts)	NAS over Sparse	135	65	3.0	41.9	236

WMT’19 En-De	Network	# Active Params (M)	Sparsity (%)	FLOPs (G)	BLEU	GPU Hours
Transformer	Dense	176	0	10.6	46.1	184
HAT	NAS over Dense	63	0	4.1	45.8	264
AutoMoE (2 Experts)	NAS over Sparse	45	41	2.8	45.5	248
AutoMoE (16 Experts)	NAS over Sparse	69	81	3.2	45.9	248

Quick Setup

(1) Install

Run the following commands to install AutoMoE:

git clone https://github.com/UBC-NLP/AutoMoE.git
cd AutoMoE
pip install --editable .

(2) Prepare Data

Run the following commands to download preprocessed MT data:

bash configs/[task_name]/get_preprocessed.sh

where [task_name] can be wmt14.en-de or wmt14.en-fr or wmt19.en-de.

(3) Run full AutoMoE pipeline

Run the following commands to start AutoMoE pipeline:

python generate_script.py --task wmt14.en-de --output_dir /tmp --num_gpus 4 --trial_run 0 --hardware_spec gpu_titanxp --max_experts 6 --frac_experts 1 > automoe.sh
bash automoe.sh

where,

task - MT dataset to use: wmt14.en-de or wmt14.en-fr or wmt19.en-de (default: wmt14.en-de)
output_dir - Output directory to write files generated during experiment (default: /tmp)
num_gpus - Number of GPUs to use (default: 4)
trial_run - Run trial run (useful to quickly check if everything runs fine without errors.): 0 (final run), 1 (dry/dummy/trial run) (default: 0)
hardware_spec - Hardware specification: gpu_titanxp (For GPU) (default: gpu_titanxp)
max_experts - Maximum experts (for Supernet) to use (default: 6)
frac_experts - Fractional (varying FFN. intermediate size) experts: 0 (Standard experts) or 1 (Fractional) (default: 1)
supernet_ckpt - Skip supernet training by specifiying checkpoint from pretrained models (default: None)
latency_compute - Use (partially) gold or predictor latency (default: gold)
latiter - Number of latency measurements for using (partially) gold latency (default: 100)
latency_constraint - Latency constraint in terms of milliseconds (default: 200)
evo_iter - Number of iterations for evolutionary search (default: 10)

Contact

If you have questions, contact Ganesh (ganeshjwhr@gmail.com), Subho (Subhabrata.Mukherjee@microsoft.com) and/or create GitHub issue.

Citation

If you use this code, please cite:

@misc{jawahar2022automoe,
      title={AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers}, 
      author={Ganesh Jawahar and Subhabrata Mukherjee and Xiaodong Liu and Young Jin Kim and Muhammad Abdul-Mageed and Laks V. S. Lakshmanan and Ahmed Hassan Awadallah and Sebastien Bubeck and Jianfeng Gao},
      year={2022},
      eprint={2210.07535},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

See LICENSE.txt for license information.

Acknowledgements

Hardware Aware Transformer from mit-han-lab
fairseq from facebookresearch

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
configs		configs
fairseq		fairseq
images		images
latency_dataset		latency_dataset
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
average_checkpoints.py		average_checkpoints.py
download_model.py		download_model.py
evo_search.py		evo_search.py
generate.py		generate.py
generate_script.py		generate_script.py
latency_dataset.py		latency_dataset.py
latency_predictor.py		latency_predictor.py
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
train.py		train.py

License

microsoft/AutoMoE

Folders and files

Latest commit

History

Repository files navigation

AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers

AutoMoE Framework

AutoMoE Key Result

Quick Setup

(1) Install

(2) Prepare Data

(3) Run full AutoMoE pipeline

Contact

Citation

License

Acknowledgements

Trademarks

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages