Skip to content
Code for the paper "Are Sixteen Heads Really Better than One?"
Branch: master
Clone or download
Latest commit c32a0d2 Jun 28, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
experiments Add experiments Jun 13, 2019
.gitignore Initial commit May 24, 2019
LICENSE Initial commit May 24, 2019 Update Jun 28, 2019

Are Sixteen Heads Really Better than One?

This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than One?.


First, you will need python >=3.6 with pytorch>=1.0. Then, clone our forks of fairseq (for MT experiments) and pytorch-pretrained-BERT (for BERT):

# Fairseq
git clone
# Pytorch pretrained BERT
git clone
cd pytorch-pretrained-BERT
git checkout paul
cd ..

You will also need sacrebleu to evaluate BLEU score (pip install sacrebleu).

Ablation experiments



bash experiments/BERT/ MNLI

Will fine-tune a pretrained BERT on MNLI (stored in ./models/MNLI) and perform the individual head ablation experiment from Section 3.1 in the paper alternatively you can run the experiment with CoLA, MRCP or SST-2 as a task in place of MNLI.


You can obtain the pretrained WMT model from [this link from the fairseq repo](wget Use the Moses tokenizer and subword-nmt in conjunction to the BPE codes provided with the pretrained model to prepair any input file you want. Then run:


Systematic Pruning Experiments


To iteratively prune 10% heads in order of increasing importance run

bash experiments/BERT/ MNLI --normalize_pruning_by_layer

This will reuse the BERT model fine-tuned if you have run the ablation experiment before (otherwise it'll just fine-tune it for you). The output of this is very verbose, but you can get the gist of the result by calling grep "strategy\|results" -A1 on the output.


Similarly, just run:


You might want to change the paths in the experiment files to point to the binarized fairseq dataset on whic you want to estimate importance scores.

You can’t perform that action at this time.