Skip to content
BERT for Coreference Resolution
Python Perl Jupyter Notebook Shell JavaScript C++ HTML
Branch: master
Clone or download
Latest commit d9bb5f7 Dec 23, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
bert More fixes Jun 26, 2019
debug remove unused files Oct 18, 2019
viz visualization May 22, 2019
.gitignore more cleanup Aug 18, 2019
LICENSE Initial commit. Jul 21, 2017 Fix subtoken map Dec 22, 2019
basic.conf add confs and readme Jun 14, 2019 Initial commit for update from 'Higher-order Coreference Resolution w… Jul 15, 2018 Initial commit for update from 'Higher-order Coreference Resolution w… Jul 15, 2018 Fix bug Jul 19, 2019 more cleanup Aug 18, 2019
experiments.conf update train configs Oct 30, 2019 remove seq_len check Oct 28, 2019 Initial commit for update from 'Higher-order Coreference Resolution w… Jul 15, 2018 add seq lens Aug 18, 2019 add error msg Jun 14, 2019 remove unused imports Aug 17, 2019 remove unused imports Aug 17, 2019 cleanup Aug 24, 2019 fix bugs when using pytorch pretrained Sep 18, 2019
requirements.txt add requirements Aug 18, 2019 cleanup setup scripts Aug 17, 2019 remove unused imports Aug 17, 2019 fix logging Aug 27, 2019 remove unused imports Aug 17, 2019

BERT and SpanBERT for Coreference Resolution

This repository contains code and models for the paper, BERT for Coreference Resolution: Baselines and Analysis. Additionally, we also include the coreference resolution model from the paper SpanBERT: Improving Pre-training by Representing and Predicting Spans, which is the current state of the art on OntoNotes (79.6 F1). Please refer to the SpanBERT repository for other tasks.

The model architecture itself is an extension of the e2e-coref model.


  • Install python3 requirements: pip install -r requirements.txt
  • export data_dir=</path/to/data_dir>
  • ./ This builds the custom kernels

Pretrained Coreference Models

Please download following files to use the pretrained coreference models on your data. If you want to train your own coreference model, you can skip this step.

Model F1 (%)
BERT-base 73.9
SpanBERT-base 77.7
BERT-large 76.9
SpanBERT-large 79.6

./ <model_name> (e.g,: bert_base, bert_large, spanbert_base, spanbert_large; assumes that $data_dir is set) This downloads BERT/SpanBERT models finetuned on OntoNotes. The original/non-finetuned version of SpanBERT weights is available in this repository. You can use these models with and (the section on Batched Prediction Instructions)

Training / Finetuning Instructions

  • Finetuning a BERT/SpanBERT large model on OntoNotes requires access to a 32GB GPU. You might be able to train the large model with a smaller max_seq_length, max_training_sentences, ffnn_size, and model_heads = false on a 16GB machine; this will almost certainly result in relatively poorer performance as measured on OntoNotes.
  • Running/testing a large pretrained model is still possible on a 16GB GPU. You should be able to finetune the base models on smaller GPUs.

Setup for training

This assumes access to OntoNotes 5.0. ./ <ontonotes/path/ontonotes-release-5.0> $data_dir. This preprocesses the OntoNotes corpus, and downloads the original (not finetuned on OntoNotes) BERT models which will be finetuned using

  • Experiment configurations are found in experiments.conf. Choose an experiment that you would like to run, e.g. bert_base
  • Note that configs without the prefix train_ load checkpoints already tuned on OntoNotes.
  • Training: GPU=0 python <experiment>
  • Results are stored in the log_root directory (see experiments.conf) and can be viewed via TensorBoard.
  • Evaluation: GPU=0 python <experiment>. This currently evaluates on the dev set.

Batched Prediction Instructions

  • Create a file where each line similar to cased_config_vocab/trial.jsonlines (make sure to strip the newlines so each line is well-formed json):
  "clusters": [], # leave this blank
  "doc_key": "nw", # key closest to your domain. "nw" is newswire. See the OntoNotes documentation.
  "sentences": [["[CLS]", "subword1", "##subword1", ".", "[SEP]"]], # list of BERT tokenized segments. Each segment should be less than the max_segment_len in your config
  "speakers": [["[SPL]", "-", "-", "-", "[SPL]"]], # speaker information for each subword in sentences
  "sentence_map": [0, 0, 0, 0, 0], # flat list where each element is the sentence index of the subwords
  "subtoken_map": [0, 0, 0, 1, 1]  # flat list containing original word index for each subword. [CLS]  and the first word share the same index
  • clusters should be left empty and is only used for evaluation purposes.
  • doc_key indicates the genre, which can be one of the following: "bc", "bn", "mz", "nw", "pt", "tc", "wb"
  • speakers indicates the speaker of each word. These can be all empty strings if there is only one known speaker.
  • Run GPU=0 python <experiment> <input_file> <output_file>, which outputs the input jsonlines with an additional key predicted_clusters.


  • The current config runs the Independent model.
  • When running on test, change the eval_path and conll_eval_path from dev to test.
  • The model_dir inside the log_root contains stdout.log. Check the max_f1 after 57000 steps. For example 2019-06-12 12:43:11,926 - INFO - __main__ - [57000] evaL_f1=0.7694, max_f1=0.7697
  • You can also load pytorch based model files (ending in .pt) which share BERT's architecture. See for details.

Important Config Keys

  • log_root: This is where all models and logs are stored. Check this before running anything.
  • bert_learning_rate: The learning rate for the BERT parameters. Typically, 1e-5 and 2e-5 work well.
  • task_learning_rate: The learning rate for the other parameters. Typically, LRs between 0.0001 to 0.0003 work well.
  • init_checkpoint: The checkpoint file from which BERT parameters are initialized. Both TF and Pytorch checkpoints work as long as they use the same BERT architecture. Use *ckpt files for TF and *pt for Pytorch.
  • max_segment_len: The maximum size of the BERT context window. Larger segments work better for SpanBERT while BERT suffers a sharp drop at 512.


If you have access to a slurm GPU cluster, you could use the following for set of commands for training.

  • python --generate_configs --data_dir <coref_data_dir>: This generates multiple configs for tuning (BERT and task) learning rates, embedding models, and max_segment_len. This modifies experiments.conf. Use --trial to print to stdout instead. If you need to generate this from scratch, refer to basic.conf.
  • grep "\{best\}" experiments.conf | cut -d = -f 1 > torun.txt: This creates a list of configs that can be used by the script to launch jobs. You can use a regexp to restrict the list of configs. For example, grep "\{best\}" experiments.conf | grep "sl512*" | cut -d = -f 1 > torun.txt will select configs with max_segment_len = 512.
  • python --data_dir <coref_data_dir> --run_jobs: This launches jobs from torun.txt on the slurm cluster.


If you use the pretrained BERT-based coreference model (or this implementation), please cite the paper, BERT for Coreference Resolution: Baselines and Analysis.

    title={{BERT} for Coreference Resolution: Baselines and Analysis},
    author={Mandar Joshi and Omer Levy and Daniel S. Weld and Luke Zettlemoyer},
    booktitle={Empirical Methods in Natural Language Processing (EMNLP)}

Additionally, if you use the pretrained SpanBERT coreference model, please cite the paper, SpanBERT: Improving Pre-training by Representing and Predicting Spans.

    title={{SpanBERT}: Improving Pre-training by Representing and Predicting Spans},
    author={Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy},
    journal={arXiv preprint arXiv:1907.10529}
You can’t perform that action at this time.