Skip to content


Switch branches/tags

💃DiscoBERT: Discourse-Aware Neural Extractive Text Summarization

Code repository for an ACL 2020 paper Discourse-Aware Neural Extractive Text Summarization.

Authors: Jiacheng Xu (University of Texas at Austin), Zhe Gan, Yu Cheng, and Jingjing Liu (Microsoft Dynamics 365 AI Research).

Contact: jcxu at cs dot utexas dot edu


EDU Segmentation & Parsing

Here is an example of discourse segmentation and RST tree conversion.

Construction of Graphs: An Example

The proposed discourse-aware model selects EDUs 1-1, 2-1, 5-2, 20-1, 20-3, 22-1. The right side of the figure illustrates the two discourse graphs we use: (1) Coref(erence) Graph (with the mentions of `Pulitzer prizes' highlighted as examples); and (2) RST Graph (induced by RST discourse trees).


The code is based on AllenNLP (v0.9), The code is developed with python 3, allennlp and pytorch>=1.0. For more requirements, please check requirements.txt.

Preprocessed Dataset & Model Archive

We maintain the preprocessed CNNDM, pre-trained CNNDM model w. discourse graph and coref graph, and pre-trained NYT model w. discourse graph and coref graph are provided in

The split of NYT is provided at data_preparation/urls_nyt/mapping_{train, valid, test}.txt.


The model framework (training, evaluation, etc.) is based on AllenNLP (v0.9). The usage of most framework related hyper-parameters, like batch size, cuda device, num of samples per epoch, can be referred to AllenNLP document.

Here are some model related hyper-parameters:

Hyper-parameter Value Usage
use_disco bool Using EDU as the selection unit or not. If not use sentence instead.
trigram_block bool Using trigram blocking or not.
min_pred_unit & max_pred_unit int The minimal and maximal number of units (either EDUs or sentences) to choose during inference. The typical value for selecting EDUs on CNNDM and NYT is [5,8) and [5,8).
use_disco_graph bool Using discourse graph for graph encoding.
use_coref bool Using coreference mention graph for graph encoding.


  • The hyper-parameters for BERT encoder is almost same as the configuration from PreSumm.
  • Inflating the number of units getting predicted for EDU-based models because EDUs are generally shorter than sentences. For CNNDM, we found that picking up 5 EDUs yields the best ROUGE F-1 score where for sentence-based model four sentences are picked.
  • We hardcoded some of vector dimension size to be 768 because we use bert-base-uncased model.
  • We tried roberta-base rather than bert-base-uncased as we used in this code repo and paper, but empirically it didn't perform better in our preliminary experiments.
  • The maxium document length is set to be 768 BPEs although we found max_len=768 doesn't bring significant gain from max_len=512.

To train or modify a model, there are several files to start with.

  • model/ is the model file. There are some unused conditions and hyper-parameters starting with "semantic_red" so you should ignore them.
  • configs/DiscoBERT.jsonnet is the configuration file which will be read by AllenNLP framework. In the pre-trained model section of, we provided the configuration files for reference. Basically we adopted most of the hyper-parameters from PreSumm.

Here is a quick reference about our model performance based on bert-base-uncased.


Model R1/R2/RL
DiscoBERT 43.38/20.44/40.21
DiscoBERT w. RST and Coref Graphs 43.77/20.85/40.67


Model R1/R2/RL
DiscoBERT 49.78/30.30/42.44
DiscoBERT w. RST and Coref Graphs 50.00/30.38/42.70


    title = {Discourse-Aware Neural Extractive Text Summarization},
    author = {Xu, Jiacheng and Gan, Zhe and Cheng, Yu and Liu, Jingjing},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    year = {2020},
    publisher = "Association for Computational Linguistics"


  • The data preprocessing (dataset handler, oracle creation, etc.) is partially based on PreSumm by Yang Liu and Mirella Lapata.
  • Data preprocessing (tokenization, sentence split, coreference resolution etc.) used CoreNLP.
  • RST Discourse Segmentation is generated from NeuEDUSeg. I slightly modified the code to run with GPU. Please check my modification here.
  • RST Discourse Parsing is generated from DPLP. My customized version is here featuring batch implementation and remaining file detection. Empirically I found that NeuEDUSeg provided better segmentation output than DPLP so we use NeuEDUSeg for segmentation and DPLP for parsing.
  • The implementation of the graph module is based on DGL.


Code for paper "Discourse-Aware Neural Extractive Text Summarization" (ACL20)








No releases published


No packages published