Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

## Extractive Summarization on CNN/DM Dataset using Transformer Version of BertSum


### Summary

This notebook demonstrates how to fine tune Transformers for extractive text summarization. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring, result postprocessing, and model evaluation.

BertSum refers to  [Fine-tune BERT for Extractive Summarization (https://arxiv.org/pdf/1903.10318.pdf) with [published example](https://github.com/nlpyang/BertSum/). And the Transformer version of Bertsum refers to our modification of BertSum and the source code can be accessed at (https://github.com/daden-ms/BertSum/). 

Extractive summarization are usually used in document summarization where each input document consists of mutiple sentences. The preprocessing of the input training data involves assigning label 0 or 1 to the document sentences based on the give summary. The summarization problem is also simplfied to classifying whether each document sentence should be included in the summary. 

The figure below illustrates how BERTSum can be fine tuned for extractive summarization task. Each sentence is inserted with [CLS] token at the beginning and  [SEP] at the end. Interval segment embedding and positional embedding are added upon the token embedding before input the BERT model. The [CLS] token representation is used as sentence embedding and only the [CLS] tokens are used as input for the summarization model. The summarization layer predicts whether the probability of each each sentence token should be included in the summary or not. Techniques like trigram blocking can be used to improve model accuarcy.   

<img src="https://nlpbp.blob.core.windows.net/images/BertSum.PNG">


### Before You Start

The running time shown in this notebook is on a Standard_NC24s_v3 Azure Deep Learning Virtual Machine with 4 NVIDIA Tesla V100 GPUs. 
> **Tip**: If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. 

The table below provides some reference running time on different machine configurations.  

|QUICK_RUN|USE_PREPROCESSED_DATA|encoder|Machine Configurations|Running time|
|:---------|:---------|:---------|:----------------------|:------------|
|True|True|baseline|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 20 minutes |
|False|True|baseline|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 60 minutes |
|True|False|baseline|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 20 minutes |
|True|True|transformer|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 80 minutes |
|False|True|transformer|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 6.5hours |
|True|False|transformer|1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| ~ 80 minutes |
|False|False|any| 1 NVIDIA Tesla V100 GPUs, 16GB GPU memory| > 24 hours|

In [1]:
## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = True
USE_PREPROCESSED_DATA = True

### Configuration

Before we start the notebook, we should set the environment variable to make sure you can access the GPUs on your machine

In [2]:
import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"  # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

In [3]:
%load_ext autoreload

In [4]:
%autoreload 2

In [5]:
import sys
import os

nlp_path = os.path.abspath("../../")
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)
sys.path.insert(0, "./")
sys.path.insert(0, "/dadendev/nlp/examples/text_summarization/BertSum/")

In [6]:
print(sys.path)

['/dadendev/nlp/examples/text_summarization/BertSum/', './', '/dadendev/nlp', '/dadendev/anaconda3/envs/cm3/lib/python36.zip', '/dadendev/anaconda3/envs/cm3/lib/python3.6', '/dadendev/anaconda3/envs/cm3/lib/python3.6/lib-dynload', '', '/home/daden/.local/lib/python3.6/site-packages', '/dadendev/anaconda3/envs/cm3/lib/python3.6/site-packages', '/dadendev/anaconda3/envs/cm3/lib/python3.6/site-packages/pyrouge-0.1.3-py3.6.egg', '/dadendev/anaconda3/envs/cm3/lib/python3.6/site-packages/IPython/extensions', '/home/daden/.ipython']


Also, we need to install the dependencies for pyrouge.

In [7]:
# dependencies for ROUGE-1.5.5.pl
!sudo apt-get update
!sudo apt-get install expat
!sudo apt-get install libexpat-dev -y

Hit:1 http://azure.archive.ubuntu.com/ubuntu bionic InRelease
Get:2 http://azure.archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://azure.archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:4 https://packages.microsoft.com/repos/microsoft-ubuntu-xenial-prod xenial InRelease
Get:5 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]    
Ign:6 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:7 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Fetched 252 kB in 1s (392 kB/s)                              
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
expat is already the newest version (2.2.5-3ubuntu0.2).
The following packages were automatically installed and are no longer required:
  linux-azure-cloud-tools-5.0.0-1018 linux-azure-cloud-tools-5.0.0-1020
  linux-azure-headers-5.0.0-101

Run the following command in your terminal to install pre-requiste for using pyrouge.
1. sudo cpan install XML::Parser
1. sudo cpan install XML::Parser::PerlSAX
1. sudo cpan install XML::DOM

Download ROUGE-1.5.5 from https://github.com/andersjo/pyrouge/tree/master/tools/ROUGE-1.5.5.
Run the following command in your terminal.
* pyrouge_set_rouge_path $ABSOLUTE_DIRECTORY_TO_ROUGE-1.5.5.pl

### Data Preprossing

The dataset we used for this notebook is CNN/DM dataset which contains the documents and accompanying questions from the news articles of CNN and Daily mail. The highlights in each article are used as summary. The dataset consits of ~289K training examples, ~11K valiation and ~11K test dataset.  You can choose the [Option 1] below preprocess the data or [Option 2] to use the preprocessed version at [BERTSum published example](https://github.com/nlpyang/BertSum/). You don't need to manually download any of these two data sets as the code below will handle this part.  Since it takes up to 28 hours to preprocess the training data  to run on 10  Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, we suggest you continue with set as True first and experiment with data preprocessing  with QUICKRUN set as True.

##### Details of Data Preprocessing

The purpose of preprocessing is to process the input articles to the format that BertSum takes.  Functions defined specific in harvardnlp_cnndm_preprocess function are unique to CNN/DM dataset that's processed by harvardnlp. However, it provides a skeleton of how to preprocessing data into the format that BertSum takes. Assuming you have all articles and target summery each in a file, line-breaker seperated, the steps to preprocess the data are:
1. sentence tokenization
2. word tokenization
3. label the sentences in the article with 1 meaning the sentence is selected and 0 meaning the sentence is not selected. The options for the selection algorithms are "greedy" and "combination"
3. convert each example to  BertSum format
    - filter the sentences in the example based on the min_src_ntokens argument. If the lefted total sentence number is less than min_nsents, the example is discarded.
    - truncate the sentences in the example if the length is greater than max_src_ntokens
    - truncate the sentences in the example and the labels if the totle number of sentences is greater than max_nsents
    - [CLS] and [SEP] are inserted before and after each sentence
    - wordPiece tokenization
    - truncate the example to 512 tokens
    - convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary.
    - segment ids are generated
    - [CLS] token positions are logged
    - [CLS] token labels are truncated if it's greater than 512, which is the maximum input length that can be taken by the BERT model.
    
    
Note that the original BERTSum paper use Stanford CoreNLP for data proprocessing, here we'll first how to use NLTK version, and then we also provide instruction of how to set up Stanford NLP and code examples of how to use Standford CoreNLP. 

##### [Option 1] Preprocess  data
The code in following cell will download the CNN/DM dataset listed at https://github.com/harvardnlp/sent-summary/.

In [8]:
if QUICK_RUN:
    top_n = 100
from utils_nlp.dataset.cnndm import CNNDMSummarization

train_dataset, test_dataset = CNNDMSummarization(top_n=top_n)

[nltk_data] Downloading package punkt to /home/daden/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
I1030 19:42:44.432886 140491949197120 file_utils.py:39] PyTorch version 1.2.0 available.
I1030 19:42:44.466703 140491949197120 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
I1030 19:42:44.483136 140491949197120 modeling.py:230] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
I1030 19:42:44.489104 140491949197120 utils.py:173] Opening tar file .data/cnndm.tar.gz.
I1030 19:42:44.490219 140491949197120 utils.py:181] .data/test.txt.src already extracted.
I1030 19:42:44.778111 140491949197120 utils.py:181] .data/test.txt.tgt.tagged already extracted.
I1030 19:42:44.804572 140491949197120 utils.py:181] .data/train.txt.src already extracted.
I1030 19:42:52.298821 140491949197120 utils.py:181] .data/train.txt.tgt.tagged already extracted.
I1030 19:42:52.910166 140491949

Preprocess the data and save the data to disk.

In [9]:
from utils_nlp.models.transformers.extractive_summarization import ExtSumProcessor

processor = ExtSumProcessor(model_name="distilbert-base-uncased")
ext_sum_train = processor.preprocess(train_dataset, train_dataset.get_target())
ext_sum_test = processor.preprocess(test_dataset, test_dataset.get_target())

I1030 19:42:54.824942 140491949197120 tokenization_utils.py:374] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ./26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


In [10]:
data_path = "./temp_data4/"

In [11]:
!mkdir -p $data_path

In [12]:
from utils_nlp.dataset.cnndm import CNNDMBertSumProcessedData

train_files = CNNDMBertSumProcessedData.save_data(
    ext_sum_train, is_test=False, path_and_prefix=data_path, chunk_size=25
)
test_files = CNNDMBertSumProcessedData.save_data(
    ext_sum_test, is_test=True, path_and_prefix=data_path, chunk_size=None
)

In [13]:
train_files

['./temp_data4/_0_train',
 './temp_data4/_1_train',
 './temp_data4/_2_train',
 './temp_data4/_3_train']

In [14]:
test_files

['./temp_data4/_0_test']

In [15]:
train_iter, test_iter = CNNDMBertSumProcessedData().splits(root=data_path)

##### [Option 2] Reuse Preprocess  data

In [16]:
if USE_PREPROCESSED_DATA:
    data_path = "./temp_data5/"
    if not os.path.exists(data_path):
        os.mkdir(data_path)
    CNNDMBertSumProcessedData.download(local_path=data_path)
    train_iter, test_iter = CNNDMBertSumProcessedData().splits(root=data_path)
    

I1030 19:42:57.929424 140491949197120 utils.py:88] Downloading from Google Drive; may take a few minutes
I1030 19:42:58.700399 140491949197120 utils.py:60] File ./temp_data5/bertsum_data.zip already exists.


In [17]:
train_iter

<generator object get_dataset at 0x7fc616e03f68>

#### Inspect Data

In [18]:
train_files

['./temp_data4/_0_train',
 './temp_data4/_1_train',
 './temp_data4/_2_train',
 './temp_data4/_3_train']

In [19]:
import torch
bert_format_data = torch.load(train_files[0])
print(len(bert_format_data))
bert_format_data[0].keys()


25


dict_keys(['src', 'labels', 'segs', 'clss', 'src_txt', 'tgt_txt'])

In [20]:
bert_format_data[0]['labels']

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

### Model training
To start model training, we need to create a instance of ExtractiveSummarizer.
#### Choose the transformer model.
Currently ExtractiveSummarizer support two models:
- distilbert-base-uncase, 
- bert-base-uncase

Potentionally, roberta-based model and xlnet can be supported but needs to be tested.
#### Choose the encoder algorithm.
There are four options:
- baseline: it used a smaller transformer model to replace the bert model and with transformer summarization layer
- classifier: it uses pretrained BERT and fine-tune BERT with **simple logistic classification** summarization layer
- transformer: it uses pretrained BERT and fine-tune BERT with **transformer** summarization layer
- RNN: it uses pretrained BERT and fine-tune BERT with **LSTM** summarization layer

In [21]:
# notebook parameters
DATA_FOLDER = "./temp"
CACHE_DIR = "./temp"
DEVICE = "cuda"
BATCH_SIZE = 3000
NUM_GPUS = 1
encoder = "transformer"
model_name = "distilbert-base-uncased"

In [22]:
from utils_nlp.models.transformers.extractive_summarization import ExtractiveSummarizer
summarizer = ExtractiveSummarizer(model_name, encoder, CACHE_DIR)

I1030 19:43:17.238647 140491949197120 configuration_utils.py:151] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json from cache at ./temp/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.1ccd1a11c9ff276830e114ea477ea2407100f4a3be7bdc45d37be9e37fa71c7e
I1030 19:43:17.239963 140491949197120 configuration_utils.py:168] Model config {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": null,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "n_heads": 12,
  "n_layers": 6,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pruned_heads": {},
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torchscript": false,
  "use_bfloat16": false,
  "vocab_size": 30522
}

I1030 19:43:17.363380 140491949197120 modeling_u

In [23]:
train_files

['./temp_data4/_0_train',
 './temp_data4/_1_train',
 './temp_data4/_2_train',
 './temp_data4/_3_train']

In [24]:
from utils_nlp.models.transformers.extractive_summarization import  get_dataset, get_dataloader

In [None]:
### from utils_nlp.common.timer import Timer
#"""
summarizer.fit(
            train_iter,
            device= DEVICE,
            batch_size=3000,
            num_gpus=NUM_GPUS,
            gradient_accumulation_steps=2,
            max_steps=1e4,
            lr=2e-3,
            warmup_steps=1e4*0.5,
            verbose=True,
            report_every=100,
        )
#"""

cuda
loss: 80.055855, time: 21.066953, examples number: 5.000000, step 100.000000 out of total 10000.000000
loss: 0.210719, time: 0.097609, examples number: 5.000000, step 100.000000 out of total 10000.000000
loss: 37.006704, time: 21.019811, examples number: 7.000000, step 200.000000 out of total 10000.000000
loss: 0.189547, time: 0.085794, examples number: 6.000000, step 200.000000 out of total 10000.000000
loss: 34.051729, time: 20.893883, examples number: 5.000000, step 300.000000 out of total 10000.000000
loss: 0.154256, time: 0.098439, examples number: 5.000000, step 300.000000 out of total 10000.000000
loss: 33.769536, time: 21.259850, examples number: 5.000000, step 400.000000 out of total 10000.000000
loss: 0.194510, time: 0.098374, examples number: 5.000000, step 400.000000 out of total 10000.000000
loss: 33.363134, time: 20.839859, examples number: 5.000000, step 500.000000 out of total 10000.000000
loss: 0.127341, time: 0.104356, examples number: 5.000000, step 500.000000 o

In [None]:
torch.save(summarizer.model, "cnndm_transformersum_bert-base-uncased_bertsum_processed_data.pt")

In [None]:
import torch
summarizer.model = torch.load("cnndm_transformersum_distilbert-base-uncased_bertsum_processed_data.pt")

### Model Evaluation

[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)), or Recall-Oriented Understudy for Gisting Evaluation has been commonly used for evaluation text summarization.

In [None]:
import torch
from utils_nlp.models.bert.extractive_text_summarization import get_data_iter
import os

test_dataset=[]
for i in range(0,6):
    filename = os.path.join(BERT_DATA_PATH, "cnndm.test.{0}.bert.pt".format(i))
    test_dataset.extend(torch.load(filename))

In [None]:
prediction = summarizer.predict(get_data_iter(test_dataset),
                               device=DEVICE,)

In [None]:
target = [test_dataset[i]['tgt_txt'] for i in range(len(test_dataset))]

In [None]:
from utils_nlp.eval.evaluate_summarization import get_rouge
rouge_transformer = get_rouge(prediction, target, "./results/")

In [None]:
prediction[0]

In [None]:
test_dataset[0]['tgt_txt']