Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

## Abstractive Summarization on CNN/DM Dataset using Transformers


### Summary

This notebook demonstrates how to fine tune Transformers models like [BART](https://arxiv.org/abs/1910.13461) and [T5](https://arxiv.org/abs/1910.10683) together with HuggingFace's [transformers library](https://github.com/huggingface/transformers)for abstractive text summarization. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring, result postprocessing, and model evaluation.




### Before You Start

Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of steps. If QUICK_RUN = True, the notebook takes about 5 minutes to run on a VM with 1 Tesla K80 GPUs with 12GB GPU memory. If QUICK_RUN = False, it takes around 15 minutes for data preprocessing, 15 minutes for fine-tuning and 3 hours for running evaluation on the whole CNN/DM test dataset.

### Additional Notes

* **ROUGE Evalation**: To run rouge evaluation, please refer to the section of compute_rouge_perl in [summarization_evaluation.ipynb](./summarization_evaluation.ipynb) for setup.

* **Distributed Training**:
Please note that the jupyter notebook only allows to use pytorch [DataParallel](https://pytorch.org/docs/master/nn.html#dataparallel). Faster speed and larger batch size can be achieved with pytorch [DistributedDataParallel](https://pytorch.org/docs/master/notes/ddp.html)(DDP). Script [extractive_summarization_cnndm_distributed_train.py](./extractive_summarization_cnndm_distributed_train.py) shows an example of how to use DDP.



In [1]:
%load_ext autoreload

%autoreload 2
## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = False

### Configuration


In [2]:
import os
import shutil
import sys
from tempfile import TemporaryDirectory
import time
import torch

nlp_path = os.path.abspath("../../")
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)

from utils_nlp.dataset.cnndm import CNNDMSummarizationDataset
from utils_nlp.eval import compute_rouge_python, compute_rouge_perl
from utils_nlp.models.transformers.abstractive_summarization_bartt5 import (
    AbstractiveSummarizer)

from utils_nlp.models.transformers.datasets import SummarizationDataset
import nltk
from nltk import tokenize

import pandas as pd
import scrapbook as sb
import pprint
start_time = time.time()

  import pandas.util.testing as tm



### Configuration: choose the transformer model to be used

Several pretrained models have been made available by [Hugging Face](https://github.com/huggingface/transformers). For abstractive summarization, the following pretrained models are supported. 

In [3]:
pd.DataFrame({"model_name": AbstractiveSummarizer.list_supported_models()})

Unnamed: 0,model_name
0,bart-large
1,bart-large-mnli
2,bart-large-cnn
3,bart-large-xsum
4,t5-small
5,t5-base
6,t5-large
7,t5-3b
8,t5-11b


In [4]:
# Transformer model being used
# MODEL_NAME = "bart-large"
MODEL_NAME = "t5-small"
# notebook parameters
# the cache data path during find tuning
CACHE_DIR = "./t5_cache" #TemporaryDirectory().name
summarizer = AbstractiveSummarizer(MODEL_NAME, cache_dir=CACHE_DIR)

### Data Preprocessing

The dataset we used for this notebook is CNN/DM dataset which contains the documents and accompanying questions from the news articles of CNN and Daily mail. The highlights in each article are used as summary. The dataset consits of ~289K training examples, ~11K valiation examples and ~11K test examples.   The code in following cell will download the CNN/DM dataset listed at https://github.com/harvardnlp/sent-summary/.


In [5]:
# the data path used to save the downloaded data file
DATA_PATH = "./bartt5_cnndm" #TemporaryDirectory().name
# The number of lines at the head of data file used for preprocessing. -1 means all the lines.
TOP_N = 100
if not QUICK_RUN:
    TOP_N = -1

In [6]:
train_dataset, test_dataset = CNNDMSummarizationDataset(top_n=TOP_N, local_cache_path=DATA_PATH, sent_split=False)

In [7]:
test_dataset[0]['tgt_txt']

"<t> marseille prosecutor says `` so far no videos were used in the crash investigation '' despite media reports . </t> <t> journalists at bild and paris match are `` very confident '' the video clip is real , an editor says . </t> <t> andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says . </t>\n"

In [8]:
print(len(train_dataset))
print(len(test_dataset))

287227
11490


Preprocess the data.

In [15]:
from multiprocessing import Pool
def preprocess(summarizer, input_data_list, num_workers=50, chunk_size=100, internal_batch_size=5e3):
        """ preprocess the data for abstractive summarization.

        Args:
            input_data_list (list of dictionary): input list where each item is
                an dictionary with fields "src" and "tgt" and both fields are string.
            num_workers (int, optional): The number of workers in the pool.
                Defautls to 50.
            chunk_size (int, optional): The size that a worker processes.
                Defaults to 100.
            internal_batch_size (int, optional): The size that one pool processes.
                Defaults to 5000. Reduce this number if you see segment fault.

        Returns:
            list of dictionary with addtional fields "source_ids",
                "source_mask" and "target_ids".
        """
        i = 0
        temp_dir = TemporaryDirectory().name
        os.makedirs(temp_dir, mode=0o777, exist_ok=False)
        temp_file = ".temp_preprocess"
        processed_length = 0
        result = []
        print(len(input_data_list))
        pool = Pool(num_workers, initializer=summarizer.processor.initializer)
        while processed_length < len(input_data_list):
            max_length = int(min(processed_length+internal_batch_size, len(input_data_list)))
            temp = []
            for j in range(processed_length, max_length):
                temp.append(input_data_list[j])
            result_generator = pool.imap(summarizer.processor.encode_example, temp, chunk_size)
            torch.save(list(result_generator), os.path.join(temp_dir, temp_file+str(i)))
            i += 1
            processed_length = max_length
            #print(processed_length)

        pool.close()
        pool.join()
        result = []
        total_batch_number = i
        for i in range(total_batch_number):
            result.extend(torch.load(os.path.join(temp_dir, temp_file+str(i))))
        if os.path.exists(temp_dir):
            shutil.rmtree(temp_dir, ignore_errors=True)
        return result



In [10]:
%%time
# abs_sum_train = summarizer.processor.preprocess(train_dataset)
abs_sum_train = preprocess(summarizer, train_dataset)
# torch.save(abs_sum_train,  os.path.join(DATA_PATH, "train_{0}_full.pt".format(MODEL_NAME)))
# abs_sum_train = torch.load(os.path.join(DATA_PATH, "train_{0}_full.pt".format(MODEL_NAME)))

287227
CPU times: user 1min 43s, sys: 57.7 s, total: 2min 40s
Wall time: 13min 22s


In [11]:
# abs_sum_train = torch.load(os.path.join(DATA_PATH, "train_{0}_full.pt".format(MODEL_NAME)))

# torch.save(abs_sum_test,  os.path.join(DATA_PATH, "test_{0}_full.pt".format(MODEL_NAME)))
# abs_sum_test = torch.load(os.path.join(DATA_PATH, "test_{0}_full.pt".format(MODEL_NAME)))

In [16]:
%%time
# abs_sum_test= summarizer.processor.preprocess(test_dataset)
abs_sum_test= preprocess(summarizer, test_dataset)

11490
CPU times: user 4.37 s, sys: 5.82 s, total: 10.2 s
Wall time: 37.6 s


In [13]:
print(len(abs_sum_train))
print(len(abs_sum_test))

287227
11490


#### Inspect Data

In [14]:
abs_sum_train[0].keys()

dict_keys(['source_ids', 'source_mask', 'target_ids'])

In [15]:
abs_sum_train[0]

{'source_ids': tensor([21603,    10,  6005,  ...,     3,    31,     7]),
 'source_mask': tensor([1, 1, 1,  ..., 1, 1, 1]),
 'target_ids': tensor([19367,     3,  1092,    16, 11171,    16,  1337,  3690,    33,   629,
            26,    30,     8,    96, 11821,  1501,    96,  5191,     3,   849,
          1926,    90,    99,   348,   845,   167,    33,   132,    38,     3,
             9,   741,    13,    96,  1792,   179,  3110,   106,   725,    96,
           298,     3,    75,    29,    29,  8108,  3064,     3,     6,  1868,
         14314,     7,     3,    10,    96,     3,    23,   183,     8,   520,
            13,     8,  2753,    96,    90,    99,   348,   845,     8,   358,
            19,    73,  4998,    11,     3,    88,     3,    31,     7,  6237,
            21,   483,     3,     5,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,  

## Fine tune model
To start model fine-tuning, we need to specify the paramters as follows.

In [17]:
BATCH_SIZE_PER_GPU = 4
GRADIENT_ACCUMULATION_STEPS = 1
MAX_POS_LENGTH = 512

# GPU used for training
NUM_GPUS = torch.cuda.device_count()


# Learning rate
LEARNING_RATE=3e-5
MAX_GRAD_NORM=0.1

# How often the statistics reports show up in training, unit is step.
REPORT_EVERY=100
SAVE_EVERY=1000

# total number of steps for training
MAX_STEPS=100
# number of steps for warm up
WARMUP_STEPS=5e1
    
if not QUICK_RUN:
    MAX_STEPS=1000
    WARMUP_STEPS=5e2
    
# inference parameters
TEST_PER_GPU_BATCH_SIZE = 96
BEAM_SIZE = 3
 

In [17]:

summarizer.fit(
            abs_sum_train,
            num_gpus=NUM_GPUS,
            batch_size=BATCH_SIZE_PER_GPU*NUM_GPUS,
            gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
            max_steps=MAX_STEPS,
            learning_rate=LEARNING_RATE,
            max_grad_norm=MAX_GRAD_NORM,
            warmup_steps=WARMUP_STEPS,
            verbose=True,
            report_every=REPORT_EVERY,
        )


timestamp: 21/05/2020 14:55:00, average loss: 2.950966, time duration: 91.118061,
                            number of examples in current reporting: 400, step 100
                            out of total 1000
timestamp: 21/05/2020 14:56:28, average loss: 2.496747, time duration: 87.725076,
                            number of examples in current reporting: 400, step 200
                            out of total 1000
timestamp: 21/05/2020 14:57:57, average loss: 2.232086, time duration: 88.453045,
                            number of examples in current reporting: 400, step 300
                            out of total 1000
timestamp: 21/05/2020 14:59:25, average loss: 2.104675, time duration: 88.361590,
                            number of examples in current reporting: 400, step 400
                            out of total 1000
timestamp: 21/05/2020 15:00:53, average loss: 1.996439, time duration: 88.524355,
                            number of examples in current reporting: 400, 

In [18]:
# save a finetuned model and load a previous saved model
"""
import torch
model_path = os.path.join(
        CACHE_DIR,
        "abssum_modelname_{0}_steps_{1}.pt".format(
            MODEL_NAME, MAX_STEPS
        ))
summarizer.save_model(global_step=MAX_STEPS, full_name=model_path)

summarizer = AbstractiveSummarizer(MODEL_NAME, cache_dir=CACHE_DIR)
summarizer.model.load_state_dict(torch.load(model_path, map_location="cpu")['model'])
"""

'\nimport torch\nmodel_path = os.path.join(\n        CACHE_DIR,\n        "abssum_modelname_{0}_steps_{1}.pt".format(\n            MODEL_NAME, MAX_STEPS\n        ))\nsummarizer.save_model(global_step=MAX_STEPS, full_name=model_path)\n\nsummarizer = AbstractiveSummarizer(MODEL_NAME, cache_dir=CACHE_DIR)\nsummarizer.model.load_state_dict(torch.load(model_path, map_location="cpu")[\'model\'])\n'

### Model Evaluation

[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)), or Recall-Oriented Understudy for Gisting Evaluation has been commonly used for evaluating text summarization.             
For the settings in this notebook with QUICK_RUN=False, you should get ROUGE scores close to the following numbers:

``
{'rouge-1': {'f': 0.3532833731474843,
             'p': 0.5062112092750258,
             'r': 0.2854026986121758},
 'rouge-2': {'f': 0.1627400891022247,
             'p': 0.23802173638805246,
             'r': 0.13034686738843493},
 'rouge-l': {'f': 0.2587374492685969,
             'p': 0.3710902340617733,
             'r': 0.20909466938819835}}
 ``           

In [18]:
source = []
target = []

    
for i in test_dataset:
    source.append(i["src_txt"]) 
    target.append(i['tgt'].replace("<t>","").replace("</t>", "").replace("\n", "")) 

In [19]:
%%time
prediction = summarizer.predict(abs_sum_test, 
                                num_gpus=NUM_GPUS, 
                                batch_size=TEST_PER_GPU_BATCH_SIZE*NUM_GPUS,
                                min_length=24, 
                                max_length=48,
                                num_beams=BEAM_SIZE,
                                checkpoint="./t5_cache/fine_tuned/abssum_t5-small.pt")

Generating summary:   0%|          | 0/120 [00:00<?, ?it/s]

dataset length is 11490


Generating summary: 100%|██████████| 120/120 [3:22:25<00:00, 101.21s/it] 


CPU times: user 3h 21min 52s, sys: 25.4 s, total: 3h 22min 17s
Wall time: 3h 22min 28s


In [20]:
%%time
prediction = summarizer.predict(abs_sum_test, 
                                num_gpus=NUM_GPUS, 
                                batch_size=TEST_PER_GPU_BATCH_SIZE*NUM_GPUS,
                                min_length=24, 
                                max_length=48,
                                num_beams=BEAM_SIZE)


dataset length is 11490
CPU times: user 3h 22min 45s, sys: 6.08 s, total: 3h 22min 51s
Wall time: 3h 22min 14s


In [None]:
rouge_scores = compute_rouge_python(cand=prediction, ref=target)
pprint.pprint(rouge_scores)

Number of candidates: 11490
Number of references: 11490


In [23]:
rouge_scores = compute_rouge_python(cand=prediction, ref=target)
pprint.pprint(rouge_scores)

Number of candidates: 11490
Number of references: 11490
{'rouge-1': {'f': 0.3532833731474843,
             'p': 0.5062112092750258,
             'r': 0.2854026986121758},
 'rouge-2': {'f': 0.1627400891022247,
             'p': 0.23802173638805246,
             'r': 0.13034686738843493},
 'rouge-l': {'f': 0.2587374492685969,
             'p': 0.3710902340617733,
             'r': 0.20909466938819835}}


In [None]:
prediction[0]

In [None]:
target[0]

In [None]:
# for testing
sb.glue("rouge_2_f_score", rouge_scores['rouge-2']['f'])

## Prediction on a single input sample

In [None]:
source = """
But under the new rule, set to be announced in the next 48 hours, Border Patrol agents would immediately return anyone to Mexico — without any detainment and without any due process — who attempts to cross the southwestern border between the legal ports of entry. The person would not be held for any length of time in an American facility.

Although they advised that details could change before the announcement, administration officials said the measure was needed to avert what they fear could be a systemwide outbreak of the coronavirus inside detention facilities along the border. Such an outbreak could spread quickly through the immigrant population and could infect large numbers of Border Patrol agents, leaving the southwestern border defenses weakened, the officials argued.
The Trump administration plans to immediately turn back all asylum seekers and other foreigners attempting to enter the United States from Mexico illegally, saying the nation cannot risk allowing the coronavirus to spread through detention facilities and Border Patrol agents, four administration officials said.
The administration officials said the ports of entry would remain open to American citizens, green-card holders and foreigners with proper documentation. Some foreigners would be blocked, including Europeans currently subject to earlier travel restrictions imposed by the administration. The points of entry will also be open to commercial traffic."""

In [None]:
small_test_dataset = SummarizationDataset(
    None,
    source=[source],
    n_processes=1
)
preprocessed_dataset =  summarizer.processor.preprocess(small_test_dataset)


In [None]:
preprocessed_dataset[0].keys()

In [None]:
prediction = summarizer.predict(preprocessed_dataset, num_gpus=0, batch_size=1, min_length=24, max_length=48,)

In [None]:
prediction[0]

In [None]:
print("Total notebook running time {}".format(time.time() - start_time))

## Clean up temporary folders

In [None]:
"""
if os.path.exists(DATA_PATH):
    shutil.rmtree(DATA_PATH, ignore_errors=True)
if os.path.exists(CACHE_DIR):
    shutil.rmtree(CACHE_DIR, ignore_errors=True)
"""