Copyright (c) Microsoft Corporation.  
Licensed under the MIT License.

# Abstractive Summarization using UniLM on CNN/DailyMails

## Before you start
Set `QUICK_RUN = True` to run the notebook on a small subset of data and a smaller number of steps

In [None]:
QUICK_RUN = False

## Summary
This notebook demostrates how to fine-tune the [Unified Language Model](https://arxiv.org/abs/1905.03197) (UniLM) for abstractive summarization task. Utility functions and classes in the microsoft/nlp-recipes repo are used to facilitate data preprocessing, model training, model scoring, result postprocessing, and model evaluation.

### Abstractive Summarization
Abstractive summarization is the task of taking an input text and summarizing its content in a shorter output text. In contrast to extractive summarization, abstractive summarization doesn't take sentences directly from the input text, instead, rephrases the input text.

### UniLM
UniLM is a state of the art model developed by Microsoft Research Asia (MSRA). The model is first pre-trained on a large unlabeled natural language corpus (English Wikipedia and BookBorpus) and can be fine-tuned on different types of labeled data for various NLP tasks like text classification and abstractive summarization.   
The figure below shows the UniLM architecture. During pre-training, the model parameters are shared across the LM objectives (i.e., bidirectional LM, unidirectional LM, and sequence-to-sequence LM). For different NLP tasks, UniLM uses different self-attention masks to control the access to context for each word token.  
The seq-to-seq LM in the third row in the figure is used in summarization task. In seq-to-seq LM, word tokens in the input sequence can access all the other tokens in the input sequence, but can not access the word tokens in the output sequence. Word tokens in the output sequence can access all the tokens in the input sequence and the tokens in the output sequence generated before the current position. 
<img src="https://nlpbp.blob.core.windows.net/images/unilm_architecture.PNG" width="600" height="600">


In [None]:
%load_ext autoreload
%autoreload 2
import time
from utils_nlp.dataset.cnndm import CNNDMSummarizationDatasetOrg
from utils_nlp.models import S2SAbsSumProcessor, S2SAbstractiveSummarizer
from utils_nlp.eval import compute_rouge_python

In [None]:
OUTPUT_FILE = './nlp_cnndm_finetuning_results.txt'

# model parameters
MODEL_NAME = "unilm-large-cased"
MAX_SEQ_LEN = 768
MAX_SOURCE_SEQ_LENGTH = 640
MAX_TARGET_SEQ_LENGTH = 128

# fine-tuning parameters
TRAIN_PER_GPU_BATCH_SIZE = 1
GRADIENT_ACCUMULATION_STEPS = 2
LEARNING_RATE = 3e-5

TOP_N = -1
WARMUP_STEPS = 500
MAX_STEPS = 5000
if QUICK_RUN:
    TOP_N = 100
    WARMUP_STEPS = 5
    MAX_STEPS = 50

# inference parameters
TEST_PER_GPU_BATCH_SIZE = 12
BEAM_SIZE = 5
FORBID_IGNORE_WORD = "."

# mixed precision setting
FP16 = False

## Load the CNN/DailyMail dataset
The [CNN/DailyMail dataset](https://cs.nyu.edu/~kcho/DMQA/) was original introduced for Q&A research. There are multiple versions of the dataset processed for summarization task available on the web. The `CNNDMSummarizationDatasetOrg` function downloads a version from the [UniLM repo](https://github.com/microsoft/unilm) with minimal processing. The function returns the training and testing dataset as `SummarizationDataset` which can be further processed for model training and testing.

In [None]:
train_ds, test_ds = CNNDMSummarizationDatasetOrg(top_n=TOP_N)
print(len(train_ds))
print(len(test_ds))

## Preprocessing
The `S2SAbsSumProcessor` has multiple methods for converting input data in `SummarizationDataset`, `IterableSummarizationDataset` or json files into the format required for model training and testing. The preprocessing steps include
- Tokenize input text
- Convert tokens into token ids

In [None]:
processor = S2SAbsSumProcessor(model_name=MODEL_NAME)

In [None]:
train_dataset = processor.train_dataset_from_sum_ds(train_ds, load_cached_features=True)
test_dataset = processor.test_dataset_from_sum_ds(test_ds)

## Fine tune model

The `S2SAbstractiveSummarizer` loads a pre-trained UniLM model specified by `model_name`.  
Call `S2SAbstractiveSummarizer.list_supported_models()` to see all the supported models.  
If you want to use a model on the local disk, specify `load_model_from_dir` and `model_file_name`. This is particularly useful if you want to load a previously fine-tuned model and use it for inference directly without fine-tuning. 

In [None]:
abs_summarizer = S2SAbstractiveSummarizer(
    model_name=MODEL_NAME,
    max_seq_len=MAX_SEQ_LEN,
    max_source_seq_length=MAX_SOURCE_SEQ_LENGTH,
    max_target_seq_length=MAX_TARGET_SEQ_LENGTH,
)

## To load a model on the local disk
# abs_summarizer = S2SAbstractiveSummarizer(
#     model_name=MODEL_NAME,
#     max_seq_len=MAX_SEQ_LEN,
#     max_source_seq_length=MAX_SOURCE_SEQ_LENGTH,
#     max_target_seq_length=MAX_TARGET_SEQ_LENGTH,
#     load_model_from_dir="./",
#     model_file_name="model.5000.bin",
# )


In [None]:
abs_summarizer.fit(
    train_dataset=train_dataset,
    per_gpu_batch_size=TRAIN_PER_GPU_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    warmup_steps=WARMUP_STEPS,
    max_steps=MAX_STEPS,
    fp16=FP16
)

## Generate summaries on testing dataset

In [None]:
res = abs_summarizer.predict(
    test_dataset=test_dataset,
    per_gpu_batch_size=TEST_PER_GPU_BATCH_SIZE,
    beam_size=BEAM_SIZE,
    forbid_ignore_word=FORBID_IGNORE_WORD,
    fp16=FP16
)

In [None]:
for r in res[:5]:
    print(r)

In [None]:
with open(OUTPUT_FILE, 'w') as f:
    for line in res:
        f.write(line + '\n')

## Evaluation
We provide utility functions for evaluating summarization models and details can be found in the [summarization evaluation notebook](./summarization_evaluation.ipynb).  
For the settings in this notebook with QUICK_RUN=False, you should get ROUGE scores close to the following numbers:  
{'rouge-1': {'f': 0.37109626943068647,
  'p': 0.4692792272280924,
  'r': 0.33322322114381886},  
 'rouge-2': {'f': 0.1690495786379728,
  'p': 0.21782900161918375,
  'r': 0.15079122430118444},  
 'rouge-l': {'f': 0.2671310062443078,
  'p': 0.3414039392451434,
  'r': 0.2392756715930202}}

In [None]:
print(compute_rouge_python(cand=res, ref=test_ds.get_target()))

## Distributed training with DistributedDataParallel (DDP)
This notebook uses DataParallel for multi-GPU training by default. In general, DistributedDataParallel(DDP) is recommended because of its better performance. See details [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).  
Since DDP requires multiprocess and can not be run from the notebook, we provide a python script [abstractive_summarization_unilm_cnndm.py](./abstractive_summarization_unilm_cnndm.py) to demonstrate how to use DDP.  
**Note that the python script set `FP16=True` by default. If you don't have apex installed as instructed above, set `FP16=False` instead.**