<a href="https://colab.research.google.com/github/neqkir/working-with-tranformers/blob/main/BERT/bertsumabs-text-summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%load_ext autoreload
%autoreload 2

QUICK_RUN = True

In [None]:
!pip install --upgrade 
!pip install -q git+https://github.com/microsoft/nlp-recipes.git
!pip install transformers
!pip install eval
!pip install rouge
!pip install jsonlines
!pip install pyrouge
!pip install scrapbook
!pip install indicnlp
#!pip install indicnlp.tokenize

import os
import shutil
import sys
from tempfile import TemporaryDirectory
import torch
import nltk
from nltk import tokenize
import pandas as pd
import pprint
import scrapbook as sb

nlp_path = os.path.abspath("../../")
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)

from utils_nlp import models
from utils_nlp.models import transformers 
from utils_nlp.models.transformers.datasets import SummarizationDataset
from utils_nlp import eval
from utils_nlp.eval import rouge
from utils_nlp.dataset.cnndm import CNNDMSummarizationDataset
from utils_nlp.eval import compute_rouge_python

from utils_nlp.models.transformers.abstractive_summarization_bertsum \
     import BertSumAbs, BertSumAbsProcessor

The dataset we used for this notebook is CNN/DM dataset which contains the documents and accompanying questions from the news articles of CNN and Daily mail. The highlights in each article are used as summary. The dataset consits of ~289K training examples, ~11K valiation examples and ~11K test examples. The length of the news articles is 781 tokens on average and the summaries are of 3.75 sentences and 56 tokens on average.

In [None]:
# the data path used to save the downloaded data file
DATA_PATH = TemporaryDirectory().name
# The number of lines at the head of data file used for preprocessing. -1 means all the lines.
TOP_N = 100
if not QUICK_RUN:
    TOP_N = -1

train_dataset, test_dataset = CNNDMSummarizationDataset(
    top_n=TOP_N, local_cache_path=DATA_PATH, prepare_extractive=False
)

Model finetuning

In [None]:
# notebook parameters
# the cache path
CACHE_PATH = TemporaryDirectory().name

# model parameters
MODEL_NAME = "bert-base-uncased"
MAX_POS = 768
MAX_SOURCE_SEQ_LENGTH = 640
MAX_TARGET_SEQ_LENGTH = 140

# mixed precision setting. To enable mixed precision training, follow instructions in SETUP.md.
FP16 = False
if FP16:
    FP16_OPT_LEVEL = "O2"

# fine-tuning parameters
# batch size, unit is the number of tokens
BATCH_SIZE_PER_GPU = 1


# GPU used for training
NUM_GPUS = torch.cuda.device_count()
if NUM_GPUS > 0:
    BATCH_SIZE = NUM_GPUS * BATCH_SIZE_PER_GPU
else:
    BATCH_SIZE = 1


# Learning rate
LEARNING_RATE_BERT = 5e-4 / 2.0
LEARNING_RATE_DEC = 0.05 / 2.0


# How often the statistics reports show up in training, unit is step.
REPORT_EVERY = 10
SAVE_EVERY = 500

# total number of steps for training
MAX_STEPS = 1e3

if not QUICK_RUN:
    MAX_STEPS = 5e3

WARMUP_STEPS_BERT = 2000
WARMUP_STEPS_DEC = 1000

In [None]:
# processor which contains the colloate function to load the preprocessed data
processor = BertSumAbsProcessor(cache_dir=CACHE_PATH, max_src_len=MAX_SOURCE_SEQ_LENGTH, max_tgt_len=MAX_TARGET_SEQ_LENGTH)
# summarizer
summarizer = BertSumAbs(
    processor, cache_dir=CACHE_PATH, max_pos_length=MAX_POS
)

In [None]:
BATCH_SIZE_PER_GPU*NUM_GPUS

In [None]:
summarizer.fit(
    train_dataset,
    num_gpus=NUM_GPUS,
    batch_size=BATCH_SIZE,
    max_steps=MAX_STEPS,
    learning_rate_bert=LEARNING_RATE_BERT,
    learning_rate_dec=LEARNING_RATE_DEC,
    warmup_steps_bert=WARMUP_STEPS_BERT,
    warmup_steps_dec=WARMUP_STEPS_DEC,
    save_every=SAVE_EVERY,
    report_every=REPORT_EVERY * 5,
    fp16=FP16,
    # checkpoint="saved checkpoint path"
)

In [None]:
summarizer.save_model(MAX_STEPS, os.path.join(CACHE_PATH, "bertsumabs.pt"))

Model Evaluation

To run rouge evaluation, please refer to the section of compute_rouge_perl in summarization_evaluation.ipynb for setup. For the settings in this notebook with QUICK_RUN=False, you should get ROUGE scores close to the following numbers:
```
{'rouge-1': {'f': 0.34819639878321873, 'p': 0.39977932634737307, 
'r': 0.34429079596863604}, 
'rouge-2': {'f': 0.13919271352557894, 'p': 0.16129965067780644, 
'r': 0.1372938054050938}, 
'rouge-l': {'f': 0.2313282318854973, 'p': 0.26664667422849747, 
'r': 0.22850294283399628}}
```
Better performance can be achieved by increasing the MAX_STEPS.

In [None]:
TEST_TOP_N = 32
if not QUICK_RUN:
    TEST_TOP_N = len(test_dataset)

if NUM_GPUS:
    BATCH_SIZE = NUM_GPUS * BATCH_SIZE_PER_GPU
else:
    BATCH_SIZE = 1
    
shortened_dataset = test_dataset.shorten(top_n=TEST_TOP_N)
src = shortened_dataset.get_source()
reference_summaries = [" ".join(t).rstrip("\n") for t in shortened_dataset.get_target()]
generated_summaries = summarizer.predict(
    shortened_dataset, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS
)
assert len(generated_summaries) == len(reference_summaries)

In [None]:
src[0]

In [None]:
generated_summaries[0]

In [None]:
reference_summaries[0]

In [None]:
rouge_scores = compute_rouge_python(cand=generated_summaries, ref=reference_summaries)
pprint.pprint(rouge_scores)

In [None]:
# for testing
sb.glue("rouge_2_f_score", rouge_scores['rouge-2']['f'])

Example

In [None]:
source = """
But under the new rule, set to be announced in the next 48 hours, Border Patrol agents would immediately return anyone to Mexico — without any detainment and without any due process — who attempts to cross the southwestern border between the legal ports of entry. The person would not be held for any length of time in an American facility.

Although they advised that details could change before the announcement, administration officials said the measure was needed to avert what they fear could be a systemwide outbreak of the coronavirus inside detention facilities along the border. Such an outbreak could spread quickly through the immigrant population and could infect large numbers of Border Patrol agents, leaving the southwestern border defenses weakened, the officials argued.
The Trump administration plans to immediately turn back all asylum seekers and other foreigners attempting to enter the United States from Mexico illegally, saying the nation cannot risk allowing the coronavirus to spread through detention facilities and Border Patrol agents, four administration officials said.
The administration officials said the ports of entry would remain open to American citizens, green-card holders and foreigners with proper documentation. Some foreigners would be blocked, including Europeans currently subject to earlier travel restrictions imposed by the administration. The points of entry will also be open to commercial traffic."""

In [None]:
test_dataset = SummarizationDataset(
    None, source=[source], source_preprocessing=[tokenize.sent_tokenize],
)
generated_summaries = summarizer.predict(test_dataset, batch_size=1, num_gpus=NUM_GPUS)

In [None]:
generated_summaries[0]

In [None]:
if os.path.exists(DATA_PATH):
    shutil.rmtree(DATA_PATH, ignore_errors=True)
if os.path.exists(CACHE_PATH):
    shutil.rmtree(CACHE_PATH, ignore_errors=True)