# Summarization

## Transformers seq2seq models for summarization
[Источник](https://rubikscode.net/2022/04/25/text-summarization-with-huggingface-transformers/)

In [None]:
!pip install transformers
import transformers



In [None]:
!pip install sentencepiece




###Pegasus
Pegasus is standard Transformer encoder-decoder but in Pegasus’ pre-training task we have a similar approach as an extractive summary – important sentences are extracted from an input document and joined together as one output sequence from the remaining sentences.

This actually means that the encoder outputs masked tokens and decoder generates gap sentences. Paper regarding the Pegasus model introduces generating gap-sentences and explains strategies for selecting those sentences. More info about the Pegasus model can be found in the scientific paper in [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization  written by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.](https://arxiv.org/pdf/1912.08777.pdf)

###BART
This model is a sequence-to-sequence model trained as a denoising autoencoder. This indicates that BART can take as an input sequence in one language and return output sequence in a different language. BART found applications in many tasks besides text summarization, such as question answering, machine translation, etc.

BART model is pre-trained on the English language and it is fine-tuned on CNN Daily Mail. More information regarding the model can be found in paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. The Paper is written by Lewis et al.](https://arxiv.org/abs/1910.13461)

###T5
XL-Sum represents a dataset which contains 1 million annotated pairs article-summary from BBC. The dataset covers 44 different languages and it is the largest dataset based on the number of collected data from a single source.

mT5 is a fine-tuned pre-trained multilingual T5 model on the XL-SUM dataset. More details can be found in [XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages](https://aclanthology.org/2021.findings-acl.413.pdf).


*The tower is 324 meters (1,063 ft) tall, about the same height
as an 81-storey building, and the tallest structure in Paris. Its base is square,
measuring 125 meters (410 ft) on each side. During its construction, the Eiffel
Tower surpassed the Washington Monument to become the tallest man-made structure
in the world, a title it held for 41 years until the Chrysler Building in New York
City was finished in 1930. It was the first structure to reach a height of 300 meters.
Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is
now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters,
the Eiffel Tower is the second tallest free-standing structure in France
after the Millau Viaduct.*

In [None]:
text_example = 'The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 meters. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.'

text_example_ru = 'В первом тайме голландцы забили дважды: на 10-й минуте отличился форвард «Барселоны» Мемфис Депай, а на 45-й — полузащитник Дэйли Блинд из «Аякса». Американцы отыграли один мяч на 76-й минуте после точного удара Хаджи Райта из «Антальяспора», а окончательный счет спустя пять минут установил Дензел Дюмфрис.'

text_example_ru2 = 'Российский актер Иван Краско прокомментировал информацию ряда СМИ о том, что его якобы госпитализировали с гипертоническим кризом в Мариинскую больницу в Санкт-Петербурге. Об этом сообщает сайт "Комсомольской правды". 92-летний актер заявил, что госпитализация была плановой. "У меня просто профилактика, лег в больницу, чтобы меня прокачали, прокололи. Это все планово", – рассказал он. Краско подчеркнул, что у него не было гипертонического криза.'

Using Pipeline

In [None]:
from transformers import pipeline

In [None]:
summarizer = pipeline("summarization", model = "google/pegasus-xsum")
summarizer(text_example)

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'summary_text': 'The Eiffel Tower is a free-standing structure in Paris, France.'}]

In [None]:
summarizer = pipeline("summarization", model = "facebook/bart-large-cnn")
summarizer(text_example)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'summary_text': 'The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building. Its base is square, measuring 125 meters (410 ft) on each side. It is the second tallest free-standing structure in France after the Millau Viaduct.'}]

In [None]:
#summarizer = pipeline("summarization", model= "csebuetnlp/mT5_multilingual_XLSum")
#summarizer(text_example)

In [None]:
#summarizer(text_example_ru2)

Using Automodel

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained('google/pegasus-xsum')
tokenizer = AutoTokenizer.from_pretrained('google/pegasus-xsum')

tokens_input = tokenizer.encode("summarize: "+ text_example, return_tensors='pt', max_length=512, truncation=True)
ids = model.generate(tokens_input, min_length=80, max_length=120)
summary = tokenizer.decode(ids[0], skip_special_tokens=True)

print(summary)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

tokens_input = tokenizer.encode("summarize: "+text_example, return_tensors='pt', max_length=512, truncation=True)
ids = model.generate(tokens_input, min_length=80, max_length=120)
summary = tokenizer.decode(ids[0], skip_special_tokens=True)

print(summary)

NameError: ignored

## BERT Extractive Summarization

### Источник:
https://deeplearninganalytics.org/text-summarization/

https://github.com/nlpyang/BertSum


Идея: использовать BERT эмбеддинги предложений исходного текста в задаче бинарной классификации для отбора самых значимых предложений, которые войдут в summary.

Для получения эмбеддингов нескольких предложений текста перед каждым предложением текста вставляется свой токен начала предложения **[CLS]**, после каждого предложения - символ **[SEP]**. В качестве эмбеддингов сегмента предложения (которые используются для того, чтобы различать первое и второе предложения в парах предложений при обучении  BERT) для последовательности предложений чередуются единичные и нулевые вектора.

_[sent1, sent2, sent3, sent4, sent5] -> [EA, EB, EA, EB, EA]._

Вектора токенов [CLS] на последнем слое BERT используются в качестве векторов предложений текста. Вектора предложений подаются на вход классификатору (в статье 3 варианта классификации):
1. linear layer + sigmoid
2. Transformer + sigmoid
3. LSTM + sigmoid

![bertsum](bertsum.png)

In [None]:
!pip install --force-reinstall torch==1.1.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.1.0
  Downloading torch-1.1.0-cp37-cp37m-manylinux1_x86_64.whl (676.9 MB)
[K     |████████████████████████████████| 676.9 MB 3.9 kB/s 
[?25hCollecting numpy
  Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 73.0 MB/s 
[?25hInstalling collected packages: numpy, torch
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.6
    Uninstalling numpy-1.21.6:
      Successfully uninstalled numpy-1.21.6
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1+cu113
    Uninstalling torch-1.12.1+cu113:
      Successfully uninstalled torch-1.12.1+cu113
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.1

In [None]:
!pip install pytorch-pretrained-bert

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch-pretrained-bert
  Downloading pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 kB 4.6 MB/s 
[?25hCollecting boto3
  Downloading boto3-1.26.16-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 26.7 MB/s 
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading s3transfer-0.6.0-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 7.0 MB/s 
[?25hCollecting jmespath<2.0.0,>=0.7.1
  Downloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting botocore<1.30.0,>=1.29.16
  Downloading botocore-1.29.16-py3-none-any.whl (9.9 MB)
[K     |████████████████████████████████| 9.9 MB 60.7 MB/s 
[?25hCollecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 63.1 MB/s 
  Downloading urllib3-1.25.11-py2.py3-

In [None]:
!pip install tensorboardX

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorboardX
  Downloading tensorboardX-2.5.1-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 4.7 MB/s 
Installing collected packages: tensorboardX
Successfully installed tensorboardX-2.5.1


In [None]:
!pip install pyrouge


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyrouge
  Downloading pyrouge-0.1.3.tar.gz (60 kB)
[K     |████████████████████████████████| 60 kB 3.4 MB/s 
[?25hBuilding wheels for collected packages: pyrouge
  Building wheel for pyrouge (setup.py) ... [?25l[?25hdone
  Created wheel for pyrouge: filename=pyrouge-0.1.3-py3-none-any.whl size=191620 sha256=a44c123f69db1a144c9a25223f1a35e793a32896aa1d34fc9120291f22b0d7f1
  Stored in directory: /root/.cache/pip/wheels/68/35/6a/ffb9a1f51b2b00fee42e7f67f5a5d8e10c67d048cda09ccd57
Successfully built pyrouge
Installing collected packages: pyrouge
Successfully installed pyrouge-0.1.3


In [None]:
!pip install multiprocess

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 4.6 MB/s 
Installing collected packages: multiprocess
Successfully installed multiprocess-0.70.14


### Данные

Датасет CNN and Daily Mail

Загрузим предобработанные данные.

In [None]:
%cd /content

/content


In [None]:
!wget --no-check-certificate --load-cookies /tmp/cookies.txt "http://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'http://docs.google.com/uc?export=download&id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6" -O bertsum_data.zip && rm -rf /tmp/cookies.txt

URL transformed to HTTPS due to an HSTS policy
--2022-11-26 15:52:17--  https://docs.google.com/uc?export=download&confirm=t&id=1x0d61LP9UAN389YN00z0Pv-7jQgirVg6
Resolving docs.google.com (docs.google.com)... 173.194.195.101, 173.194.195.113, 173.194.195.138, ...
Connecting to docs.google.com (docs.google.com)|173.194.195.101|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-04-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/6phpe5e0ifccad2m7rk402md4bsm1tq6/1669477875000/02403291851892694101/*/1x0d61LP9UAN389YN00z0Pv-7jQgirVg6?e=download&uuid=4b99cf05-e603-4f81-820a-d25d9728f198 [following]
--2022-11-26 15:52:17--  https://doc-04-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/6phpe5e0ifccad2m7rk402md4bsm1tq6/1669477875000/02403291851892694101/*/1x0d61LP9UAN389YN00z0Pv-7jQgirVg6?e=download&uuid=4b99cf05-e603-4f81-820a-d25d9728f198
Resolving doc-04-0g-docs.googleusercontent.com (doc-04

In [None]:
!git clone https://github.com/nlpyang/BertSum

Cloning into 'BertSum'...
remote: Enumerating objects: 301, done.[K
remote: Counting objects: 100% (293/293), done.[K
remote: Compressing objects: 100% (124/124), done.[K
remote: Total 301 (delta 165), reused 290 (delta 164), pack-reused 8[K
Receiving objects: 100% (301/301), 15.05 MiB | 18.90 MiB/s, done.
Resolving deltas: 100% (165/165), done.


In [None]:
%cd BertSum

/content/BertSum


In [None]:
!unzip ../bertsum_data.zip -d ./bert_data

Archive:  ../bertsum_data.zip
  inflating: ./bert_data/cnndm.test.0.bert.pt  
  inflating: ./bert_data/cnndm.test.1.bert.pt  
  inflating: ./bert_data/cnndm.test.2.bert.pt  
  inflating: ./bert_data/cnndm.test.3.bert.pt  
  inflating: ./bert_data/cnndm.test.4.bert.pt  
  inflating: ./bert_data/cnndm.test.5.bert.pt  
  inflating: ./bert_data/cnndm.train.0.bert.pt  
  inflating: ./bert_data/cnndm.train.100.bert.pt  
  inflating: ./bert_data/cnndm.train.101.bert.pt  
  inflating: ./bert_data/cnndm.train.102.bert.pt  
  inflating: ./bert_data/cnndm.train.103.bert.pt  
  inflating: ./bert_data/cnndm.train.104.bert.pt  
  inflating: ./bert_data/cnndm.train.105.bert.pt  
  inflating: ./bert_data/cnndm.train.106.bert.pt  
  inflating: ./bert_data/cnndm.train.107.bert.pt  
  inflating: ./bert_data/cnndm.train.108.bert.pt  
  inflating: ./bert_data/cnndm.train.109.bert.pt  
  inflating: ./bert_data/cnndm.train.10.bert.pt  
  inflating: ./bert_data/cnndm.train.110.bert.pt  
  inflating: ./bert_da

In [None]:
%cd src

/content/BertSum/src


Пример входных данных

In [None]:
import torch
cnn_test_samp = torch.load("/content/BertSum/bert_data/cnndm.test.0.bert.pt")

In [None]:
cnn_test_samp0 = cnn_test_samp[0]

In [None]:
cnn_test_samp0.keys()

dict_keys(['src', 'labels', 'segs', 'clss', 'src_txt', 'tgt_txt'])

In [None]:
print(cnn_test_samp0['clss']) # индексы CLS токенов для предложений входного текста
print(cnn_test_samp0['labels']) # таргет метки для предложений (1 - входит в summary, 0 - не входит)
print(cnn_test_samp0['segs']) # id сегментов предложений
print(cnn_test_samp0['src']) # id слов

[0, 25, 57, 78, 112, 136, 174, 197, 223, 245, 285, 301, 337, 358, 382, 416, 452]
[0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,

In [None]:
cnn_test_samp0['src_txt'] # входной текст


['a university of iowa student has died nearly three months after a fall in rome in a suspected robbery attack in rome .',
 'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program in italy when the incident happened in january .',
 'he was flown back to chicago via air ambulance on march 20 , but he died on sunday .',
 'andrew mogni , 20 , from glen ellyn , illinois , a university of iowa student has died nearly three months after a fall in rome in a suspected robbery',
 'he was taken to a medical facility in the chicago area , close to his family home in glen ellyn .',
 "he died on sunday at northwestern memorial hospital - medical examiner 's office spokesman frank shuftan says a cause of death wo n't be released until monday at the earliest .",
 'initial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed .',
 "on sunday , his cousin abby wrote online : ` this morning my cous

In [None]:
cnn_test_samp0['tgt_txt'] # target summary

'andrew mogni , 20 , from glen ellyn , illinois , had only just arrived for a semester program when the incident happened in january<q>he was flown back to chicago via air on march 20 but he died on sunday<q>initial police reports indicated the fall was an accident but authorities are investigating the possibility that mogni was robbed<q>his cousin claims he was attacked and thrown 40ft from a bridge'

Обучение модели

In [None]:
!python train.py -mode train -encoder classifier -dropout 0.1 -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier -lr 2e-3 -visible_gpus 0  -gpu_ranks 0 -world_size 1 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 2000 -accum_count 2 -log_file ../logs/bert_classifier -use_interval true -warmup_steps 10000

[2022-11-26 15:53:46,518 INFO] Device ID 0
[2022-11-26 15:53:46,518 INFO] Device cuda
[2022-11-26 15:53:46,746 INFO] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz not found in cache, downloading to /tmp/tmp4i8q0k4f
100% 407873900/407873900 [00:06<00:00, 66661322.98B/s]
[2022-11-26 15:53:53,041 INFO] copying /tmp/tmp4i8q0k4f to cache at ../temp/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
[2022-11-26 15:53:54,347 INFO] creating metadata file for ../temp/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
[2022-11-26 15:53:54,348 INFO] removing temp file /tmp/tmp4i8q0k4f
[2022-11-26 15:53:54,415 INFO] loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ../temp/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca

Тестирование на валидационных и тестовых данных

In [None]:
!python train.py -mode validate -bert_data_path ../bert_data/cnndm -model_path ../models/bert_classifier  -visible_gpus 0  -gpu_ranks 0 -batch_size 30000  -log_file ../logs/bert_classifier_valid  -result_path ../results/cnndm -test_all -block_trigram true

Примеры summary

In [None]:
# extracted summary
N = 20
with open("/BertSum/results/cnndm_step2000.candidate") as f:
    dec = [next(f) for x in range(N)]

In [None]:
# target summary
N = 20
with open("/BertSum/results/cnndm_step2000.gold") as f:
    ref = [next(f) for x in range(N)]

In [None]:
ref[0].split('<q>')

['the 79th masters tournament gets underway at augusta national on thursday',
 'rory mcilroy and tiger woods will be the star attractions in the field bidding for the green jacket at 2015 masters',
 'mcilroy , justin rose , ian poulter , graeme mcdowell and more gave sportsmail the verdict on each hole at augusta',
 'click on the brilliant interactive graphic below for details on each hole of the masters 2015 course',
 'click here for all the latest news from the masters 2015\n']

In [None]:
dec[0].split('<q>')

['to help get you in the mood for the first major of the year , rory mcilroy , ian poulter , graeme mcdowell and justin rose , plus past masters champions nick faldo and charl schwartzel , give the lowdown on every hole at the world-famous augusta national golf club .',
 'the masters 2015 is almost here .',
 'click on the graphic below to get a closer look at what the biggest names in the game will face when they tee off on thursday .\n']

In [None]:
ref[1].split('<q>')

["jeff powell looks ahead to saturday 's fight at the mgm grand",
 'floyd mayweather takes on manny pacquiao in $ 300m showdown',
 'both fighters arrived in las vegas on tuesday with public appearances',
 'read : mayweather makes official arrival ahead of manny pacquiao fight',
 'al haymon : the man behind mayweather who is revolutionising boxing',
 "mayweather vs pacquiao takes centre stage ... but who 's on the undercard ?\n"]

In [None]:
dec[1].split('<q>')

["powell reflects on the pair 's arrivals on the las vegas strip and looks forward to the rest of the week .",
 'both boxers made public appearances on tuesday as their $ 300million showdown draws ever closer , and our man powell was there .',
 "sportsmail 's boxing correspondent jeff powell looks ahead to saturday 's mega-fight at the mgm grand after witnessing floyd mayweather and manny pacquiao 's grand arrivals in las vegas .\n"]

In [None]:
ref[2].split('<q>')

['gary locke has been interim manager since start of february',
 'locke has won two and drawn four of his seven games in charge',
 'the 37-year-old took over when allan johnston quit\n']

In [None]:
dec[2].split('<q>')

['the former hearts boss joined the club as assistant boss to allan johnston last summer but took control of the team when his ex-tynecastle team-mate quit at the start of february .',
 'gary locke has been given the job at kilmarnock on a permanent basis after a successful interim spell',
 'the 39-year-old - who will speak at a press conference on friday morning - has lost just once in seven games since taking over at rugby park .\n']

##Transformer-based metrics

####BARTScore

In [15]:
!wget https://github.com/neulab/BARTScore/raw/refs/heads/main/bart_score.py

--2024-11-11 18:51:40--  https://github.com/neulab/BARTScore/raw/refs/heads/main/bart_score.py
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/neulab/BARTScore/refs/heads/main/bart_score.py [following]
--2024-11-11 18:51:40--  https://raw.githubusercontent.com/neulab/BARTScore/refs/heads/main/bart_score.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4221 (4.1K) [text/plain]
Saving to: ‘bart_score.py’


2024-11-11 18:51:40 (49.8 MB/s) - ‘bart_score.py’ saved [4221/4221]



In [17]:
# To use the CNNDM version BARTScore
from bart_score import BARTScorer
bart_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
bart_scorer.score(['This is interesting.'], ['This is fun.'], batch_size=4) # generation scores from the first list of texts to the second list of texts.

# To use our trained ParaBank version BARTScore
#from bart_score import BARTScorer
#art_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
#bart_scorer.load(path='bart.pth')
#bart_scorer.score(['This is interesting.'], ['This is fun.'], batch_size=4)


[-2.510651111602783]

####BertScore

In [7]:
!pip install evaluate
!pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [9]:
from evaluate import load
bertscore = load("bertscore")
predictions = ["hello there", "general kenobi"]
references = ["hello there", "general kenobi"]
#results = bertscore.compute(predictions=predictions, references=references, lang="en")
results = bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased")
print(results)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

{'precision': [1.0, 1.0], 'recall': [1.0, 1.0], 'f1': [1.0, 1.0], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.44.2)'}


##Sumy
The Sumy package is the most complete and maintained library for extractive summarization. It contains various algorithm implementations, has a command line interface, and a [web demo](https://huggingface.co/spaces/issam9/sumy_space) which you can experiment with. Also, it deals with both raw text sources and web links. Sumy includes all the necessary preprocessing methods — parsers, tokenizers, and stemmers, and provides support for many languages.

In [None]:
!pip install gensim spacy numpy nltk sumy rouge

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rouge-1.0.1-py3-none-any.whl (13 kB)
Downloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m100.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: breada

In [None]:
!pip install datasets
import datasets
import numpy as np
import rouge

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


In [None]:
dataset = datasets.load_dataset("cnn_dailymail", '3.0.0')
first_entry = dataset['train'][0]

Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
print(first_entry)

{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office char

### Luhn's summarizer

Luhn's summarizer was one of the first attempts in the field of text summarization. In his 1958 paper ''The Automatic Creation of Literature Abstracts", Luhn proposes that word frequency determines the word's significance.

At the preprocessing stage, words are stemmed and the stop words are removed. Then, the list of stems is compiled, and sorted by decreasing frequency, with indexes indicating the stem's significance. The sentence is representative of the context if the greater number of frequent words are grouped together with a distance of 4 or 5 non-significant words between them. Thus, only the portions limited by significant terms are considered instead of the whole sentence, which introduces the significance factor for the portions. If there are multiple portions in the sentence, the sentence is assigned the maximum significance factor. Finally, the top-scoring sentences are included in the summary.

In [None]:
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.parsers.plaintext import PlaintextParser
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
def summarize_luhn(article: str, sentence_count: int) -> str:
    ''' Utility function to perform Luhn's summarization.

        By default, LuhnSummarizer will select 100% of non-stop post-processed words as
        significant, but you can overwrite the significant_percantage attribute as a
        fraction: summarizerLuhn.significant_percentage = 1/3

    '''

    parser = PlaintextParser.from_string(article, Tokenizer('english'))
    summarizerLuhn = LuhnSummarizer(Stemmer('english'))
    summarizerLuhn.stop_words = get_stop_words('english')
    luhn_summary = summarizerLuhn(parser.document, sentences_count = sentence_count)
    return ' '.join([str(sentence) for sentence in luhn_summary])

In [None]:
summarize_luhn(first_entry['article'], sentence_count = 2)

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.'

### Task 1.
Compute ROUGE1-3 for this article.

### Task 2.
Implement LSA summarization.

### Task 3.
Summarize 100 random articles and compare quality for LSA and Luhn based on ROUGE1-3 and METEOR