<a href="https://colab.research.google.com/github/kperv/summarizer_app/blob/main/summarization_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Extractive summarization task**

**Main highlights**

*   Supported languages are Russian and Spanish.
*   Text is broken into sentences by Spacy.
*   Sentences are tokenized by pretrained BertTokenizer.
*   sklearn is used for the main solution

**Benchmark** Clustering algorithm on aggregated word vectors.

**Neural Net architecture** The attempted idea is to train a centroid in a vector space by using Stacked Convolutional Layers.

**Metric is Bert_score** as it supports target languages



# Installations and imports

In [1]:
%%capture
!pip install transformers[sentencepiece]
!pip install pytorch-lightning
!pip install -U bert_score
!pip install datasets
!pip install https://huggingface.co/spacy/ru_core_news_md/resolve/main/ru_core_news_md-any-py3-none-any.whl
!pip install https://huggingface.co/spacy/es_core_news_md/resolve/main/es_core_news_md-any-py3-none-any.whl


import os
import numpy as np
import torch
import spacy
import transformers
import datasets
import pytorch_lightning as pl
import bert_score
import torch.nn.functional as F
import sklearn


from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint
from datasets import load_dataset
from transformers import BertTokenizer, BertModel
from bert_score import score
from torch import nn
from torchvision import transforms
from torch.utils.data import DataLoader

In [2]:
nlp_ru = spacy.load("ru_core_news_md")
nlp_es = spacy.load("es_core_news_md")
nlp = {"ru": nlp_ru, "es": nlp_es}

pl.seed_everything(42)

42

### requirements

In [200]:
# necessary
print("numpy=={}".format(np.__version__))
print("spacy=={}".format(spacy.__version__))
print("transformers=={}".format(transformers.__version__))
print("bert_score=={}".format(bert_score.__version__))
# only for benchmark
print("sklearn=={}".format(sklearn.__version__))
# only for nets
print("datasets=={}".format(datasets.__version__))
print("torch=={}".format(torch.__version__))
print("PyTorch Lightning=={}".format(pl.__version__))

numpy==1.19.5
spacy==3.1.1
transformers==4.9.2
bert_score==0.3.10
sklearn==0.22.2.post1
datasets==1.11.0
torch==1.9.0+cu102
PyTorch Lightning==1.4.2


### Examples and checks of Bert score

Predictions correspond to a news headline and references to a ferst paragraph of the same article. Expected to get high scores.

In [4]:
predictions = ["Аналитик прокомментировал предстоящий визит Меркель в Москву"]
references = ["Ведущий научный сотрудник Центра германских исследований Института Европы РАН Александр Камкин прокомментировал в беседе с RT сообщение о том, что канцлер Германии Ангела Меркель и президент России Владимир Путин проведут переговоры в Москве 20 августа"]
P, R, F1 = score(predictions, references, lang='ru')
print(f"System level F1 score: {F1.mean():.3f}")
print(f"System level P score: {P.mean():.3f}")
print(f"System level R score: {R.mean():.3f}")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/714M [00:00<?, ?B/s]

System level F1 score: 0.680
System level P score: 0.738
System level R score: 0.632


In [5]:
predictions = ["Qué sabemos sobre el diálogo entre la oposición y Maduro que está programado para comenzar este viernes en México"]
references = ["Tras años de tensiones, protestas y negociaciones estancadas, en medio de una situación económica muy deteriorada y complicada aún más por la pandemia de covid-19, el gobierno de Venezuela y la oposición intentará por quinta vez llegar a una solución para la crisis política mediante un diálogo, esta vez en México."]
P, R, F1 = score(predictions, references, lang='es')
print(f"System level F1 score: {F1.mean():.3f}")
print(f"System level P score: {P.mean():.3f}")
print(f"System level R score: {R.mean():.3f}")

System level F1 score: 0.636
System level P score: 0.663
System level R score: 0.611


# Clustering solution

### data

In [215]:
number = 3
lang = "ru"

Some random paragraph from news

In [216]:
text = "Очень непросто делать прогнозы. С одной стороны, достаточно долго длится фаза подъема, после 11 июня началась третья волна в Свердловской области. Мы достигли очень высокого уровня заболеваемости и смертности и стабилизировались на нем. Как долго это будет продолжаться, зависит от доли восприимчивого к вирусу населения. На момент начала третьей волны число привитых или переболевших свердловчан не превышало 50 процентов, — отметил Соловьев. — Чтобы третья волна остановилась, мы должны достичь высокого уровня коллективного иммунитета. С помощью вакцинации мы уже не успеваем его достичь. Второй вариант — заболеваемость. Официальная статистика не отражает реальное число людей, которые встретились с коронавирусом. Сейчас приблизительно 65% жителей переболели или вакцинировались от коронавируса. Это должно сказываться на снижении заболеваемости, но пока этого не происходит."

### Separate sentences with SpaCy

In [217]:
def break_text(text, lang):
  doc = nlp[lang](text)
  assert doc.has_annotation("SENT_START")
  sentences = [str(sent) for sent in doc.sents]
  return sentences

In [218]:
sentences = break_text(text, lang)

### Convert words to tokens and run through Bert encoder to get word embeddings

In [219]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
encoded_sentences = tokenizer(sentences, truncation=True, padding=True, return_tensors="pt")
outputs = model(encoded_sentences.input_ids)
embeddings = outputs.last_hidden_state

I'm using *mean* to get sentence embeddings. This step can be improved.

In [220]:
embeddings = embeddings.detach().numpy()
embeddings = embeddings.mean(axis=1)

Cluster embeddings in a space into n (number) clusters to get centroids. Take distance metric between sentence embeddings and centroids and choose n closest to put into the summary.

### Collect summary sentences together

In [221]:
kmeans = KMeans(n_clusters=number).fit(embeddings)
centroids = kmeans.cluster_centers_
result = pairwise_distances_argmin_min(embeddings, centroids)
ummary_sent_positions = list(np.argsort(result[1])[:number])

In [222]:
summary = ""
for idx in summary_sent_positions:
  summary += sentences[idx]
summary

'На момент начала третьей волны число привитых или переболевших свердловчан не превышало 50 процентов, — отметил Соловьев.Очень непросто делать прогнозы.Второй вариант — заболеваемость.'

### Check the result

In [223]:
predictions = [summary]
references = [text]
P, R, F1 = score(predictions, references, lang=lang)
print(f"System level F1 score: {F1.mean():.3f}")
print(f"System level P score: {P.mean():.3f}")
print(f"System level R score: {R.mean():.3f}")

System level F1 score: 0.770
System level P score: 0.898
System level R score: 0.675


# Neural Net solution - Not implemented


### Checks for long training

In [None]:
!nvidia-smi

In [None]:
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
device

### Dataset is large-scale MultiLingual SUMmarization dataset

**es**

Data Splits: Train 266367	Val 10358	Test 1392

Total amount of disk used: 1764.09 MB


**ru**

Data Splits:Train 25556	Val 750	Test 757

Total amount of disk used: 364.68 MB



In [None]:
%%capture
dataset_ru = load_dataset("mlsum", "ru")
dataset_es = load_dataset("mlsum", "es")

In [7]:
dataset_ru.num_rows

{'test': 757, 'train': 25556, 'validation': 750}

In [None]:
dataset_es.num_rows

### Assemble structure

In [None]:
class MlSumDataModule(pl.LightningDataModule):
  def __init__(self, batch_size=32):
    super.__init__()
    self.batch_size = batch_size

  def prepare_data(self):
    self.dataset_ru = load_dataset("mlsum", "ru")
    self.dataset_es = load_dataset("mlsum", "es")
    nlp_ru = spacy.load("ru_core_news_md")
    nlp_es = spacy.load("es_core_news_md")

  def train_dataloader(self):
    dataset_ru_train = DataLoader(
        self.dataset_ru["train"], 
        batch_size=self.batch_size)
    dataset_es_train = DataLoader(
        self.dataset_es["train"], 
        batch_size=self.batch_size)
    loaders = [dataset_ru_train, dataset_es_train]
    return train_loaders

  def val_dataloader(self):
    dataset_ru_val = DataLoader(
        self.dataset_ru["val"], 
        batch_size=self.batch_size)
    dataset_es_val = DataLoader(
        self.dataset_es["val"], 
        batch_size=self.batch_size)
    loaders = [dataset_ru_val, dataset_es_val]
    return val_loaders

  def test_dataloader(self):
    dataset_ru_test = DataLoader(
        self.dataset_ru["test"], 
        batch_size=self.batch_size)
    dataset_es_test = DataLoader(
        self.dataset_es["test"], 
        batch_size=self.batch_size)
    loaders = [dataset_ru_test, dataset_es_test]
    return test_loaders

In [None]:
class Model(pl.LightningModule):
  def __init__(self):
    super.__init__()
    self.l1 = nn.Linear(x, y)

  def forward(self, x):
    return torch.relu()

  def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    loss = F.pairwise_distance(y_hat, y)
    self.log("train_loss", loss, on_epoch=True)
    return loss

  def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = F.pairwise_distance(y_hat, y)
    self.log("val_loss", loss, on_epoch=True)

  def test_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = F.pairwise_distance(y_hat, y)
    self.log("test_loss", loss, on_epoch=True)


  def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(), lr=self.lr)

### Train

In [None]:
trainer = pl.Trainer()
model = Model()

trainer.fit(model, train_loader)
trainer.test(model, test_dataloaders=val_dataloader)
trainer.test(test_dataloaders=test_dataloader)

In [None]:
# call after training
trainer = pl.Trainer()
trainer.fit(model)
trainer.test(dataloaders=test_dataloader)

In [None]:
# or call with pretrained model
model = MyLightningModule.load_from_checkpoint(PATH)
trainer = pl.Trainer()
trainer.test(model, dataloaders=test_dataloader)

### Predict

In [None]:
# get predictions
PATH = "../saved_model"
my_model = Model.load_from_checkpoint(PATH)
my_model.freeze()
prediction = my_model(new_text)

# Possible improvements


*   Stack Convolution layers (similar to InceptionNet)
*   Use Multi-Head Attention instead of CNNs (similar to Hie-BART)

