<a href="https://colab.research.google.com/github/kperv/summarizer_app/blob/main/summarization_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Extractive summarization task***

### Main highlights

*   Supported languages are Russian and Spanish.
*   Text is broken into sentences by Spacy.
*   Sentences are tokenized by pretrained BertTokenizer.
*   Framework PyTorch (PyTorch Lightning)


### Neural Net architecture

The attempted idea is to train a centroid in a vector space by using Stacked Convolutional Layers.

### Benchmark

Apply clustering algorithm on aggregated word vectors to get the result fast.

### Metric is Bert_Score


*   Supports target languages
*   Based on vector distance, which is better suited for the task



# **Installations and imports**

In [1]:
%%capture
!pip install transformers[sentencepiece]
!pip install pytorch-lightning
!pip install -U bert_score
!pip install datasets
!pip install https://huggingface.co/spacy/ru_core_news_md/resolve/main/ru_core_news_md-any-py3-none-any.whl
!pip install https://huggingface.co/spacy/es_core_news_md/resolve/main/es_core_news_md-any-py3-none-any.whl


import os
import numpy as np
import torch
import spacy
import transformers
import datasets
import pytorch_lightning as pl
import bert_score
import torch.nn.functional as F


from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint
from datasets import load_dataset
from transformers import BertTokenizer, BertModel
from bert_score import score
from torch import nn
from torchvision import transforms
from torch.utils.data import DataLoader

In [2]:
pl.seed_everything(42)

## requirements

In [3]:
print("Spacy = ", spacy.__version__)
print("torch = ", torch.__version__)
print("numpy = ", np.__version__)
print("datasets = ", datasets.__version__)
print("Transformers = ", transformers.__version__)
print("PyTorch Lightning = ", pl.__version__)
print("Spacy = ", spacy.__version__)
print("Bert Score = ", bert_score.__version__)

Spacy =  3.1.1
torch =  1.9.0+cu102
numpy =  1.19.5
datasets =  1.11.0
Transformers =  4.9.2
PyTorch Lightning =  1.4.2
Spacy =  3.1.1
Bert Score =  0.3.10


## Examples and checks of Bert score

Predictions correspond to a news headline and references to a ferst paragraph of the same article. Expected to get high scores.

In [4]:
predictions = ["Аналитик прокомментировал предстоящий визит Меркель в Москву"]
references = ["Ведущий научный сотрудник Центра германских исследований Института Европы РАН Александр Камкин прокомментировал в беседе с RT сообщение о том, что канцлер Германии Ангела Меркель и президент России Владимир Путин проведут переговоры в Москве 20 августа"]
P, R, F1 = score(predictions, references, lang='ru')
print(f"System level F1 score: {F1.mean():.3f}")
print(f"System level P score: {P.mean():.3f}")
print(f"System level R score: {R.mean():.3f}")

System level F1 score: 0.680
System level P score: 0.738
System level R score: 0.632


In [5]:
predictions = ["Qué sabemos sobre el diálogo entre la oposición y Maduro que está programado para comenzar este viernes en México"]
references = ["Tras años de tensiones, protestas y negociaciones estancadas, en medio de una situación económica muy deteriorada y complicada aún más por la pandemia de covid-19, el gobierno de Venezuela y la oposición intentará por quinta vez llegar a una solución para la crisis política mediante un diálogo, esta vez en México."]
P, R, F1 = score(predictions, references, lang='es')
print(f"System level F1 score: {F1.mean():.3f}")
print(f"System level P score: {P.mean():.3f}")
print(f"System level R score: {R.mean():.3f}")

System level F1 score: 0.636
System level P score: 0.663
System level R score: 0.611


## **Dataset is large-scale MultiLingual SUMmarization dataset**

### Main info:

**es**

Size of downloaded dataset files: 489.53 MB

Size of the generated dataset: 1274.55 MB

Total amount of disk used: 1764.09 MB

**Data Splits:** Train 266367	Val 10358	Test 1392



**ru**

Size of downloaded dataset files: 101.30 MB

Size of the generated dataset: 263.38 MB

Total amount of disk used: 364.68 MB

**Data Splits:** Train 25556	Val 750	Test 757

In [6]:
%%capture
dataset_ru = load_dataset("mlsum", "ru")
dataset_es = load_dataset("mlsum", "es")

nlp_ru = spacy.load("ru_core_news_md")
nlp_es = spacy.load("es_core_news_md")

In [7]:
dataset_ru.num_rows

{'test': 757, 'train': 25556, 'validation': 750}

# ***Clustering solution***

In [8]:
def break_text(text):
  doc = nlp_es(text)
  assert doc.has_annotation("SENT_START")
  sentences = [str(sent) for sent in doc.sents]
  return sentences

In [11]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [10]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
encoded_sentences = tokenizer(predictions, padding=True, truncation=True, return_tensors="pt")
#decoded_string = tokenizer.decode(encoded_sentences.input_ids[0])
outputs = model(encoded_sentences.input_ids)
embeddings = outputs.last_hidden_state

In [9]:

#tokenized_sents = []
#ids = []
#for sent in break_text(text):
#  tokens = tokenizer.tokenize(sent)
#  tokenized_sents.append(tokens)
#  id = tokenizer.convert_tokens_to_ids(tokens)
#  ids.append(id)
#input_ids = torch.tensor([ids[0]])
#input_ids

## *Check for long training*

In [18]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [19]:
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cpu')

# ***Neural Net solution***

In [21]:
class MlSumDataModule(pl.LightningDataModule):
  def __init__(self, batch_size=32):
    super.__init__()
    self.batch_size = batch_size

  def prepare_data(self):
    self.dataset_ru = load_dataset("mlsum", "ru")
    self.dataset_es = load_dataset("mlsum", "es")
    nlp_ru = spacy.load("ru_core_news_md")
    nlp_es = spacy.load("es_core_news_md")

  def train_dataloader(self):
    dataset_ru_train = DataLoader(
        self.dataset_ru["train"], 
        batch_size=self.batch_size)
    dataset_es_train = DataLoader(
        self.dataset_es["train"], 
        batch_size=self.batch_size)
    loaders = [dataset_ru_train, dataset_es_train]
    return train_loaders

  def val_dataloader(self):
    dataset_ru_val = DataLoader(
        self.dataset_ru["val"], 
        batch_size=self.batch_size)
    dataset_es_val = DataLoader(
        self.dataset_es["val"], 
        batch_size=self.batch_size)
    loaders = [dataset_ru_val, dataset_es_val]
    return val_loaders

  def test_dataloader(self):
    dataset_ru_test = DataLoader(
        self.dataset_ru["test"], 
        batch_size=self.batch_size)
    dataset_es_test = DataLoader(
        self.dataset_es["test"], 
        batch_size=self.batch_size)
    loaders = [dataset_ru_test, dataset_es_test]
    return test_loaders

In [29]:
class Model(pl.LightningModule):
  def __init__(self):
    super.__init__()
    self.l1 = nn.Linear(x, y)

  def forward(self, x):
    return torch.relu()

  def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    loss = F.pairwise_distance(y_hat, y)
    self.log("train_loss", loss, on_epoch=True)
    return loss

  def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = F.pairwise_distance(y_hat, y)
    self.log("val_loss", loss, on_epoch=True)

  def test_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = F.pairwise_distance(y_hat, y)
    self.log("test_loss", loss, on_epoch=True)


  def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(), lr=self.lr)

In [30]:
trainer = pl.Trainer()
model = Model()

trainer.fit(model, train_loader)
trainer.test(model, test_dataloaders=val_dataloader)
trainer.test(test_dataloaders=test_dataloader)

TypeError: ignored

In [None]:
# call after training
trainer = pl.Trainer()
trainer.fit(model)
trainer.test(dataloaders=test_dataloader)

In [None]:
# or call with pretrained model
model = MyLightningModule.load_from_checkpoint(PATH)
trainer = pl.Trainer()
trainer.test(model, dataloaders=test_dataloader)

In [None]:
# get predictions
PATH = "../saved_model"
my_model = Model.load_from_checkpoint(PATH)
my_model.freeze()
prediction = my_model(new_text)

# Further improvements



*   Stack Convolution layers (similar to InceptionNet)
*   Use Multi-Head Attention instead of CNNs (similar to Hie-BART)

