<a href="https://colab.research.google.com/github/kperv/summarizer_app/blob/main/T5small_mlsum_ru.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization project
### for ods.ai Natural Language Processing course

Pretrained language model **t5-small**
(60 million parameters)

https://arxiv.org/abs/1910.10683

https://github.com/google-research/text-to-text-transfer-transformer

**Dataset** is the russian part of the Large-scale MultiLingual SUMmarization dataset. 

Splits: 

-Train -- Val -- Test

25556  - 750 - 757

https://github.com/ThomasScialom/MLSUM

The project is based on the official **hf.co tutorial**

https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb

In [1]:
%%capture
! pip install datasets nltk torch
! pip install rouge==0.3.1
! pip install -U transformers

In [2]:
!nvidia-smi

Tue Dec 14 08:06:40 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
import nltk
import numpy as np
import pandas as pd
import torch
import transformers
from datasets import DatasetDict
from datasets import Dataset
from datasets import load_dataset
from rouge import Rouge

nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Fine-tuning a model on a summarization task

### Loading the dataset

In [4]:
dataset = load_dataset("mlsum", 'ru')

Downloading:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading and preparing dataset mlsum/ru (download: 101.30 MiB, generated: 263.38 MiB, post-processed: Unknown size, total: 364.68 MiB) to /root/.cache/huggingface/datasets/mlsum/ru/1.0.0/77f23eb185781f439927ac2569ab1da1083195d8b2dab2b2f6bbe52feb600688...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/99.0M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.51M [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset mlsum downloaded and prepared to /root/.cache/huggingface/datasets/mlsum/ru/1.0.0/77f23eb185781f439927ac2569ab1da1083195d8b2dab2b2f6bbe52feb600688. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 25556
    })
    validation: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 750
    })
    test: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 757
    })
})

### Preprocessing the data

In [6]:
model_checkpoint = "t5-small"

In [7]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

In [8]:
max_input_length = 1024
max_target_length = 128


def preprocess_function(examples):
    inputs = ["summarize: " + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [9]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)

  0%|          | 0/26 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

### Fine-tuning the model

In [None]:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

device = torch.device("cuda")
model.to(device)

In [11]:
batch_size = 4
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-mlsum-ru",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    logging_dir='logs',
)

In [12]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [13]:
def compute_metrics(eval_pred):
    rouge = Rouge()
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    decoded_preds = [nltk.sent_tokenize(pred.strip()) for pred in decoded_preds]
    decoded_labels = [nltk.sent_tokenize(label.strip()) for label in decoded_labels]
    
    decoded_preds = [pred if len(pred) else 'а' for pred in decoded_preds]
    decoded_labels = [label if len(label) else 'а' for label in decoded_labels]


    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(pred) for pred in decoded_preds]
    decoded_labels = ["\n".join(label) for label in decoded_labels]
    
    result = rouge.get_scores(hyps=decoded_preds, refs=decoded_labels, avg=True)
    # Extract a few results
    result = {key: value['f'] * 100 for key, value in result.items()}
    
    return {k: round(v, 4) for k, v in result.items()}

In [14]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Using amp half precision backend


In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: url, summary, topic, title, date, text.
***** Running training *****
  Num examples = 25556
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 31945


Epoch,Training Loss,Validation Loss,Rouge-1,Rouge-2,Rouge-l
1,2.2135,2.185042,2.6589,0.2871,1.8472


Saving model checkpoint to t5-small-mlsum-ru/checkpoint-500
Configuration saved in t5-small-mlsum-ru/checkpoint-500/config.json
Model weights saved in t5-small-mlsum-ru/checkpoint-500/pytorch_model.bin
tokenizer config file saved in t5-small-mlsum-ru/checkpoint-500/tokenizer_config.json
Special tokens file saved in t5-small-mlsum-ru/checkpoint-500/special_tokens_map.json
Saving model checkpoint to t5-small-mlsum-ru/checkpoint-1000
Configuration saved in t5-small-mlsum-ru/checkpoint-1000/config.json
Model weights saved in t5-small-mlsum-ru/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in t5-small-mlsum-ru/checkpoint-1000/tokenizer_config.json
Special tokens file saved in t5-small-mlsum-ru/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to t5-small-mlsum-ru/checkpoint-1500
Configuration saved in t5-small-mlsum-ru/checkpoint-1500/config.json
Model weights saved in t5-small-mlsum-ru/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in t5-small-ml

### Get model predictions

In [None]:
predictions = trainer.predict(tokenized_datasets["test"])

In [None]:
summaries = tokenizer.batch_decode(predictions.predictions, skip_special_tokens=True)

In [None]:
dataset['test'].set_format('pandas')
test_df = dataset['test'][:]
test_df = test_df.drop(columns=['summary'])
test_df['summary'] = summaries
test_df.head(5)

### Calculate Rouge scores

In [None]:
def get_rouge_score(sample):
    rouge = Rouge()
    preprocess_exs = lambda exs : [ex.strip().lower() for ex in exs]
    predictions = []
    predictions.append(sample['summary'])
    predictions = preprocess_exs(predictions)
    references = []
    references.append(sample.text)
    references = preprocess_exs(references)
    predictions = [pred if len(pred) else 'а' for pred in predictions]
    rouge_scores =  rouge.get_scores(predictions, references, avg=True)
    return {k: round(v['f'], 3) for k, v in rouge_scores.items()}

In [None]:
def add_metrics(dataset):
    dataset = dataset.loc[:, ['text', 'summary']]
    dataset[['rouge-1', 'rouge-2', 'rouge-l']] = 0, 0, 0
    df = pd.DataFrame(list(dataset.apply(get_rouge_score, axis=1).values))
    dataset = df.combine_first(dataset)
    dataset = dataset.reindex(
        columns=['text', 'summary', 'rouge-1', 'rouge-2', 'rouge-l']
    )
    return dataset

In [None]:
test_df = add_metrics(test_df)
test_df.head()

In [None]:
round(test_df['rouge-1'].mean(), 3)

In [None]:
round(test_df['rouge-2'].mean(), 3)

In [None]:
round(test_df['rouge-l'].mean(), 3)