# Title scoring model

The entered or generated titles need to be evaluated. The main goal is to increase the number of views, so we use the number of views for previously published articles as a criterion. Articles ranked by the views num are divided into "well-viewed" and "rarely read", respectively 1 and 0. The classifier gives the probability of getting an article into the well-viewed. The score is obtained by multiplying this number by 10.

# 1. Preparing data for the title scoring model



In [1]:
import os
import gc
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

import numpy as np
import pandas as pd
import pickle

try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    data_path='/content/gdrive/My Drive/Colab Notebooks/title/data'
except:
    data_path='../../DATASETS/IT_TEXTS/PREPROCESSING'

In [2]:
Xy = pd.read_feather(f'{data_path}/Xy.feather')
Xy.fillna('', inplace=True)
X, y = Xy[['title', 'summary']], Xy['class']
del Xy

Let's split the data into training, test and validation samples. We use the validation set for evaluation and adjustment, without touching the test set, leaving it to evaluate only the final model.

In [3]:
from sklearn.model_selection import train_test_split
train_texts, test_texts, train_labels, test_labels = train_test_split(X, y, test_size=.2, random_state=0)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2, random_state=0)

# 2. Building a deep learning model

To work with texts, we use the [transformers] library (https://huggingface.co/transformers/), which is highly efficient for text recognition (NLU) and text generation (NLG) tasks. The library provides a convenient interface for working with pretrained NLP models based on the transformer architecture. In fact, these are pytorch models for NLP problems that can be easily translated into tensorflow models and back.

In [4]:
# %%bash
# pip3 install transformers

In [5]:
from transformers import BertTokenizer
from transformers import BertForSequenceClassification

model_name = 'DeepPavlov/rubert-base-cased'
model = BertForSequenceClassification.from_pretrained(model_name,
                                                      num_labels=2,
                                                      output_attentions=False,
                                                      output_hidden_states=False)

# в случае ошибки, учесть fast-версию
# https://fantashit.com/autotokenizer-from-pretrained-bert-throws-typeerror-when-encoding-certain-input/
tokenizer = BertTokenizer.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at DeepPavlov/rubert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Embedding(119547, 768, padding_idx=0)

Let's pass the texts to the tokenizer. The `truncation = True` and `padding = True` flags ensure that sequences are padded to the same length and truncated to not exceed the maximum input sequence length.

In [6]:
# max_length = 96
# train_encodings = tokenizer(train_texts['title'].to_list(),
#                             train_texts['summary'].to_list(),
#                             truncation=True,
#                             padding=True,
#                             max_length=max_length,
#                             return_tensors="pt")

# val_encodings = tokenizer(val_texts['title'].to_list(),
#                           val_texts['summary'].to_list(),
#                           truncation=True,
#                           padding=True,
#                           max_length=max_length,
#                           return_tensors="pt")

# test_encodings = tokenizer(test_texts['title'].to_list(),
#                            test_texts['summary'].to_list(),
#                            truncation=True,
#                            padding=True,
#                            max_length=max_length,
#                            return_tensors="pt")

In [7]:
# with open(f'{data_path}/train_encodings.pickle', 'wb') as f:
#     pickle.dump(train_encodings, f)

# with open(f'{data_path}/val_encodings.pickle', 'wb') as f:
#     pickle.dump(val_encodings, f)

# with open(f'{data_path}/test_encodings.pickle', 'wb') as f:
#     pickle.dump(test_encodings, f)

In [8]:
# The tokenization procedure is a rather long process, so we pickle and unpickle data to save time.

with open(f'{data_path}/train_encodings.pickle', 'rb') as f:
    train_encodings = pickle.load(f)

with open(f'{data_path}/val_encodings.pickle', 'rb') as f:
    val_encodings = pickle.load(f)

with open(f'{data_path}/test_encodings.pickle', 'rb') as f:
    test_encodings = pickle.load(f)

Теперь представим размеченные тексты и метки в виде `Dataset`-объекта. Для этого наследуем класс от `torch.utils.data.Dataset`, в котором реализуем методы `__len__` и `__getitem__`. Это позволяет отправлять данные пакетно, батчами и обучать модель с помощью метода `forward()`.

In [9]:
import torch

class TitlesDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)
    

train_dataset = TitlesDataset(train_encodings, train_labels.to_list())
val_dataset = TitlesDataset(val_encodings, val_labels.to_list())
test_dataset = TitlesDataset(test_encodings, test_labels.to_list())

In [12]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir=f'{data_path}/results',
    save_steps=1000,
    num_train_epochs=40,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    warmup_steps=500,
    logging_dir=f'{data_path}/logs',
    logging_steps=100
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

gc.collect()
torch.cuda.empty_cache()
trainer.train()
#trainer.train(f"{data_path}/results/checkpoint-9000")

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


RuntimeError: CUDA out of memory. Tried to allocate 352.00 MiB (GPU 0; 1.96 GiB total capacity; 1007.75 MiB already allocated; 316.31 MiB free; 1.09 GiB reserved in total by PyTorch)

In [None]:
from torch.utils.data import DataLoader
from transformers import BertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
torch.cuda.empty_cache()

model_name = 'DeepPavlov/rubert-base-cased'
model = BertForSequenceClassification.from_pretrained(model_name,
                                                      num_labels = 2,
                                                      output_attentions = False,
                                                      output_hidden_states = False)
model.to(device)
model.train()

train_loader = DataLoader(train_dataset,
                          batch_size=1,
                          shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids,
                        attention_mask=attention_mask,
                        labels=labels)
        loss = outputs[0]
        print(loss)
        loss.backward()
        optim.step()

model.eval()

In [None]:
!rm -rf sample_data