### Задание  

1. Взять датасет  
https://huggingface.co/datasets/merionum/ru_paraphraser  
решить задачу парафраза  
2. (дополнительно необязательная задача)на выбор взять:  
https://huggingface.co/datasets/sberquad  
https://huggingface.co/datasets/blinoff/medical_qa_ru_data  
натренировать любую модель для вопросно ответной системы

In [1]:
import numpy as np
import torch

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction

### Загрузка датасета

In [2]:
dataset = load_dataset('merionum/ru_paraphraser')
dataset

Using custom data configuration merionum--ru_paraphraser-46b7ccf402279b95
Reusing dataset json (C:\Users\Kartsev.ES\.cache\huggingface\datasets\merionum___json\merionum--ru_paraphraser-46b7ccf402279b95\0.0.0\da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'id_1', 'id_2', 'text_1', 'text_2', 'class'],
        num_rows: 7227
    })
    test: Dataset({
        features: ['id', 'id_1', 'id_2', 'text_1', 'text_2', 'class'],
        num_rows: 1924
    })
})

In [3]:
label_list = list(set(dataset['train']['class']))
label_list

['0', '1', '-1']

Потому как исходный датасет содержит пары последовательностей (text_1 и text_2) и метку класса (class), то понимаю под решением задачи парафраза - решение задачи классификации, а не генерации. 

### Предобработка

In [4]:
def one_hot_encoding(example):
    example_class = example['class']
    oh1 = example_class =='-1'
    oh2 = example_class =='0'
    oh3 = example_class =='1'
    return {'class_-1': oh1, 'class_0': oh2, 'class_1': oh3}

In [5]:
ohe_dataset = dataset.map(one_hot_encoding)



  0%|          | 0/7227 [00:00<?, ?ex/s]

  0%|          | 0/1924 [00:00<?, ?ex/s]

In [6]:
ohe_dataset['train'][0]

{'id': '1',
 'id_1': '201',
 'id_2': '8159',
 'text_1': 'Полицейским разрешат стрелять на поражение по гражданам с травматикой.',
 'text_2': 'Полиции могут разрешить стрелять по хулиганам с травматикой.',
 'class': '0',
 'class_-1': False,
 'class_0': True,
 'class_1': False}

In [7]:
labels = ['class_-1', 'class_0', 'class_1']
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
id2label

{0: 'class_-1', 1: 'class_0', 2: 'class_1'}

In [8]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples):
    # take a batch of texts
    text_1 = examples["text_1"]
    text_2 = examples["text_2"]
    # encode them
    encoding = tokenizer(text_1, text_2, padding="max_length", truncation=True, max_length=128)
    # add labels
    labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
    # create numpy array of shape (batch_size, num_labels)
    labels_matrix = np.zeros((len(text_1), len(labels)))
    # fill numpy array
    for idx, label in enumerate(labels):
        labels_matrix[:, idx] = labels_batch[label]
        
    encoding["labels"] = labels_matrix.tolist()
    
    return encoding

In [9]:
encoded_dataset = ohe_dataset.map(preprocess_data, batched=True, remove_columns=ohe_dataset['train'].column_names)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [10]:
example = encoded_dataset['train'][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [11]:
example['labels']

[0.0, 1.0, 0.0]

In [12]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['class_0']

In [13]:
encoded_dataset.set_format("torch")

### Загрузка модели

In [14]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [15]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [16]:
encoded_dataset['train']['input_ids'][0]

tensor([  101,  1194, 14150, 29436, 10325, 29751, 15290, 10325, 29747, 23925,
        10325, 29745,  1195, 10260, 29744, 16856, 15290, 29753, 10260, 22919,
         1196, 22919, 16856, 15290, 29436, 17432, 22919, 23742,  1192, 10260,
         1194, 14150, 16856, 10260, 29743, 15290, 18947, 10325, 15290,  1194,
        14150,  1183, 16856, 10260, 29743, 29742, 28995, 10260, 29745,  1196,
         1197, 16856, 10260, 25529, 29745, 10260, 22919, 10325, 23925, 14150,
        10325,  1012,   102,  1194, 14150, 29436, 10325, 29751, 15414,  1191,
        14150, 29741, 29748, 22919,  1195, 10260, 29744, 16856, 15290, 29753,
        10325, 22919, 23742,  1196, 22919, 16856, 15290, 29436, 17432, 22919,
        23742,  1194, 14150,  1200, 29748, 29436, 10325, 29741, 28995, 10260,
        29745,  1196,  1197, 16856, 10260, 25529, 29745, 10260, 22919, 10325,
        23925, 14150, 10325,  1012,   102,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])

In [17]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0),
                labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.7265, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.2958, -0.2201,  0.2289]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [18]:
softmax = torch.nn.Softmax(dim=-1)

### Предсказание перед обучением

In [19]:
def predict(idx):
    print(f"Text_1: {dataset['test'][idx]['text_1']}")
    print(f"Text_2: {dataset['test'][idx]['text_2']}")
    print(f"Class: {dataset['test'][idx]['class']}")
    print(f"-------------------------------------------")
    encoding = tokenizer(dataset['test'][idx]['text_1'], dataset['test'][idx]['text_2'], return_tensors="pt")
    encoding = {k: v.to(model.device) for k,v in encoding.items()}
    outputs = model(**encoding)
    probs = softmax(outputs.logits.squeeze().cpu()).detach().numpy()
    print(f"Probs: {probs}")
    print(f"Label: {model.config.id2label[probs.argmax(axis=-1)]}")

In [20]:
predict(0)

Text_1: Цены на нефть восстанавливаются
Text_2: Парламент Словакии поблагодарил народы бывшего СССР за Победу
Class: -1
-------------------------------------------
Probs: [0.32632005 0.25832552 0.41535443]
Label: class_1


In [21]:
predict(5)

Text_1: Вертолет с 11 иностранцами на борту упал в Пакистане
Text_2: В Пакистане упал вертолет с 11 иностранцами
Class: 1
-------------------------------------------
Probs: [0.37363523 0.23425287 0.39211184]
Label: class_1


### Обучение

In [22]:
batch_size = 8
metric_name = "f1"

In [23]:
small_train_dataset = encoded_dataset["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = encoded_dataset["test"].shuffle(seed=42).select(range(200))

In [24]:
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    #sigmoid = torch.nn.Sigmoid()
    #probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    #y_pred = np.zeros(probs.shape)
    #y_pred[np.where(probs >= threshold)] = 1
    
    # first, apply softmax on predictions which are of shape (batch_size, num_labels)
    softmax = torch.nn.Softmax(dim=-1)
    probs = softmax(torch.Tensor(predictions))
    #print(f'probs:\n{probs}')
    #print(f'probs.argmax:\n{probs.argmax(axis=-1)}')

    # next, use argmax to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    for i in range(len(y_pred)):
        y_pred[i, probs[i].argmax(axis=-1)] = 1
    #print(f'y_pred:\n{y_pred}')
    
    # finally, compute metrics
    y_true = labels
    #print(f'y_true:\n{y_true}')

    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

In [25]:
args = TrainingArguments(
    f"bert-finetuned",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

In [26]:
trainer = Trainer(
    model,
    args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [27]:
trainer.train()

***** Running training *****
  Num examples = 1000
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 625


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.603961,0.45,0.5875,0.45
2,No log,0.576186,0.49,0.6175,0.49
3,No log,0.597678,0.525,0.64375,0.525
4,0.545300,0.593498,0.545,0.65875,0.545
5,0.545300,0.618576,0.495,0.62125,0.495


***** Running Evaluation *****
  Num examples = 200
  Batch size = 8
Saving model checkpoint to bert-finetuned\checkpoint-125
Configuration saved in bert-finetuned\checkpoint-125\config.json
Model weights saved in bert-finetuned\checkpoint-125\pytorch_model.bin
tokenizer config file saved in bert-finetuned\checkpoint-125\tokenizer_config.json
Special tokens file saved in bert-finetuned\checkpoint-125\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8
Saving model checkpoint to bert-finetuned\checkpoint-250
Configuration saved in bert-finetuned\checkpoint-250\config.json
Model weights saved in bert-finetuned\checkpoint-250\pytorch_model.bin
tokenizer config file saved in bert-finetuned\checkpoint-250\tokenizer_config.json
Special tokens file saved in bert-finetuned\checkpoint-250\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8
Saving model checkpoint to bert-finetuned\checkpoint-375
Configuration save

TrainOutput(global_step=625, training_loss=0.5190874633789062, metrics={'train_runtime': 2556.3043, 'train_samples_per_second': 1.956, 'train_steps_per_second': 0.244, 'total_flos': 328891772160000.0, 'train_loss': 0.5190874633789062, 'epoch': 5.0})

In [28]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'eval_loss': 0.5934982299804688,
 'eval_f1': 0.545,
 'eval_roc_auc': 0.65875,
 'eval_accuracy': 0.545,
 'eval_runtime': 22.7767,
 'eval_samples_per_second': 8.781,
 'eval_steps_per_second': 1.098,
 'epoch': 5.0}

### Предсказание после обучения

In [29]:
predict(0)

Text_1: Цены на нефть восстанавливаются
Text_2: Парламент Словакии поблагодарил народы бывшего СССР за Победу
Class: -1
-------------------------------------------
Probs: [0.95633435 0.03367946 0.00998617]
Label: class_-1


In [30]:
predict(5)

Text_1: Вертолет с 11 иностранцами на борту упал в Пакистане
Text_2: В Пакистане упал вертолет с 11 иностранцами
Class: 1
-------------------------------------------
Probs: [0.03318259 0.67381936 0.29299802]
Label: class_0
