#  🤗 Transformers Finetuning

__Автор задач: Блохин Н.В. (NVBlokhin@fa.ru)__

Материалы:
* https://huggingface.co/docs/transformers/training
* https://huggingface.co/docs/datasets/main/en/repository_structure
* https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset
* https://huggingface.co/docs/transformers/v4.35.2/en/training#prepare-a-dataset
* https://huggingface.co/docs/datasets/process
* https://huggingface.co/docs/evaluate/index
* https://huggingface.co/docs/transformers/main_classes/trainer
* https://huggingface.co/docs/transformers/v4.35.2/en/main_classes/trainer#transformers.TrainingArguments

## Задачи для совместного разбора

1\. Обсудите основные шаги по дообучению моделей из экосистемы 🤗 Transformers.

## Задачи для самостоятельного решения

<p class="task" id="1"></p>

1\. Разбейте данные из файла `reviews_polarity.csv` на обучающее и валидационное множество в соотношении 80 на 20. Создайте папку `reviews_polarity_dataset` и сохраните в нее полученные фрагменты данных под названием `train.csv` и `test.csv`. Создайте объект `datasets.Dataset`, используя функцию `load_dataset`.

Токенизируйте строки при помощи токенизатора, соотвествующего модели `rubert-base-cased-sentiment`. Удалите из датасета поле `text` после токенизации, замените поле `class` на `labels` и приведите данные к тензорам `torch`.

Создайте два `DataLoader` на основе обучающего и валидационного множества. Получите батч из обучающего множества и выведите его на экран.

- [ ] Проверено на семинаре

In [None]:
!pip install datasets

In [None]:
import transformers
import torch as th
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('reviews_polarity.csv')

X_train, X_test, y_train, y_test  = train_test_split(
    df['text'], df['class'], test_size=0.2, random_state=42
  )
train_data = pd.DataFrame({'text':X_train, 'class':y_train})
test_data = pd.DataFrame({'text':X_test, 'class':y_test})

In [None]:
import os

os.makedirs('reviews_polarity_dataset', exist_ok=True)
train_data.to_csv('reviews_polarity_dataset/train.csv', index=False)
test_data.to_csv('reviews_polarity_dataset/test.csv', index=False)

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    'csv',
    data_files={'train': 'reviews_polarity_dataset/train.csv', 'test': 'reviews_polarity_dataset/test.csv'}
    )

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('blanchefort/rubert-base-cased-sentiment')

def tokenize_data(example):
    tokens = tokenizer.encode_plus(example, padding='max_length', truncation=True)
    return tokens

dataset = dataset.map(tokenize_data)

tokenizer_config.json:   0%|          | 0.00/499 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/943 [00:00<?, ?B/s]

Map:   0%|          | 0/30574 [00:00<?, ? examples/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/7644 [00:00<?, ? examples/s]

In [None]:
dataset = dataset.remove_columns(['text'])
dataset = dataset.rename_column('class', 'labels')

In [None]:
dataset.set_format(type="torch")

In [None]:
from torch.utils.data import DataLoader
train_loader = DataLoader(dataset['train'], batch_size=32)
test_loader = DataLoader(dataset['test'], batch_size=32)

In [None]:
for batch in train_loader:
    print(batch)
    break

{'labels': tensor([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1,
        0, 0, 1, 1, 1, 1, 1, 1]), 'input_ids': tensor([[  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395,   102, 14108,   102],
        [  101, 10873, 29395, 

<p class="task" id="2"></p>

2\. Создайте модель при помощи класса `AutoModelForSequenceClassification`, заменив голову модели в соответствии с задачей бинарной классификации. Используя стандартный цикл обучения `torch`, настройте модель для решения задачи бинарной классификации. Во время обучения выводите на экран значение функции потерь (используйте готовые значения, которые генерирует модель) на обучающем множестве и f1 на валидационном множестве.

Здесь и далее для ускорения процесса обучения вы можете заморозить часть сети или уменьшить размер наборов данных, выбрав небольшое подмножество примеров.

- [ ] Проверено на семинаре

In [None]:
import torch.optim as optim
from sklearn.metrics import f1_score

In [None]:
from transformers import AutoModelForSequenceClassification

mname = 'blanchefort/rubert-base-cased-sentiment'
model = AutoModelForSequenceClassification.from_pretrained(mname, num_labels=2, ignore_mismatched_sizes=True)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at blanchefort/rubert-base-cased-sentiment and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.requires_grad_(False)
model.classifier.requires_grad_(True)
model = model.to(device='cuda')

In [None]:
n_epochs = 5
lr = 0.001
optimizer = optim.Adam(model.parameters(), lr=lr)

for epoch in range(n_epochs):
    losses = []
    trues = []
    batch_outs = []
    test_trues = []
    test_batch_outs = []
    for batch in train_loader:
        batch = {k: v.to(device='cuda') for k, v in batch.items()}
        out = model(**batch)
        loss = out.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        losses.append(loss.item())
        trues.extend(batch['labels'].cpu())
        batch_outs.extend(out.logits.argmax(dim=1).cpu())
    print('epoch= ', epoch, 'loss= ', th.tensor(losses).mean().item())
    print('f1_score train = ', f1_score(trues, batch_outs))
    for test_batch in test_loader:
        test_batch = {k: v.to(device='cuda') for k, v in test_batch.items()}
        out = model(**test_batch)
        test_trues.extend(test_batch['labels'].cpu())
        test_batch_outs.extend(out.logits.argmax(dim=1).cpu())
    print('f1_score test = ', f1_score(test_trues, test_batch_outs))

epoch=  0 loss=  0.531309187412262
f1_score train =  0.8797097199824074
f1_score test =  0.7272727272727273
epoch=  1 loss=  0.5314652323722839
f1_score train =  0.8800732600732601
f1_score test =  0.7272727272727273
epoch=  2 loss=  0.5314401388168335
f1_score train =  0.8800732600732601
f1_score test =  0.7272727272727273
epoch=  3 loss=  0.531433641910553
f1_score train =  0.8800732600732601
f1_score test =  0.7272727272727273
epoch=  4 loss=  0.5314317345619202
f1_score train =  0.8800732600732601
f1_score test =  0.7272727272727273


<p class="task" id="3"></p>

3\. Создайте модель при помощи класса `AutoModelForSequenceClassification`, заменив голову модели в соответствии с задачей бинарной классификации. Используя `transformers.Trainer`, настройте модель для решения задачи бинарной классификации. При настройке `Trainer` укажите количество эпох, равное 5. Во время обучения выводите на экран значение функции потерь на обучающем множестве и f1 на валидационном множестве.  

- [ ] Проверено на семинаре


In [None]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

In [None]:
mname = 'blanchefort/rubert-base-cased-sentiment'
model = AutoModelForSequenceClassification.from_pretrained(mname, num_labels=2, ignore_mismatched_sizes=True)
model.requires_grad_(False)
model.classifier.requires_grad_(True)
model = model.to(device='cuda')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at blanchefort/rubert-base-cased-sentiment and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
!pip install transformers[torch]

In [None]:
!pip install accelerate -U

In [None]:
import accelerate
import torch

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch"
)

In [None]:
from sklearn.metrics import f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds)
    return {"f1": f1}

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,0.5307,0.513863,0.882946
2,0.5089,0.513935,0.882946
3,0.5327,0.514155,0.882946
4,0.5211,0.513399,0.882946
5,0.5214,0.513403,0.882946


TrainOutput(global_step=9555, training_loss=0.5222066510977513, metrics={'train_runtime': 343.2768, 'train_samples_per_second': 445.326, 'train_steps_per_second': 27.835, 'total_flos': 471349066791600.0, 'train_loss': 0.5222066510977513, 'epoch': 5.0})

<p class="task" id="4"></p>

4\. Используя эмбеддинги `distiluse-base-multilingual-cased-v1` из пакета `sentence_transformers`, решите задачу бинарной классификации. Для этого добавьте несколько полносвязных слоев поверх модели `SentenceTransformer`. Заморозьте часть модели, отвечающей за генерацию эмбеддингов. Во время обучения выводите на экран значение функции потерь на обучающем множестве и f1 на валидационном множестве.  

- [ ] Проверено на семинаре

In [None]:
!pip install sentence_transformers

In [None]:
from sentence_transformers import SentenceTransformer

model_ = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1')

.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.45k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

In [None]:
model_.requires_grad_(False)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

In [None]:
from sentence_transformers import SentenceTransformer
import torch.nn as nn
import torch.optim as optim

class Net(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.base_model = model
        self.classifier = nn.Sequential(
            nn.Linear(512, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )

    def forward(self, X):
        out = self.base_model(X)['sentence_embedding']
        out = self.classifier(out)
        return out

In [None]:
model = Net(model_)
model = model.to(device='cuda')

n_epochs = 5
lr = 0.001
optimizer = optim.Adam(model.parameters(), lr=lr)
crit = nn.CrossEntropyLoss(ignore_index=0)

for epoch in range(n_epochs):
    losses = []
    trues = []
    batch_outs = []
    test_trues = []
    test_batch_outs = []
    for batch in train_loader:
        batch = {k: v.to(device='cuda') for k, v in batch.items()}
        y = batch.pop('labels')
        batch.pop('token_type_ids')
        out = model(batch)
        loss = crit(out, y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        losses.append(loss.item())
        trues.extend(y.cpu())
        batch_outs.extend(out.argmax(dim=1).cpu())
    print('epoch= ', epoch, 'loss= ', th.tensor(losses).mean().item())
    print('f1_score train = ', f1_score(trues, batch_outs))
    for test_batch in test_loader:
        test_batch = {k: v.to(device='cuda') for k, v in test_batch.items()}
        y = test_batch.pop('labels')
        test_batch.pop('token_type_ids')
        out = model(test_batch)
        test_trues.extend(y.cpu())
        test_batch_outs.extend(out.argmax(dim=1).cpu())
    print('f1_score test = ', f1_score(test_trues, test_batch_outs))

epoch=  0 loss=  0.024874551221728325
f1_score train =  0.8800732600732601
f1_score test =  0.8829460762823323
epoch=  1 loss=  0.00013644000864587724
f1_score train =  0.8800732600732601
f1_score test =  0.8829460762823323
epoch=  2 loss=  4.3038551666541025e-05
f1_score train =  0.8800732600732601
f1_score test =  0.8829460762823323
epoch=  3 loss=  1.891094507300295e-05
f1_score train =  0.8800732600732601
f1_score test =  0.8829460762823323
epoch=  4 loss=  9.510884410701692e-06
f1_score train =  0.8800732600732601
f1_score test =  0.8829460762823323


## Обратная связь
- [x] Хочу получить обратную связь по решению