# Задание 2. Классификация с помощью BERT

Возьмите набор данных эмоциональных окрасок отзывов [emotions dataset](https://huggingface.co/datasets/emotion). В датасете 5 классов. Получите эмбеддинг из BERT-подобного кодировщика (возьмите 3-4 разных), классифицируйте тексты с помощью методов ML или  нейростеи, использовав эмбеддинги в качестве входов.

Произведите fine-tuning кодировщика на ваших данных (классификатор на основе bert), и сравните, как изменилось качество классификации.

**Дополнительно**

* Попробуйте классифицировать, использовав такие методы, как TF-IDF, word2vec и другие.

* Попробуйте использовать косинусную меру для оценки близости текстов. Получается ли с её помощью разделять классы?

In [1]:
!pip install -q transformers
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import torch
from datasets import load_dataset
import warnings

warnings.filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
from IPython.display import clear_output

train = load_dataset("SetFit/emotion", split="train")
clear_output()
train_df = pd.DataFrame({"text": train["text"], "emotion": train["label"]})
train_df.head()

Unnamed: 0,text,emotion
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


In [4]:
train_df['emotion'].value_counts()

1    5362
0    4666
3    2159
4    1937
2    1304
5     572
Name: emotion, dtype: int64

Инициализация токенайзера и модели. Перед тем, как отправить текст в модель, его следует токенизировать.

**Учитывайте**, что:
* Выходы моделей могут отличаться (разные имена ключей словаря, разная вложенность, т.е. фантомные размерности).
* Длина эмбеддинга длинее 2048 в данной задаче не потребуется. А чем длинее эмбеддинг, тем медленнее всё будет учиться.


API токенизатора [BERT](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer) (как пример).

Руководство по использованию бертов с [hugging face c towardsdatascience.com](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209).




Возьмите модель "rubert-tiny2" и соответствующий токенизатор

In [None]:
!pip install transformers --progress-bar off

In [32]:

import torch
from transformers import AutoTokenizer, AutoModel
from transformers import BertTokenizer, BertForSequenceClassification
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2", model_max_length=312)
model = AutoModel.from_pretrained("cointegrated/rubert-tiny2").to(device)
#tokenizer = BertTokenizer.from_pretrained("cointegrated/rubert-tiny2")
#model = BertForSequenceClassification.from_pretrained("cointegrated/rubert-tiny2", num_labels=6).to(device)


In [None]:
# подать векторы на рандом форест или бустинг

Посмотрим, как устроен датасет:

In [6]:
# Your code here
print(len(train_df))

16000


Посмотрите на выходы модели:

https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel.forward.returns

Среди них есть эмбеддинги в ключах: pooler_output и hidden_state

Попробуйте использовать один из них для классификации.

В данном задании достаточно использовать 5000 фраз (3000 train, 1000 val,1000 test).




In [7]:
test = load_dataset("SetFit/emotion", split="test")
test_df = pd.DataFrame({"text": test["text"], "emotion": test["label"]})
test_df.head()

Unnamed: 0,text,emotion
0,im feeling rather rotten so im not very ambiti...,0
1,im updating my blog because i feel shitty,0
2,i never make her separate from me because i do...,0
3,i left with my bouquet of red and yellow tulip...,1
4,i was feeling a little vain when i did this one,0


In [8]:
val = load_dataset("SetFit/emotion", split="validation")
val_df = pd.DataFrame({"text": val["text"], "emotion": val["label"]})
val_df.head()

Unnamed: 0,text,emotion
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1


In [9]:
X_train = train_df['text'].sample(frac=3/16)
y_train = train_df['emotion'].sample(frac=3/16)
X_val = val_df['text'].sample(frac=0.5)
y_val = val_df['emotion'].sample(frac=0.5)
X_test = test_df['text'].sample(frac=0.5)
y_test = test_df['emotion'].sample(frac=0.5)

Сохраните эмбеддинги модели и постройте поверх них классификатор.

In [10]:
tokenized_train = tokenizer(X_train.values.tolist(), padding = True, truncation = True, return_tensors="pt")
tokenized_val = tokenizer(X_val.values.tolist() , padding = True, truncation = True,  return_tensors="pt")

print(tokenized_train.keys())

#move on device (GPU)
tokenized_train = {k:torch.tensor(v).to(device) for k,v in tokenized_train.items()}
tokenized_val = {k:torch.tensor(v).to(device) for k,v in tokenized_val.items()}

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [30]:
tokenized_train

{'input_ids': tensor([[    2,    76,   772,  ...,     0,     0,     0],
         [    2,    76, 12235,  ...,     0,     0,     0],
         [    2,    76,   768,  ...,     0,     0,     0],
         ...,
         [    2,    76,   880,  ...,     0,     0,     0],
         [    2,    76,   782,  ...,     0,     0,     0],
         [    2, 10551,  2703,  ...,     0,     0,     0]], device='cuda:0'),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0'),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}

In [11]:
with torch.no_grad():
  hidden_train = model(**tokenized_train) #dim : [batch_size(nr_sentences), tokens, emb_dim]
  hidden_val = model(**tokenized_val)

In [12]:
hidden_train

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0541, -0.2967,  0.5910,  ..., -0.8362,  0.4133, -1.1161],
         [-0.6164,  0.2336,  2.2585,  ..., -1.1374,  0.7806, -2.0107],
         [ 0.7787,  0.2303,  0.3477,  ..., -0.2127,  0.8619, -2.2262],
         ...,
         [-0.4156, -0.1559,  0.6747,  ..., -1.5931,  0.0327, -1.0409],
         [-0.2447, -0.3099,  0.8128,  ..., -1.4527,  0.1188, -1.2305],
         [-0.4167, -0.0931,  1.0927,  ..., -1.6998,  0.2477, -1.1915]],

        [[-0.1730, -0.5607,  0.8248,  ..., -1.1353, -0.0262, -0.8950],
         [-0.4755, -0.2832,  2.5157,  ..., -1.0320,  0.5585, -1.4182],
         [ 0.0075, -1.0687,  2.0982,  ..., -1.2623,  0.0439, -0.6056],
         ...,
         [-0.5051, -0.9245,  1.2226,  ..., -1.6779, -0.0068, -0.7962],
         [-0.1406, -1.0932,  1.1818,  ..., -1.3671,  0.4554, -1.1932],
         [ 0.0445, -0.9453,  1.2892,  ..., -1.6911,  0.1658, -1.3134]],

        [[ 0.2132, -0.6074,  0.2383,  ..., -0.8288,  

In [None]:
hidden_val

In [14]:
with torch.no_grad():
  hidden_train = model(**tokenized_train) #dim : [batch_size(nr_sentences), tokens, emb_dim]
  hidden_val = model(**tokenized_val)

#get only the [CLS] hidden states
cls_train = hidden_train.pooler_output
cls_val = hidden_val.pooler_output

In [64]:
print(hidden_train.keys())

odict_keys(['last_hidden_state', 'pooler_output'])


In [65]:
print(hidden_val.keys())

odict_keys(['last_hidden_state', 'pooler_output'])


In [40]:
hidden_train.pooler_output

tensor([[ 0.1363,  0.0126,  0.1126,  ..., -0.0747, -0.0309,  0.0144],
        [-0.1190,  0.0380,  0.1518,  ...,  0.1062, -0.1095, -0.0587],
        [ 0.0816,  0.0450, -0.1239,  ...,  0.2059,  0.0529, -0.0431],
        ...,
        [ 0.1140,  0.0222,  0.1265,  ...,  0.1583, -0.1228, -0.2148],
        [-0.0566, -0.1058,  0.1700,  ...,  0.0970, -0.1028, -0.1261],
        [-0.0654, -0.0252, -0.0682,  ...,  0.0705, -0.1706, -0.0582]],
       device='cuda:0')

In [15]:
from sklearn.metrics import classification_report

In [16]:
x_train = cls_train.to("cpu")
x_val = cls_val.to("cpu")

print(x_train.shape, y_train.shape, x_val.shape, y_val.shape)

torch.Size([3000, 312]) (3000,) torch.Size([1000, 312]) (1000,)


In [43]:
x_train

tensor([[ 0.1363,  0.0126,  0.1126,  ..., -0.0747, -0.0309,  0.0144],
        [-0.1190,  0.0380,  0.1518,  ...,  0.1062, -0.1095, -0.0587],
        [ 0.0816,  0.0450, -0.1239,  ...,  0.2059,  0.0529, -0.0431],
        ...,
        [ 0.1140,  0.0222,  0.1265,  ...,  0.1583, -0.1228, -0.2148],
        [-0.0566, -0.1058,  0.1700,  ...,  0.0970, -0.1028, -0.1261],
        [-0.0654, -0.0252, -0.0682,  ...,  0.0705, -0.1706, -0.0582]])

In [53]:
x_train.numpy()

array([[ 0.13630863,  0.01262598,  0.11260218, ..., -0.07466914,
        -0.03089368,  0.01441864],
       [-0.11903999,  0.03797973,  0.15180744, ...,  0.10620189,
        -0.10948235, -0.05866457],
       [ 0.08163867,  0.04500313, -0.1238536 , ...,  0.20588058,
         0.0528654 , -0.04305964],
       ...,
       [ 0.11395024,  0.02215588,  0.12654135, ...,  0.15832627,
        -0.12275476, -0.21483377],
       [-0.05659902, -0.10581838,  0.17004511, ...,  0.09696726,
        -0.10282239, -0.12610713],
       [-0.0653757 , -0.02521783, -0.06822247, ...,  0.07049131,
        -0.17063232, -0.05822738]], dtype=float32)

In [44]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(x_train, y_train)
y_val_pred = rf.predict(x_val)

torch.Size([3000, 312]) (3000,) torch.Size([1000, 312]) (1000,)


In [47]:
!pip install -q catboost

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [48]:
import xgboost
import catboost
import lightgbm
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, KFold

In [68]:
KF = KFold(n_splits=3, random_state=42, shuffle=True)
catbust = catboost.CatBoostClassifier(verbose=0, random_seed=42)
xgb = xgboost.XGBClassifier(random_state=42)
lgbm = lightgbm.LGBMClassifier(random_state=42)

In [69]:
catb = []
XGB = []
LGBM = []

In [70]:
catbust.fit(x_train.numpy(), y_train)
y_pred1 = catbust.predict(x_val.numpy())
catb.append(y_pred1)

In [None]:
print(classification_report(y_pred1, y_val.values))

              precision    recall  f1-score   support

           0       0.38      0.27      0.31       390
           1       0.59      0.36      0.45       571
           2       0.00      0.00      0.00         1
           3       0.02      0.12      0.04        25
           4       0.03      0.23      0.05        13
           5       0.00      0.00      0.00         0

    accuracy                           0.32      1000
   macro avg       0.17      0.16      0.14      1000
weighted avg       0.49      0.32      0.38      1000



In [73]:
xgb.fit(x_train.numpy(), y_train)
y_pred2 = xgb.predict(x_val.numpy())
XGB.append(y_pred2)

In [74]:
print(classification_report(y_pred2, y_val.values))

              precision    recall  f1-score   support

           0       0.42      0.31      0.36       380
           1       0.60      0.37      0.46       570
           2       0.00      0.00      0.00         2
           3       0.02      0.09      0.03        32
           4       0.01      0.07      0.02        15
           5       0.00      0.00      0.00         1

    accuracy                           0.33      1000
   macro avg       0.18      0.14      0.14      1000
weighted avg       0.51      0.33      0.40      1000



In [75]:
lgbm.fit(x_train.numpy(), y_train)
y_pred3 = lgbm.predict(x_val.numpy())
LGBM.append(y_pred3)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.011742 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 79560
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 312
[LightGBM] [Info] Start training from score -1.231001
[LightGBM] [Info] Start training from score -1.110685
[LightGBM] [Info] Start training from score -2.509199
[LightGBM] [Info] Start training from score -1.963735
[LightGBM] [Info] Start training from score -2.131437
[LightGBM] [Info] Start training from score -3.261435


In [76]:
print(classification_report(y_pred3, y_val.values))

              precision    recall  f1-score   support

           0       0.38      0.29      0.33       363
           1       0.61      0.35      0.44       607
           2       0.00      0.00      0.00         1
           3       0.02      0.18      0.04        17
           4       0.02      0.17      0.04        12
           5       0.00      0.00      0.00         0

    accuracy                           0.32      1000
   macro avg       0.17      0.16      0.14      1000
weighted avg       0.51      0.32      0.39      1000



Попробуйте дообучить BERT и произвести классификацию заново

In [None]:
# Your code here

In [None]:
! pip install -U accelerate
! pip install -U transformers

Предобработка данных

In [37]:
setfit = load_dataset("SetFit/emotion")

In [38]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [39]:
tokenized_setfit = setfit.map(preprocess_function, batched=True)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [60]:
tokenized_setfit['train']

Dataset({
    features: ['text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 16000
})

In [87]:
tokenized_setfit_1 = tokenized_setfit.remove_columns("label_text")
tokenized_setfit_1 = tokenized_setfit_1.remove_columns('token_type_ids')

In [88]:
tokenized_setfit_1

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
})

In [94]:
tokenized_setfit_1.shape

{'train': (16000, 4), 'validation': (2000, 4), 'test': (2000, 4)}

In [95]:
tokenizer

BertTokenizerFast(name_or_path='cointegrated/rubert-tiny2', vocab_size=83828, model_max_length=312, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [40]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [41]:
import evaluate

accuracy = evaluate.load("accuracy")

In [96]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [1]:
!pip install -q -U transformers accelerate git+https://github.com/huggingface/peft.git datasets evaluate --progress-bar off

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [97]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1, #about 15 minutes for 1 epoch
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_setfit_1['train'],
    eval_dataset=tokenized_setfit_1['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [98]:
trainer.train()

TypeError: ignored

## Формат результата

Получить значение качества классификации