# Обученная модель с HuggingFace
Из-за ограниченности вычислительных ресурсов попробую взять уже обученную на датасете RuSentiment модель BERT с HuggingFace.
Ссылка на модель - https://huggingface.co/blanchefort/rubert-base-cased-sentiment-rusentiment

In [None]:
from transformers import (
    AutoTokenizer,
    BertTokenizerFast,
    AutoModelForSequenceClassification,
    pipeline
)
import numpy as np
import torch
from datasets import load_dataset
from sklearn.metrics import classification_report, accuracy_score
from tqdm import tqdm

In [None]:
# загрузка обученной модели
model_name = 'blanchefort/rubert-base-cased-sentiment-rusentiment'

tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=True)

print(f"Модель загружена! Параметров: {model.num_parameters():,}")
print(f"Устройство: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

Модель загружена! Параметров: 177,855,747
Устройство: CUDA


In [None]:
# соответствие меток
print(model.config.id2label)
print(model.config.label2id)

{0: 'NEUTRAL', 1: 'POSITIVE', 2: 'NEGATIVE'}
{'NEUTRAL': 0, 'POSITIVE': 1, 'NEGATIVE': 2}


In [None]:
# Загрузка только тестовой выборки
df_test = load_dataset("MonoHime/ru_sentiment_dataset", split="validation")
df_test

Dataset({
    features: ['Unnamed: 0', 'text', 'sentiment'],
    num_rows: 21098
})

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device).eval()

In [None]:
@torch.no_grad()
def predict(texts: list[str]) -> np.array:
    inputs = tokenizer(texts, max_length=256, padding=True, truncation=True, return_tensors='pt').to(device)
    outputs = model(**inputs)
    predicted = torch.nn.functional.softmax(outputs.logits, dim=1)
    predicted = torch.argmax(predicted, dim=1).cpu().numpy()
    return predicted

def predict_by_batch(texts, batch_size):
  all_preds = []

  for i in tqdm(range(0, len(texts), batch_size), desc="Predicting"):
    batch_texts = texts[i:i + batch_size]
    preds = predict(batch_texts)
    all_preds.extend(preds)

  return np.array(all_preds)

In [None]:
y_test = np.array(df_test["sentiment"])
y_test_pred = predict_by_batch(list(df_test['text']), batch_size=32)

Predicting: 100%|██████████| 660/660 [05:22<00:00,  2.05it/s]


In [None]:
accuracy_score(y_test, y_test_pred)

0.6086358896577875

In [None]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.43      0.56      0.49      5560
           1       0.74      0.67      0.71     10026
           2       0.63      0.54      0.58      5512

    accuracy                           0.61     21098
   macro avg       0.60      0.59      0.59     21098
weighted avg       0.63      0.61      0.61     21098



Качество достаточно низкое, попробуем дообучить свою модель BERT.