# 3. BERT + classic model

Будем получать эмбеддинги от BERT, а дальше так же, как и с TF-IDF, в полученном признаковом простанстве строить классические модели.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('data.csv')

In [3]:
df.head(3)

Unnamed: 0,id,answer1,score1,answer2,score2,answer3,score3,result
0,train_0,для анализа массивов данных необходимых в работе,2.0,для анализа массивов данных необходимых в работе,2.0,"стараюсь всегда брать задачи, выполнение котор...",2.0,6.0
1,train_1,Буду использовать полученные знания в работе д...,2.0,Автоматизирую процесс сбора данных и дальнейше...,2.0,Задача по анализу кода и содержанию пакетов - ...,1.5,5.5
2,train_2,хочу стать топовым программистом во всём мире ...,1.5,изучаю программирование,1.5,-,0.0,3.0


In [4]:
# !pip install transformers torch

In [5]:
!set HF_HUB_DISABLE_SYMLINKS_WARNING=1

In [6]:
from transformers import AutoTokenizer, AutoModel
import torch

In [7]:
#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("ai-forever/sbert_large_nlu_ru")
model = AutoModel.from_pretrained("ai-forever/sbert_large_nlu_ru")

In [8]:
from sklearn.model_selection import cross_val_predict

def get_best_threshold(model, X_train, y_train):
    """Ищет лучший порог отсечения, который максимизирует метрику f1_macro с использованием кросс-валидации"""
    # Получаем вероятности предсказаний с помощью кросс-валидации
    y_probs = cross_val_predict(model, X_train, y_train, cv=5, method='predict_proba')[:, 1]

    thresholds = np.arange(0.0, 1.0, 0.01)
    f1_scores = []
    
    for threshold in thresholds:
        y_pred = (y_probs >= threshold).astype(int)
        f1 = f1_score(y_train, y_pred, average='macro')
        f1_scores.append(f1)
    
    optimal_threshold = thresholds[np.argmax(f1_scores)]
    optimal_f1 = max(f1_scores)
    return optimal_threshold

## 3.1. BERT + KNN

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, f1_score

In [10]:
df.isna().sum()

id         0
answer1    0
score1     0
answer2    1
score2     0
answer3    0
score3     0
result     0
dtype: int64

Необходимо заполнить пропуски, так как BERT не сможет нормально работать с ними:

In [11]:
df = df.fillna('')

In [12]:
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

In [13]:
def get_bert_embeddings(sentences: list[str]):
    """Создает BERT-эмбеддинги для каждого текстового ответа"""
    #Tokenize sentences
    encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=50, return_tensors='pt')
    
    #Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    #Perform pooling. In this case, mean pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    return sentence_embeddings

### 3.1.1. Пробная модель для первого вопроса

In [14]:
df.head(1)

Unnamed: 0,id,answer1,score1,answer2,score2,answer3,score3,result
0,train_0,для анализа массивов данных необходимых в работе,2.0,для анализа массивов данных необходимых в работе,2.0,"стараюсь всегда брать задачи, выполнение котор...",2.0,6.0


In [15]:
X, y = df[['answer1']].copy(), df['score1'].copy()
y = (y > 1.5).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

In [16]:
sentence_embeddings_train = get_bert_embeddings(X_train['answer1'].tolist())
sentence_embeddings_test = get_bert_embeddings(X_test['answer1'].tolist())

In [17]:
knn = KNeighborsClassifier()
knn.fit(sentence_embeddings_train, y_train)

y_train_pred = knn.predict(sentence_embeddings_train)
y_test_pred = knn.predict(sentence_embeddings_test)

print(classification_report(y_train, y_train_pred))
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.82      0.87      0.85       286
           1       0.76      0.67      0.71       169

    accuracy                           0.80       455
   macro avg       0.79      0.77      0.78       455
weighted avg       0.80      0.80      0.80       455

              precision    recall  f1-score   support

           0       0.70      0.67      0.69       123
           1       0.48      0.51      0.49        73

    accuracy                           0.61       196
   macro avg       0.59      0.59      0.59       196
weighted avg       0.62      0.61      0.61       196



In [18]:
import optuna
import logging
from sklearn.model_selection import cross_val_score

optuna.logging.set_verbosity(optuna.logging.ERROR) # отключаем стандартный вывод optuna

In [19]:
def objective(trial):
    model = KNeighborsClassifier(
        n_neighbors=trial.suggest_int('n_neighbors', 1, 50),
        weights=trial.suggest_categorical('weights', ['uniform', 'distance']),
        metric=trial.suggest_categorical('metric', ['euclidean', 'cosine']),
        leaf_size=trial.suggest_int('leaf_size', 1, 100),
    )

    # Оцениваем модель с помощью кросс-валидации
    score = cross_val_score(model, sentence_embeddings_train, y_train, cv=5, scoring='f1_macro').mean()

    # Выводим логи
    print(f'Trial {trial.number+1}: score = {score}')
    
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000)

Trial 1: score = 0.684626103722526
Trial 2: score = 0.639916789764689
Trial 3: score = 0.6617942140159512
Trial 4: score = 0.6500021286851122
Trial 5: score = 0.6852113713817654
Trial 6: score = 0.6605864073475145
Trial 7: score = 0.6485623925908268
Trial 8: score = 0.6765115760222987
Trial 9: score = 0.6765672781051253
Trial 10: score = 0.6765115760222987
Trial 11: score = 0.6834933642395139
Trial 12: score = 0.6904910810307511
Trial 13: score = 0.6929523447947359
Trial 14: score = 0.6947397383096601
Trial 15: score = 0.6730512125213639
Trial 16: score = 0.6713165077374927
Trial 17: score = 0.6864878979323246
Trial 18: score = 0.6713165077374927
Trial 19: score = 0.6679041901437446
Trial 20: score = 0.6475135425619651
Trial 21: score = 0.666002130174009
Trial 22: score = 0.6929523447947359
Trial 23: score = 0.6929523447947359
Trial 24: score = 0.6827070501364
Trial 25: score = 0.6827070501364
Trial 26: score = 0.6929523447947359
Trial 27: score = 0.6781892070776167
Trial 28: score = 0

In [20]:
study.best_params

{'n_neighbors': 38, 'weights': 'distance', 'metric': 'cosine', 'leaf_size': 51}

In [21]:
best_model = KNeighborsClassifier(**study.best_params)
best_model.fit(sentence_embeddings_train, y_train)

optimal_threshold = get_best_threshold(best_model, sentence_embeddings_train, y_train)
y_train_pred = (best_model.predict_proba(sentence_embeddings_train)[:, 1] >= optimal_threshold).astype(int)
y_test_pred = (best_model.predict_proba(sentence_embeddings_test)[:, 1] >= optimal_threshold).astype(int)

print(classification_report(y_train, y_train_pred))
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       286
           1       1.00      1.00      1.00       169

    accuracy                           1.00       455
   macro avg       1.00      1.00      1.00       455
weighted avg       1.00      1.00      1.00       455

              precision    recall  f1-score   support

           0       0.75      0.73      0.74       123
           1       0.57      0.59      0.58        73

    accuracy                           0.68       196
   macro avg       0.66      0.66      0.66       196
weighted avg       0.68      0.68      0.68       196



## 3.2. BERT + LogReg

### 3.2.1. Пробная модель для первого вопроса

In [22]:
from sklearn.linear_model import LogisticRegression

In [23]:
def objective(trial):
    model = LogisticRegression(
        C=trial.suggest_float('C', 1e-4, 1e4, log=True),
        penalty=trial.suggest_categorical('penalty', ['l1', 'l2']),
        class_weight=trial.suggest_categorical('class_weight', [None, 'balanced']),
        solver='liblinear'
    )

    # Оцениваем модель с помощью кросс-валидации
    score = cross_val_score(model, sentence_embeddings_train, y_train, cv=5, scoring='f1_macro').mean()

    # Выводим логи
    print(f'Trial {trial.number+1}: score = {score}')
    
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

Trial 1: score = 0.6614812748806322
Trial 2: score = 0.3859604571013967
Trial 3: score = 0.6894930499652834
Trial 4: score = 0.7018409577493976
Trial 5: score = 0.7012338544717803
Trial 6: score = 0.5982152105127241
Trial 7: score = 0.6974121548570018
Trial 8: score = 0.7017917265898284
Trial 9: score = 0.6611744831215469
Trial 10: score = 0.6653287132993304
Trial 11: score = 0.6741382907308712
Trial 12: score = 0.6647788978811398
Trial 13: score = 0.6928075121626522
Trial 14: score = 0.6700814312455255
Trial 15: score = 0.69690173802947
Trial 16: score = 0.6991421213372484
Trial 17: score = 0.6828285147736338
Trial 18: score = 0.664082626410262
Trial 19: score = 0.3859604571013967
Trial 20: score = 0.7134468911383731
Trial 21: score = 0.6973285907707198
Trial 22: score = 0.6955994968559998
Trial 23: score = 0.716036750136338
Trial 24: score = 0.7032073937831473
Trial 25: score = 0.716036750136338
Trial 26: score = 0.7140548217055015
Trial 27: score = 0.6715489898511652
Trial 28: score

In [24]:
study.best_params

{'C': 0.09099591441633904, 'penalty': 'l2', 'class_weight': 'balanced'}

In [25]:
best_model = LogisticRegression(**study.best_params, solver='liblinear')
best_model.fit(sentence_embeddings_train, y_train)

optimal_threshold = get_best_threshold(best_model, sentence_embeddings_train, y_train)
y_train_pred = (best_model.predict_proba(sentence_embeddings_train)[:, 1] >= optimal_threshold).astype(int)
y_test_pred = (best_model.predict_proba(sentence_embeddings_test)[:, 1] >= optimal_threshold).astype(int)

print(classification_report(y_train, y_train_pred))
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.95      0.91      0.93       286
           1       0.85      0.92      0.89       169

    accuracy                           0.91       455
   macro avg       0.90      0.91      0.91       455
weighted avg       0.92      0.91      0.91       455

              precision    recall  f1-score   support

           0       0.79      0.72      0.75       123
           1       0.59      0.68      0.63        73

    accuracy                           0.70       196
   macro avg       0.69      0.70      0.69       196
weighted avg       0.72      0.70      0.71       196



## 3.3. BERT + Naive Bayes

### 3.3.1. Пробная модель для первого вопроса

In [26]:
from sklearn.naive_bayes import GaussianNB

In [27]:
model = GaussianNB()
model.fit(sentence_embeddings_train, y_train)

optimal_threshold = get_best_threshold(best_model, sentence_embeddings_train, y_train)
y_train_pred = (model.predict_proba(sentence_embeddings_train)[:, 1] >= optimal_threshold).astype(int)
y_test_pred = (model.predict_proba(sentence_embeddings_test)[:, 1] >= optimal_threshold).astype(int)

print(classification_report(y_train, y_train_pred))
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.85      0.68      0.76       286
           1       0.60      0.80      0.69       169

    accuracy                           0.73       455
   macro avg       0.73      0.74      0.72       455
weighted avg       0.76      0.73      0.73       455

              precision    recall  f1-score   support

           0       0.82      0.67      0.74       123
           1       0.57      0.75      0.65        73

    accuracy                           0.70       196
   macro avg       0.70      0.71      0.69       196
weighted avg       0.73      0.70      0.70       196



Все три модели показывают худшее качество при классификации первого вопроса, чем при использовании TF-IDF + LogReg. Это может быть связано с тем, что BERT - довольно мощная и сложная модель и на небольшой выборке (которая имеется у нас) модели, использующие эмбеддинги BERT, могут легко переобучиться на тренировочный сет. BERT лучше применять в задачах с большим объемом данных, где он может получить больше информации для понимания контекста, и в следствии этого, лучше классифицировать тексты.