# Викишоп, BERT


Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Необходимо обучить модель классифицировать комментарии на позитивные и негативные. В распоряжении набор данных с разметкой о токсичности правок.

Значение метрики качества *F1* не меньше 0.75.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.






## Обработка текста

In [1]:
pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m71.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m116.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m87.8 MB/s[0m eta [36m0:00:00[0m
Co

In [2]:
pip install catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [71]:
import numpy as np
import pandas as pd
import torch
import transformers
from transformers import AutoTokenizer, AutoModel
from tqdm.notebook import tqdm
from tqdm import notebook

import warnings
import re
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier


from sklearn.metrics import f1_score
#from sklearn.metrics import plot_confusion_matrix


**Прочитаем и очистим данные**

In [72]:
#прочитаем данные
df_comments = pd.read_csv('./toxic_comments.csv')

In [73]:
def clear_text(text):
    text = str(text).lower() # текст в нижний регистр
    text = re.sub(r'[^a-zA-Z\' ]', ' ', text)
    text = [w for w in text.split() if len(w) >= 3] # удаляем слова короче 3х символов
    clear_text = " ".join(text)
    return clear_text

#применим функцию очистки
df_comments['text'] = df_comments['text'].apply(clear_text)

**Применим BERT**

In [74]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [75]:
tokenizer = AutoTokenizer.from_pretrained("unitary/toxic-bert")
model = AutoModel.from_pretrained("unitary/toxic-bert").to(device)

In [77]:
tokenized = df_comments['text'].apply(
    lambda x: tokenizer.encode(x, truncation=True, add_special_tokens=True))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [78]:
batch_size = 100 #86
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)])
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])

        with torch.no_grad():
            batch_embeddings = model(batch.to(device), attention_mask=attention_mask_batch.to(device))

        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

features = np.concatenate(embeddings)

  0%|          | 0/1592 [00:00<?, ?it/s]

In [79]:
target = df_comments['toxic']

In [80]:
features.shape

(159200, 768)

In [89]:
target.shape

(159200,)

In [88]:
target = target.loc[0:159199]

**Разделим на выборки**

In [90]:
bert_x_train, bert_x_test = train_test_split(features, test_size=0.1, shuffle = False)

In [96]:
target_train, target_test = train_test_split(target, test_size=0.1, shuffle = False)

## Обучение

### CatBoostClassifier

In [103]:
model_1 = CatBoostClassifier()
scores = cross_val_score(model_1, bert_x_train, target_train, cv=5, scoring='f1')
final_score = sum(scores) / len(scores)
print('Средняя оценка качества модели:', final_score)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
5:	learn: 0.1342180	total: 3.56s	remaining: 9m 50s
6:	learn: 0.1146344	total: 3.95s	remaining: 9m 19s
7:	learn: 0.1001192	total: 4.37s	remaining: 9m 1s
8:	learn: 0.0896390	total: 4.84s	remaining: 8m 53s
9:	learn: 0.0823251	total: 5.29s	remaining: 8m 43s
10:	learn: 0.0768638	total: 5.7s	remaining: 8m 32s
11:	learn: 0.0723707	total: 6.09s	remaining: 8m 21s
12:	learn: 0.0689858	total: 6.46s	remaining: 8m 10s
13:	learn: 0.0659701	total: 6.86s	remaining: 8m 2s
14:	learn: 0.0639934	total: 7.27s	remaining: 7m 57s
15:	learn: 0.0622473	total: 7.69s	remaining: 7m 53s
16:	learn: 0.0606446	total: 8.07s	remaining: 7m 46s
17:	learn: 0.0592555	total: 8.43s	remaining: 7m 39s
18:	learn: 0.0582004	total: 8.85s	remaining: 7m 36s
19:	learn: 0.0571928	total: 9.22s	remaining: 7m 31s
20:	learn: 0.0563657	total: 9.63s	remaining: 7m 29s
21:	learn: 0.0557920	total: 10s	remaining: 7m 25s
22:	learn: 0.0552466	total: 10.4s	remaining:

### LogisticRegression

In [104]:
model_2 = LogisticRegression(random_state=42, max_iter=1000)
scores = cross_val_score(model_2, bert_x_train, target_train, cv=5, scoring='f1')
final_score = sum(scores) / len(scores)
final_score

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.9007811981517257

## Тестирование модели

In [101]:
model_2 = LogisticRegression(random_state=42, max_iter=1000)
model_2.fit(bert_x_train, target_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [102]:
predictions = model_2.predict(bert_x_test)
f1 = f1_score(target_test, predictions)

print(f1)

0.8909370199692781


## Вывод