# Проект для «Викишоп» с BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других.  
Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Необходимо обучить модель классифицировать комментарии на позитивные и негативные (*F1* не меньше 0.75). 


**Описание данных:**

- *text* — текст комментария; 
- *toxic* — целевой признак (комментарий позитивный или негативный).

<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Линейная-регрессия" data-toc-modified-id="Линейная-регрессия-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Линейная регрессия</a></span></li><li><span><a href="#LightGBM" data-toc-modified-id="LightGBM-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LightGBM</a></span></li><li><span><a href="#CatBoost" data-toc-modified-id="CatBoost-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>CatBoost</a></span></li><li><span><a href="#BERT" data-toc-modified-id="BERT 2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>BERT</a></span></li><li><span><a href="#Результаты-моделей" data-toc-modified-id="Результаты-моделей-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Результаты моделей</a></span></li></ul></li><li><span><a href="#Тестирование-модели" data-toc-modified-id="Тестирование-модели-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Тестирование модели</a></span><ul class="toc-item"><li><span><a href="#Проверка-на-адекватность." data-toc-modified-id="Проверка-на-адекватность.-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Проверка на адекватность.</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

## Подготовка

In [1]:
!pip install transformers -q
!pip install catboost -q
import sys
!{sys.executable} -m pip install spacy -q
!{sys.executable} -m spacy download en_core_web_sm -q

[K     |████████████████████████████████| 12.8 MB 25.4 MB/s 
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.utils import shuffle
import warnings
import re
import lightgbm as lgb
import nltk
import torch
import spacy
from numpy import sqrt
from numpy import argmax
from numpy import arange
from sklearn.dummy import DummyClassifier
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_randfloat
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score, roc_curve
from matplotlib import pyplot
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer 
from tqdm import notebook
from tqdm import tqdm
from nltk.stem import WordNetLemmatizer
import transformers
import catboost
from catboost import CatBoostClassifier
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv', index_col=0)

In [4]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Данные состоят из 159292 строк и двух столбцов. Пропусков нет, текст комментариев на английском языке.
Подготовим данные для обучения моделей.

Лемматизируем текст.

In [6]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def text_preprocessing(text):
    doc = nlp(text)
    doc=" ".join([token.lemma_ for token in doc])
    doc=re.sub(r'[^a-zA-Z ]',' ',doc)
  
    return " ".join(doc.split())

In [7]:
tqdm.pandas() 
df['lemm_text']=df['text'].progress_apply(text_preprocessing).str.lower()

100%|██████████| 159292/159292 [18:04<00:00, 146.89it/s]


In [8]:
df.head()

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edit make under my usernam...
1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour i be see...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man i be really not try to edit war it be ...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i can not make any real suggestion on imp...
4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...


Разделим исходные данные на обучающую, валидационную и тестовые выборки.


In [9]:
features = df['lemm_text']
target = df['toxic']

In [10]:
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.4, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.5, random_state=12345)

Установим стоп-слова.

In [11]:
stopwords = set(stopwords.words('english'))

## Обучение

Создадим функцию для добавления результатов работы моделей.

In [12]:
results = {'name': [], 'best_params':[], 'F1':[]}

def add_model_result(results, name, best_params, f1):
    results['name'].append(name)
    results['best_params'].append(best_params)
    results['F1'].append(f1)

Приступим к подбору гиперпараметров моделей.

### Линейная регрессия 

In [13]:
pipeline = Pipeline(
    [
        ("tfidf", TfidfVectorizer(stop_words=stopwords)),
        ("clf", LogisticRegression()),
    ]
)
parameters = {"clf__C":[1, 10, 100, 1000],  "clf__penalty":["l1","l2"]}
grid_lr = GridSearchCV(pipeline, parameters, scoring="f1", cv=3)
grid_lr.fit(features_train, target_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(stop_words={'a',
                                                                    'about',
                                                                    'above',
                                                                    'after',
                                                                    'again',
                                                                    'against',
                                                                    'ain',
                                                                    'all', 'am',
                                                                    'an', 'and',
                                                                    'any',
                                                                    'are',
                                                                    'aren',
         

In [14]:
print('LogisticRegression', grid_lr.best_score_, grid_lr.best_params_)

LogisticRegression 0.7653882922788018 {'clf__C': 10, 'clf__penalty': 'l2'}


In [15]:
add_model_result(results, 'LogisticRegression', grid_lr.best_params_, grid_lr.best_score_)

Попробуем изменить порог классификации.

In [16]:
def to_labels(predictions, threshold):
	return (predictions >= threshold).astype('int')

In [17]:
predict = grid_lr.best_estimator_.predict_proba(features_valid)[:, 1]
thresholds = arange(0, 1, 0.001)
scores = [f1_score(target_valid, to_labels(predict, t)) for t in thresholds]
ix = argmax(scores)
print('Threshold=%.3f, F-Score=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.367, F-Score=0.79098


In [18]:
add_model_result(results, 'LogisticRegression_trsh', thresholds[ix], scores[ix])

### LightGBM 

In [19]:
pipeline = Pipeline(
    [
        ("tfidf", TfidfVectorizer(stop_words=stopwords)),
        ("clf", lgb.LGBMClassifier()),
    ]
)
f1 = (cross_val_score(pipeline, features_train, target_train,scoring='f1', cv=3)).mean()
print('F1 = ', f1)

F1 =  0.7477460390446508


In [20]:
add_model_result(results, 'LightGBM', 'None', f1)

Изменим порог классификации.

In [21]:
pipeline.fit(features_train, target_train)
predict = pipeline.predict_proba(features_valid)[:, 1]
thresholds = arange(0, 1, 0.001)
scores = [f1_score(target_valid, to_labels(predict, t)) for t in thresholds]
ix = argmax(scores)
print('Threshold=%.3f, F-Score=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.277, F-Score=0.78777


In [22]:
add_model_result(results, 'LightGBM_trsh', thresholds[ix], scores[ix])

### CatBoost

Выделим из исходной выборки (лемматизированной) валидационную и тестовую.

In [23]:
train = df.sample(frac=0.6).copy()
validation = df[~df.index.isin(train.index)].copy()
test = validation.sample(frac=0.5).copy()
val = validation[~validation.index.isin(test.index)].copy()

In [24]:
X_col = ['lemm_text']
y_col = ['toxic']
text_features = ['lemm_text']

In [25]:
%%time
model = CatBoostClassifier(text_features=text_features, silent=True, task_type="GPU")

parameters = {'depth' : sp_randint(4, 10),
              'learning_rate': sp_randfloat(0.01, 0.06),
              'iterations' : sp_randint(500, 1000)}

grid_cat = RandomizedSearchCV(model,parameters, scoring='f1', cv=3)
grid_cat.fit(train[X_col],train[y_col])

CPU times: user 16min 29s, sys: 2min 19s, total: 18min 48s
Wall time: 12min 5s


RandomizedSearchCV(cv=3,
                   estimator=<catboost.core.CatBoostClassifier object at 0x7f752c6f1710>,
                   param_distributions={'depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f752c6f1110>,
                                        'iterations': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f752c6f1b50>,
                                        'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f752c6f1d90>},
                   scoring='f1')

In [26]:
print('CatBoost', grid_cat.best_score_, grid_cat.best_params_)

CatBoost 0.7830470888340056 {'depth': 9, 'iterations': 629, 'learning_rate': 0.06785217697316344}


In [27]:
add_model_result(results, 'CatBoost', grid_cat.best_params_, grid_cat.best_score_)

Изменим порог классификации.

In [28]:
predict = grid_cat.best_estimator_.predict_proba(val[X_col])[:, 1]
thresholds = arange(0, 1, 0.001)
scores = [f1_score(val[y_col], to_labels(predict, t)) for t in thresholds]
ix = argmax(scores)
print('Threshold=%.3f, F-Score=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.292, F-Score=0.78657


In [29]:
add_model_result(results, 'CatBoost_trsh', thresholds[ix], scores[ix])

### BERT 

In [30]:
model = transformers.AutoModel.from_pretrained('unitary/toxic-bert')
tokenizer = transformers.AutoTokenizer.from_pretrained('unitary/toxic-bert')

Downloading config.json:   0%|          | 0.00/811 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the model checkpoint at unitary/toxic-bert were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading tokenizer_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

BERT обучается батчами, мы установим `batch_size = 100`, поэтому количество элементов в выборке должно быть кратным 100, обрежем исходную выборку.

In [31]:
df.shape

(159292, 3)

In [32]:
df_bert = df[:159200].copy()

Модель имеет ограничение в 512 словотокенов, поэтому укажем токенайзеру, что он должен обрезать все последовательности токенов, которые выходят за данный лимит.

In [33]:
tokenized = df_bert['text'].apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True))
n = max(len(x) for x in tokenized)
padded = np.array([i + [0]*(n - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

In [34]:
batch_size = 100
embeddings = []
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).to(device)
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)
        
    with torch.no_grad():
        model.to(device)
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        
    embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

  0%|          | 0/1592 [00:00<?, ?it/s]

Выделим признаки и целевой признак, а также обучающую, валидационную и тестовую выборки.  


In [35]:
features_bert = np.concatenate(embeddings)
target_bert = df_bert['toxic']
features_bert_train, features_bert_valid, target_bert_train, target_bert_valid = train_test_split(
    features_bert, target_bert, test_size=0.4, random_state=12345)
features_bert_valid, features_bert_test, target_bert_valid, target_bert_test = train_test_split(
    features_bert_valid, target_bert_valid, test_size=0.5, random_state=12345)

In [36]:
model = LogisticRegression()
parameters = {"C":[1, 10, 100, 1000], "penalty":["l1","l2"]}

grid_lr_bert = GridSearchCV(model,parameters, scoring="f1", cv=3)
grid_lr_bert.fit(features_bert_train, target_bert_train)

GridSearchCV(cv=3, estimator=LogisticRegression(),
             param_grid={'C': [1, 10, 100, 1000], 'penalty': ['l1', 'l2']},
             scoring='f1')

In [37]:
print('BERT', grid_lr_bert.best_score_, grid_lr_bert.best_params_)

BERT 0.9416481207229989 {'C': 1, 'penalty': 'l2'}


In [38]:
add_model_result(results, 'BERT', grid_lr_bert.best_params_, grid_lr_bert.best_score_)

Изменим порог классификации.

In [39]:
predict = grid_lr_bert.best_estimator_.predict_proba(features_bert_valid)[:, 1]
thresholds = arange(0, 1, 0.001)
scores = [f1_score(target_bert_valid, to_labels(predict, t)) for t in thresholds]
ix = argmax(scores)
print('Threshold=%.3f, F-Score=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.489, F-Score=0.94944


In [40]:
add_model_result(results, 'BERT_trsh', thresholds[ix], scores[ix])

### Результаты моделей

In [41]:
pd.DataFrame(results).sort_values('F1', ascending=False)


Unnamed: 0,name,best_params,F1
7,BERT_trsh,0.489,0.949442
6,BERT,"{'C': 1, 'penalty': 'l2'}",0.941648
1,LogisticRegression_trsh,0.367,0.79098
3,LightGBM_trsh,0.277,0.787773
5,CatBoost_trsh,0.292,0.786571
4,CatBoost,"{'depth': 9, 'iterations': 629, 'learning_rate...",0.783047
0,LogisticRegression,"{'clf__C': 10, 'clf__penalty': 'l2'}",0.765388
2,LightGBM,,0.747746


Таким образом, лучшей моделью является LogisticRegression обученная на признаках, подготовленных BERT (unitary/toxic-bert), с параметрами: `C = 1, penalty = 'l2'`, и порогом классификации 0.489,  показавщая F1 = 0.949.

## Тестирование модели

In [43]:
features_bert_train = np.concatenate([features_bert_train, features_bert_valid])
target_bert_train = np.concatenate([target_bert_train, target_bert_valid])

In [44]:
model = grid_lr_bert.best_estimator_
model.fit(features_bert_train, target_bert_train)
threshold = 0.489
predictions = np.where(model.predict_proba(features_bert_test)[:,1] > threshold, 1, 0)
print('F1 тестовой выборке =', f1_score(target_bert_test, predictions))

F1 тестовой выборке = 0.9525378050606378


### Проверка на адекватность.

Сравним с случайной моделью.

In [45]:
dummy = DummyClassifier(strategy="uniform")
dummy.fit(features_bert_train, target_bert_train)
pred = dummy.predict(features_bert_test)
print('F1 dummy =', f1_score(target_bert_test, pred))

F1 dummy = 0.172559669922181


На тестовой выборке модель показала значение метрики F1 равное 0.95, что соотвествует требованию - более 0.75.

## Выводы

Таким образом, нами подготовлены данные и обучены различные модели. Лучшей моделью по качеству метрики F1 стала LogisticRegression, обученная на признаках, подготовленных BERT (unitary/toxic-bert), с параметрами: `C = 1, penalty = 'l2'`, и порогом классификации 0.489, показавшая на тестовой выборке значение F1 равное 0.95.