<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>TF-IDF</a></span></li><li><span><a href="#Tunned-BERT-+-emmbeding" data-toc-modified-id="Tunned-BERT-+-emmbeding-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tunned BERT + emmbeding</a></span></li><li><span><a href="#BERT-+-Fine-tunning" data-toc-modified-id="BERT-+-Fine-tunning-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>BERT + Fine tunning</a></span></li><li><span><a href="#Использование-и-тестирование-обученной-BERT-модели" data-toc-modified-id="Использование-и-тестирование-обученной-BERT-модели-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Использование и тестирование обученной BERT-модели</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект анализа токсичности комментарие для Интернет-магазина

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Требуется инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

In [1]:
## установка недостающих библиотек
# %%script echo
! pip install -q evaluate
! pip install -q sentencepiece
! pip install -U sacremoses



In [2]:
# %%script echo
! python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 1.3 MB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import ssl
import os

import evaluate
import nltk
import numpy as np
import pandas as pd
import spacy
import torch
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from torch.optim import AdamW
from torch.utils.data import DataLoader
from tqdm import notebook
from transformers import (AutoTokenizer,
                          AutoModelForSequenceClassification,
                          TextClassificationPipeline,
                          get_scheduler)

from datasets import Dataset

pd.options.mode.chained_assignment = None
os.environ["TOKENIZERS_PARALLELISM"] = "true"

'1.21.1'

In [None]:
## константы будущего обучения
TEST_SIZE = 0.2
RANDOM_STATE = 15071982
CV_SIZE = 5

In [4]:
try:
    data = pd.read_csv('datasets/toxic_comments.csv')
except:
    data = pd.read_csv('/datasets/toxic_comments.csv')

In [5]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


## Подготовка

In [8]:
### удалим ненужную колонку
data.drop(columns='Unnamed: 0', inplace=True)


In [9]:
## Лемматизация
spacy_lemmatizer = spacy.load("en_core_web_sm")
def do_lemmatize(text):
    parsed_text = spacy_lemmatizer(text)
    return " ".join([w.lemma_ for w in parsed_text if not w.is_punct and not w.like_num and not w.is_space])

## Обучение

### TF-IDF

In [193]:
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context
nltk.download('stopwords')
stopwords = list(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /Users/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [194]:
data_train, data_test = train_test_split(data, test_size=TEST_SIZE, stratify=data['toxic'], random_state=RANDOM_STATE)

In [195]:
def create_tf_idf_pipeline():
    return Pipeline([
        ("vect", TfidfVectorizer(stop_words = stopwords)),
        ("model", LogisticRegression(max_iter=1000))
    ])

In [196]:
%%script echo
tf_idf_pipeline = create_tf_idf_pipeline()
tf_idf_pipeline.fit(data_train['text_lemm'],data_train['toxic'])

predictions_test = tf_idf_pipeline.predict(data_test['text_lemm'])
f1_score_test = f1_score(data_test['toxic'], predictions_test)
print(f'f1-score на тесте: {f1_score_test}')




In [197]:
%%script echo

parameters = {'model__C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(create_tf_idf_pipeline(),
    parameters, scoring='f1', cv=CV_SIZE, n_jobs=-1)

grid_search.fit(data_train['text_lemm'], data_train['toxic'])

print('Лучшие параметры: ', grid_search.best_params_)
print(f'f1-score на cross-валидации: {grid_search.best_score_}')




In [198]:
%%script echo
predictions_test = grid_search.predict(data_test['text_lemm'])
f1_score_test = f1_score(data_test['toxic'], predictions_test)
print(f'f1-score на тесте: {f1_score_test}')


### Получено в домашних условиях
# Лучшие параметры:  {'model__C': 21.05271052631579}
# f1-score на cross-валидации: 0.7757574153726411
# f1-score на тесте: 0.7732558139534884




#### Выводы
Модель на основе TF-IDF и LogisticRegression с гиперпараметром С=21.05, показали требуемую точность на тестовой выборке с f1-score=0.77

### Tunned BERT + emmbeding

Загрузим готовую модель и токенезатор с huggingface

In [199]:
model_name = 'martin-ha/toxic-comment-model'
# model_name = 'bert-large-uncased'
# model_name = 'bert-base-uncased'
# model_name = 'distilbert-base-uncased'
# model_name = 'unitary/toxic-bert'

tokenizer = torch.hub.load('huggingface/pytorch-transformers',
                           'tokenizer',
                           model_name,
                           do_lower_case=True)
model = torch.hub.load('huggingface/pytorch-transformers', 'model', model_name)


Using cache found in /Users/user/.cache/torch/hub/huggingface_pytorch-transformers_main
Using cache found in /Users/user/.cache/torch/hub/huggingface_pytorch-transformers_main
Some weights of the model checkpoint at martin-ha/toxic-comment-model were not used when initializing DistilBertModel: ['classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Выполним кодирование

In [200]:
BERT_DATA_SIZE = 10000
df_bert = data.sample(BERT_DATA_SIZE, random_state=RANDOM_STATE)
df_bert['toxic'].value_counts(normalize=True)

0    0.8965
1    0.1035
Name: toxic, dtype: float64

In [201]:
tokenized = df_bert['text'].apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True, max_length=512))

In [202]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

print(f'max lenght = {max_len}')

padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)

max lenght = 512


In [203]:
%%script echo
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    batch = torch.LongTensor(padded[batch_size * i:batch_size * (i + 1)])
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size * i:batch_size * (i + 1)])

    with torch.no_grad():
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)

    embeddings.append(batch_embeddings[0][:, 0, :].numpy())





Обучим модель

In [204]:
%%script echo
features = np.concatenate(embeddings)
features_train, features_test, target_train, target_test = train_test_split(features,
                                                                            df_bert['toxic'],
                                                                            test_size=TEST_SIZE,
                                                                            stratify=df_bert['toxic'])

parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LogisticRegression(max_iter=10000), parameters, scoring='f1', cv=CV_SIZE, n_jobs=-1)
grid_search.fit(features_train, target_train)

predictions_train = grid_search.predict(features_train)
print('Лучшие параметры: ', grid_search.best_params_)
print(f'f1-score на cross-валидации: {grid_search.best_score_}')

predictions_test = grid_search.predict(features_test)
print(f'f1-score на тесте: {f1_score(target_test, predictions_test)}')

print()

### Получено в домашних условиях
# bert-large-uncased (datasize = 400)
# Лучшие параметры:  {'C': 10.526405263157894}
# f1-score на cross-валидации: 0.5938375350140056
# f1-score на тесте: 0.28571428571428575

# bert-base-uncased (datasize = 400)
# Лучшие параметры:  {'C': 31.579015789473683}
# f1-score на cross-валидации: 0.5004195804195805
# f1-score на тесте: 0.625

# distilbert-base-uncased (datasize = 400)
# Лучшие параметры:  {'C': 5.263252631578947}
# f1-score на cross-валидации: 0.6257664884135472
# f1-score на тесте: 0.7499999999999999

# martin-ha/toxic-comment-model (datasize = 400)
# Лучшие параметры:  {'C': 5.263252631578947}
# f1-score на cross-валидации: 0.6229181929181928
# f1-score на тесте: 0.888888888888889

# martin-ha/toxic-comment-model (datasize = 10000)
# Лучшие параметры:  {'C': 89.4736947368421}
# f1-score на cross-валидации: 0.7406059478496869
# f1-score на тесте: 0.7206703910614525




С использованием готовой модели для распознования токсичных комментариев у нас получилось добиться f1-score 0.7206

### BERT + Fine tunning

In [205]:
BERT_DATA_SIZE = 400

LOADER_BATCH_SIZE = 8
NUM_EPOCHS = 1

df, df_save = train_test_split(data, test_size=0.2, stratify=data['toxic'], random_state=RANDOM_STATE)

if BERT_DATA_SIZE >= 100_000:
    df = df.copy()
else:
    df = df.sample(BERT_DATA_SIZE, random_state=RANDOM_STATE)

data_train, data_test = train_test_split(df, test_size=TEST_SIZE, stratify=df['toxic'], random_state=RANDOM_STATE)
print(f'shape train {data_train.shape}')
print(f'shape test {data_test.shape}')

data_train = Dataset.from_pandas(data_train)
data_test = Dataset.from_pandas(data_test)

shape train (320, 2)
shape test (80, 2)


In [206]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)


def ds_preproc(ds):
    ds = ds.map(tokenize_function)
    ds = ds.remove_columns(['text', '__index_level_0__'])
    ds = ds.rename_column('toxic', 'labels')
    ds.set_format('torch')
    return ds

In [207]:
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenized_train = ds_preproc(data_train)
tokenized_test = ds_preproc(data_test)

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

In [208]:
train_dataloader = DataLoader(tokenized_train, shuffle=True, batch_size=LOADER_BATCH_SIZE)
test_dataloader = DataLoader(tokenized_test, batch_size=LOADER_BATCH_SIZE)

In [209]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [210]:
%%script echo

optimizer = AdamW(model.parameters(), lr=5e-6)

num_training_steps = NUM_EPOCHS * len(train_dataloader)

lr_scheduler = get_scheduler(
    name='linear',
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps)

device = 'cpu'
model.to(device)

# Выполняем цикл...
for epoch in notebook.tqdm(range(NUM_EPOCHS)):

    #... обучения
    model.train()
    for batch in notebook.tqdm(train_dataloader, leave=False):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    #... оценки
    metric = evaluate.load('f1')

    model.eval()
    for batch in notebook.tqdm(test_dataloader, leave=False):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)

        metric.add_batch(predictions=predictions, references=batch['labels'])

    print(f'epoch {epoch} -', metric.compute())

### получено в домашних условиях
# BERT_DATA_SIZE = 4000
# LOADER_BATCH_SIZE = 100
# NUM_EPOCHS = 1
# 2:51:00
# epoch 0 - {'f1': 0.0}




In [211]:
%%script echo
# Сохраняем модель
save_directory = f'./models/my_pretrained_toxic_{NUM_EPOCHS}_{LOADER_BATCH_SIZE}_{BERT_DATA_SIZE}'
model.save_pretrained(save_directory)




### Использование и тестирование обученной BERT-модели

In [212]:
df = data.sample(10000, random_state=RANDOM_STATE)
model_path = "martin-ha/toxic-comment-model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
toxic_pipeline =  TextClassificationPipeline(model=model, tokenizer=tokenizer)

In [213]:
def get_toxic_flag(text):
    result = toxic_pipeline(text, padding=True, truncation=True)
    if result[0]['label'] == 'toxic':
        return 1
    return 0

In [214]:
## пример работы готовой модели
for text in ['you are Mother fucker', 'you are lucky guy']:
    print(f'for "{text}" toxic flag is {get_toxic_flag(text)}')

for "you are Mother fucker" toxic flag is 1
for "you are lucky guy" toxic flag is 0


In [215]:
%%script echo
%time df['toxic_flag_prediction'] = df['text'].apply(get_toxic_flag)

### получено в домашних условниях
# выборка 10000
# CPU times: user 53min 35s, sys: 2min 4s, total: 55min 40s
# Wall time: 7min 8s




In [216]:
%%script echo
df['toxic_flag_prediction'].value_counts(normalize=True)




In [217]:
%%script echo
f1_score_test = f1_score(df['toxic'],df['toxic_flag_prediction'])
print(f'f1-score : {f1_score_test}')

## получено в домашних условиях
# выборка 10000
# f1-score : 0.6963048498845266




## Выводы

В рамках проведенного исследования мы обучили модель распознавания токсичных комментариев на основе TF-IDF и LogisticRegression. С ипользованием гиперпараметра С=21.05, модель достигла на тесовой выборке метрики f1-score=0.77, что постановке задачи.

Так же было проведены попытки построить модель на основе предобученной модели BERT в 2х вариантах:
* построение emmbeding'ов на основе готовой модели [martin-ha/toxic-comment-model](https://huggingface.co/martin-ha/toxic-comment-model) + логистическая регрессия - f1-score на тесте: 0.72
* дообучение bert-base-uncased для бинарной классификаци, к сожалению, доступное оборудование не позволяет дообучать данную модель с требуемой точностью за разумное время - поэтому результаты не приведены, код исследования оставлен для справки
