# Проект для «Викишоп» c BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Необходимо обучить модель классифицировать комментарии на позитивные и негативные. 



## Подготовка

In [2]:
!pip install ydata_profiling nltk lightgbm torch transformers tqdm -q

In [8]:
import os
import pandas as pd
import numpy as np
import ydata_profiling
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
import re
from nltk.corpus import wordnet
from nltk.corpus import stopwords
nltk.download("stopwords") # поддерживает удаление стоп-слов
nltk.download('punkt') # делит текст на список предложений
nltk.download('wordnet') # проводит лемматизацию
stopwords = list(stopwords.words('english'))
nltk.download('averaged_perceptron_tagger')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

import torch
import transformers as tf
from tqdm import notebook

import pickle
import time

import resource, sys
resource.setrlimit(resource.RLIMIT_STACK, (resource.RLIM_INFINITY, resource.RLIM_INFINITY))
sys.setrecursionlimit(10**6)

RANDOM_STATE = 42
CROSSVAL = 5
%matplotlib inline

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [4]:
try:
    data = pd.read_csv('toxic_comments.csv', index_col='Unnamed: 0')
except:
    data = pd.read_csv('https://****.csv', index_col='Unnamed: 0')

In [5]:
data.head(20)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


In [1]:
data.profile_report()

159292 комментария, пропусков нет, дубликатов нет, есть комментарии разной длины

Очень сильный дисбаланс классов, который будет влиять на метрику, нужно устранять

## Обучение

Очистка и лемматизация текста

In [6]:
# Lemmatize with POS Tag

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# # 1. Init Lemmatizer
# lemmatizer = WordNetLemmatizer()
# # 2. Lemmatize Single Word with the appropriate POS tag
# word = 'feet'
# print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))
# # 3. Lemmatize a Sentence with the appropriate POS tag
# sentence = "The striped bats are hanging on their feet for best"
# print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
# #> ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']

In [9]:
%%time

lemmatizer = nltk.WordNetLemmatizer()

def cleaning(row):
    # удаление+нижний регистр
    text = re.sub('[^a-zA-Z]', ' ', row.text).lower()
    # токенизация
    text = nltk.word_tokenize(text, language = 'english')
    # лемматизация
    text = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in text] 
    # соединение слов
    row['text_new'] = ' '.join(text) 
    return row

clean_data = data.apply(cleaning, axis=1).drop(['text'],axis = 1)

CPU times: user 19min 44s, sys: 1min 41s, total: 21min 25s
Wall time: 21min 26s


In [10]:
clean_data

Unnamed: 0,toxic,text_new
0,0,explanation why the edits make under my userna...
1,0,d aww he match this background colour i m seem...
2,0,hey man i m really not try to edit war it s ju...
3,0,more i can t make any real suggestion on impro...
4,0,you sir be my hero any chance you remember wha...
...,...,...
159446,0,and for the second time of ask when your view ...
159447,0,you should be ashamed of yourself that be a ho...
159448,0,spitzer umm there no actual article for prosti...
159449,0,and it look like it be actually you who put on...


Выделение выборок + TFIDF

In [11]:
x_train, x_test , y_train, y_test = train_test_split(
    clean_data['text_new'],
    clean_data['toxic'],
    test_size = 0.3,
    random_state = RANDOM_STATE)

In [12]:
count_tf_idf = TfidfVectorizer(stop_words = stopwords)

In [13]:
%%time
tf_idf_x_train = count_tf_idf.fit_transform(x_train)

CPU times: user 4.51 s, sys: 56.2 ms, total: 4.57 s
Wall time: 4.57 s


Ребалансировка классов

In [14]:
class_ratio = data['toxic'].value_counts()[0] / data['toxic'].value_counts()[1]
class_ratio

8.841344371679229

In [15]:
%%time
classificator = LogisticRegression()
train_f1 = cross_val_score(classificator, 
                      tf_idf_x_train, 
                      y_train, 
                      cv=CROSSVAL, 
                      scoring='f1').mean()
print('F1 на CV', train_f1)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1 на CV 0.7112140481801166
CPU times: user 1min 53s, sys: 1min 54s, total: 3min 47s
Wall time: 3min 48s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [16]:
dict_classes={0:1, 1:class_ratio}

classificator = LogisticRegression(class_weight=dict_classes)

train_f1_balanced = cross_val_score(classificator, 
                                    tf_idf_x_train, 
                                    y_train, 
                                    cv=CROSSVAL, 
                                    scoring='f1').mean()
print('F1 на CV с балансированными классами', train_f1_balanced)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

F1 на CV с балансированными классами 0.7493947756056886


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### LogisticRegression

In [17]:
modelLR = LogisticRegression(solver = 'lbfgs',
                             max_iter = 1000,
                             class_weight='balanced')

In [18]:
%%time
LR = modelLR.fit(tf_idf_x_train, y_train)

CPU times: user 27.3 s, sys: 27.9 s, total: 55.2 s
Wall time: 55.2 s


In [19]:
LR_score = cross_val_score(LR, tf_idf_x_train, y_train, cv=CROSSVAL,scoring='f1')
LR_score.mean()

0.7434481087151483

### LightGBM

In [20]:
modelLGB = LGBMClassifier(random_state = RANDOM_STATE)
param_search = {'learning_rate' : [0.1],
                'n_estimators' : [100, 200, 1000],
                'verbose': [0]
               }
gsearchLGB = GridSearchCV(estimator=modelLGB, cv=CROSSVAL,
                        param_grid=param_search,scoring = 'f1')


In [21]:
%%time
gsearchLGB.fit(tf_idf_x_train, y_train)

You can set `force_col_wise=true` to remove the overhead.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_col_wise=true` to remove the overhead.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=tru

GridSearchCV(cv=5, estimator=LGBMClassifier(random_state=42),
             param_grid={'learning_rate': [0.1],
                         'n_estimators': [100, 200, 1000], 'verbose': [0]},
             scoring='f1')

In [22]:
print(gsearchLGB.best_score_)
print(gsearchLGB.best_params_)

0.7668145344599745
{'learning_rate': 0.1, 'n_estimators': 1000, 'verbose': 0}


С помощью классификатора LGBM достигнута целевая метрика на обучающей выборке, так как переход порога в 0,75 не очень большой, проверим работу модели BERT для данной выборки

### BERT

Ищем видеокарту

In [23]:
print('torch.cuda.is_available() ', torch.cuda.is_available())
print('torch.cuda.current_device() ', torch.cuda.current_device())
print('torch.cuda.device_count()', torch.cuda.device_count())
print('torch.cuda.get_device_name(0)', torch.cuda.get_device_name(0))

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

torch.cuda.is_available()  True
torch.cuda.current_device()  0
torch.cuda.device_count() 1
torch.cuda.get_device_name(0) Tesla M40
cuda:0


Токенизатор

In [25]:
tokenizer = tf.BertTokenizer.from_pretrained('unitary/toxic-bert')

In [26]:
text = list(data['text'])

In [27]:
%%time
tokenized = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)

CPU times: user 7min 20s, sys: 17.9 s, total: 7min 38s
Wall time: 7min 27s


Модель

In [79]:
model = transformers.BertModel.from_pretrained('unitary/toxic-bert')
model.to(device)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

Получаем эмбеддинги

In [97]:
%%time
batch_size = 150
embeddings = []

for i in notebook.tqdm(range(tokenized['input_ids'].shape[0] // batch_size)):
    
        num_batches = tokenized['input_ids'].shape[0] // batch_size
        tokens = tokenized['input_ids'][batch_size*i:batch_size*(i+1)]
        token_types = tokenized['token_type_ids'][batch_size*i:batch_size*(i+1)]
        mask = tokenized['attention_mask'][batch_size*i:batch_size*(i+1)]
        
        batch = {'input_ids' : tokens.to(device),
                 'token_type_ids' : token_types.to(device), 
                 'attention_mask' : mask.to(device)}
        
        
        start_time = time.time()
        print(f"Processing batch {i+1}/{num_batches}")
                
        with torch.no_grad():
            batch_embeddings = model(**batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].to('cpu').numpy())
        
        
        elapsed_time = time.time() - start_time
        remaining_batches = num_batches - (i + 1)
        remaining_time = elapsed_time * remaining_batches
        print(f"Batch {i+1} processed. Estimated time remaining: {remaining_time:.2f} sec ({(remaining_time/60):.2f} min)")
        
torch.cuda.empty_cache()

  0%|          | 0/1061 [00:00<?, ?it/s]

Processing batch 1/1061
Batch 1 processed. Estimated time remaining: 6307.12 sec (105.12 min)
Processing batch 2/1061
Batch 2 processed. Estimated time remaining: 8773.26 sec (146.22 min)
Processing batch 3/1061
Batch 3 processed. Estimated time remaining: 10092.57 sec (168.21 min)
Processing batch 4/1061
Batch 4 processed. Estimated time remaining: 10490.02 sec (174.83 min)
Processing batch 5/1061
Batch 5 processed. Estimated time remaining: 10641.26 sec (177.35 min)
Processing batch 6/1061
Batch 6 processed. Estimated time remaining: 10784.83 sec (179.75 min)
Processing batch 7/1061
Batch 7 processed. Estimated time remaining: 11056.60 sec (184.28 min)
Processing batch 8/1061
Batch 8 processed. Estimated time remaining: 11147.71 sec (185.80 min)
Processing batch 9/1061
Batch 9 processed. Estimated time remaining: 11019.74 sec (183.66 min)
Processing batch 10/1061
Batch 10 processed. Estimated time remaining: 11198.54 sec (186.64 min)
Processing batch 11/1061
Batch 11 processed. Estim

На TeslaM40 модель училась 3 часа 7 минут, что очень долго..

In [99]:
# with open('embeddings.pickle', 'wb') as f:
#     pickle.dump(embeddings, f)
# None

In [28]:
with open('embeddings.pickle', 'rb') as f:
    embeddings = pickle.load(f)

Получаем признаки, разделяем на выборки

In [29]:
x = np.concatenate(embeddings)
y = data['toxic'][0:159150]

In [30]:
x, x_test, y, y_test = train_test_split(x, y, random_state=RANDOM_STATE, stratify=y)

Обучим модель классификации

In [31]:
model = LogisticRegression(random_state=RANDOM_STATE, n_jobs=-1)

In [32]:
param_grid = {
    'solver':['lbfgs', 'saga'],
    'C':[0.05, 0.1, 0.15],
    'class_weight': ['balanced'],
                  
}

gs_lr = GridSearchCV(
    model,
    param_grid,
    scoring='f1',
    n_jobs = -1,
    verbose=1
)

In [33]:
%time gs_lr.fit(x, y)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

CPU times: user 4min 40s, sys: 44.7 s, total: 5min 25s
Wall time: 10min 18s




In [34]:
pd.DataFrame(gs_lr.cv_results_)[
    ['param_C',
     'param_solver',
     'mean_test_score',
     'rank_test_score']
]

Unnamed: 0,param_C,param_solver,mean_test_score,rank_test_score
0,0.05,lbfgs,0.92226,5
1,0.05,saga,0.922076,6
2,0.1,lbfgs,0.923242,4
3,0.1,saga,0.923264,3
4,0.15,lbfgs,0.923753,2
5,0.15,saga,0.924357,1


Получим предсказания на тестовой выборке

In [47]:
predictions = gs_lr.best_estimator_.predict(x_test)

cr = classification_report(y_test, predictions, output_dict=True)
display(pd.DataFrame(cr).round(decimals=3).transpose())

Unnamed: 0,precision,recall,f1-score,support
0,0.998,0.984,0.991,35744.0
1,0.871,0.986,0.925,4044.0
accuracy,0.984,0.984,0.984,0.984
macro avg,0.935,0.985,0.958,39788.0
weighted avg,0.985,0.984,0.984,39788.0


## Выводы

В ходе работы был проведен анализ данных, данные подготовлены для работы с моделями LR и LGBM

Модель LGBM показала относительно высокую метрику на обучающей выборке: `f1 ~ 0.77`
Далее были составлены эмбеддинги с помощью модели BERT, затем LR, недостатком стало длительное время обучения, но при этом были получены хорошие результаты, точность предсказаний составила `f1 > 0.92`