# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

In [84]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from pymystem3 import Mystem
import re
from tqdm import tqdm, notebook
from time import sleep
import pymorphy2
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import f1_score, accuracy_score
from hyperopt import hp, fmin, tpe, rand, STATUS_OK, Trials
from hyperopt.pyll.base import scope
from sklearn.neighbors import KNeighborsClassifier
import torch
import spacy
from spacy.tokenizer import Tokenizer
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier, Pool
from sklearn.pipeline import Pipeline

In [2]:
np.random.seed(0)

In [3]:
tqdm.pandas()

In [4]:
torch.device('cuda')

device(type='cuda')

## Подготовка

In [5]:
try:
    data = pd.read_csv('datasets/toxic_comments.csv', index_col=0)
except FileNotFoundError: 
    data = pd.read_csv('/datasets/toxic_comments.csv', index_col=0)

In [6]:
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [7]:
data.dropna(inplace=True)
data.drop_duplicates(inplace=True)
data.reset_index(inplace=True, drop=True)

In [8]:
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [9]:
# data_sample = data.sample(n=data.shape[0])
# data_sample.reset_index(inplace=True, drop=True)
# data_sample

In [10]:
def prepare_string(text):
    text = re.sub(r'[^A-Za-z]', ' ', text)
    text = text.lower().replace('/n', ' ').split()
    
    for word in text:
        if len(word) < 3:
            text.remove(word)
    return ' '.join(text)

In [11]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def lemmatize_spacy(text):
    return " ".join([token.lemma_ for token in nlp(text)])

### Токенизация и очистка

In [12]:
X_corpus = data['text'].progress_apply(prepare_string)
X_corpus

100%|██████████| 159292/159292 [00:06<00:00, 23069.08it/s]


0         explanation why the edits made under username ...
1         aww matches this background colour m seemingly...
2         hey man m really not trying edit war s just th...
3         more can make any real suggestions improvement...
4         you sir are hero any chance you remember what ...
                                ...                        
159287    and for the second time asking when your view ...
159288    you should ashamed yourself that a horrible th...
159289    spitzer umm theres actual article for prostitu...
159290    and looks like was actually you who put the sp...
159291    and really don think you understand came here ...
Name: text, Length: 159292, dtype: object

### Лемматизация

In [13]:
X_corpus_lemma = X_corpus.progress_apply(lemmatize_spacy)
X_corpus_lemma

100%|██████████| 159292/159292 [11:17<00:00, 235.13it/s]


0         explanation why the edit make under username h...
1         aww match this background colour m seemingly s...
2         hey man m really not try edit war s just that ...
3         more can make any real suggestion improvement ...
4         you sir be hero any chance you remember what p...
                                ...                        
159287    and for the second time ask when your view com...
159288    you should ashamed yourself that a horrible th...
159289    spitzer umm there s actual article for prostit...
159290    and look like be actually you who put the spee...
159291    and really don think you understand come here ...
Name: text, Length: 159292, dtype: object

In [14]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mig29fulcrum\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Делим выборки

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X_corpus_lemma, data['toxic'], test_size=0.1, stratify=data['toxic'])

### Вычисление TF-IDF

In [16]:
lr_pipe = Pipeline(
    [
        ("vect", TfidfVectorizer(stop_words=stopwords)),
        ("clf", LogisticRegression(max_iter=10000, class_weight='balanced')),
    ]
)

In [17]:
lgbm_pipe = Pipeline(
    [
        ("vect", TfidfVectorizer(stop_words=stopwords)),
        ("clf", LGBMClassifier()),
    ]
)

### Кросс-валидация моделей

In [18]:
cross_lr = cross_val_score(estimator=lr_pipe, 
                           X=X_train, 
                           y=y_train, 
                           cv=StratifiedKFold(5, shuffle=True),
                           error_score='raise',
                           scoring='f1')
cross_lr.mean(), cross_lr.std()

(0.7524863651573724, 0.0014079186484297864)

In [19]:
cross_lgbm = cross_val_score(lgbm_pipe, 
                                  X=X_train, 
                                  y=y_train, 
                                  cv=StratifiedKFold(5, shuffle=True),  
                                  scoring='f1')
cross_lgbm.mean(), cross_lgbm.std()

(0.7518180521453741, 0.004292083280278978)

* Выбираем логистическую регрессию (метрика лучше и разница по фолдам меньше)

## Обучение

### Подбор гиперпараметров

In [20]:
class OptHyperparams:
    def __init__ (self, 
                  estimator, 
                  space=dict(),
                  other_params=dict(), 
                  cv=5, 
                  max_evals=20, 
                  scoring='neg_log_loss'):
        
        self.estimator = estimator
        self.space=space
        self.other_params = other_params
        self.cv=cv
        self.max_evals = max_evals
        self.scoring = scoring
        
    def __cross_val__(self, params):
        cross_val = cross_val_score(self.estimator(**params, **self.other_params), 
                        X=self.X, 
                        y=self.y,           
                        scoring=self.scoring,
                        cv=self.cv,
                        error_score='raise'
                       )
        result = {'loss': -cross_val.mean(), 'status': STATUS_OK}

        print(f'Средний {self.scoring}: {cross_val.mean():.4f}')
        print(f'Стандартное отклонение по фолдам: {cross_val.std():.4f}')
        print(f'Tекущие параметры: {params}')
        return result


    def __get_best_params__(self):

        best_params = fmin(self.__cross_val__, 
                           space = self.space, 
                           algo=tpe.suggest, 
                           trials=Trials(), 
                           max_evals=self.max_evals)
        print ('Лучшие гиперпараметры:', best_params)
        return best_params
    
    def fit(self, X, y):
        self.X=X
        self.y=y
        self.model = self.estimator(**self.__get_best_params__(), **self.other_params)
        self.model.fit(X, y)
        
    def predict(self, X_pred):
        return self.model.predict(X_pred)
    
#     def predict_proba(self, X_pred_prob):
#         return self.model.predict_proba(X_pred_prob)

In [21]:
space_lr= {
    'C' : hp.lognormal('C', 0, 1), 
    # 'penalty' : hp.choice('penalty', ['l1','l2']),
    # 'class_weight' : hp.choice('class_weight', ['balanced', None,])
}

other_params_lr = {
    'max_iter' : 10000,
    'solver': 'liblinear',
    'class_weight' : 'balanced'
}
    

In [22]:
lr_grid_pipe = Pipeline(
    [
        ("vect", TfidfVectorizer(stop_words=stopwords)),
        ("clf",  OptHyperparams(LogisticRegression,
                                space=space_lr, 
                                other_params=other_params_lr,
                                max_evals=20, 
                                scoring='f1', 
                                cv = StratifiedKFold(n_splits=5, shuffle=True),
                               )
        )
    ]
)

In [23]:
lr_grid_pipe.fit(X_train, y_train)

Средний f1: 0.7511                                    
Стандартное отклонение по фолдам: 0.0016              
Tекущие параметры: {'C': 0.7762759041144033}          
Средний f1: 0.7457                                                               
Стандартное отклонение по фолдам: 0.0063                                         
Tекущие параметры: {'C': 0.6107760237790363}                                     
Средний f1: 0.7355                                                               
Стандартное отклонение по фолдам: 0.0040                                         
Tекущие параметры: {'C': 0.28358727735875916}                                    
Средний f1: 0.7606                                                               
Стандартное отклонение по фолдам: 0.0032                                         
Tекущие параметры: {'C': 1.694356719428899}                                      
Средний f1: 0.7439                                                               
Стандартное отк

Pipeline(steps=[('vect',
                 TfidfVectorizer(stop_words={'a', 'about', 'above', 'after',
                                             'again', 'against', 'ain', 'all',
                                             'am', 'an', 'and', 'any', 'are',
                                             'aren', "aren't", 'as', 'at', 'be',
                                             'because', 'been', 'before',
                                             'being', 'below', 'between',
                                             'both', 'but', 'by', 'can',
                                             'couldn', "couldn't", ...})),
                ('clf',
                 <__main__.OptHyperparams object at 0x000001C572CB89A0>)])

In [24]:
f1_score(y_test, lr_grid_pipe.predict(X_test))

0.7575338678462814

* Подбор параметров немного улучшил метрику
* Целевая метрика достигнута

# Проект для «Викишоп» с BERT

In [25]:
import torch
import transformers
from tqdm import notebook
import tensorflow as tf
from tensorflow import keras
from torch.utils.data import Dataset, DataLoader

In [26]:
df = pd.read_csv('datasets/toxic_comments.csv', index_col=0)

In [27]:
n_samples = 1000

In [28]:
data = df.sample(n_samples)

In [29]:
X = data['text']
y = data['toxic']

In [30]:
tokenizer = transformers.BertTokenizer(
    vocab_file='datasets/ds_bert/vocab.txt')

tokenized = data['text'].progress_apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True, max_length=512,))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

100%|██████████| 1000/1000 [00:01<00:00, 630.50it/s]


In [31]:
attention_mask

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

In [32]:
config = transformers.BertConfig.from_json_file(
    'datasets/ds_bert/bert_config.json')
model = transformers.BertModel.from_pretrained('unitary/toxic-bert', 
                                               config=config,  
                                               ignore_mismatched_sizes=True)

Some weights of the model checkpoint at unitary/toxic-bert were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at unitary/toxic-bert and are newly initialized because the shapes did not match:
- bert.embeddings.word_embeddings.weight: found shape torch.Size([30522, 768]) in the checkpoint and torch.Size([28996, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

In [33]:
batch_size = 200
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:, 0, :].numpy())

  0%|          | 0/5 [00:00<?, ?it/s]

In [34]:
features = np.concatenate(embeddings)
features.shape

(1000, 768)

In [124]:
X_train, X_test, y_train, y_test = train_test_split(features, y, test_size=0.5, shuffle=True, stratify=y)

In [125]:
train_pool = Pool(X_train, y_train)
test_pool = Pool(X_test, y_test)

In [136]:
from sklearn.utils.class_weight import compute_class_weight
classes = np.unique(y_train)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weights = dict(zip(classes, weights))
print(class_weights)

clf = CatBoostClassifier(devices='GPU', 
                         silent=True, 
                         eval_metric='F1',
                         class_weights=class_weights, 
                         n_estimators=2000, 
                         learning_rate=0.001, 
                         max_depth=2
                         
                        )

{0: 0.5580357142857143, 1: 4.8076923076923075}


In [137]:
clf.fit(train_pool, 
        plot=True, 
        eval_set=test_pool, 
        #use_best_model=True,
        #early_stopping_rounds=100
       )

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x1c50b04be50>

In [120]:
clf = CatBoostClassifier(devices='GPU', 
                         silent=True, 
                         eval_metric='TotalF1',
                         #class_weights=class_weights, 
                         n_estimators=200, 
                         #learning_rate=0.1, 
                         #max_depth=4
                        )

In [121]:
clf.fit(train_pool,
        plot=True, 
        eval_set=test_pool,
       #use_best_model=True
       )

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x1c50d456d60>

## Выводы

* Целевая метрика достигнута.