<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#1.LogisticRegression" data-toc-modified-id="1.LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>1.LogisticRegression</a></span></li><li><span><a href="#2.-SGDClassifier" data-toc-modified-id="2.-SGDClassifier-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>2. SGDClassifier</a></span></li><li><span><a href="#3.-Catboost" data-toc-modified-id="3.-Catboost-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>3. Catboost</a></span></li><li><span><a href="#4.-LGBMClassifier" data-toc-modified-id="4.-LGBMClassifier-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>4. LGBMClassifier</a></span></li><li><span><a href="#5.-BERT" data-toc-modified-id="5.-BERT-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>5. BERT</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп» c BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [5]:
import pandas as pd
from pymystem3 import Mystem
import nltk
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier, LGBMRegressor
import lightgbm as lgbm
import catboost as cb
from sklearn.linear_model import SGDClassifier
import torch
import torch
import transformers 
import numpy as np

In [6]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [7]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/ramen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [8]:
lemmatizer = WordNetLemmatizer()

In [None]:
all_lemm_sentences = []
for i in range(df.shape[0]):
    sentence = df.loc[i]['text']
    word_list = re.sub(r'[^a-zA-Z ]', ' ', df.loc[i]['text']).split()
    lemmatized_output =  ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    all_lemm_sentences.append(lemmatized_output)

In [None]:
df['lemm'] = pd.Series(all_lemm_sentences)

In [8]:
train,oth = train_test_split(df, test_size=0.4)

In [9]:
valid, test = train_test_split(oth, test_size=0.5)

In [10]:
train_val = pd.concat([train, valid])

In [11]:
corpus = train['lemm'].values

In [12]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/ramen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
text_transformer = TfidfVectorizer(stop_words=stop_words, ngram_range=(1, 2), lowercase=True)

In [14]:
X_train_text = text_transformer.fit_transform(corpus)

In [15]:
X_test_text = text_transformer.transform(test['lemm'].values)

## Обучение

Итак данные готовы, теперь применим 4 модели(LogRegressing, Catboost, SGDClassifier, LGBMClassifier), и посмотрим кто же лучше всего может описать и предсказать по нашим данным<br>


### 1.LogisticRegression

In [18]:
logit = LogisticRegression(C=5e1, solver='lbfgs', multi_class='multinomial', random_state=17, n_jobs=-1)    

In [19]:
logit.fit(X_train_text, train['toxic'])

LogisticRegression(C=50.0, multi_class='multinomial', n_jobs=-1,
                   random_state=17)

In [20]:
logit.predict(X_test_text)

array([0, 0, 0, ..., 0, 0, 0])

In [21]:
accuracy_score(test['toxic'], logit.predict(X_test_text))

0.9606767977440075

In [22]:
f1_score(test['toxic'], logit.predict(X_test_text))

0.7850659359479363

### 2. SGDClassifier

In [23]:
from sklearn.linear_model import SGDClassifier

In [24]:
clf = SGDClassifier(loss = "hinge", penalty = "l1")

In [25]:
clf.fit(X_train_text, train['toxic'])

SGDClassifier(penalty='l1')

In [26]:
f1_score(test['toxic'], clf.predict(X_test_text))

0.6533066132264529

### 3. Catboost

In [40]:
model = cb.CatBoostClassifier(iterations=100, depth=10, loss_function='Logloss',random_seed=12345)

In [42]:
model.fit(X_train_text, train['toxic'])

Custom logger is already specified. Specify more than one logger at same time is not thread safe.

Learning rate set to 0.5
0:	learn: 0.3449195	total: 19.1s	remaining: 31m 31s
1:	learn: 0.2516541	total: 38.2s	remaining: 31m 9s
2:	learn: 0.2218982	total: 57.2s	remaining: 30m 49s
3:	learn: 0.2084444	total: 1m 16s	remaining: 30m 33s
4:	learn: 0.1983441	total: 1m 35s	remaining: 30m 15s
5:	learn: 0.1912460	total: 1m 54s	remaining: 29m 55s
6:	learn: 0.1857448	total: 2m 13s	remaining: 29m 37s
7:	learn: 0.1805690	total: 2m 32s	remaining: 29m 19s
8:	learn: 0.1762914	total: 2m 52s	remaining: 29m
9:	learn: 0.1716879	total: 3m 11s	remaining: 28m 41s
10:	learn: 0.1687829	total: 3m 30s	remaining: 28m 23s
11:	learn: 0.1667784	total: 3m 49s	remaining: 28m 5s
12:	learn: 0.1637667	total: 4m 9s	remaining: 27m 47s
13:	learn: 0.1612539	total: 4m 28s	remaining: 27m 28s
14:	learn: 0.1591714	total: 4m 47s	remaining: 27m 9s
15:	learn: 0.1578823	total: 5m 6s	remaining: 26m 50s
16:	learn: 0.1556772	total: 5m 26s	remaining: 26m 32s
17:	learn: 0.1536626	total: 5m 45s	remaining: 26m 13s
18:	learn: 0.1517668	tota

<catboost.core.CatBoostClassifier at 0x7fc39b7a7f70>

In [45]:
f1_score(test['toxic'], model.predict(X_test_text))

0.7532690984170681

### 4. LGBMClassifier

In [27]:
clf_LGBM = lgbm.LGBMClassifier(verbose=-1, learning_rate=0.5, max_depth=20, num_leaves=50, n_estimators=120, max_bin=2000)

In [28]:
clf_LGBM.fit(X_train_text, train['toxic'])

LGBMClassifier(learning_rate=0.5, max_bin=2000, max_depth=20, n_estimators=120,
               num_leaves=50, verbose=-1)

In [29]:
f1_score(test['toxic'], clf_LGBM.predict(X_test_text))

0.7664162038549494

### 5. BERT

И так приступим, для начала я удаляю все переменные использованные ранее чтобы наше ядро не крашнулось, и памяти хватило. Затем токенизирую все текста(ну почти все).<br>


In [1]:
from IPython import get_ipython
get_ipython().magic('reset -sf') 

In [65]:
import pandas as pd
import torch
import transformers 
import numpy as np
from tqdm import notebook
from tokenizers import BertWordPieceTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [3]:
tokenizer = BertWordPieceTokenizer(
  clean_text=False,
  handle_chinese_chars=False,
  strip_accents=False,
  lowercase=True,
)

In [4]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [5]:
#df = df.iloc[:10000]

In [6]:
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

In [7]:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased', model_max_len=800)

In [23]:
tokenized = df['text'].apply(
  lambda x: tokenizer.encode(x, add_special_tokens=True))

Так как длина в нашей макс модели 512 то нам нужно ограничить наш диапазон на 512


In [24]:
index = []
for i in range(len(tokenized)):
    if len(tokenized.iloc[i]) > 512:
        index.append(i)

In [25]:
tokenized.drop(index=index, inplace=True)

In [26]:
tokenized

0         [101, 16409, 1643, 20592, 2116, 2009, 1103, 14...
1         [101, 141, 112, 170, 2246, 2246, 106, 1124, 26...
2         [101, 4403, 1299, 117, 146, 112, 182, 1541, 11...
3         [101, 107, 3046, 146, 1169, 112, 189, 1294, 12...
4         [101, 1192, 117, 6442, 117, 1132, 1139, 6485, ...
                                ...                        
159566    [101, 107, 131, 131, 131, 131, 131, 1262, 1111...
159567    [101, 1192, 1431, 1129, 16155, 1104, 3739, 133...
159568    [101, 156, 18965, 6198, 12189, 1306, 117, 1175...
159569    [101, 1262, 1122, 2736, 1176, 1122, 1108, 2140...
159570    [101, 107, 1262, 119, 119, 119, 146, 1541, 127...
Name: text, Length: 155749, dtype: object

В целом не так уж и много данных ушло<br>

In [11]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)

In [29]:
del tokenized

In [13]:
model = transformers.BertModel.from_pretrained('bert-base-cased')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Embedding**

In [15]:
%%time
from tqdm import notebook
batch_size = 100 # для примера возьмем такой батч, где будет всего две строки датасета
embeddings = [] 
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).to(device) # закидываем тензор на GPU
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)
        
        with torch.no_grad():
            model.to(device)
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy()) # перевод обратно на проц, чтобы в нумпай кинуть
        del batch
        del attention_mask_batch
        del batch_embeddings
        
features = np.concatenate(embeddings)

  0%|          | 0/1557 [00:00<?, ?it/s]

CPU times: user 1h 42min 55s, sys: 1.41 s, total: 1h 42min 56s
Wall time: 1h 42min 57s


In [19]:
features.shape

(155700, 768)

In [28]:
df = df.drop(index=index).iloc[:155700]

In [38]:
df

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159517,"""\nYou've got no call to start attacking me ad...",0
159518,"Agreed, besides debating it now won't repair d...",0
159519,"REDIRECT Talk:Harry Yates (footballer, born 1925)",0
159520,HEY HOW ABOUT SHE PUTS OUT A NEW SINGLE AM I R...,0


In [47]:
features[:124560]

array([[ 0.5411031 ,  0.01921145,  0.11053315, ..., -0.3546025 ,
         0.3625402 ,  0.0970161 ],
       [ 0.3427579 , -0.10840041,  0.31300983, ...,  0.2157756 ,
        -0.01797534,  0.13064337],
       [ 0.56802046,  0.23963812, -0.15950912, ..., -0.28971767,
         0.16949017, -0.04347336],
       ...,
       [ 0.22448054,  0.03906267, -0.09136067, ..., -0.5576799 ,
         0.1713735 ,  0.5461098 ],
       [ 0.58344555, -0.10617191,  0.1454983 , ..., -0.18654124,
         0.44484606,  0.42675206],
       [ 0.38280675,  0.10741405, -0.02836372, ..., -0.18946797,
        -0.04672127, -0.01444094]], dtype=float32)

In [48]:
df.iloc[:124560]

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
127622,"""\n\nHi - I started editing in January of this...",0
127623,Thanks for your note \n\nI have been dealing w...,0
127624,"""\n\n This person is noteable \nNotability of ...",0
127625,Spam \nQuit inserting linkspam into the Trinid...,0


После embedding'a настало время научить модель на одной из моделей, например LogRegression<br>


In [69]:
log_model = LogisticRegression(max_iter=124560, C=5e1, solver='lbfgs', multi_class='multinomial', random_state=17, n_jobs=-1)
log_model.fit(features[:124560], df.iloc[:124560]['toxic'])

LogisticRegression(C=50.0, max_iter=124560, multi_class='multinomial',
                   n_jobs=-1, random_state=17)

In [70]:
features[124560:].shape

(31140, 768)

In [71]:
features.shape

(155700, 768)

In [72]:
log_model.predict(features[124560:])

array([0, 0, 0, ..., 0, 1, 0])

In [73]:
f1_score(df[124560:]['toxic'], log_model.predict(features[124560:]))

0.7124113475177306

## Выводы

In [74]:
pd.DataFrame([0.7850659359479363,0.6533066132264529,0.7532690984170681,0.7664162038549494, 0.712411347517730], columns=['f1_score'], 
             index=['LogisticRegression','SGDClassifier','CatBoostClassifier','LGBMClassifier', 'BERT'])

Unnamed: 0,f1_score
LogisticRegression,0.785066
SGDClassifier,0.653307
CatBoostClassifier,0.753269
LGBMClassifier,0.766416
BERT,0.712411


Полученные результаты свидетельствут о том что в целом все модели предсказывают достаточно неплохо, но в рамках данного проекта нужно преодолеть значение f1 > 0.75 с чем справились все, за исключением стохастического градиентого спуска. А также BERT не достиг значения 0,75