<li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#BERT" data-toc-modified-id="BERT-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>BERT</a></span></li></ul></li><li><span><a href="#Тестирование" data-toc-modified-id="Тестирование-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Тестирование</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Выводы</a></span></li>

# Проект для «Викишоп» с BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
!pip install lightgbm



In [2]:
import re
import numpy as np
import pandas as pd
import torch
import transformers
import nltk
from lightgbm import LGBMClassifier
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
from tqdm import notebook

In [3]:
nltk.download('stopwords')
stopwords = set(stopwords.words('english')) 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Artem\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
data = pd.read_csv('datasets/toxic_comments.csv')

In [5]:
data.sample(50)

Unnamed: 0,text,toxic
141152,Your vandalism\nPlease stop your stupidity. No...,1
67937,"""\n\nIs it necessary to have this many picture...",0
53259,"look dude, they came, they saw, they kicked cr...",1
95824,Fullerton Securities \n\n Company Overview \n\...,0
116112,"""\n\n Supposed influence on historical events ...",0
133635,I removed his bullshits. 114.179.18.37,1
73608,REDIRECT Talk:German Americans/Archive 2,0
155669,As spoken in.... (don't mention the States!) \...,0
79960,"""\n\nI will think about what you said. I remo...",0
118302,Great page\nLove that phrase. Hope this artic...,0


In [6]:
data['text'] = data['text'].apply(lambda x: x.replace("\n", ""))
data['text'] = data['text'].apply(lambda x: re.sub(r'[^a-zA-Z]', ' ', x))
data['text'] = data['text'].str.lower()
data['text'] = data['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

Проверим наличие дубликатов и удалим их

In [7]:
data.duplicated().sum()

1758

In [8]:
data = data.drop_duplicates(subset=['text']).reset_index(drop=True)

Лемматизируем тексты

In [9]:
lemma = nltk.wordnet.WordNetLemmatizer()
data['lem_text'] = data['text'].apply(lambda x: " ".join([lemma.lemmatize(i) for i in x.split()]))

разделим выборки

In [10]:
train,test = train_test_split(data,test_size=0.4,random_state=12345)

Оценим баланс классов в тестовой выборке

In [11]:
print(train['toxic'].value_counts())
print('отрицательный класс встречается в {} раз чаще'.format(round(train['toxic'].value_counts()[0]/train['toxic'].value_counts()[1],2)))

0    85028
1     9643
Name: toxic, dtype: int64
отрицательный класс встречается в 8.82 раз чаще


In [12]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled


In [13]:
upsampled = pd.DataFrame()
upsampled['lem_text'],upsampled['toxic'] = upsample(train['lem_text'],train['toxic'],9)

In [14]:
upsampled['toxic'].value_counts()

1    86787
0    85028
Name: toxic, dtype: int64

In [15]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

In [16]:
features_train = count_tf_idf.fit_transform(upsampled['lem_text'])
features_test = count_tf_idf.transform(test['lem_text'])
target_train = upsampled['toxic']
target_test = test['toxic']


In [17]:
features_valid,features_test,target_valid,target_test = train_test_split(features_test,target_test,train_size=0.5,random_state=12345)

## Обучение

In [18]:
model = RandomForestClassifier(n_jobs=8,random_state=12345)
model.fit(features_train,target_train)
predictions = model.predict(features_valid)
print(f1_score(target_valid,predictions))

0.6364309514994445


In [19]:
model = LogisticRegression(max_iter=500,n_jobs=-1,random_state=12345)
parametrs  = {'solver':['sag','saga'],
            'penalty':['l2']}
grid = GridSearchCV(model,parametrs,scoring='f1',cv=2)
grid.fit(features_train,target_train)
predictions = grid.predict(features_valid)
log_reg_estimator = grid.best_estimator_
print('F1 - ',f1_score(target_valid,predictions))
print(grid.best_params_)

F1 -  0.7354497354497354
{'penalty': 'l2', 'solver': 'sag'}


In [20]:
best_score = 0
for i in np.arange(0.10,0.7,0.01):
    model = log_reg_estimator
    probabilities = grid.predict_proba(features_valid)
    probabilities_one_valid = probabilities[:, 1]
    predictions = probabilities_one_valid > i
    score = f1_score(target_valid,predictions)
    if score > best_score: 
        best_score = score
        best_threshold = i
print('F1 - ',best_score)
print('порог - ',best_threshold.round(2))       

F1 -  0.7692804130364633
порог -  0.68


In [21]:
model = LGBMClassifier(n_estimators=500,random_state=12345,n_jobs=8)
model.fit(features_train,target_train)
predictions = model.predict(features_valid)
print('F1 -',f1_score(target_valid,predictions))

F1 - 0.7550960978450786


### BERT

токенизируем тексты для `BERT`

In [22]:
data = data.sample(16000).reset_index(drop=True) # ограничим выборку 16 000 значений.

In [23]:
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained('bert-base-uncased')
tokenized = data['text'].apply(
  lambda x: tokenizer.encode(x,truncation=True,max_length=512,add_special_tokens=True)) 

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'DistilBertTokenizerFast'.


In [24]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)

In [25]:
device = torch.device("cuda")
config = transformers.DistilBertConfig()
model = transformers.DistilBertModel(config)
configuration = model.config

In [26]:
batch_size = 16

embeddings = []

for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    # преобразуем данные
    model.to(device)
    model.eval()
    batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).to(device)
    # преобразуем маску
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)
    batch_embeddings = model(batch, attention_mask=attention_mask_batch)
    with torch.no_grad():
        model.eval()
        model.to(device)
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)
    # преобразуем элементы методом numpy() к типу numpy.array
    embeddings.append(batch_embeddings[0][:,0,:].cpu().data.numpy()) 
features_bert = np.concatenate(embeddings)
features_bert = pd.DataFrame(features_bert)

  0%|          | 0/1000 [00:00<?, ?it/s]

In [27]:
features_train_bert,features_valid_bert,target_train_bert,target_valid_bert = train_test_split(features_bert,data['toxic'],test_size=0.2)
features_valid_bert,features_test_bert,target_valid_bert,target_test_bert = train_test_split(features_valid_bert,target_valid_bert,test_size=0.5)

In [28]:
print(target_train_bert.value_counts())
print('отрицательный класс встречается в {} раз чаще'.format(round(target_train_bert.value_counts()[0]/target_train_bert.value_counts()[1],2)))

0    11496
1     1304
Name: toxic, dtype: int64
отрицательный класс встречается в 8.82 раз чаще


In [29]:
up_features_train = pd.DataFrame()
up_target_train = pd.DataFrame()

up_features_train,up_target_train = upsample(features_train_bert,target_train_bert,9)

In [30]:
model = LogisticRegression(max_iter=500,random_state=12345)
parametrs  = {'solver':['sag'],
            'penalty':['l2'],
            'C':np.arange(0.6,1,0.2)}
grid = GridSearchCV(model,parametrs,scoring='f1',cv=2,n_jobs=-1)
grid.fit(up_features_train,up_target_train)
predictions = grid.predict(features_valid_bert)
print('F1 - ',f1_score(target_valid_bert,predictions))
print(grid.best_params_)

F1 -  0.4475524475524475
{'C': 0.8, 'penalty': 'l2', 'solver': 'sag'}




In [31]:
model = grid
best_score = 0
for i in np.arange(0.10,0.7,0.01):
    probabilities = model.predict_proba(features_valid_bert)
    probabilities_one_valid = probabilities[:, 1]
    predictions = probabilities_one_valid > i
    score = f1_score(target_valid_bert,predictions)
    if score > best_score: 
        best_score = score
        best_threshold = i
print('F1 - ',best_score)
print('порог - ',best_threshold.round(2)) 

F1 -  0.5437352245862884
порог -  0.69


## Тестирование

Проверим лучшую модель (логистическую регрессию) на тестовой выборке

In [32]:
model = log_reg_estimator
probabilities = model.predict_proba(features_test)
probabilities_one_valid = probabilities[:, 1]
predictions = probabilities_one_valid > 0.68
print('F1 - ',f1_score(target_test,predictions))

F1 -  0.7618276085547635


## Выводы

В ходе проекта мы:
1. Предобработали данные (удалили лишние символы, лемматизировали тексты,рассчитали TF-IDF,сбалансировали целевой признак,разбили данные на три выборки)
2. Обучили модели машинного обучения и проверили их на валидационной выборке.
3. Проверирили лучшую модель на тестовой выборке (метрика F1 - 0.88)