**Проект для «Викишоп»**

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества F1 не меньше 0.75.

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели.
3. Сделайте выводы.

**Описание данных**

Данные находятся в файле toxic_comments.csv. Столбец text в нём содержит текст комментария, а toxic — целевой признак.

# Подготовка

In [1]:
!pip install fast_ml torch transformers seaborn stop_words catboost



In [13]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer
from tqdm.auto import tqdm
import torch
import transformers
import re 
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier
from fast_ml.model_development import train_valid_test_split
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

<div class="alert alert-block alert-info">
<b>Совет: </b> Желательно чтобы все импорты были собраны в первой ячейке ноутбука! Если у того, кто будет запускать твой ноутбук будут отсутствовать некоторые библиотеки, то он это увидит сразу, а не в процессе!
</div>

In [3]:
data = pd.read_csv('/Users/kate/Desktop/DS/yandex/Projects/toxic_comments.csv')
data.head(5)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [5]:
data['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

<div class="alert alert-block alert-success">
<b>Успех:</b> Данные загружены корректно, первичный осмотр проведен. Радует, что баланс классов был изучен.
</div>

In [6]:
def clear_text(text):
    new_text = []
    for i in text:
        new_text.append(" ".join(re.sub(r'[^a-zA-Z]', ' ', i).split()))
    return new_text

data['new_text'] = clear_text(data['text'])
data

Unnamed: 0,text,toxic,new_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...
...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,And for the second time of asking when your vi...
159567,You should be ashamed of yourself \n\nThat is ...,0,You should be ashamed of yourself That is a ho...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm theres no actual article for prost...
159569,And it looks like it was actually you who put ...,0,And it looks like it was actually you who put ...


<div class="alert alert-block alert-success">
<b>Успех:</b> Очистка была сделана верно.
</div>

In [7]:
nltk.download('stopwords')
stop_words = set(nltk_stopwords.words('english'))

def stopwords(text):
    new_text = []
    for word_list in text:
        line = []
        line = [x for x in word_list.split() if not x in stop_words]
        new_text.append(line)
    return new_text

data['new_text'] = stopwords(data['new_text'])
data

[nltk_data] Downloading package stopwords to /Users/kate/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,toxic,new_text
0,Explanation\nWhy the edits made under my usern...,0,"[Explanation, Why, edits, made, username, Hard..."
1,D'aww! He matches this background colour I'm s...,0,"[D, aww, He, matches, background, colour, I, s..."
2,"Hey man, I'm really not trying to edit war. It...",0,"[Hey, man, I, really, trying, edit, war, It, g..."
3,"""\nMore\nI can't make any real suggestions on ...",0,"[More, I, make, real, suggestions, improvement..."
4,"You, sir, are my hero. Any chance you remember...",0,"[You, sir, hero, Any, chance, remember, page]"
...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,"[And, second, time, asking, view, completely, ..."
159567,You should be ashamed of yourself \n\nThat is ...,0,"[You, ashamed, That, horrible, thing, put, tal..."
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,"[Spitzer, Umm, theres, actual, article, prosti..."
159569,And it looks like it was actually you who put ...,0,"[And, looks, like, actually, put, speedy, firs..."


In [8]:
nltk.download('wordnet')
stemmer = nltk.WordNetLemmatizer()

def lemmatize(text):
    lemm_text = []
    for word_list in text:
        lemm_text.append(" ".join(stemmer.lemmatize(word) for word in word_list))
    return lemm_text

data['new_text'] = lemmatize(data['new_text'])
data

[nltk_data] Downloading package wordnet to /Users/kate/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,toxic,new_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why edits made username Hardcore M...
1,D'aww! He matches this background colour I'm s...,0,D aww He match background colour I seemingly s...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I really trying edit war It guy consta...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I make real suggestion improvement I wond...
4,"You, sir, are my hero. Any chance you remember...",0,You sir hero Any chance remember page
...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,And second time asking view completely contrad...
159567,You should be ashamed of yourself \n\nThat is ...,0,You ashamed That horrible thing put talk page
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm there actual article prostitution ...
159569,And it looks like it was actually you who put ...,0,And look like actually put speedy first versio...


**ВЫВОД 1**

- данные импортированы
- в тексте оставлены только слова, убраны стопслова, текст лемматизирован

<div class="alert alert-block alert-success">
<b>Успех:</b> Отлично, что лемматизатор был применен именно к словам.
</div>

# Обучение 

**Слова будут преобразовываться в векторы двумя способами:**

1. BERT
2. TfidfVectorizer

## BERT

In [19]:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')

# tokenized = data['new_text'].apply(
#     lambda x: tokenizer.encode(x, add_special_tokens=True, padding = "max_length", 
#                               truncation = True, return_attention_mask = True, return_tensors = "pt"))

# tokenized

# преобразование слов в токены
encoded_text = tokenizer(list(data['new_text']), add_special_tokens=True, padding = True, 
                         return_attention_mask = True, return_tensors = "pt")
encoded_text
# for i in tokenized.values:
#     if len(i) > max_len:
#         max_len = len(i)

# padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

# attention_mask = np.where(padded != 0, 1, 0)

Token indices sequence length is longer than the specified maximum sequence length for this model (678 > 512). Running this sequence through the model will result in indexing errors


{'input_ids': tensor([[  101,  7526,  2339,  ...,     0,     0,     0],
        [  101,  1040, 22091,  ...,     0,     0,     0],
        [  101,  4931,  2158,  ...,     0,     0,     0],
        ...,
        [  101, 13183,  6290,  ...,     0,     0,     0],
        [  101,  1998,  2298,  ...,     0,     0,     0],
        [  101,  1998,  1045,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

In [20]:
encoded_text.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [21]:
encoded_text['input_ids'] = ((encoded_text['input_ids'].T)[:512]).T
print(encoded_text['input_ids'].shape)

encoded_text['token_type_ids'] = ((encoded_text['token_type_ids'].T)[:512]).T
print(encoded_text['token_type_ids'].shape)

encoded_text['attention_mask'] = ((encoded_text['attention_mask'].T)[:512]).T
print(encoded_text['attention_mask'].shape)

torch.Size([159571, 512])
torch.Size([159571, 512])
torch.Size([159571, 512])


In [22]:
# взято только первые 500 строк 
encoded_text_bert = encoded_text.copy()
encoded_text_bert['input_ids'] = encoded_text_bert['input_ids'][:500]
encoded_text_bert['token_type_ids'] = encoded_text_bert['token_type_ids'][:500]
encoded_text_bert['attention_mask'] = encoded_text_bert['attention_mask'][:500]

In [23]:
# создание эмбэддингов
model = transformers.BertModel.from_pretrained('bert-base-uncased')
batch_size = 1
embeddings = []
with torch.no_grad():
    for i in notebook.tqdm(range(0, len(encoded_text_bert['input_ids']), batch_size)):
        bert_embeddings = model(
            input_ids = encoded_text_bert['input_ids'][i: i + batch_size], 
            attention_mask = encoded_text_bert['attention_mask'][i: i + batch_size]
  
        )
        embeddings.append(bert_embeddings[0][:,0,:].numpy())                


# for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
#         batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
#         attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
#         with torch.no_grad():
#             batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
#         embeddings.append(batch_embeddings[0][:,0,:].numpy())

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, max=500.0), HTML(value='')))




In [43]:
features = np.vstack(embeddings)

In [54]:
pd.DataFrame(features)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.003943,-0.038551,-0.191658,-0.416463,-0.100273,-0.181285,0.586726,0.504233,-0.136987,-0.255546,...,0.123200,-0.140592,0.253080,-0.075918,0.367628,-0.005794,-0.419067,-0.402295,-0.012533,0.486601
1,-0.370478,-0.229204,0.351519,-0.246613,-0.471595,-0.353639,0.769191,0.443525,0.105449,-0.058893,...,0.188431,-0.236262,0.221057,0.097152,0.034832,-0.159341,0.003167,-0.648572,0.359582,0.716788
2,0.124796,0.196310,0.029785,-0.136838,-0.732282,-0.408913,0.496076,0.881864,-0.173776,-0.327263,...,-0.305325,-0.191583,0.075978,0.080356,0.393167,0.148950,-0.178105,-0.359515,0.618104,0.506780
3,0.007574,0.238556,0.315752,-0.162564,-0.227713,-0.431189,0.172111,0.384288,-0.128905,-0.052831,...,0.071006,-0.351603,-0.056626,0.399258,0.306327,-0.137928,-0.485954,-0.476714,0.260979,0.525012
4,-0.026004,0.145654,0.151565,-0.167552,-0.324778,-0.263809,0.298572,0.608470,0.088514,0.055187,...,-0.025698,-0.367768,0.093500,0.191290,0.029543,-0.097696,-0.211805,-0.541219,-0.000663,0.404261
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,-0.124434,-0.045278,-0.144127,0.130008,-0.176664,-0.470352,0.490547,0.630391,-0.341693,-0.024476,...,0.109039,0.078683,0.111596,-0.286132,-0.004543,-0.390831,-0.041959,-0.011874,0.557075,0.407108
496,0.045821,0.045132,0.495879,-0.199855,-0.107832,-0.286765,0.486568,0.418575,-0.118376,-0.358943,...,-0.278910,-0.332389,-0.256643,0.289449,0.269922,-0.180460,-0.362857,-0.243672,0.172332,0.068301
497,-0.467393,0.221900,-0.451334,0.196944,-0.770069,0.592165,0.480514,0.561038,-0.697607,0.151172,...,-0.069998,0.344805,0.196829,0.232389,0.231368,-0.437250,-0.167758,-0.081547,0.442517,0.026120
498,-0.280228,0.055083,-0.051474,-0.056231,0.045012,-0.264153,0.300203,0.549051,-0.346395,-0.061949,...,-0.122788,0.029805,-0.119288,-0.137450,0.098796,-0.036382,-0.488359,-0.371308,0.241362,0.345105


In [58]:
target = data['toxic'][:500]
pd.DataFrame(target)

Unnamed: 0,toxic
0,0
1,0
2,0
3,0
4,0
...,...
495,0
496,0
497,1
498,0


In [115]:
from sklearn.model_selection import train_test_split
X_train, X_rem, y_train, y_rem = train_test_split(features,target, train_size=0.6)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

In [108]:
result = []
def model(model, model_name):
    model = model
    model.fit(X_train, y_train)
    pred = model.predict(X_valid)
    result.append(pd.Series({'Estimator' : model_name,
                             'F1 score:' : f1_score(y_valid, pred)}))

In [109]:
model(LogisticRegression(random_state=12345, class_weight='balanced'), 'LogisticRegression')

In [110]:
model(DecisionTreeClassifier(random_state = 12345, max_depth=10), 'DecisionTreeClassifier')

In [111]:
model(RandomForestClassifier(random_state=12345, class_weight='balanced', n_estimators=100, max_depth=10), 
     'RandomForestClassifier')

In [112]:
model(CatBoostClassifier(iterations=5, learning_rate=1, loss_function='Logloss', depth=10), 'CatBoostClassifier')

0:	learn: 0.2555684	total: 1.98s	remaining: 7.91s
1:	learn: 0.1190436	total: 3.68s	remaining: 5.52s
2:	learn: 0.0625236	total: 5.36s	remaining: 3.57s
3:	learn: 0.0424755	total: 7.15s	remaining: 1.79s
4:	learn: 0.0315904	total: 9.21s	remaining: 0us


In [113]:
best_results = pd.concat(result, axis=1).T.set_index('Estimator')
best_results

Unnamed: 0_level_0,F1 score:
Estimator,Unnamed: 1_level_1
LogisticRegression,0.714286
DecisionTreeClassifier,0.25
RandomForestClassifier,0.444444
CatBoostClassifier,0.2


In [116]:
model = LogisticRegression(random_state=12345, class_weight='balanced')
model.fit(X_train, y_train)
pred = model.predict(X_test)
print('F1 score:', f1_score(y_test, pred))

F1 score: 0.6


**ВЫВОД 2.1**

- использована модель BERT
- наилучший результат показала LogisticRegression, на валидационной выборке F1 = 0.7, на тестовой 0.6; скорее всего результат улучшился бы при подборе определенных параметров методом gridsearchcv

------------

<div class="alert alert-block alert-success">
<b>Успех:</b> Молодец, что освоил векторизацию с помощью БЕРТа!
</div>

## TfidfVectorizer 

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer 
#corpus = data['new_text'].values.astype('U')
# tf_idf = TfidfVectorizer().fit_transform(corpus)
# tf_idf.shape

In [24]:
X = data['new_text']
y = data['toxic']

In [25]:
X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

<div class="alert alert-block alert-danger">
<b>Ошибка:</b> Векторизатор можно обучать только после разбиения выборки на части. При этом он должен быть обучен только на тренировочной части данных.
</div>

<div class="alert alert-block alert-warning">
<b>Изменения:</b> Векторизатор теперь обучен после разбиения 
</div>

In [26]:
corpus = X_train.values.astype('U')
corpus2 = X_valid.values.astype('U')
corpus3 = X_test.values.astype('U')

In [28]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
_vectorizer = vectorizer
X_valid = _vectorizer.transform(corpus2)
X_test = _vectorizer.transform(corpus3)

In [29]:
result = []
def model(model, model_name):
    model = model
    model.fit(X_train, y_train)
    pred = model.predict(X_valid)
    result.append(pd.Series({'Estimator' : model_name,
        'F1 score:' : f1_score(y_valid, pred)}))

In [30]:
model(LogisticRegression(random_state=12345, class_weight='balanced'), 'LogisticRegression')

In [31]:
model(DecisionTreeClassifier(random_state = 12345, max_depth=10), 'DecisionTreeClassifier')

In [32]:
model(RandomForestClassifier(random_state=12345, n_estimators=100, max_depth=10, class_weight='balanced'), 
      'RandomForestClassifier')

In [33]:
best_results = pd.concat(result, axis=1).T.set_index('Estimator')
best_results

Unnamed: 0_level_0,F1 score:
Estimator,Unnamed: 1_level_1
LogisticRegression,0.750343
DecisionTreeClassifier,0.582074
RandomForestClassifier,0.368487


In [35]:
model = LogisticRegression(random_state=12345, class_weight='balanced')
model.fit(X_train, y_train)
pred = model.predict(X_test)
print('F1 score:', f1_score(y_test, pred))

F1 score: 0.7496288298015927


<div class="alert alert-block alert-info">
<b>Совет: </b> Можно было подобрать параметры. Напомню, что внутри кросс-валидации происходит разбиение выборки на треин и валидацию. Однако, в таком случае векторизатор обучен на всей выборке, а это не совсем корректно. Для избежания такого эффекта можно использовать <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">пайплайн</a>. <a href="https://medium.com/analytics-vidhya/ml-pipelines-using-scikit-learn-and-gridsearchcv-fe605a7f9e05">Тут</a> есть пример.
</div>

**ВЫВОД 2.2**

- использован TfidfVectorizer 
- наилучший результат показала LogisticRegression, на валидационной выборке F1 = 0.75, на тестовой 0.75

**ИТОГОВЫЙ ВЫВОД**

- в тексте оставлены только слова, убраны стопслова, текст лемматизирован
- текст в дальнейшем был преобразован в векторы двумя способами 
  1. BERT
  2. TfidfVectorizer
---
  1. BERT
- преобразование текста с помощью BERT трудоемкий процесс, занимает много памяти и времени для обработки (при использовании ноутбука для обработки всех строк потребовалось бы более 88 часов)
- были использованы следующие модели: LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, CatBoostClassifier
- тем не менее с использованием BERT ( с учетом преобразования только 500 строк) логистическая регрессия показала f1 0.71, что скорее всего может быть улучшено с помощью gridsearchcv
---
  2. TfidfVectorizer
- преобразование текста с помощью TfidfVectorizer дало более высокий результат по всем моделям
- были ииспользованы следующие модели: LogisticRegression, DecisionTreeClassifier, RandomForestClassifier
- логистическая регрессия показала на валидационной выборке F1 = 0.75, на тестовой 0.75

<div class="alert alert-block alert-success">
<b>Успех:</b> Приятно видеть подробный вывод в конце проекта!
</div>

**ВОПРОСЫ РЕВЬЮЕРУ**


1. Можно ли как поместить предобработку текста в pipeline?
2. Правильно ли был применен bert? можно ли как-то ускорить процесс? 
3. Зачем используется attention_mask?
4. Где почитать поподробнее про torch? 

<div class="alert alert-block alert-info">
<b>Ответы: </b> 
    
1. Да, можно. Для этого нужно сделать свои классы с нужными методами: https://stackoverflow.com/questions/43232506/using-pipeline-with-custom-classes-in-sklearn , https://medium.com/analytics-vidhya/scikit-learn-pipelines-with-custom-transformer-a-step-by-step-guide-9b9b886fd2cc
2. Да. Да. Подробнее в самом первом комментарии.
3. Чтобы при кодировка текста не учитывать слова паддинги. БЕРТ принимает тексты длиной 512 токенов, поэтом используются паддинги
4. Если что-то конкретное, то в документации, а так можешь поискать курсы/статьи в интернете.