# Определение токсичных комментариев

Необходимо искать токсичные комментарии на основе набора данных с разметкой токсичности.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import time

import pandas as pd
import numpy as np

import re
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer
import spacy

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import f1_score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Загрузка

In [3]:
df = pd.read_csv('toxic_comments.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [6]:
df.isna().value_counts()

Unnamed: 0  text   toxic
False       False  False    159292
dtype: int64

In [7]:
df.duplicated().value_counts()

False    159292
dtype: int64

In [8]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Вывод: пропусков и дубликатов в данных нет. Целевой признак не сбалансирован. Это следует учитывать в моделях

### Очистка и лемматизация

In [9]:
# нормализация и очистка текста

def clear_text(text):
    text = text.lower()
    cleared = re.sub(r'[^a-zA-Z]', " ", text).split()
    return " ".join(cleared)

In [10]:
# лемматизация текста

%%time

lemmatizer = WordNetLemmatizer()

def nltk_lemmatize(text):
    word_list = nltk.word_tokenize(text)
    lemm_text = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
        
    return lemm_text


df['nltk_lem_text'] = df['text'].apply(lambda x: nltk_lemmatize(clear_text(x)))

print(df['nltk_lem_text'])

0         explanation why the edits made under my userna...
1         d aww he match this background colour i m seem...
2         hey man i m really not trying to edit war it s...
3         more i can t make any real suggestion on impro...
4         you sir are my hero any chance you remember wh...
                                ...                        
159287    and for the second time of asking when your vi...
159288    you should be ashamed of yourself that is a ho...
159289    spitzer umm there no actual article for prosti...
159290    and it look like it wa actually you who put on...
159291    and i really don t think you understand i came...
Name: nltk_lem_text, Length: 159292, dtype: object
CPU times: user 1min 38s, sys: 678 ms, total: 1min 39s
Wall time: 1min 39s


In [11]:
%%time

sp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

def spacy_lemmatize (text):    
    text = text.lower()
    lemm_text = sp(text)
    lemm_text = " ".join([token.lemma_ for token in lemm_text])
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', lemm_text)
    
    return " ".join(cleared_text.split())

df['spacy_lem_text'] = df['text'].apply(spacy_lemmatize)

print(df['spacy_lem_text'])

0         explanation why the edit make under my usernam...
1         d aww he match this background colour I be see...
2         hey man I be really not try to edit war it be ...
3         more I can not make any real suggestion on imp...
4         you sir be my hero any chance you remember wha...
                                ...                        
159287    and for the second time of ask when your view ...
159288    you should be ashamed of yourself that be a ho...
159289    spitzer umm there s no actual article for pros...
159290    and it look like it be actually you who put on...
159291    and I really do not think you understand I com...
Name: spacy_lem_text, Length: 159292, dtype: object
CPU times: user 23min 23s, sys: 9.7 s, total: 23min 33s
Wall time: 23min 33s


Вывод: лемматизация методом spacy выполняется намного дольше, чем nltk. Посмотрим как лемматизация повлияет на модели машинного обучения.

## Обучение

Сначала выберем модель и подберём гиперпараметры, потом на лучшей прогоним результаты разной лемматизации и протестируем оптимальный из них

In [12]:
# разделение выборок

target = df['toxic']
features_nltk = df['nltk_lem_text']

features_train_nltk, features_valid_nltk, target_train_nltk, target_valid_nltk = train_test_split(features_nltk, 
                                                                               target, 
                                                                               test_size=0.2, 
                                                                               random_state=12345)

In [13]:
# TF-IDF

stopwords = set(nltk_stopwords.words('english'))

count_tf_idf = TfidfVectorizer(stop_words=stopwords)

features_train_nltk = count_tf_idf.fit_transform(features_train_nltk.values.astype('U'))
 
features_valid_nltk = count_tf_idf.transform(features_valid_nltk.values.astype('U'))

In [14]:
print('features_train', features_train_nltk.shape)
print('features_valid', features_valid_nltk.shape)
print('target_train', target_train_nltk.shape)
print('target_valid', target_valid_nltk.shape)

features_train (127433, 138358)
features_valid (31859, 138358)
target_train (127433,)
target_valid (31859,)


In [15]:
# разделение выборок

target = df['toxic']
features_spacy = df['spacy_lem_text']

features_train_spacy, features_valid_spacy, target_train_spacy, target_valid_spacy = train_test_split(features_spacy, 
                                                                               target, 
                                                                               test_size=0.2, 
                                                                               random_state=12345)

In [16]:
# TF-IDF

features_train_spacy = count_tf_idf.fit_transform(features_train_spacy.values.astype('U'))
 
features_valid_spacy = count_tf_idf.transform(features_valid_spacy.values.astype('U'))

In [17]:
# Logistic Regression

#%%time

lr_model = LogisticRegression(class_weight='balanced', random_state = 12345)

params = {
    "C": [0.01, 1, 10]
 }

lr_gs = GridSearchCV(lr_model, params, cv=3, scoring='f1', verbose=True).fit(features_train_spacy, target_train_spacy)

print('best parameters: ', lr_gs.best_params_)
print('best f1 score: ', lr_gs.best_score_)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
best parameters:  {'C': 10}
best f1 score:  0.7592394403658248


In [18]:
# Decision Tree

#%%time

tree = DecisionTreeClassifier(class_weight='balanced', random_state = 12345)

params = {
    'criterion':['gini', 'entropy'],        
    'max_depth':list(range(1,15,5)) 
 }
    
tree_gs = GridSearchCV(tree, params, cv=3, scoring='f1', verbose=True).fit(features_train_spacy, target_train_spacy)

print('best parameters: ', tree_gs.best_params_)
print('best f1 score: ', tree_gs.best_score_)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
best parameters:  {'criterion': 'entropy', 'max_depth': 11}
best f1 score:  0.5815049365666366


In [19]:
# Random Forest

#%%time

rf_model = RandomForestClassifier(class_weight='balanced', random_state = 12345)

params = {
    'n_estimators':[10, 50, 100],        
    'max_depth':list(range(1,15,5)) 
 }
    
rf_gs = GridSearchCV(rf_model, params, cv=3, scoring='f1', verbose=True).fit(features_train_spacy, target_train_spacy)

print('best parameters: ', rf_gs.best_params_)
print('best f1 score: ', rf_gs.best_score_)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
best parameters:  {'max_depth': 11, 'n_estimators': 100}
best f1 score:  0.3666099429776586


Вывод: модель логистической регрессии показала наилучший результат. Проверим результаты лемматизаци nltk на логистической регрессии

### Оценка лемматизации

In [20]:
# Logistic Regression

#%%time

lr_model = LogisticRegression(class_weight='balanced', random_state = 12345)

params = {
    "C": [0.01, 1, 10]
 }

lr_gs_nltk = GridSearchCV(lr_model, params, cv=3, scoring='f1', verbose=True).fit(features_train_nltk, target_train_nltk)

print('best parameters: ', lr_gs_nltk.best_params_)
print('best f1 score: ', lr_gs_nltk.best_score_)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
best parameters:  {'C': 10}
best f1 score:  0.7568265757418132


Вывод: наилучший f1 получился у логистической регрессии на данных, лемматизированных методом Spacy.  Тестируем.

### Тестирование

In [21]:
lr_model_test = LogisticRegression(class_weight='balanced', C = 10, random_state = 12345)

lr_model_test.fit(features_train_spacy, target_train_spacy)

predictions = lr_model_test.predict(features_valid_spacy)

print('f1_score:', f1_score(target_valid_spacy, predictions))

f1_score: 0.7705739229998569


In [22]:
lr_model_test = LogisticRegression(class_weight='balanced', C = 10, random_state = 12345)

lr_model_test.fit(features_train_nltk, target_train_nltk)

predictions = lr_model_test.predict(features_valid_nltk)

print('f1_score:', f1_score(target_valid_nltk, predictions))

f1_score: 0.76527698458024


## Выводы

Текст загружен, в предобработке не нуждался. Подготовка данных заключалась в очистке от знаков препнания, цифр, символов и пр. кроме букв и приведении к строчным буквам. Лемматизация проводилась двумя способами NLTK (более быстрый) и Spacy (более долгий). Для обучения текстовые данные векторизовались методом TF-IDF. Обучение происходило посредством логистической регрессии и деревянных моделей. Логистическая регрессия показала наилучший результат. Лемматизация оказалась лучше методом Spacy.

В качестве дальнейших шагов по улучшению метрики можно выполнить токенизацию, обучение с градиентным бустингом и векторизацию нейронной сеткой BERT.