<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для интернет-магазина

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучим модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.

Построим модель со значением метрики качества *F1* не меньше 0.75. 

## Подготовка

In [1]:
import pandas as pd

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
import re

import warnings
warnings.filterwarnings("ignore")

import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import f1_score, make_scorer
from catboost import CatBoostClassifier
from sklearn.dummy import DummyClassifier

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
try:
    data = pd.read_csv('/datasets/toxic_comments.csv')
except:
    data = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')

Посмотрим данные, названия столбцов и информацию о данных

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
data.columns

Index(['Unnamed: 0', 'text', 'toxic'], dtype='object')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


Удалим дублирующий индекс столбец

In [6]:
data = data.drop(['Unnamed: 0'], axis=1)

In [7]:
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


Проверим пропуски

In [8]:
pd.DataFrame(round(data.isna().mean()*100,)).style.background_gradient('coolwarm')

Unnamed: 0,0
text,0.0
toxic,0.0


Посмотрим данные методом describe

In [9]:
data.describe()

Unnamed: 0,toxic
count,159292.0
mean,0.101612
std,0.302139
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


Посмотрим соотношение в целевом признаке

In [10]:
data['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Произведем обработку, уберем стоп-слова, создадим токены и произведем лемматизацию текста

In [11]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [12]:
lemmatizer = WordNetLemmatizer()

In [13]:
def lemmatize(text):
    word_list = nltk.word_tokenize(text)
    word_list = [word for word in word_list if not word in stopwords]
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in word_list])  
    return lemmatized_output

In [14]:
def clear_text(text):
    text = re.sub(r"[^a-zA-Z ]", ' ', str(text))
    text = text.lower()
    return ' '.join(text.split())

In [15]:
data['tokens'] = data['text'].apply(lambda x: lemmatize(clear_text(x)))

In [16]:
data.head()

Unnamed: 0,text,toxic,tokens
0,Explanation\nWhy the edits made under my usern...,0,explanation edits make username hardcore metal...
1,D'aww! He matches this background colour I'm s...,0,aww match background colour seemingly stuck th...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man really try edit war guy constantly rem...
3,"""\nMore\nI can't make any real suggestions on ...",0,make real suggestion improvement wonder sectio...
4,"You, sir, are my hero. Any chance you remember...",0,sir hero chance remember page


Поделим данные на обучающую и тестовую выборки

In [17]:
features = data['tokens']
target = data['toxic']

In [18]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.20, random_state=12345, stratify=target) 

In [19]:
features_train.shape, features_test.shape, target_train.shape, target_test.shape

((127433,), (31859,), (127433,), (31859,))

Создадим признаки

TFIDF

In [20]:
tf_idf = TfidfVectorizer(stop_words = stopwords) 
tf_idf_train = tf_idf.fit_transform(features_train) 
print("Размер матрицы tf-idf:", tf_idf_train.shape)

Размер матрицы tf-idf: (127433, 132707)


In [21]:
tf_idf_test = tf_idf.transform(features_test) 
print("Размер матрицы tf-idf_test:", tf_idf_test.shape)

Размер матрицы tf-idf_test: (31859, 132707)


In [22]:
target_train = target_train.values 
target_test = target_test.values

Вывод: открыли данные, изучили их, посмотрели есть ли пропуски, какие столбцы и данные есть, убрали лишний столбец, обработали текст, создали токены и убрали стоп-слова, затем провели лемматизацию, для обучения провел векторизацию с помощью TFIDF

## Обучение

In [23]:
scorer = make_scorer(f1_score)

In [24]:
lr_model=LogisticRegression(random_state=12345, class_weight='balanced', penalty='l2', max_iter=40, solver='newton-cg', multi_class='multinomial')
lr_f1_tfidf = cross_val_score(lr_model, tf_idf_train, target_train, scoring=scorer, cv=3).max()
               
lr_f1_tfidf

0.7570957797384926

In [25]:
sgd_model=SGDClassifier(random_state=12345, class_weight='balanced', penalty='l2', loss='modified_huber', max_iter=20)
sgd_f1_tfidf = cross_val_score(sgd_model, tf_idf_train, target_train, scoring=scorer, cv=3).max()
               
sgd_f1_tfidf

0.7497660879509305

In [26]:
cat_model=CatBoostClassifier(random_state=12345, learning_rate=0.7, iterations=100, verbose=20, eval_metric = 'F1')
cat_f1_tfidf = cross_val_score(cat_model, tf_idf_train, target_train, scoring=scorer, cv=3).max()
               
cat_f1_tfidf

0:	learn: 0.4190649	total: 2.3s	remaining: 3m 48s
20:	learn: 0.7218844	total: 34.9s	remaining: 2m 11s
40:	learn: 0.7566687	total: 1m 7s	remaining: 1m 36s
60:	learn: 0.7739177	total: 1m 38s	remaining: 1m 3s
80:	learn: 0.7802469	total: 2m 9s	remaining: 30.4s
99:	learn: 0.7942514	total: 2m 39s	remaining: 0us
0:	learn: 0.4719004	total: 2.21s	remaining: 3m 38s
20:	learn: 0.7173698	total: 34.1s	remaining: 2m 8s
40:	learn: 0.7581139	total: 1m 5s	remaining: 1m 33s
60:	learn: 0.7760396	total: 1m 35s	remaining: 1m 1s
80:	learn: 0.7874162	total: 2m 5s	remaining: 29.5s
99:	learn: 0.7982086	total: 2m 34s	remaining: 0us
0:	learn: 0.5010431	total: 2.08s	remaining: 3m 25s
20:	learn: 0.7242673	total: 33.6s	remaining: 2m 6s
40:	learn: 0.7620970	total: 1m 4s	remaining: 1m 32s
60:	learn: 0.7818121	total: 1m 35s	remaining: 1m
80:	learn: 0.7920205	total: 2m 6s	remaining: 29.6s
99:	learn: 0.8025628	total: 2m 36s	remaining: 0us


0.7424506777613744

Вывод: взяли разные модели, провели обучение и кросс-валидацию, выбрали лучшую модель, это логистическая регрессия

## Выводы

In [27]:
lr_model.fit(tf_idf_train, target_train)
predicted_lr = lr_model.predict(tf_idf_test)
f1_lr = f1_score(target_test, predicted_lr)
f1_lr

0.7550740537575424

In [28]:
model_dummy = DummyClassifier()
model_dummy.fit(tf_idf_train, target_train)
predicted_dummy = model_dummy.predict(tf_idf_test)
f1_dummy = f1_score(target_test, predicted_dummy)
f1_dummy

0.0

Вывод: провели тестирование лучшей модели, она прошла тестирование, также проверили на адекватность с помощью константной модели, проверка пройдена.