<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LogisticRegression</a></span></li><li><span><a href="#CatBoost" data-toc-modified-id="CatBoost-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>CatBoost</a></span></li></ul></li><li><span><a href="#Тестирование-лучшей-модели" data-toc-modified-id="Тестирование-лучшей-модели-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Тестирование лучшей модели</a></span></li><li><span><a href="#Вывод" data-toc-modified-id="Вывод-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Вывод</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.



**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

 <div class="alert alert-success">
<h2> Комментарий ревьюера <a class="tocSkip"> </h2>

<b>Все отлично!👍:</b> 
    
Вижу твое добавленное описание проекта. Молодец! Это поможет тебе расставлять акценты в выводах

## Подготовка

In [1]:
import pandas as pd
import numpy as np
import spacy
import re
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm.notebook import tqdm

from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from catboost import CatBoostClassifier, Pool, cv
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split, cross_val_score

import warnings

In [2]:
RANDOM_STATE = 12345

In [3]:
warnings.filterwarnings("ignore")

In [4]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [5]:
df.groupby('toxic')['toxic'].agg(['count'])

Unnamed: 0_level_0,count
toxic,Unnamed: 1_level_1
0,143106
1,16186


Виден сильный дисбаланс классов, это необходимо учесть при разбиении данных и обучении модели

In [6]:
def clear_text(text):
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', text)
    cleared_text = ' '.join(cleared_text.split())
    return cleared_text

In [7]:
nlp = spacy.load('en_core_web_sm')

In [8]:
def lemmatize_text(text):
    lemm_text = " ".join([token.lemma_ for token in nlp(text)])
    return lemm_text

In [9]:
tqdm.pandas()
df['lemm_text'] = df['text'].progress_apply(clear_text)
df.head()

  0%|          | 0/159292 [00:00<?, ?it/s]

Unnamed: 0.1,Unnamed: 0,text,toxic,lemm_text
0,0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...
2,2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...
4,4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...


In [10]:
df['lemm_text'] = df['lemm_text'].progress_apply(lemmatize_text)
df.head()

  0%|          | 0/159292 [00:00<?, ?it/s]

Unnamed: 0.1,Unnamed: 0,text,toxic,lemm_text
0,0,Explanation\nWhy the edits made under my usern...,0,explanation why the edit make under my usernam...
1,1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour I m seem...
2,2,"Hey man, I'm really not trying to edit war. It...",0,hey man I m really not try to edit war it s ju...
3,3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestion on impro...
4,4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...


In [11]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
df_train, df_test = train_test_split(df, test_size=.25, stratify=df['toxic'], random_state=RANDOM_STATE)

In [13]:
count_tf_idf = TfidfVectorizer(stop_words=list(stopwords))
tf_idf_train = count_tf_idf.fit_transform(df_train['lemm_text'])
tf_idf_train.shape

(119469, 132298)

In [14]:
tf_idf_test = count_tf_idf.transform(df_test.lemm_text)
tf_idf_test.shape

(39823, 132298)

In [15]:
target_train = df_train['toxic']
target_test = df_test['toxic']

## Обучение

### LogisticRegression

In [16]:
model_log_reg = LogisticRegression(class_weight='balanced', random_state=RANDOM_STATE, C=6)
log_reg_scores = cross_val_score(model_log_reg, 
                            tf_idf_train, 
                            target_train, 
                            cv=3, 
                            scoring='f1')

In [17]:
print('Метрика f1:', log_reg_scores.mean())

Метрика f1: 0.7631433012881711


### CatBoost

In [18]:
train_pool = Pool(data=tf_idf_train,
                  label=target_train,
                 )

catboost = CatBoostClassifier(loss_function='Logloss',
                              eval_metric='F1',
                              verbose=20, 
                              random_state=RANDOM_STATE,
                              iterations=40,
                              depth=5,
                              auto_class_weights='Balanced')

parameters_cat = {'learning_rate':np.arange(0.1,1,0.2)}

catboost_grid = catboost.grid_search(parameters_cat, train_pool,
            verbose=True,
            plot=False)

0:	learn: 0.4361062	test: 0.4240730	best: 0.4240730 (0)	total: 1.75s	remaining: 1m 8s
20:	learn: 0.7580884	test: 0.7498820	best: 0.7498820 (20)	total: 29.6s	remaining: 26.7s
39:	learn: 0.7955000	test: 0.7989114	best: 0.7989114 (39)	total: 55.5s	remaining: 0us

bestTest = 0.7989113901
bestIteration = 39

0:	loss: 0.7989114	best: 0.7989114 (0)	total: 1m 45s	remaining: 7m 1s
0:	learn: 0.4361062	test: 0.4240730	best: 0.4240730 (0)	total: 1.7s	remaining: 1m 6s
20:	learn: 0.8178774	test: 0.8185972	best: 0.8185972 (20)	total: 28.7s	remaining: 26s
39:	learn: 0.8508112	test: 0.8461702	best: 0.8461702 (39)	total: 54.6s	remaining: 0us

bestTest = 0.8461702342
bestIteration = 39

1:	loss: 0.8461702	best: 0.8461702 (1)	total: 2m 40s	remaining: 4m
0:	learn: 0.4361062	test: 0.4240730	best: 0.4240730 (0)	total: 1.8s	remaining: 1m 10s
20:	learn: 0.8359085	test: 0.8366693	best: 0.8386310 (19)	total: 29.4s	remaining: 26.6s
39:	learn: 0.8687301	test: 0.8568970	best: 0.8568970 (39)	total: 54.4s	remaining: 

In [19]:
print('Метрика f1:', max(catboost_grid['cv_results']['test-F1-mean']))

Метрика f1: 0.858943973771193


Лучшая метрика у модели CatBoost, выбираем её для дальнейшего тестирования.

## Тестирование лучшей модели

In [20]:
catboost_grid['params']

{'learning_rate': 0.9000000000000001}

In [21]:
train_pool = Pool(data=tf_idf_train,
                  label=target_train)

test_pool = Pool(data=tf_idf_test,
                  label=target_test)

catboost_model = CatBoostClassifier(loss_function='Logloss',
                              eval_metric='F1', 
                              random_state=RANDOM_STATE,
                              iterations=40,
                              depth=5,
                              auto_class_weights='Balanced',
                              learning_rate=0.9)

catboost_model.fit(train_pool, eval_set=test_pool)

0:	learn: 0.4789017	test: 0.4664065	best: 0.4664065 (0)	total: 1.9s	remaining: 1m 14s
1:	learn: 0.5899162	test: 0.5860098	best: 0.5860098 (1)	total: 3.4s	remaining: 1m 4s
2:	learn: 0.6274950	test: 0.6192452	best: 0.6192452 (2)	total: 4.89s	remaining: 1m
3:	learn: 0.6681541	test: 0.6583624	best: 0.6583624 (3)	total: 6.44s	remaining: 58s
4:	learn: 0.7073342	test: 0.6976670	best: 0.6976670 (4)	total: 7.99s	remaining: 55.9s
5:	learn: 0.7319713	test: 0.7236329	best: 0.7236329 (5)	total: 9.44s	remaining: 53.5s
6:	learn: 0.7562815	test: 0.7443180	best: 0.7443180 (6)	total: 10.9s	remaining: 51.5s
7:	learn: 0.7696588	test: 0.7579779	best: 0.7579779 (7)	total: 12.4s	remaining: 49.7s
8:	learn: 0.7803342	test: 0.7713563	best: 0.7713563 (8)	total: 13.8s	remaining: 47.6s
9:	learn: 0.7879346	test: 0.7759542	best: 0.7759542 (9)	total: 15.3s	remaining: 45.8s
10:	learn: 0.7965385	test: 0.7863864	best: 0.7863864 (10)	total: 16.7s	remaining: 44s
11:	learn: 0.8037192	test: 0.7917001	best: 0.7917001 (11)	to

<catboost.core.CatBoostClassifier at 0x7f4a645c2430>

In [22]:
print('Метрика f1:', catboost_model.best_score_['validation']['F1'])

Метрика f1: 0.8654049253770559


Проверим на адекватность, сравнив с константной моделью

In [23]:
dummy_model = DummyClassifier(random_state=RANDOM_STATE, strategy='most_frequent')
dummy_model.fit(tf_idf_train, target_train)
dummy_predictions = dummy_model.predict(tf_idf_test)
f1 = f1_score(target_test, dummy_predictions)
print('Метрика F1 = для dummy: ', f1)

Метрика F1 = для dummy:  0.0


Всё верно, наша модель показывает хороший результат. F1=0 для константной модели, т.к. TP=0

## Вывод

- лемматизировали и очистиили все признаки при подготовки данных
- векторизовали все призники методом TF-IDf методом TfidfVectorizer
- учитывали дисбаланс классов при разбиении параметром stratify
- использовали две модели для выявления лучшей CatBoost и LogisticRegression
- при обучении моделей укзывали параметр class_weigts как 'balanced'
- у LogisticRegression на кросс-валидации F1 = 0.763
- у CatBoost на кросс-валидации F1 = 0.859
- при тестировании лучшей модели метрика F1 = 0.865, что выше заданной F1 = 0.75
- проверили модель на адекватность с константной моделью