<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.


**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
# Импорт необходимых библиотек
import os
import numpy as np
import pandas as pd
from pymystem3 import Mystem
import re
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV

In [2]:
data_path_local_0 = 'C:/Users/Maroznik/Documents/dev/Яндекс.Практикум/projects/data/project_13/toxic_comments.csv'

#данные на сервере
data_path_server_0 = '/datasets/toxic_comments.csv'
    
if (os.path.exists(data_path_local_0)):
    data = pd.read_csv(data_path_local_0, sep=',')
    print('Успешное чтение файлов')
elif (os.path.exists(data_path_server_0)):
    data = pd.read_csv(data_path_server_0, sep=',')
else:
    print('Путь к файлу не найден')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [4]:
data

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


In [5]:
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
def lemmatize(text):
    lemm_list = lemmatizer.lemmatize(text)
    return "".join(lemm_list)


def clear_text(text):
    clr_list = re.sub(r'[^a-zA-Z ]', ' ', text)
    clr_list = clr_list.split()
    clr_text = ' '.join(clr_list)
    return clr_text

In [7]:
data['lemm_text'] = data['text'].apply(lambda x: lemmatize(clear_text(x)))

In [8]:
# разделим выборку на обучающую и валидационную
features_train, features_valid, target_train, target_valid = train_test_split(
    data.lemm_text, data.toxic, test_size=0.25, random_state=12345, stratify=data['toxic'])
display(features_train.shape)
display(features_valid.shape)
display(target_train.shape)
display(target_valid.shape)
display(features_valid.shape[0] / (features_train.shape[0] + features_valid.shape[0]))

(119678,)

(39893,)

(119678,)

(39893,)

0.25000156670071627

In [9]:
#исследование дисбаланса классов
features_zeros = features_train[target_train == 0]
features_ones = features_train[target_train == 1]
target_zeros = target_train[target_train == 0]
target_ones = target_train[target_train == 1]

print(features_zeros.shape)
print(features_ones.shape)
print(target_zeros.shape)
print(target_ones.shape)

(107509,)
(12169,)
(107509,)
(12169,)


Положительного класса почти в 9 раза меньше, чем отрицательного. Дисбаланс классов имеет место.

**Балансирование классов**

Балансировку можно проводить следующими способами: взвешиванием классов, увеличением выборки, уменьшением выборки. Чтобы не потерять классы для обучения не будем использовать уменьшение выборки, а из оставшихся вариантов выберем один. Произведем увеличение выборки.

In [10]:
def upsample(features, target, repeat):
    features_zeros = features[target_train == 0]
    features_ones = features[target_train == 1]
    target_zeros = target[target_train == 0]
    target_ones = target[target_train == 1]
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat) 
    return features_upsampled, target_upsampled
 
features_upsampled, target_upsampled = upsample(features_train, target_train, 8)
features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)

features_zeros = features_upsampled[target_upsampled == 0]
features_ones = features_upsampled[target_upsampled == 1]
target_zeros = target_upsampled[target_upsampled == 0]
target_ones = target_upsampled[target_upsampled == 1]

print(features_zeros.shape)
print(features_ones.shape)
print(target_zeros.shape)
print(target_ones.shape)

(107509,)
(97352,)
(107509,)
(97352,)


Дисбаланс устранен

In [11]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
 
tf_idf_train = count_tf_idf.fit_transform(features_upsampled)
tf_idf_test = count_tf_idf.transform(features_valid)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Обучение

Обучим модель логистической регрессии

In [12]:
model = LogisticRegression(random_state=12345)

In [13]:
%%time
model.fit(tf_idf_train, target_upsampled)

CPU times: user 19.2 s, sys: 36.1 s, total: 55.3 s
Wall time: 55.3 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(random_state=12345)

In [14]:
%%time
predictions = model.predict(tf_idf_test)

CPU times: user 6.01 ms, sys: 2.01 ms, total: 8.03 ms
Wall time: 79.9 ms


In [15]:
result = f1_score(target_valid, predictions)
display(result)

0.7638873175698608

Обучим моодель решающего дерева с выбором  оптимальных параметров

In [16]:
dt_model = DecisionTreeClassifier()
parametrs = { 'max_depth': range (5,10),
              'min_samples_leaf': range (1,5) }
grid_dt_model = GridSearchCV(dt_model, parametrs, scoring=f1_score, cv=5)

In [17]:
%%time
grid_dt_model.fit(tf_idf_train, target_upsampled)
grid_dt_model.best_params_

117656    1
82533     0
118946    0
11087     0
         ..
40453     1
123637    1
99164     1
145574    1
99047     1
Name: toxic, Length: 40973, dtype: int64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 74, in inner_f
    return f(**kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1068, in f1_score
    return fbeta_score(y_true, y_pred, beta=1, labels=labels,
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1192, in fbeta_score
    _, _

CPU times: user 22min 25s, sys: 7.8 s, total: 22min 32s
Wall time: 22min 34s


{'max_depth': 5, 'min_samples_leaf': 1}

In [18]:
best_dt_model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=1)

In [19]:
%%time
best_dt_model.fit(tf_idf_train, target_upsampled)

CPU times: user 3.3 s, sys: 47.9 ms, total: 3.35 s
Wall time: 3.35 s


DecisionTreeClassifier(max_depth=5)

In [20]:
%%time
predictions = best_dt_model.predict(tf_idf_test)

CPU times: user 20.1 ms, sys: 0 ns, total: 20.1 ms
Wall time: 18.8 ms


In [21]:
result = f1_score(target_valid, predictions)
display(result)

0.4603058994901675

Обучим моодель случайного леса с выбором  оптимальных параметров

In [22]:
rf_model = RandomForestClassifier(random_state=12345) 

In [23]:
parametrs = { 'max_depth': range (5,10),
              'n_estimators': range (40,60) }
grid_rf_model = GridSearchCV(rf_model, parametrs, scoring=f1_score, cv=5)

In [24]:
%%time
grid_rf_model.fit(tf_idf_train, target_upsampled)
grid_rf_model.best_params_

117656    1
82533     0
118946    0
11087     0
         ..
40453     1
123637    1
99164     1
145574    1
99047     1
Name: toxic, Length: 40973, dtype: int64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 74, in inner_f
    return f(**kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1068, in f1_score
    return fbeta_score(y_true, y_pred, beta=1, labels=labels,
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1192, in fbeta_score
    _, _

CPU times: user 1h 22min 3s, sys: 27.2 s, total: 1h 22min 30s
Wall time: 1h 22min 41s


{'max_depth': 5, 'n_estimators': 40}

In [35]:
best_rf_model = RandomForestClassifier(max_depth=5, n_estimators=40)

In [36]:
%%time
best_rf_model.fit(tf_idf_train, target_upsampled)

CPU times: user 1.55 s, sys: 32 ms, total: 1.59 s
Wall time: 1.59 s


RandomForestClassifier(max_depth=5, n_estimators=40)

In [37]:
%%time
predictions = best_rf_model.predict(tf_idf_test)

CPU times: user 230 ms, sys: 3.97 ms, total: 234 ms
Wall time: 242 ms


In [38]:
result = f1_score(target_valid, predictions)
display(result)

0.19370354175776128

In [29]:
'''
%%time
model_lgbm = LGBMClassifier ()
parametrs = { 'max_depth': range (2,6),
              'learning_rate': np.arange (0.1,0.5,0.1) }
grid_lgbm = GridSearchCV(model_lgbm, parametrs, scoring=f1_score, cv=5)
grid_lgbm.fit(tf_idf_train, target_upsampled, verbose=False)
display(grid_lgbm.best_params_)
'''

"\n%%time\nmodel_lgbm = LGBMClassifier ()\nparametrs = { 'max_depth': range (2,6),\n              'learning_rate': np.arange (0.1,0.5,0.1) }\ngrid_lgbm = GridSearchCV(model_lgbm, parametrs, scoring=f1_score, cv=5)\ngrid_lgbm.fit(tf_idf_train, target_upsampled, verbose=False)\ndisplay(grid_lgbm.best_params_)\n"

In [30]:
'''
%%time
best_lgbm_model = LGBMClassifier(learning_rate=0.3, max_depth=5)
'''

'\n%%time\nbest_lgbm_model = LGBMClassifier(learning_rate=0.3, max_depth=5)\n'

In [31]:
'''
%%time
best_lgbm_model.fit(tf_idf_train, target_upsampled, verbose=False)
'''

'\n%%time\nbest_lgbm_model.fit(tf_idf_train, target_upsampled, verbose=False)\n'

In [32]:
'''
%%time
predictions = best_lgbm_model.predict(tf_idf_test)
'''

'\n%%time\npredictions = best_lgbm_model.predict(tf_idf_test)\n'

In [33]:
'''
result = f1_score(target_valid, predictions)
display(result)
'''

'\nresult = f1_score(target_valid, predictions)\ndisplay(result)\n'

## Выводы

Наиболее точная модель и единственная, которая удовлетворяет требованию по точности (f1 > 0.75) это модель логистической регрессии (0.7638873175698608). Модели решающего дерева и случайного леса показали результаты значительно хуже (0.4603058994901675 и 0.19370354175776128 соответственно). Модели LGBM и CatBoost и BERT не получилось запустить на локальном компьютере, в JupiterHub и Google Colab ввиду нехватки аппаратных средств.

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Данные загружены и подготовлены
- [ ]  Модели обучены
- [ ]  Значение метрики *F1* не меньше 0.75
- [ ]  Выводы написаны