<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Выгрузка-данных" data-toc-modified-id="Выгрузка-данных-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Выгрузка данных</a></span></li><li><span><a href="#Очистка-данных" data-toc-modified-id="Очистка-данных-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Очистка данных</a></span></li><li><span><a href="#Обработка-с-помощью-BERT" data-toc-modified-id="Обработка-с-помощью-BERT-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Обработка с помощью BERT</a></span></li><li><span><a href="#Вывод" data-toc-modified-id="Вывод-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Вывод</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#CatBoost" data-toc-modified-id="CatBoost-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>CatBoost</a></span></li><li><span><a href="#LightGBM" data-toc-modified-id="LightGBM-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LightGBM</a></span></li><li><span><a href="#Логистическая-регрессия" data-toc-modified-id="Логистическая-регрессия-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Логистическая регрессия</a></span></li><li><span><a href="#Дерево-решений" data-toc-modified-id="Дерево-решений-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Дерево решений</a></span></li><li><span><a href="#Случайный-лес" data-toc-modified-id="Случайный-лес-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Случайный лес</a></span></li><li><span><a href="#Вывод" data-toc-modified-id="Вывод-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Вывод</a></span></li></ul></li><li><span><a href="#Тестирование" data-toc-modified-id="Тестирование-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Тестирование</a></span></li><li><span><a href="#Вывод" data-toc-modified-id="Вывод-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Вывод</a></span></li></ul></div>

# Определение позитивных и негативных комментариев с помощью BERT

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 
<br/>
<br/> **Цель проекта** - обучить модель классифицировать комментарии на позитивные и негативные и построить модель со значением метрики качества *F1* не меньше 0.75. 

## Подготовка

### Выгрузка данных

In [1]:
!pip install lightgbm

In [2]:
!pip install catboost

In [3]:
!pip install torch

In [4]:
!pip install transformers

Импортируем все библиотеки и инструменты, а также загрузим Датафрейм и выведем первые строки на экран.

In [1]:
import pandas as pd
import re
import nltk.corpus
import numpy as np
import warnings
import torch
import transformers as ppb
from tqdm import tqdm
from tqdm import notebook
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import f1_score
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('C:\\Users\maria\Downloads\\toxic_comments.csv')
display(df.head())

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


Убедимся, что нет пропущенных значений и дубликатов.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [4]:
df.duplicated().sum()

0

Так как в данных слишком много строк, возьмём из них только 1000 для оперативности.

In [7]:
df_new = df.sample(1000).reset_index(drop=True)

Пропущенных значений и дубликатов нет, но по первым пяти строкам видно, что в тексте есть много лишних символов, знаков препинаний и стоп-слов. Проведём очистку данных.

### Очистка данных

Напишем функцию для очистки данных от символов, чисел и с переводом текста в нижний регистр.

In [10]:
def  clean_text(df, text_field, new_text_field_name):
    # приводим текст в нижний регистр
    df[new_text_field_name] = df[text_field].str.lower()
    # убираем знаки препинания и символы
    df[new_text_field_name] = df[new_text_field_name].apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", " ", elem))  
    # убираем числа
    df[new_text_field_name] = df[new_text_field_name].apply(lambda elem: re.sub(r"\d+", "", elem))
    return df
data_clean = clean_text(df_new, 'text', 'text_clean')
data_clean = data_clean.drop('text', axis=1)
data_clean.head()

Unnamed: 0,toxic,text_clean
0,0,kotniski it seems that varsovian embarked...
1,0,vandalism on cheap trick putting back an albu...
2,0,i agree completely that policy debate belong...
3,0,if you look at file stephania jpg now you...
4,0,i think i fixed the grammar although i m ...


In [11]:
# убираем стоп-слова
nltk.download('stopwords')
stop = stopwords.words('english')
data_clean['text_clean'] = data_clean['text_clean'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data_clean.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,toxic,text_clean
0,0,kotniski seems varsovian embarked mass project...
1,0,vandalism cheap trick putting back album relea...
2,0,agree completely policy debate belongs somewhe...
3,0,look file stephania jpg see originally uploade...
4,0,think fixed grammar although sure missed hami


Данные очищены, выбраны и готовы к обработке с помощью BERT.

### Обработка с помощью BERT

Импортируем предварительно обученную модели BERT и токенизатор.

In [12]:
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'unitary/toxic-bert')

In [13]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at unitary/toxic-bert were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Токенизируем набор данных и сделаем все векторы одинакового размера, дополнив короткие предложения идентификатором токена 0. Далее отбросим нулевые токены и «создадим маску» для действительно важных токенов, то есть укажем нулевые и не нулевые значения.

In [14]:
from torch.nn.utils.rnn import pad_sequence
tqdm.pandas()

tokenized = data_clean['text_clean'].progress_apply(
    lambda x: tokenizer.encode(x, max_length=512, truncation=True, add_special_tokens=True))

padded = pad_sequence([torch.as_tensor(seq) for seq in tokenized], batch_first=True)

attention_mask = padded > 0
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

100%|██████████| 1000/1000 [00:01<00:00, 674.99it/s]


(1000, 512)

In [15]:
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)])
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
    
    with torch.no_grad():
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)
    embeddings.append(batch_embeddings[0][:,0,:].numpy())

  0%|          | 0/10 [00:00<?, ?it/s]

Создадим признаки, выделим целевой признак и разделим данные на тренировочную и тестовую выборки.

In [16]:
features = pd.DataFrame(np.concatenate(embeddings))

In [17]:
target = data_clean['toxic']

In [18]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25)

In [19]:
target_train.value_counts()

0    675
1     75
Name: toxic, dtype: int64

Так как в данных присутствует дисбаланс классов, решим эту проблему увеличением объектов меньшего класса. Для этого применим фукнцию `upsample`.

In [20]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)
    return features_upsampled, target_upsampled

In [21]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 9)
target_upsampled.value_counts()

0    675
1    675
Name: toxic, dtype: int64

### Вывод

В подготовке данных были выполнены следующие шаги:
- загружены и импортированы все необходимые библиотеки и инструменты;
- проведена загрузка данных, их очистка от символов, знаков препинания, чисел и стоп-слов;
- сделана выборка из 1000 строк длиной меньше 512 символов;
- проведена обработка данных с помощью BERT;
- признаки и целевой признак разделены на тренировочную и тестовую выборки в отношении 3:1;
- сделано увеличение выборки с токсичным комментариями для устранения дисбаланса классов.

## Обучение

Обучим 2 модели Градиентного бустинга (CatBoost и LightGBM), Дерево решений, Случайный лес и Логистическую регрессию. Для подбора параметров и кросс-валидации используем для каждой модели `GridSearchCV()`.

### CatBoost

In [22]:
parameters_cbc = {'learning_rate' : [0.1, 0.2, 0.3],
                 'depth' : range (10, 50, 10)
                }

In [23]:
%%time

CBC = GridSearchCV(CatBoostClassifier (iterations=100), parameters_cbc, scoring='f1')
CBC.fit(features_upsampled, target_upsampled)
best_model_cbc = CBC.best_estimator_
print(CBC.best_params_)
print('f1-метрика:', CBC.best_score_)

0:	learn: 0.5918080	total: 1.44s	remaining: 2m 22s
1:	learn: 0.5059178	total: 2.77s	remaining: 2m 15s
2:	learn: 0.4386475	total: 4.13s	remaining: 2m 13s
3:	learn: 0.3797320	total: 5.5s	remaining: 2m 12s
4:	learn: 0.3361156	total: 6.89s	remaining: 2m 10s
5:	learn: 0.3014963	total: 8.3s	remaining: 2m 10s
6:	learn: 0.2664965	total: 9.75s	remaining: 2m 9s
7:	learn: 0.2402668	total: 11.3s	remaining: 2m 9s
8:	learn: 0.2150024	total: 12.8s	remaining: 2m 9s
9:	learn: 0.1952887	total: 14.3s	remaining: 2m 8s
10:	learn: 0.1772457	total: 15.9s	remaining: 2m 8s
11:	learn: 0.1615025	total: 17.4s	remaining: 2m 7s
12:	learn: 0.1489524	total: 18.9s	remaining: 2m 6s
13:	learn: 0.1349419	total: 20.5s	remaining: 2m 5s
14:	learn: 0.1239527	total: 22s	remaining: 2m 4s
15:	learn: 0.1149380	total: 23.6s	remaining: 2m 4s
16:	learn: 0.1061718	total: 25.3s	remaining: 2m 3s
17:	learn: 0.0989627	total: 26.9s	remaining: 2m 2s
18:	learn: 0.0926895	total: 28.7s	remaining: 2m 2s
19:	learn: 0.0870399	total: 30.3s	remai

59:	learn: 0.0199183	total: 1m 43s	remaining: 1m 9s
60:	learn: 0.0195739	total: 1m 45s	remaining: 1m 7s
61:	learn: 0.0190677	total: 1m 47s	remaining: 1m 5s
62:	learn: 0.0187315	total: 1m 48s	remaining: 1m 3s
63:	learn: 0.0182815	total: 1m 50s	remaining: 1m 2s
64:	learn: 0.0178529	total: 1m 52s	remaining: 1m
65:	learn: 0.0174874	total: 1m 54s	remaining: 58.8s
66:	learn: 0.0171809	total: 1m 55s	remaining: 57s
67:	learn: 0.0168785	total: 1m 57s	remaining: 55.3s
68:	learn: 0.0164697	total: 1m 59s	remaining: 53.6s
69:	learn: 0.0161816	total: 2m	remaining: 51.8s
70:	learn: 0.0158042	total: 2m 2s	remaining: 50.1s
71:	learn: 0.0155249	total: 2m 4s	remaining: 48.3s
72:	learn: 0.0152278	total: 2m 5s	remaining: 46.5s
73:	learn: 0.0149450	total: 2m 7s	remaining: 44.8s
74:	learn: 0.0146290	total: 2m 9s	remaining: 43.1s
75:	learn: 0.0143263	total: 2m 11s	remaining: 41.4s
76:	learn: 0.0140796	total: 2m 12s	remaining: 39.6s
77:	learn: 0.0138535	total: 2m 14s	remaining: 37.9s
78:	learn: 0.0135260	total

18:	learn: 0.0949724	total: 32.4s	remaining: 2m 18s
19:	learn: 0.0878059	total: 34.1s	remaining: 2m 16s
20:	learn: 0.0829112	total: 35.8s	remaining: 2m 14s
21:	learn: 0.0779614	total: 37.4s	remaining: 2m 12s
22:	learn: 0.0734076	total: 39.1s	remaining: 2m 10s
23:	learn: 0.0690453	total: 40.7s	remaining: 2m 8s
24:	learn: 0.0657529	total: 42.4s	remaining: 2m 7s
25:	learn: 0.0621223	total: 44.1s	remaining: 2m 5s
26:	learn: 0.0597216	total: 45.8s	remaining: 2m 3s
27:	learn: 0.0569850	total: 47.5s	remaining: 2m 2s
28:	learn: 0.0541493	total: 49.2s	remaining: 2m
29:	learn: 0.0511634	total: 50.9s	remaining: 1m 58s
30:	learn: 0.0490630	total: 52.5s	remaining: 1m 56s
31:	learn: 0.0468099	total: 54.2s	remaining: 1m 55s
32:	learn: 0.0451365	total: 55.9s	remaining: 1m 53s
33:	learn: 0.0435104	total: 57.5s	remaining: 1m 51s
34:	learn: 0.0415138	total: 59.2s	remaining: 1m 49s
35:	learn: 0.0401773	total: 1m 1s	remaining: 1m 48s
36:	learn: 0.0388174	total: 1m 2s	remaining: 1m 47s
37:	learn: 0.0375440	

77:	learn: 0.0140873	total: 2m 12s	remaining: 37.2s
78:	learn: 0.0137381	total: 2m 13s	remaining: 35.6s
79:	learn: 0.0134603	total: 2m 15s	remaining: 33.9s
80:	learn: 0.0132238	total: 2m 17s	remaining: 32.2s
81:	learn: 0.0129938	total: 2m 18s	remaining: 30.5s
82:	learn: 0.0127742	total: 2m 20s	remaining: 28.8s
83:	learn: 0.0124604	total: 2m 22s	remaining: 27.1s
84:	learn: 0.0122091	total: 2m 23s	remaining: 25.4s
85:	learn: 0.0119861	total: 2m 25s	remaining: 23.7s
86:	learn: 0.0115909	total: 2m 27s	remaining: 22s
87:	learn: 0.0114016	total: 2m 28s	remaining: 20.3s
88:	learn: 0.0112106	total: 2m 30s	remaining: 18.6s
89:	learn: 0.0109404	total: 2m 32s	remaining: 16.9s
90:	learn: 0.0108084	total: 2m 34s	remaining: 15.2s
91:	learn: 0.0106461	total: 2m 35s	remaining: 13.5s
92:	learn: 0.0104301	total: 2m 37s	remaining: 11.8s
93:	learn: 0.0102348	total: 2m 38s	remaining: 10.1s
94:	learn: 0.0100767	total: 2m 40s	remaining: 8.46s
95:	learn: 0.0098074	total: 2m 42s	remaining: 6.76s
96:	learn: 0.0

36:	learn: 0.0146497	total: 1m 3s	remaining: 1m 47s
37:	learn: 0.0141441	total: 1m 5s	remaining: 1m 46s
38:	learn: 0.0136686	total: 1m 6s	remaining: 1m 44s
39:	learn: 0.0131627	total: 1m 8s	remaining: 1m 42s
40:	learn: 0.0127250	total: 1m 10s	remaining: 1m 41s
41:	learn: 0.0123289	total: 1m 11s	remaining: 1m 39s
42:	learn: 0.0117084	total: 1m 13s	remaining: 1m 37s
43:	learn: 0.0111556	total: 1m 15s	remaining: 1m 35s
44:	learn: 0.0108208	total: 1m 16s	remaining: 1m 33s
45:	learn: 0.0104823	total: 1m 18s	remaining: 1m 32s
46:	learn: 0.0101434	total: 1m 20s	remaining: 1m 30s
47:	learn: 0.0096228	total: 1m 22s	remaining: 1m 28s
48:	learn: 0.0091162	total: 1m 23s	remaining: 1m 27s
49:	learn: 0.0087138	total: 1m 25s	remaining: 1m 25s
50:	learn: 0.0085032	total: 1m 27s	remaining: 1m 23s
51:	learn: 0.0083055	total: 1m 28s	remaining: 1m 21s
52:	learn: 0.0080771	total: 1m 30s	remaining: 1m 20s
53:	learn: 0.0078523	total: 1m 32s	remaining: 1m 18s
54:	learn: 0.0074715	total: 1m 33s	remaining: 1m 1

94:	learn: 0.0025037	total: 2m 41s	remaining: 8.52s
95:	learn: 0.0024466	total: 2m 43s	remaining: 6.82s
96:	learn: 0.0024188	total: 2m 45s	remaining: 5.11s
97:	learn: 0.0023782	total: 2m 47s	remaining: 3.41s
98:	learn: 0.0023473	total: 2m 48s	remaining: 1.7s
99:	learn: 0.0023027	total: 2m 50s	remaining: 0us
0:	learn: 0.5155230	total: 1.64s	remaining: 2m 42s
1:	learn: 0.3829244	total: 3.33s	remaining: 2m 43s
2:	learn: 0.3042162	total: 5.03s	remaining: 2m 42s
3:	learn: 0.2332948	total: 6.67s	remaining: 2m 40s
4:	learn: 0.1912446	total: 8.3s	remaining: 2m 37s
5:	learn: 0.1574596	total: 10s	remaining: 2m 36s
6:	learn: 0.1291101	total: 11.8s	remaining: 2m 36s
7:	learn: 0.1104790	total: 13.5s	remaining: 2m 35s
8:	learn: 0.0953968	total: 15.2s	remaining: 2m 33s
9:	learn: 0.0844462	total: 16.9s	remaining: 2m 32s
10:	learn: 0.0762571	total: 18.6s	remaining: 2m 30s
11:	learn: 0.0682226	total: 20.3s	remaining: 2m 29s
12:	learn: 0.0615019	total: 22s	remaining: 2m 27s
13:	learn: 0.0536249	total: 23

53:	learn: 0.0077419	total: 1m 31s	remaining: 1m 18s
54:	learn: 0.0075557	total: 1m 33s	remaining: 1m 16s
55:	learn: 0.0072217	total: 1m 34s	remaining: 1m 14s
56:	learn: 0.0069924	total: 1m 36s	remaining: 1m 12s
57:	learn: 0.0067533	total: 1m 38s	remaining: 1m 11s
58:	learn: 0.0065282	total: 1m 40s	remaining: 1m 9s
59:	learn: 0.0063791	total: 1m 41s	remaining: 1m 7s
60:	learn: 0.0062266	total: 1m 43s	remaining: 1m 6s
61:	learn: 0.0060977	total: 1m 45s	remaining: 1m 4s
62:	learn: 0.0058987	total: 1m 47s	remaining: 1m 2s
63:	learn: 0.0057787	total: 1m 48s	remaining: 1m 1s
64:	learn: 0.0056707	total: 1m 50s	remaining: 59.4s
65:	learn: 0.0054915	total: 1m 52s	remaining: 57.7s
66:	learn: 0.0053823	total: 1m 53s	remaining: 56s
67:	learn: 0.0052161	total: 1m 55s	remaining: 54.4s
68:	learn: 0.0050685	total: 1m 57s	remaining: 52.7s
69:	learn: 0.0049868	total: 1m 58s	remaining: 51s
70:	learn: 0.0048365	total: 2m	remaining: 49.2s
71:	learn: 0.0046182	total: 2m 2s	remaining: 47.5s
72:	learn: 0.004

12:	learn: 0.0348786	total: 21.8s	remaining: 2m 25s
13:	learn: 0.0313576	total: 23.4s	remaining: 2m 23s
14:	learn: 0.0287404	total: 25.1s	remaining: 2m 22s
15:	learn: 0.0267613	total: 26.7s	remaining: 2m 20s
16:	learn: 0.0248262	total: 28.5s	remaining: 2m 18s
17:	learn: 0.0226474	total: 30.2s	remaining: 2m 17s
18:	learn: 0.0211151	total: 31.9s	remaining: 2m 15s
19:	learn: 0.0198141	total: 33.5s	remaining: 2m 14s
20:	learn: 0.0186889	total: 35.2s	remaining: 2m 12s
21:	learn: 0.0174003	total: 36.8s	remaining: 2m 10s
22:	learn: 0.0163400	total: 38.5s	remaining: 2m 8s
23:	learn: 0.0152325	total: 40.2s	remaining: 2m 7s
24:	learn: 0.0145408	total: 41.8s	remaining: 2m 5s
25:	learn: 0.0136406	total: 43.5s	remaining: 2m 3s
26:	learn: 0.0127994	total: 45.2s	remaining: 2m 2s
27:	learn: 0.0121518	total: 47s	remaining: 2m
28:	learn: 0.0115530	total: 48.6s	remaining: 1m 59s
29:	learn: 0.0109901	total: 50.3s	remaining: 1m 57s
30:	learn: 0.0104868	total: 52s	remaining: 1m 55s
31:	learn: 0.0100121	tota

71:	learn: 0.0023748	total: 2m 4s	remaining: 48.2s
72:	learn: 0.0023147	total: 2m 5s	remaining: 46.5s
73:	learn: 0.0022622	total: 2m 7s	remaining: 44.7s
74:	learn: 0.0022621	total: 2m 9s	remaining: 43s
75:	learn: 0.0022200	total: 2m 10s	remaining: 41.3s
76:	learn: 0.0021625	total: 2m 12s	remaining: 39.6s
77:	learn: 0.0020805	total: 2m 14s	remaining: 37.9s
78:	learn: 0.0020193	total: 2m 15s	remaining: 36.1s
79:	learn: 0.0020192	total: 2m 17s	remaining: 34.4s
80:	learn: 0.0019818	total: 2m 19s	remaining: 32.7s
81:	learn: 0.0019435	total: 2m 21s	remaining: 31s
82:	learn: 0.0019167	total: 2m 22s	remaining: 29.2s
83:	learn: 0.0018723	total: 2m 24s	remaining: 27.5s
84:	learn: 0.0018473	total: 2m 26s	remaining: 25.8s
85:	learn: 0.0018171	total: 2m 27s	remaining: 24.1s
86:	learn: 0.0018171	total: 2m 29s	remaining: 22.4s
87:	learn: 0.0017765	total: 2m 31s	remaining: 20.6s
88:	learn: 0.0017470	total: 2m 33s	remaining: 18.9s
89:	learn: 0.0017180	total: 2m 34s	remaining: 17.2s
90:	learn: 0.0016788

30:	learn: 0.0102065	total: 53.4s	remaining: 1m 58s
31:	learn: 0.0097315	total: 55s	remaining: 1m 56s
32:	learn: 0.0093500	total: 56.7s	remaining: 1m 55s
33:	learn: 0.0087733	total: 58.5s	remaining: 1m 53s
34:	learn: 0.0084399	total: 1m	remaining: 1m 51s
35:	learn: 0.0080026	total: 1m 1s	remaining: 1m 50s
36:	learn: 0.0076662	total: 1m 3s	remaining: 1m 48s
37:	learn: 0.0072988	total: 1m 5s	remaining: 1m 46s
38:	learn: 0.0070377	total: 1m 7s	remaining: 1m 45s
39:	learn: 0.0067321	total: 1m 8s	remaining: 1m 43s
40:	learn: 0.0065861	total: 1m 10s	remaining: 1m 41s
41:	learn: 0.0063535	total: 1m 12s	remaining: 1m 39s
42:	learn: 0.0062040	total: 1m 13s	remaining: 1m 37s
43:	learn: 0.0060294	total: 1m 15s	remaining: 1m 35s
44:	learn: 0.0058712	total: 1m 17s	remaining: 1m 34s
45:	learn: 0.0056201	total: 1m 18s	remaining: 1m 32s
46:	learn: 0.0054861	total: 1m 20s	remaining: 1m 30s
47:	learn: 0.0053680	total: 1m 22s	remaining: 1m 29s
48:	learn: 0.0051949	total: 1m 23s	remaining: 1m 27s
49:	lear

88:	learn: 0.0097858	total: 2m 32s	remaining: 18.8s
89:	learn: 0.0096367	total: 2m 33s	remaining: 17.1s
90:	learn: 0.0094720	total: 2m 35s	remaining: 15.4s
91:	learn: 0.0092710	total: 2m 37s	remaining: 13.7s
92:	learn: 0.0090567	total: 2m 38s	remaining: 12s
93:	learn: 0.0088721	total: 2m 40s	remaining: 10.3s
94:	learn: 0.0087508	total: 2m 42s	remaining: 8.55s
95:	learn: 0.0086105	total: 2m 44s	remaining: 6.84s
96:	learn: 0.0084175	total: 2m 45s	remaining: 5.13s
97:	learn: 0.0082652	total: 2m 47s	remaining: 3.42s
98:	learn: 0.0081186	total: 2m 49s	remaining: 1.71s
99:	learn: 0.0080085	total: 2m 50s	remaining: 0us
{'depth': 10, 'learning_rate': 0.1}
f1-метрика: 0.9873887805271119
Wall time: 45min 26s


### LightGBM

In [26]:
parameters_lgbm = {'learning_rate' : [0.1, 0.2, 0.3]
                  }

In [27]:
%%time

LGBMC = GridSearchCV(LGBMClassifier(), parameters_lgbm, scoring='f1')
LGBMC.fit(features_upsampled, target_upsampled)
best_model_lgbmc = LGBMC.best_estimator_
print(LGBMC.best_params_)
print('f1-метрика:', LGBMC.best_score_)

{'learning_rate': 0.1}
f1-метрика: 0.9926631007400673
Wall time: 40.1 s


### Логистическая регрессия

In [22]:
parameters_lr = {'C': np.linspace(0.0001, 100, 20)}

In [23]:
%%time

LR = GridSearchCV(LogisticRegression(), parameters_lr, scoring='f1')
LR.fit(features_upsampled, target_upsampled)
best_model_lr = LR.best_estimator_
print(LR.best_params_)
print('f1-метрика: ', LR.best_score_)

{'C': 10.526405263157894}
f1-метрика:  0.9904654169360052
Wall time: 21.3 s


### Дерево решений

In [28]:
state = np.random.RandomState(12345)

In [29]:
parameters_dtc = {'max_depth': range (20, 80, 10),
                 'min_samples_leaf': range (1,10),
                 'min_samples_split': range(2,10)
                }

In [30]:
%%time

DTC = GridSearchCV(DecisionTreeClassifier(random_state=state), parameters_dtc, scoring='f1')
DTC.fit(features_upsampled, target_upsampled)
best_model_dtc = DTC.best_estimator_
print(DTC.best_params_)
print('f1-метрика:', DTC.best_score_)

{'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2}
f1-метрика: 0.9883103165587611
Wall time: 6min 15s


### Случайный лес

In [31]:
parameters_rfc = {'max_depth': range (20, 80, 10),
                 'n_estimators': [100, 200, 300]
                }

In [32]:
%%time

RFC = GridSearchCV(RandomForestClassifier(random_state=state), parameters_rfc, scoring='f1')
RFC.fit(features_upsampled, target_upsampled)
best_model_rfc = RFC.best_estimator_
print(RFC.best_params_)
print('f1-метрика:', RFC.best_score_)

{'max_depth': 20, 'n_estimators': 100}
f1-метрика: 0.9919252117537336
Wall time: 1min 55s


### Вывод

- По качеству почти все модели, кроме Дерева решений, достаточно эффективны, f1-метрика  у них больше 99. У модели Дерева решений f1-метрика самая низкая, поэтому эту модель использовать не рекомендуется.
- По времени самой долгой получилась модель CatBoost, поэтому её использовать не рекомендуется.
- Самой быстрой и качественной моделью получилась Логистическая регрессия с гиперпараметром `C`= 10.5. Проверим эту модель на тестовой выборке и сравним значение метрики f1 с Dummy-моделью.

## Тестирование

In [24]:
%%time

best_model_lr.fit(features_upsampled, target_upsampled)
prediction = best_model_lr.predict(features_test)
score_lr = f1_score(target_test, prediction)
print('f1-метрика на тестовой выборке:', score_lr.round(3))

f1-метрика на тестовой выборке: 0.842
Wall time: 245 ms


In [25]:
dummy = DummyClassifier()
dummy.fit(features_upsampled, target_upsampled)
prediction_clf = dummy.predict(features_test)
score_test = f1_score(target_test, prediction_clf)
print('f1-метрика случайной модели на тестовой выборке:', score_test.round(3))

f1-метрика случайной модели на тестовой выборке: 0.0


## Вывод

Для классификации комментариев на позитивные и негативные рекомендуется использовать модель Логистической регрессии с гиперпараметром C= 10.5, так как значение f1-метрики на обучающей выборке у неё самое большее, а времени на обучение и предсказание уходит мало. 

На тестовой выборке значение f1 у модели получилось 0.842, что соответствует условию поставленной задачи. Также модель прошла проверку на адекватность.