<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
# импорт библиотек
import pandas as pd
import numpy as np
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') 
from nltk.corpus import stopwords 
nltk.download('stopwords') 
from sklearn.metrics import f1_score
import torch
import transformers
from scipy.stats import randint
import re
from tqdm import notebook
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier
import catboost
import lightgbm as lgb
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import RandomizedSearchCV
!pip install bayesian-optimization
!pip install scikit-optimize
from bayes_opt import BayesianOptimization, UtilityFunction
from skopt  import BayesSearchCV 

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dande\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dande\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




In [2]:
#чтение файлов
try:
    df = pd.read_csv('/datasets/toxic_comments.csv')
except:
    df = pd.read_csv('toxic_comments.csv')

In [3]:
df.head()
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


None

В дататсете 2 столбца: текст и оценка токсичности комментария, 159571 строк, пропусков нет, однако в тексте встречаются цифры и знаки препинания.

In [4]:
#почистим данные от цифр и символов
def withsub(st):
    new_s = re.sub(r"[^a-zA-Z ']", ' ', st)
    return new_s
df['text'] = df['text'].apply(withsub)
df['text']

0         Explanation Why the edits made under my userna...
1         D'aww  He matches this background colour I'm s...
2         Hey man  I'm really not trying to edit war  It...
3           More I can't make any real suggestions on im...
4         You  sir  are my hero  Any chance you remember...
                                ...                        
159566          And for the second time of asking  when ...
159567    You should be ashamed of yourself   That is a ...
159568    Spitzer   Umm  theres no actual article for pr...
159569    And it looks like it was actually you who put ...
159570      And     I really don't think you understand ...
Name: text, Length: 159571, dtype: object

In [5]:
#lemmatize
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    tokens = w_tokenizer.tokenize(text)
    lemmas = []
    for w in tokens:
        lemmas.append(lemmatizer.lemmatize(w))   
    return ' '.join(lemmas)


In [6]:
#проверка лемматизатора
test = df['text'].values[0]
print(test)
print(lemmatize_text(test))

Explanation Why the edits made under my username Hardcore Metallica Fan were reverted  They weren't vandalisms  just closure on some GAs after I voted at New York Dolls FAC  And please don't remove the template from the talk page since I'm retired now             
Explanation Why the edits made under my username Hardcore Metallica Fan were reverted They weren't vandalism just closure on some GAs after I voted at New York Dolls FAC And please don't remove the template from the talk page since I'm retired now


Вроде бы работает, можно применть функцию ко всему датасету

In [7]:
df['text'] = df['text'].apply(lemmatize_text)
print(df['text'])

0         Explanation Why the edits made under my userna...
1         D'aww He match this background colour I'm seem...
2         Hey man I'm really not trying to edit war It's...
3         More I can't make any real suggestion on impro...
4         You sir are my hero Any chance you remember wh...
                                ...                        
159566    And for the second time of asking when your vi...
159567    You should be ashamed of yourself That is a ho...
159568    Spitzer Umm there no actual article for prosti...
159569    And it look like it wa actually you who put on...
159570    And I really don't think you understand I came...
Name: text, Length: 159571, dtype: object


In [8]:
#инициализация векторизатора
stop_words = set(stopwords.words('english')) 
count_tf_idf = TfidfVectorizer(stop_words=stop_words, max_features=20000) 

In [9]:
# разбиение выборки на части
target = df['toxic']
features = df['text']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.4, random_state=42, stratify=target)

In [10]:
X_train = count_tf_idf.fit_transform(X_train.values) 
X_train.shape

(95742, 20000)

In [11]:
X_test = count_tf_idf.transform(X_test.values) 
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
X_test.shape

(31915, 20000)

Похоже всё готово к обучению

## Обучение

In [12]:
#подбор гиперпараметров логистической регресии
#parameters = {'C': np.linspace(0.05, 50, 10), 'solver': ['liblinear']}
#grid_search = GridSearchCV(LogisticRegression(), parameters)
#grid_search.fit(X_train, y_train)

#print('best parameters: ', grid_search.best_params_)
#print('best scrores: ', grid_search.best_score_)

In [34]:
#LogisticRegression #C=5.26
model_1 = LogisticRegression(solver='liblinear', C=11.15)
model_1.fit(X_train, y_train)
y_pred = model_1.predict(X_valid)
print(f1_score(y_valid, y_pred))

0.7831649831649832


In [14]:
y_pred = model_1.predict(X_test)
print(f1_score(y_test, y_pred))

0.7709478021978021


Логистическая регрессия в целом справляется с задачей. Подбор гиперпараметров убран в комментарий, т.к. время выполнения этой ячейки очень велико

In [15]:
%time
#RandomForestClassifier
#подбор гиперпараметров
#max_depth
best_result = 0
best_depth = 0
for depth in range(3, 10, 1):
    model = RandomForestClassifier(n_estimators=20, max_depth=depth, random_state=12345)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    score = f1_score(y_valid, y_pred)
    if score > best_result:
        model_2 = model
        best_result = score
        best_depth = depth
print('best_result = ', best_result, 'depth = ', best_depth)

Wall time: 0 ns
best_result =  0.002459268367660621 depth =  9


In [16]:
#RandomForestClassifier
#подбор гиперпараметров
#n_estimators
%time
for est in range(20, 150, 20):
    model = RandomForestClassifier(n_estimators=est, max_depth=best_depth, random_state=12345)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    score = f1_score(y_valid, y_pred)
    if score > best_result:
        model_2 = model
        best_result = score
        best_est = est
    else:
        best_est = 20
print('best_result = ', best_result, 'est = ', best_est)

Wall time: 0 ns
best_result =  0.002459268367660621 est =  20


Случайный лес как будто совсем не подходит для этой задачи, f1-мера очень маленькая

In [17]:
%time
#cat boost
model_cat = CatBoostClassifier(loss_function='Logloss', eval_metric='F1', iterations=300,  early_stopping_rounds=50)
model_cat.fit(X_train, y_train, verbose=30)
y_pred = model_cat.predict(X_valid)
score = f1_score(y_valid, y_pred)
print(score)

Wall time: 0 ns
Learning rate set to 0.217944
0:	learn: 0.4434275	total: 3.95s	remaining: 19m 40s
30:	learn: 0.6210665	total: 1m 1s	remaining: 8m 55s
60:	learn: 0.6824949	total: 1m 54s	remaining: 7m 28s
90:	learn: 0.7168778	total: 2m 46s	remaining: 6m 22s
120:	learn: 0.7347527	total: 3m 40s	remaining: 5m 25s
150:	learn: 0.7500618	total: 4m 31s	remaining: 4m 28s
180:	learn: 0.7641164	total: 5m 23s	remaining: 3m 32s
210:	learn: 0.7711854	total: 6m 15s	remaining: 2m 38s
240:	learn: 0.7777442	total: 7m 7s	remaining: 1m 44s
270:	learn: 0.7851950	total: 8m	remaining: 51.4s
299:	learn: 0.7916866	total: 8m 51s	remaining: 0us
0.7508067407673001


In [18]:
y_pred = model_cat.predict(X_test)
score = f1_score(y_test, y_pred)
print(score)

0.7371283538796228


Модель CatBoost неплохо справляется, но до порогового значения точности не хватает, попробуем подобрать гиперпараметры

In [19]:
%%time

cbc = CatBoostClassifier(eval_metric='F1',iterations=100, 
                              loss_function='Logloss',  early_stopping_rounds=50)

# Creating the hyperparameter grid
param_dist = { "learning_rate": np.linspace(0,0.2,5), "max_depth": randint(3, 10), "eval_metric": ['F1'], "iterations": [100], 
                               "early_stopping_rounds": [50]}
               
#Instantiate RandomSearchCV object
rscv = RandomizedSearchCV(cbc , param_dist, scoring='f1', cv = 3)

#Fit the model
rscv.fit(X_train,y_train)

# Print the tuned parameters and score
print(rscv.best_params_)
print(rscv.best_score_)

Traceback (most recent call last):
  File "C:\Users\dande\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 4921, in fit
    self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 2176, in _fit
    train_params = self._prepare_train_params(
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 2108, in _prepare_train_params
    _check_train_params(params)
  File "_catboost.pyx", line 5855, in _catboost._check_train_params
  File "_catboost.pyx", line 5874, in _catboost._check_train_params
_catboost.CatBoostError: C:/Program Files (x86)/Go Agent/pipelines/BuildMaster/catboost.git/catboost/private/libs/options/boosting_options.cpp:79: Learning r

0:	learn: 0.3501965	total: 2.58s	remaining: 4m 15s
1:	learn: 0.3662783	total: 4.91s	remaining: 4m
2:	learn: 0.4455528	total: 7.2s	remaining: 3m 52s
3:	learn: 0.4201498	total: 9.5s	remaining: 3m 48s
4:	learn: 0.4250301	total: 11.8s	remaining: 3m 45s
5:	learn: 0.4225488	total: 14.2s	remaining: 3m 42s
6:	learn: 0.4304010	total: 16.5s	remaining: 3m 39s
7:	learn: 0.4276276	total: 18.8s	remaining: 3m 36s
8:	learn: 0.4381179	total: 21.1s	remaining: 3m 33s
9:	learn: 0.4413023	total: 23.4s	remaining: 3m 30s
10:	learn: 0.4978884	total: 25.7s	remaining: 3m 28s
11:	learn: 0.4696970	total: 28.1s	remaining: 3m 25s
12:	learn: 0.4705334	total: 30.4s	remaining: 3m 23s
13:	learn: 0.4689140	total: 32.8s	remaining: 3m 21s
14:	learn: 0.4724648	total: 35.2s	remaining: 3m 19s
15:	learn: 0.4921273	total: 37.5s	remaining: 3m 17s
16:	learn: 0.4919540	total: 39.9s	remaining: 3m 14s
17:	learn: 0.5132443	total: 42.2s	remaining: 3m 12s
18:	learn: 0.4762347	total: 44.5s	remaining: 3m 9s
19:	learn: 0.5039298	total: 4

58:	learn: 0.5550439	total: 2m 19s	remaining: 1m 37s
59:	learn: 0.5553973	total: 2m 22s	remaining: 1m 34s
60:	learn: 0.5627998	total: 2m 24s	remaining: 1m 32s
61:	learn: 0.5625477	total: 2m 26s	remaining: 1m 29s
62:	learn: 0.5628272	total: 2m 29s	remaining: 1m 27s
63:	learn: 0.5632974	total: 2m 31s	remaining: 1m 25s
64:	learn: 0.5659598	total: 2m 33s	remaining: 1m 22s
65:	learn: 0.5677699	total: 2m 36s	remaining: 1m 20s
66:	learn: 0.5685764	total: 2m 38s	remaining: 1m 17s
67:	learn: 0.5686998	total: 2m 40s	remaining: 1m 15s
68:	learn: 0.5687615	total: 2m 42s	remaining: 1m 13s
69:	learn: 0.5681695	total: 2m 45s	remaining: 1m 10s
70:	learn: 0.5680139	total: 2m 47s	remaining: 1m 8s
71:	learn: 0.5675470	total: 2m 49s	remaining: 1m 5s
72:	learn: 0.5756888	total: 2m 51s	remaining: 1m 3s
73:	learn: 0.5772551	total: 2m 54s	remaining: 1m 1s
74:	learn: 0.5772551	total: 2m 56s	remaining: 58.8s
75:	learn: 0.5775620	total: 2m 58s	remaining: 56.4s
76:	learn: 0.5780222	total: 3m 1s	remaining: 54.1s
7

17:	learn: 0.4988004	total: 10.2s	remaining: 46.3s
18:	learn: 0.5025068	total: 10.7s	remaining: 45.7s
19:	learn: 0.5179337	total: 11.3s	remaining: 45s
20:	learn: 0.5189031	total: 11.8s	remaining: 44.4s
21:	learn: 0.5195127	total: 12.4s	remaining: 43.8s
22:	learn: 0.5053519	total: 12.9s	remaining: 43.2s
23:	learn: 0.5046154	total: 13.5s	remaining: 42.6s
24:	learn: 0.5065999	total: 14s	remaining: 42.1s
25:	learn: 0.5071672	total: 14.6s	remaining: 41.7s
26:	learn: 0.5076188	total: 15.2s	remaining: 41.1s
27:	learn: 0.5075137	total: 15.8s	remaining: 40.6s
28:	learn: 0.5093761	total: 16.3s	remaining: 40s
29:	learn: 0.5112270	total: 16.9s	remaining: 39.4s
30:	learn: 0.5126346	total: 17.4s	remaining: 38.8s
31:	learn: 0.5129135	total: 18s	remaining: 38.3s
32:	learn: 0.5148739	total: 18.5s	remaining: 37.7s
33:	learn: 0.5157133	total: 19.1s	remaining: 37.1s
34:	learn: 0.5195625	total: 19.6s	remaining: 36.5s
35:	learn: 0.5212299	total: 20.2s	remaining: 35.9s
36:	learn: 0.5234401	total: 20.7s	remai

80:	learn: 0.5703150	total: 44.4s	remaining: 10.4s
81:	learn: 0.5705316	total: 45s	remaining: 9.87s
82:	learn: 0.5708099	total: 45.5s	remaining: 9.32s
83:	learn: 0.5712739	total: 46s	remaining: 8.76s
84:	learn: 0.5742959	total: 46.5s	remaining: 8.21s
85:	learn: 0.5792925	total: 47.1s	remaining: 7.66s
86:	learn: 0.5800838	total: 47.6s	remaining: 7.11s
87:	learn: 0.5807837	total: 48.2s	remaining: 6.57s
88:	learn: 0.5868520	total: 48.7s	remaining: 6.02s
89:	learn: 0.5874186	total: 49.3s	remaining: 5.48s
90:	learn: 0.5884739	total: 50s	remaining: 4.94s
91:	learn: 0.5894400	total: 50.7s	remaining: 4.41s
92:	learn: 0.5906427	total: 51.2s	remaining: 3.85s
93:	learn: 0.5914803	total: 51.7s	remaining: 3.3s
94:	learn: 0.5917172	total: 52.3s	remaining: 2.75s
95:	learn: 0.5984752	total: 52.8s	remaining: 2.2s
96:	learn: 0.5989203	total: 53.3s	remaining: 1.65s
97:	learn: 0.6000212	total: 53.9s	remaining: 1.1s
98:	learn: 0.6016484	total: 54.4s	remaining: 549ms
99:	learn: 0.6018802	total: 54.9s	remain

42:	learn: 0.6227121	total: 56.5s	remaining: 1m 14s
43:	learn: 0.6274347	total: 57.8s	remaining: 1m 13s
44:	learn: 0.6285064	total: 59.1s	remaining: 1m 12s
45:	learn: 0.6314268	total: 1m	remaining: 1m 10s
46:	learn: 0.6374191	total: 1m 1s	remaining: 1m 9s
47:	learn: 0.6390399	total: 1m 3s	remaining: 1m 8s
48:	learn: 0.6415481	total: 1m 4s	remaining: 1m 6s
49:	learn: 0.6432952	total: 1m 5s	remaining: 1m 5s
50:	learn: 0.6443446	total: 1m 6s	remaining: 1m 4s
51:	learn: 0.6485607	total: 1m 8s	remaining: 1m 2s
52:	learn: 0.6496543	total: 1m 9s	remaining: 1m 1s
53:	learn: 0.6507566	total: 1m 10s	remaining: 1m
54:	learn: 0.6511675	total: 1m 12s	remaining: 58.9s
55:	learn: 0.6517812	total: 1m 13s	remaining: 57.6s
56:	learn: 0.6521254	total: 1m 14s	remaining: 56.3s
57:	learn: 0.6527383	total: 1m 15s	remaining: 55s
58:	learn: 0.6532184	total: 1m 17s	remaining: 53.6s
59:	learn: 0.6540371	total: 1m 18s	remaining: 52.3s
60:	learn: 0.6547848	total: 1m 19s	remaining: 51s
61:	learn: 0.6568251	total: 1

2:	learn: 0.4104314	total: 4.1s	remaining: 2m 12s
3:	learn: 0.4157863	total: 5.42s	remaining: 2m 10s
4:	learn: 0.4368666	total: 6.79s	remaining: 2m 9s
5:	learn: 0.4678963	total: 8.11s	remaining: 2m 7s
6:	learn: 0.4639381	total: 9.42s	remaining: 2m 5s
7:	learn: 0.4807335	total: 10.7s	remaining: 2m 3s
8:	learn: 0.5025711	total: 12.1s	remaining: 2m 1s
9:	learn: 0.4956382	total: 13.4s	remaining: 2m
10:	learn: 0.5206928	total: 14.7s	remaining: 1m 58s
11:	learn: 0.5215144	total: 16s	remaining: 1m 57s
12:	learn: 0.5364172	total: 17.3s	remaining: 1m 55s
13:	learn: 0.5072217	total: 18.6s	remaining: 1m 54s
14:	learn: 0.5223461	total: 19.9s	remaining: 1m 52s
15:	learn: 0.5414867	total: 21.2s	remaining: 1m 51s
16:	learn: 0.5556652	total: 22.5s	remaining: 1m 49s
17:	learn: 0.5609251	total: 23.8s	remaining: 1m 48s
18:	learn: 0.5731576	total: 25.1s	remaining: 1m 47s
19:	learn: 0.5746680	total: 26.4s	remaining: 1m 45s
20:	learn: 0.5736853	total: 27.7s	remaining: 1m 44s
21:	learn: 0.5755349	total: 29s	

Traceback (most recent call last):
  File "C:\Users\dande\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 4921, in fit
    self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 2176, in _fit
    train_params = self._prepare_train_params(
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 2108, in _prepare_train_params
    _check_train_params(params)
  File "_catboost.pyx", line 5855, in _catboost._check_train_params
  File "_catboost.pyx", line 5874, in _catboost._check_train_params
_catboost.CatBoostError: C:/Program Files (x86)/Go Agent/pipelines/BuildMaster/catboost.git/catboost/private/libs/options/boosting_options.cpp:79: Learning r

Traceback (most recent call last):
  File "C:\Users\dande\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 4921, in fit
    self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 2176, in _fit
    train_params = self._prepare_train_params(
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 2108, in _prepare_train_params
    _check_train_params(params)
  File "_catboost.pyx", line 5855, in _catboost._check_train_params
  File "_catboost.pyx", line 5874, in _catboost._check_train_params
_catboost.CatBoostError: C:/Program Files (x86)/Go Agent/pipelines/BuildMaster/catboost.git/catboost/private/libs/options/boosting_options.cpp:79: Learning r

0:	learn: 0.3501521	total: 1.39s	remaining: 2m 17s
1:	learn: 0.3645258	total: 2.72s	remaining: 2m 13s
2:	learn: 0.3534472	total: 4.04s	remaining: 2m 10s
3:	learn: 0.4288972	total: 5.35s	remaining: 2m 8s
4:	learn: 0.4346057	total: 6.69s	remaining: 2m 7s
5:	learn: 0.4844075	total: 8.03s	remaining: 2m 5s
6:	learn: 0.4882543	total: 9.35s	remaining: 2m 4s
7:	learn: 0.4476337	total: 10.6s	remaining: 2m 2s
8:	learn: 0.5135411	total: 12s	remaining: 2m 1s
9:	learn: 0.5029680	total: 13.3s	remaining: 1m 59s
10:	learn: 0.5054143	total: 14.6s	remaining: 1m 58s
11:	learn: 0.5068244	total: 15.9s	remaining: 1m 56s
12:	learn: 0.5242806	total: 17.2s	remaining: 1m 55s
13:	learn: 0.5237452	total: 18.5s	remaining: 1m 53s
14:	learn: 0.5143763	total: 19.8s	remaining: 1m 52s
15:	learn: 0.5319624	total: 21.1s	remaining: 1m 50s
16:	learn: 0.5351610	total: 22.5s	remaining: 1m 49s
17:	learn: 0.5365093	total: 23.8s	remaining: 1m 48s
18:	learn: 0.5532382	total: 25.1s	remaining: 1m 46s
19:	learn: 0.5641472	total: 26

60:	learn: 0.6360098	total: 1m 20s	remaining: 51.3s
61:	learn: 0.6353761	total: 1m 21s	remaining: 50s
62:	learn: 0.6404001	total: 1m 22s	remaining: 48.7s
63:	learn: 0.6409629	total: 1m 24s	remaining: 47.4s
64:	learn: 0.6419250	total: 1m 25s	remaining: 46s
65:	learn: 0.6446231	total: 1m 26s	remaining: 44.7s
66:	learn: 0.6495033	total: 1m 28s	remaining: 43.4s
67:	learn: 0.6498429	total: 1m 29s	remaining: 42.1s
68:	learn: 0.6477215	total: 1m 30s	remaining: 40.8s
69:	learn: 0.6514309	total: 1m 32s	remaining: 39.5s
70:	learn: 0.6513604	total: 1m 33s	remaining: 38.1s
71:	learn: 0.6517740	total: 1m 34s	remaining: 36.8s
72:	learn: 0.6531395	total: 1m 36s	remaining: 35.5s
73:	learn: 0.6547727	total: 1m 37s	remaining: 34.3s
74:	learn: 0.6553809	total: 1m 39s	remaining: 33s
75:	learn: 0.6590978	total: 1m 40s	remaining: 31.9s
76:	learn: 0.6591662	total: 1m 42s	remaining: 30.5s
77:	learn: 0.6599056	total: 1m 43s	remaining: 29.2s
78:	learn: 0.6645323	total: 1m 44s	remaining: 27.9s
79:	learn: 0.66620

Traceback (most recent call last):
  File "C:\Users\dande\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 4921, in fit
    self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 2192, in _fit
    self._train(
  File "C:\Users\dande\anaconda3\lib\site-packages\catboost\core.py", line 1619, in _train
    self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
  File "_catboost.pyx", line 4408, in _catboost._CatBoost._train
  File "_catboost.pyx", line 4457, in _catboost._CatBoost._train
_catboost.CatBoostError: bad allocation

Traceback (most recent call last):
  File "C:\Users\dande\anaconda3\lib\site-packa

0:	learn: 0.4434275	total: 2.05s	remaining: 3m 22s
1:	learn: 0.4351011	total: 3.87s	remaining: 3m 9s
2:	learn: 0.4837423	total: 5.66s	remaining: 3m 2s
3:	learn: 0.4909604	total: 7.45s	remaining: 2m 58s
4:	learn: 0.4641856	total: 9.22s	remaining: 2m 55s
5:	learn: 0.4685510	total: 11s	remaining: 2m 52s
6:	learn: 0.5165008	total: 12.8s	remaining: 2m 49s
7:	learn: 0.4962314	total: 14.5s	remaining: 2m 47s
8:	learn: 0.4997343	total: 16.3s	remaining: 2m 44s
9:	learn: 0.5229481	total: 18.1s	remaining: 2m 42s
10:	learn: 0.5261116	total: 19.8s	remaining: 2m 40s
11:	learn: 0.5424702	total: 21.6s	remaining: 2m 38s
12:	learn: 0.5362748	total: 23.3s	remaining: 2m 35s
13:	learn: 0.5558635	total: 25.1s	remaining: 2m 34s
14:	learn: 0.5592321	total: 26.9s	remaining: 2m 32s
15:	learn: 0.5683313	total: 28.6s	remaining: 2m 30s
16:	learn: 0.5593714	total: 30.4s	remaining: 2m 28s
17:	learn: 0.5527337	total: 32.1s	remaining: 2m 26s
18:	learn: 0.5570929	total: 33.8s	remaining: 2m 24s
19:	learn: 0.5580481	total

In [27]:
cbc = CatBoostClassifier(eval_metric='F1',iterations=300, 
                              loss_function='Logloss', learning_rate=0.15, max_depth=6)
cbc.fit(X_train, y_train, verbose=30)
y_pred = model_cat.predict(X_valid)
score = f1_score(y_valid, y_pred)
print(score)

0:	learn: 0.4434275	total: 3.18s	remaining: 15m 50s
30:	learn: 0.5899188	total: 1m 11s	remaining: 10m 21s
60:	learn: 0.6414117	total: 2m 16s	remaining: 8m 54s
90:	learn: 0.6819883	total: 3m 25s	remaining: 7m 51s
120:	learn: 0.7029537	total: 4m 31s	remaining: 6m 41s
150:	learn: 0.7210884	total: 5m 35s	remaining: 5m 31s
180:	learn: 0.7348399	total: 6m 42s	remaining: 4m 24s
210:	learn: 0.7467347	total: 7m 51s	remaining: 3m 18s
240:	learn: 0.7589990	total: 9m 12s	remaining: 2m 15s
270:	learn: 0.7670462	total: 10m 30s	remaining: 1m 7s
299:	learn: 0.7719042	total: 11m 44s	remaining: 0us
0.7508067407673001


In [28]:
y_pred = cbc.predict(X_test)
score = f1_score(y_test, y_pred)
print(score)

0.7273732718894009


После RandomSearch ничего не изменилось, стало даже немного хуже, хотя поиск занял много времени

In [29]:
gbm = lgb.LGBMClassifier()
gbm.fit(X_train, y_train)
predictions = gbm.predict(X_valid, num_iteration=1000)
score = f1_score(y_valid, predictions)
print(score)

0.7544867193108399


In [30]:
y_pred = gbm.predict(X_test)
score = f1_score(y_test, y_pred)
print(score)

0.7446961524631428


LGBM без тюнинга на валидационной выборке немного опережает CatBoost, но может быть станет лучше после подбора гиперпараметров. В качестве метода для подбора я решила попробовать байесовскую оптимизацию.

In [31]:
%%time
#LightGBM
def bayes_parameter_opt_lgb(X, y, init_round=15, opt_round=25, n_folds=3, random_seed=6,
                            n_estimators=10000, output_process=False):
    # prepare data
    train_data = lgb.Dataset(data=X, label=y, free_raw_data=False)
    # parameters
    def lgb_eval(learning_rate,num_leaves, feature_fraction, bagging_fraction, max_depth, max_bin, 
                 min_data_in_leaf,min_sum_hessian_in_leaf,subsample):
        params = {'application':'binary', 'metric':'auc'}
        params['learning_rate'] = max(min(learning_rate, 1), 0)
        params["num_leaves"] = int(round(num_leaves))
        params['feature_fraction'] = max(min(feature_fraction, 1), 0)
        params['bagging_fraction'] = max(min(bagging_fraction, 1), 0)
        params['max_depth'] = int(round(max_depth))
        params['max_bin'] = int(round(max_depth))
        params['min_data_in_leaf'] = int(round(min_data_in_leaf))
        params['min_sum_hessian_in_leaf'] = min_sum_hessian_in_leaf
        params['subsample'] = max(min(subsample, 1), 0)
        
        cv_result = lgb.cv(params, train_data, nfold=n_folds, seed=random_seed, stratified=True, verbose_eval =200, 
                           metrics=['auc'])
        return max(cv_result['auc-mean'])
     
    lgbBO = BayesianOptimization(lgb_eval, {'learning_rate': (0.01, 1.0),
                                            'num_leaves': (24, 80),
                                            'feature_fraction': (0.1, 0.9),
                                            'bagging_fraction': (0.8, 1),
                                            'max_depth': (5, 30),
                                            'max_bin':(20,90),
                                            'min_data_in_leaf': (20, 80),
                                            'min_sum_hessian_in_leaf':(0,100),
                                           'subsample': (0.01, 1.0)}, random_state=200)

    
    #n_iter: How many steps of bayesian optimization you want to perform. The more steps the more likely to find 
    #a good maximum you are.
    #init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying
    #the exploration space.
    
    lgbBO.maximize(init_points=init_round, n_iter=opt_round)
    
    model_auc=[]
    for model in range(len(lgbBO.res)):
        model_auc.append(lgbBO.res[model]['target'])
    
    # return best parameters
    return lgbBO.res[pd.Series(model_auc).idxmax()]['target'],lgbBO.res[pd.Series(model_auc).idxmax()]['params']

opt_params = bayes_parameter_opt_lgb(X_train, y_train, init_round=5, opt_round=10, n_folds=3, random_seed=6,n_estimators=10000)

|   iter    |  target   | baggin... | featur... | learni... |  max_bin  | max_depth | min_da... | min_su... | num_le... | subsample |
-------------------------------------------------------------------------------------------------------------------------------------
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 170425
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 9584
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 170425
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 9584
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Tota



[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [0m 1       [0m | [0m 0.8845  [0m | [0m 0.9895  [0m | [0m 0.2812  [0m | [0m 0.5985  [0m | [0m 49.98   [0m | [0m 24.1    [0m | [0m 20.17   [0m | [0m 35.74   [0m | [0m 74.94   [0m | [0m 0.4615  [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 54080
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4160
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 54080
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4160


[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 54080
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4160




[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [95m 2       [0m | [95m 0.9063  [0m | [95m 0.9964  [0m | [95m 0.7939  [0m | [95m 0.9862  [0m | [95m 84.63   [0m | [95m 12.59   [0m | [95m 70.77   [0m | [95m 12.12   [0m | [95m 67.99   [0m | [95m 0.258   [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 129218
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4973
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 129218
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4973


[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 129218
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4973




[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [0m 3       [0m | [0m 0.8618  [0m | [0m 0.8192  [0m | [0m 0.8548  [0m | [0m 0.8278  [0m | [0m 56.28   [0m | [0m 26.84   [0m | [0m 54.7    [0m | [0m 45.01   [0m | [0m 62.09   [0m | [0m 0.4252  [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 110321
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4627
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 110321
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4627


[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 110321
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4627




[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
| [0m 4       [0m | [0m 0.9009  [0m | [0m 0.9281  [0m | [0m 0.5869  [0m | [0m 0.1144  [0m | [0m 87.62   [0m | [0m 23.97   [0m | [0m 60.78   [0m | [0m 32.93   [0m | [0m 25.48   [0m | [0m 0.8056  [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 56460
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 5646
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the 



[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [0m 5       [0m | [0m 0.849   [0m | [0m 0.9946  [0m | [0m 0.3264  [0m | [0m 0.6526  [0m | [0m 38.59   [0m | [0m 9.692   [0m | [0m 45.14   [0m | [0m 66.6    [0m | [0m 52.98   [0m | [0m 0.8559  [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 48672
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4056
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 48672
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4056


[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 48672
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4056




[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701


| [95m 6       [0m | [95m 0.9367  [0m | [95m 1.0     [0m | [95m 0.7296  [0m | [95m 0.7464  [0m | [95m 90.0    [0m | [95m 11.76   [0m | [95m 73.6    [0m | [95m 0.0     [0m | [95m 36.63   [0m | [95m 0.4786  [0m |


[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 60648
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4332
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 60648
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4332
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 60648
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4332




[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [0m 7       [0m | [0m 0.9324  [0m | [0m 0.9216  [0m | [0m 0.2936  [0m | [0m 0.141   [0m | [0m 85.6    [0m | [0m 13.98   [0m | [0m 66.63   [0m | [0m 13.1    [0m | [0m 70.1    [0m | [0m 0.9925  [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 50336
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4576
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 50336
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4576


[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 50336
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4576




[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [0m 8       [0m | [0m 0.9351  [0m | [0m 0.883   [0m | [0m 0.4122  [0m | [0m 0.2459  [0m | [0m 81.03   [0m | [0m 11.23   [0m | [0m 61.89   [0m | [0m 11.57   [0m | [0m 73.14   [0m | [0m 0.9497  [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 88308
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4906
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 88308
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4906


[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 88308
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4906




[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701


| [0m 9       [0m | [0m 0.907   [0m | [0m 0.8     [0m | [0m 0.1     [0m | [0m 0.01    [0m | [0m 87.9    [0m | [0m 17.78   [0m | [0m 55.93   [0m | [0m 18.55   [0m | [0m 69.77   [0m | [0m 1.0     [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36248
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4531
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36248
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4531
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 36248
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4531




[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [0m 10      [0m | [0m 0.9234  [0m | [0m 0.9777  [0m | [0m 0.5503  [0m | [0m 0.1127  [0m | [0m 87.46   [0m | [0m 7.71    [0m | [0m 63.09   [0m | [0m 13.76   [0m | [0m 78.5    [0m | [0m 0.4     [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 28784
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4112
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 28784
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4112
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 28784
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4112




[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [0m 11      [0m | [0m 0.9193  [0m | [0m 0.9673  [0m | [0m 0.1272  [0m | [0m 0.01253 [0m | [0m 83.12   [0m | [0m 6.602   [0m | [0m 72.19   [0m | [0m 0.0     [0m | [0m 28.45   [0m | [0m 0.991   [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 89520
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4476
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 89520
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4476
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[Light



[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
| [0m 12      [0m | [0m 0.9273  [0m | [0m 0.8702  [0m | [0m 0.1     [0m | [0m 0.01    [0m | [0m 84.88   [0m | [0m 20.36   [0m | [0m 63.79   [0m | [0m 6.851   [0m | [0m 77.99   [0m | [0m 1.0     [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 89286
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 3882
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_col_wise=true` to remove the 



[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
| [0m 13      [0m | [0m 0.9057  [0m | [0m 0.803   [0m | [0m 0.9     [0m | [0m 0.9194  [0m | [0m 90.0    [0m | [0m 22.66   [0m | [0m 79.24   [0m | [0m 0.0     [0m | [0m 32.04   [0m | [0m 0.9748  [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 68608
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4288
[LightGBM] [Info] Number of positive: 6490, number of 



[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [95m 14      [0m | [95m 0.9416  [0m | [95m 0.8122  [0m | [95m 0.5114  [0m | [95m 0.4233  [0m | [95m 87.67   [0m | [95m 15.71   [0m | [95m 68.32   [0m | [95m 6.974   [0m | [95m 43.2    [0m | [95m 0.4561  [0m |
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36608
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4576
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 36608
[LightGBM] [Info] Number of data points in the train set: 63828, number of used features: 4576
[LightGBM] [Info] Number of positive: 6490, number of negative: 57338
You can set `force_row_wise=true` to remove the overh



[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101680 -> initscore=-2.178701
[LightGBM] [Info] Start training from score -2.178701




| [0m 15      [0m | [0m 0.9235  [0m | [0m 0.8285  [0m | [0m 0.1175  [0m | [0m 0.01    [0m | [0m 90.0    [0m | [0m 8.088   [0m | [0m 62.08   [0m | [0m 0.0     [0m | [0m 42.97   [0m | [0m 1.0     [0m |
Wall time: 6min 30s


In [32]:
opt_params[1]["num_leaves"] = int(round(opt_params[1]["num_leaves"]))
opt_params[1]['max_depth'] = int(round(opt_params[1]['max_depth']))
opt_params[1]['min_data_in_leaf'] = int(round(opt_params[1]['min_data_in_leaf']))
opt_params[1]['max_bin'] = int(round(opt_params[1]['max_bin']))
opt_params[1]['objective']='binary'
opt_params[1]['metric']='auc'
opt_params[1]['is_unbalance']=True
opt_params[1]['boost_from_average']=False
opt_params=opt_params[1]
opt_params

{'bagging_fraction': 0.8122414528726879,
 'feature_fraction': 0.5113856611559656,
 'learning_rate': 0.4232585793073392,
 'max_bin': 88,
 'max_depth': 16,
 'min_data_in_leaf': 68,
 'min_sum_hessian_in_leaf': 6.973665921734473,
 'num_leaves': 43,
 'subsample': 0.45613013986179224,
 'objective': 'binary',
 'metric': 'auc',
 'is_unbalance': True,
 'boost_from_average': False}

In [33]:
gbm = lgb.LGBMClassifier(**opt_params)
gbm.fit(X_train, y_train)
predictions = gbm.predict(X_valid, num_iteration=1000)
score = f1_score(y_valid, predictions)
print(score)

0.7163572299769616


Несмотря на оптимизацию параметров, LGBM не справляется с заданной планкой.

## Обучение №2

In [42]:
#инициализация векторизатора
stop_words = set(stopwords.words('english')) 
count_tf_idf = TfidfVectorizer(stop_words=stop_words, max_features=100000) 

In [43]:
# разбиение выборки на части
target = df['toxic']
features = df['text']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.4, random_state=42, stratify=target)

In [44]:
X_train = count_tf_idf.fit_transform(X_train.values) 
X_train.shape

(95742, 100000)

In [45]:
X_test = count_tf_idf.transform(X_test.values) 
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
X_test.shape

(31915, 100000)

In [46]:
#LogisticRegression #C=5.26
model_1 = LogisticRegression(solver='liblinear', C=11.15)
model_1.fit(X_train, y_train)
y_pred = model_1.predict(X_valid)
print(f1_score(y_valid, y_pred))

0.7788902087222129


In [47]:
y_pred = model_1.predict(X_test)
print(f1_score(y_test, y_pred))

0.7680027763317716


In [48]:
gbm = lgb.LGBMClassifier()
gbm.fit(X_train, y_train)
predictions = gbm.predict(X_valid, num_iteration=1000)
score = f1_score(y_valid, predictions)
print(score)

0.7543985637342909


In [49]:
y_pred = gbm.predict(X_test)
score = f1_score(y_test, y_pred)
print(score)

0.7417146974063399


Вывод: результаты стали не лучше, а даже немного хуже. Это скорее всего связяно с переизбытком информации при обучении. В любом случае, нужная модель найдена и это логистическая регрессия.

## Выводы

Natural language processing оказалась принципиально новой задачей, потребовавшей много времени и сил. В ней важную роль играет этап превращения текста в признаки, понятные моделям. В данном проекте это реализовано с помощью TF-IDF, в ходе работы стало понятно, что имеет значение не только тип векторизотора, но и количество признаков, которое он генерирует. При слишком большом и слишком маленьком количестве признаков страдает качество обучения моделей. При этом лучшую метрику качества f1 показала модель логистической регрессии, возможно потому что ей проще всего работать с векторизированными признаками.