<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Random Forest</a></span></li><li><span><a href="#LightGBM" data-toc-modified-id="LightGBM-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>LightGBM</a></span></li><li><span><a href="#Предсказание-на-тестовой-выборке" data-toc-modified-id="Предсказание-на-тестовой-выборке-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Предсказание на тестовой выборке</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75.

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели.
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [None]:
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import f1_score
from sklearn.dummy import DummyClassifier
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re

from tqdm.notebook import tqdm
tqdm.pandas()

lemmatizer = WordNetLemmatizer()
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
df = pd.read_csv("/datasets/toxic_comments.csv")

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [None]:
df = df.drop(['Unnamed: 0'], axis=1)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [None]:
display(df['toxic'].value_counts())
display(df['toxic'].value_counts()[0] / df['toxic'].value_counts()[1])

0    143106
1     16186
Name: toxic, dtype: int64

8.841344371679229

In [None]:
def clear_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = ' '.join(text.split())
    return text

In [None]:
df['text'] = df['text'].progress_apply(clear_text)

  0%|          | 0/159292 [00:00<?, ?it/s]

In [None]:
#POS-тэг
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV
               }
    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
def lemmatize(text):
    text = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(text)]
    return ' '.join(text)

In [None]:
df['text'] = df['text'].progress_apply(lemmatize)

  0%|          | 0/159292 [00:00<?, ?it/s]

In [None]:
df.head(10)

Unnamed: 0,text,toxic
0,explanation why the edits make under my userna...,0
1,d aww he match this background colour i m seem...,0
2,hey man i m really not try to edit war it s ju...,0
3,more i can t make any real suggestion on impro...,0
4,you sir be my hero any chance you remember wha...,0
5,congratulation from me a well use the tool wel...,0
6,cocksucker before you piss around on my work,1
7,your vandalism to the matt shirvington article...,0
8,sorry if the word nonsense be offensive to you...,0
9,alignment on this subject and which be contrar...,0


In [None]:
df.isna().sum()

text     0
toxic    0
dtype: int64

In [None]:
features = df.drop(['toxic'], axis=1)
target = df['toxic']

features_train, features_test, target_train, target_test = train_test_split(features,
                                                                              target,
                                                                              test_size=0.1,
                                                                              random_state=123)
features_train = features_train.text
features_test = features_test.text
print(features_train.shape)
print(features_test.shape)

(143362,)
(15930,)


<div class="alert" style="background-color:#ead7f7;color:#8737bf">
   
Сделал 10%

</div>

In [None]:
#stopwords = set(nltk_stopwords.words('english'))
#count_tf_idf = TfidfVectorizer(stop_words=stopwords)

#features_train_idf = count_tf_idf.fit_transform(features_train)
#features_test_idf = count_tf_idf.transform(features_test)

#print(features_train_idf.shape)
#print(features_test_idf.shape)

(143362, 142521)
(15930, 142521)


## Обучение

### Logistic Regression

In [None]:
pipeline = Pipeline([("vect", TfidfVectorizer(stop_words='english', sublinear_tf=True)),
                     ("lr", LogisticRegression(random_state=123, class_weight='balanced'))])
param = {'lr__C': ([n for n in range(5, 15, 1)])}
#model = LogisticRegression(random_state=123, class_weight='balanced', max_iter=200, 'solver': ['liblinear', 'lbfgs', 'newton-cg']

In [None]:
%%time

model_lr = GridSearchCV(pipeline,
                          param_grid=param,
                          scoring='f1',
                          cv=3,
                          verbose=3,
                          n_jobs=-1)

model_lr.fit(features_train, target_train)
display(model_lr.best_params_)
display(model_lr.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END ........................................lr__C=5; total time= 1.1min
[CV 2/3] END ........................................lr__C=5; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END ........................................lr__C=5; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END ........................................lr__C=6; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3] END ........................................lr__C=6; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END ........................................lr__C=6; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END ........................................lr__C=7; total time=  58.6s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3] END ........................................lr__C=7; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END ........................................lr__C=7; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END ........................................lr__C=8; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3] END ........................................lr__C=8; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END ........................................lr__C=8; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END ........................................lr__C=9; total time=  59.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3] END ........................................lr__C=9; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END ........................................lr__C=9; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END .......................................lr__C=10; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3] END .......................................lr__C=10; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END .......................................lr__C=10; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END .......................................lr__C=11; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3] END .......................................lr__C=11; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END .......................................lr__C=11; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END .......................................lr__C=12; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3] END .......................................lr__C=12; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END .......................................lr__C=12; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END .......................................lr__C=13; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3] END .......................................lr__C=13; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END .......................................lr__C=13; total time=  57.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/3] END .......................................lr__C=14; total time= 1.1min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/3] END .......................................lr__C=14; total time= 1.0min


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/3] END .......................................lr__C=14; total time=  59.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


{'lr__C': 8}

0.7583355070874354

CPU times: user 17min 6s, sys: 14min 47s, total: 31min 53s
Wall time: 32min 23s


### Random Forest

In [None]:
pipeline = Pipeline([("vect", TfidfVectorizer(stop_words='english')),
                     ("rfc", RandomForestClassifier(random_state=123, class_weight='balanced'))])
param = {'rfc__max_depth': [n for n in range(6, 19, 3)], 'rfc__n_estimators': [25, 75]}
#model = RandomForestClassifier(random_state=123, class_weight='balanced')

In [None]:
%%time

model_rf = GridSearchCV(pipeline,
                          param_grid=param,
                          scoring='f1',
                          cv=3,
                          verbose=3,
                          n_jobs=-1)

model_rf.fit(features_train, target_train)
display(model_rf.best_params_)
display(model_rf.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END .........rfc__max_depth=6, rfc__n_estimators=25; total time=  10.3s
[CV 2/3] END .........rfc__max_depth=6, rfc__n_estimators=25; total time=  10.8s
[CV 3/3] END .........rfc__max_depth=6, rfc__n_estimators=25; total time=  10.5s
[CV 1/3] END .........rfc__max_depth=6, rfc__n_estimators=75; total time=  12.4s
[CV 2/3] END .........rfc__max_depth=6, rfc__n_estimators=75; total time=  12.4s
[CV 3/3] END .........rfc__max_depth=6, rfc__n_estimators=75; total time=  12.4s
[CV 1/3] END .........rfc__max_depth=9, rfc__n_estimators=25; total time=  10.6s
[CV 2/3] END .........rfc__max_depth=9, rfc__n_estimators=25; total time=  11.1s
[CV 3/3] END .........rfc__max_depth=9, rfc__n_estimators=25; total time=  11.1s
[CV 1/3] END .........rfc__max_depth=9, rfc__n_estimators=75; total time=  13.9s
[CV 2/3] END .........rfc__max_depth=9, rfc__n_estimators=75; total time=  13.7s
[CV 3/3] END .........rfc__max_depth=9, rfc__n_e

{'rfc__max_depth': 18, 'rfc__n_estimators': 75}

0.3806782078335879

CPU times: user 6min 40s, sys: 0 ns, total: 6min 40s
Wall time: 6min 53s


### LightGBM

In [None]:
pipeline = Pipeline([("vect", TfidfVectorizer(stop_words='english')),
                     ("lgbm", LGBMClassifier(random_state=123, class_weight='balanced'))])
param = {'lgbm__n_estimators': [15], 'lgbm__num_leaves': [30, 80], 'lgbm__max_depth' : [15, 45]}
#model = LGBMClassifier(random_state=123, class_weight='balanced', learning_rate=0.1)

In [None]:
%%time

model_lg = GridSearchCV(pipeline,
                          param_grid=param,
                          scoring='f1',
                          cv=3,
                          verbose=3,
                          n_jobs=-1)

model_lg.fit(features_train, target_train)
display(model_lg.best_params_)
display(model_lg.best_score_)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3] END lgbm__max_depth=15, lgbm__n_estimators=15, lgbm__num_leaves=30; total time= 4.1min
[CV 2/3] END lgbm__max_depth=15, lgbm__n_estimators=15, lgbm__num_leaves=30; total time= 1.8min
[CV 3/3] END lgbm__max_depth=15, lgbm__n_estimators=15, lgbm__num_leaves=30; total time= 2.2min
[CV 1/3] END lgbm__max_depth=15, lgbm__n_estimators=15, lgbm__num_leaves=80; total time= 1.1min
[CV 2/3] END lgbm__max_depth=15, lgbm__n_estimators=15, lgbm__num_leaves=80; total time= 1.5min
[CV 3/3] END lgbm__max_depth=15, lgbm__n_estimators=15, lgbm__num_leaves=80; total time= 3.1min
[CV 1/3] END lgbm__max_depth=45, lgbm__n_estimators=15, lgbm__num_leaves=30; total time= 1.3min
[CV 2/3] END lgbm__max_depth=45, lgbm__n_estimators=15, lgbm__num_leaves=30; total time=  52.2s
[CV 3/3] END lgbm__max_depth=45, lgbm__n_estimators=15, lgbm__num_leaves=30; total time=  57.3s
[CV 1/3] END lgbm__max_depth=45, lgbm__n_estimators=15, lgbm__num_leaves=80;

{'lgbm__max_depth': 45, 'lgbm__n_estimators': 15, 'lgbm__num_leaves': 30}

0.6810377599558581

CPU times: user 22min 50s, sys: 20.9 s, total: 23min 11s
Wall time: 23min 20s


По итогам обучения лучшее значение f1 выдала модель Logistic Regression - 0.758

### Предсказание на тестовой выборке

In [None]:
predictions_test = model_lr.best_estimator_.predict(features_test)
f1 = f1_score(target_test, predictions_test)
print("LogisticRegression: F1 =", f1)

LogisticRegression: F1 = 0.7560275545350171


In [None]:
dummy_model = DummyClassifier(strategy='uniform', random_state=123)
dummy_model.fit(features_train, target_train)
print("DummyClassifier: F1=", f1_score(target_test, dummy_model.predict(features_test)))

DummyClassifier: F1= 0.17142557211841278


Результат логистической регрессии на тестовой выборке равен 0.758, модель прошла тест на адекватность и соответветствует поставленной задаче.

## Выводы

В ходе исследования было сделано:

- Подготовленны данные обучения на моделях.
- поделены данные на обучающую и тестовою выборку.
- Обучены модели и выбрана лучшая из них.

На этапе обучения были подобраны параметры и обучены модели со следующими результатами:

- При обучении Логистической Регрессии получено значение f1=0.76
- При обучении Случайного Леса получено значение f1=0.4
- При обучении LightGBM получено значение f1=0.71

На тестовой выборке данных логистическая регрессия показала результат удовлетворяющий поставленной задаче: 0.756>0.75
Для проверки адекватности модели точность её предсказаний была сопоставлена с точностью предсказаний Dummy модели. Результат выбранной модели значительно лучше результата Dummy модели
