# Машинное обучение для текстов: классификация комментариев

Заказчику - некому интернет-магазину - для оптимизации модерирования комментариев необходима модель, которая будет определять, токсичный ли комментарий оставил пользователь, или нет. 

Цель - обучить модель, которая решала бы задачу классификации, разделяя комментарии на позитивные и негативные. Для обучения модели есть размеченный набор комментариев. 

Заказчик требует от модели значение метрики *F1* не меньше 0.75. 

Столбец *text* в исходном датафрейме содержит текст комментария, *toxic* — указание, токсичный коментарий, или нет, целевой признак.

# 1. Подготовка

In [1]:
import pandas as pd
import numpy as np
import copy
import math
import nltk

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import f1_score



In [2]:
import re

In [None]:
%%time
pip install catboost

In [None]:
from catboost import CatBoostClassifier

In [3]:
%%time

nltk.download('wordnet')
#nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Пользователь\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


Wall time: 11.8 s


True

In [4]:
%%time

from nltk.stem import WordNetLemmatizer 
#from nltk.corpus import wordnet

Wall time: 0 ns


In [5]:
state = 12345

In [8]:
data = pd.read_csv('...')

In [9]:
data.head(10)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [11]:
len(data[data.text.isna()==True])

0

In [12]:
len(data[data.toxic.isna()==True])

0

In [13]:
len(data[data.text.duplicated()==True])

0

## Приведение к нижнему регистру

Приведем классифицируемый текст к нижнему регистру, это облегчит дальнейшую лемматизацию и обучение модели.

In [14]:
data['original'] = data.text

In [15]:
data.text = data.text.str.lower()

In [16]:
data.head(10)

Unnamed: 0,text,toxic,original
0,explanation\nwhy the edits made under my usern...,0,Explanation\nWhy the edits made under my usern...
1,d'aww! he matches this background colour i'm s...,0,D'aww! He matches this background colour I'm s...
2,"hey man, i'm really not trying to edit war. it...",0,"Hey man, I'm really not trying to edit war. It..."
3,"""\nmore\ni can't make any real suggestions on ...",0,"""\nMore\nI can't make any real suggestions on ..."
4,"you, sir, are my hero. any chance you remember...",0,"You, sir, are my hero. Any chance you remember..."
5,"""\n\ncongratulations from me as well, use the ...",0,"""\n\nCongratulations from me as well, use the ..."
6,cocksucker before you piss around on my work,1,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK
7,your vandalism to the matt shirvington article...,0,Your vandalism to the Matt Shirvington article...
8,sorry if the word 'nonsense' was offensive to ...,0,Sorry if the word 'nonsense' was offensive to ...
9,alignment on this subject and which are contra...,0,alignment on this subject and which are contra...


## Очистка корпуса

Отчистим текст от лишних знаков, оставим только буквы (в данном случае - английского алфавита) и пробелы. Мы привели текст к нижнему регистру, поэтому достаточно было бы указать только диапазон знаков a-z и пробел, однако на всякий случай оставим также диапазон A-Z.  
Для этого определим функцию и проверим ее работу на сэмпле.

In [18]:
def text_clearing(text):
    space_text = re.sub(r'[^a-zA-Z ]',' ', text)
    cleared_text = " ".join(space_text.split())
    return cleared_text

Применим функцию очистки текста к корпусу.

In [24]:
%%time
clear_text = data['text'].apply(text_clearing)

Wall time: 1.52 s


In [25]:
data2 = copy.deepcopy(data)
data2['clear_text'] = clear_text

## Токенизация корпуса

Произведем токенизацию очищенного корпуса с помощью возможностей nltk.

In [27]:

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Пользователь\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [31]:
%%time
tokenized_text = clear_text.apply(
    lambda x: nltk.word_tokenize(x))

Wall time: 30.5 s


In [32]:
tokenized_text

0         [explanation, why, the, edits, made, under, my...
1         [d, aww, he, matches, this, background, colour...
2         [hey, man, i, m, really, not, trying, to, edit...
3         [more, i, can, t, make, any, real, suggestions...
4         [you, sir, are, my, hero, any, chance, you, re...
                                ...                        
159566    [and, for, the, second, time, of, asking, when...
159567    [you, should, be, ashamed, of, yourself, that,...
159568    [spitzer, umm, theres, no, actual, article, fo...
159569    [and, it, looks, like, it, was, actually, you,...
159570    [and, i, really, don, t, think, you, understan...
Name: text, Length: 159571, dtype: object

In [33]:
data2['tokenized_text'] = tokenized_text
data2.head(10)

Unnamed: 0,text,toxic,original,clear_text,tokenized_text
0,explanation\nwhy the edits made under my usern...,0,Explanation\nWhy the edits made under my usern...,explanation why the edits made under my userna...,"[explanation, why, the, edits, made, under, my..."
1,d'aww! he matches this background colour i'm s...,0,D'aww! He matches this background colour I'm s...,d aww he matches this background colour i m se...,"[d, aww, he, matches, this, background, colour..."
2,"hey man, i'm really not trying to edit war. it...",0,"Hey man, I'm really not trying to edit war. It...",hey man i m really not trying to edit war it s...,"[hey, man, i, m, really, not, trying, to, edit..."
3,"""\nmore\ni can't make any real suggestions on ...",0,"""\nMore\nI can't make any real suggestions on ...",more i can t make any real suggestions on impr...,"[more, i, can, t, make, any, real, suggestions..."
4,"you, sir, are my hero. any chance you remember...",0,"You, sir, are my hero. Any chance you remember...",you sir are my hero any chance you remember wh...,"[you, sir, are, my, hero, any, chance, you, re..."
5,"""\n\ncongratulations from me as well, use the ...",0,"""\n\nCongratulations from me as well, use the ...",congratulations from me as well use the tools ...,"[congratulations, from, me, as, well, use, the..."
6,cocksucker before you piss around on my work,1,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,cocksucker before you piss around on my work,"[cocksucker, before, you, piss, around, on, my..."
7,your vandalism to the matt shirvington article...,0,Your vandalism to the Matt Shirvington article...,your vandalism to the matt shirvington article...,"[your, vandalism, to, the, matt, shirvington, ..."
8,sorry if the word 'nonsense' was offensive to ...,0,Sorry if the word 'nonsense' was offensive to ...,sorry if the word nonsense was offensive to yo...,"[sorry, if, the, word, nonsense, was, offensiv..."
9,alignment on this subject and which are contra...,0,alignment on this subject and which are contra...,alignment on this subject and which are contra...,"[alignment, on, this, subject, and, which, are..."


## Лемматизация

Напишем функцию, которая бы производила лемматизацию текста с помощьюю nltk WorldNetLemmatizer.

In [36]:
def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    lemm_text = " ".join([lemmatizer.lemmatize(w) for w in text])
    return lemm_text

In [41]:
%%time
lemm_text = tokenized_text.apply(lemmatize)

Wall time: 23.3 s


In [42]:
lemm_text

0         explanation why the edits made under my userna...
1         d aww he match this background colour i m seem...
2         hey man i m really not trying to edit war it s...
3         more i can t make any real suggestion on impro...
4         you sir are my hero any chance you remember wh...
                                ...                        
159566    and for the second time of asking when your vi...
159567    you should be ashamed of yourself that is a ho...
159568    spitzer umm there no actual article for prosti...
159569    and it look like it wa actually you who put on...
159570    and i really don t think you understand i came...
Name: text, Length: 159571, dtype: object

Обратим внимание на то, что несмотря на произведенную лемматизацию многие слова не были приведены в начальную форму. Причина этого в том, что не проставлены Part-of-Speech теги, из-за этого лемматайзер плохо разбирает части речи и, соответственно, не знает, какая у слова начальная форма. Исправим это.

In [43]:
%%time
import nltk
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Пользователь\AppData\Roaming\nltk_data...


Wall time: 1.22 s


[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [44]:
from nltk.corpus import wordnet

In [45]:

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def lemmatize_POS(text):
    lemmatizer = WordNetLemmatizer()
    lemm_POS_text = " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in text])
    return lemm_POS_text

In [48]:
%%time

lemm_text2 = tokenized_text.apply(lemmatize_POS)

Wall time: 47min 8s


Невероятно долгая лемматизация. Посмотрим, насколько лучше стал лемматизирован текст.

In [176]:
data2['lemm_text'] = lemm_text

In [177]:
data2['lemm_text_POS'] = lemm_text2

In [179]:
data2.loc[100:120,['original','lemm_text','lemm_text_POS']]

Unnamed: 0,original,lemm_text,lemm_text_POS
100,"However, the Moonlite edit noted by golden dap...",however the moonlite edit noted by golden daph...,however the moonlite edit note by golden daph ...
101,Check the following websites:\n\nhttp://www.ir...,check the following website http www iranchamb...,check the follow website http www iranchamber ...
102,i can't believe no one has already put up this...,i can t believe no one ha already put up this ...,i can t believe no one have already put up thi...
103,"""\n\nWell, after I asked you to provide the di...",well after i asked you to provide the diffs wi...,well after i ask you to provide the diffs with...
104,What page shoudld there be for important chara...,what page shoudld there be for important chara...,what page shoudld there be for important chara...
105,A pair of jew-hating weiner nazi schmucks.,a pair of jew hating weiner nazi schmuck,a pair of jew hat weiner nazi schmuck
106,I tend to think that when the list is longer t...,i tend to think that when the list is longer t...,i tend to think that when the list be longer t...
107,"""\n\n What's up with this? \n""""If you are a re...",what s up with this if you are a religiously o...,what s up with this if you be a religiously or...
108,I'm not vandalizing \n\nI'm just having fun m...,i m not vandalizing i m just having fun man yo...,i m not vandalize i m just have fun man you ha...
109,Welcome to Wikipedia ! [bla] Discover Ekopedia...,welcome to wikipedia bla discover ekopedia the...,welcome to wikipedia bla discover ekopedia the...


Положа руку на сердце, заметим, что набор лемматизированных текстов стал несильно лучше. Но, как минимум, почти все глаголы встали в инфинитив. 

## Векторизация текстов

Перед тем, как приступить к обучению моделей и выбору лучшей, разобьем массив текстов на обучающий корпус, валидационную и тестовую выборки и векторизируем их. 

In [50]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Пользователь\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [110]:
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.7, stop_words=stopwords)

In [111]:
lemm_text_train_valid, lemm_text_test, target_train_valid, target_test = train_test_split(lemm_text2, data.toxic, 
                                                                                          random_state = state, 
                                                                                          test_size = 0.25)
lemm_text_train, lemm_text_valid, target_train, target_valid = train_test_split(lemm_text_train_valid,
                                                                                          target_train_valid, 
                                                                                          random_state = state, 
                                                                                          test_size = 0.3)

In [152]:
lemm_text_train_valid_noPOS, lemm_text_test_noPOS, target_train_valid_noPOS, target_test_noPOS = train_test_split(lemm_text, data.toxic, 
                                                                                          random_state = state, 
                                                                                          test_size = 0.25)
lemm_text_train_noPOS, lemm_text_valid_noPOS, target_train_noPOS, target_valid_noPOS = train_test_split(lemm_text_train_valid_noPOS,
                                                                                          target_train_valid_noPOS, 
                                                                                          random_state = state, 
                                                                                          test_size = 0.3)

In [112]:
%%time
corpus_train = lemm_text_train.values.astype('U')
corpus_valid = lemm_text_valid.values.astype('U')
corpus_test = lemm_text_test.values.astype('U')

Wall time: 1.19 s


In [113]:
%%time
features_train = count_tf_idf.fit_transform(corpus_train)

Wall time: 3.43 s


In [114]:
%%time
features_valid = count_tf_idf.transform(corpus_valid)


Wall time: 1.39 s


In [115]:
%%time
features_test = count_tf_idf.transform(corpus_test)


Wall time: 1.53 s


In [116]:
features_train = features_train.toarray()

In [117]:
features_valid = features_valid.toarray()
features_test = features_test.toarray()

Основные корпусы текстов готовы. Однако мы проделаем то же самое и с текстом, лемматизированным без POS-меток, чтобы в дальнейшем сравнить качество не только алгоритмов, но и способов лемматизации.

In [153]:
%%time
corpus_train_noPOS = lemm_text_train_noPOS.values.astype('U')
corpus_valid_noPOS = lemm_text_valid_noPOS.values.astype('U')
corpus_test_noPOS = lemm_text_test_noPOS.values.astype('U')

count_tf_idf_noPOS = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.7, stop_words=stopwords)

features_train_noPOS = count_tf_idf_noPOS.fit_transform(corpus_train).toarray()
features_valid_noPOS = count_tf_idf_noPOS.transform(corpus_valid).toarray()
features_test_noPOS = count_tf_idf_noPOS.transform(corpus_test).toarray()



Wall time: 11.7 s


# 2. Обучение

## Логистическая регрессия

In [118]:
log_reg = LogisticRegression()

In [119]:
%%time
log_reg.fit(features_train, target_train)

Wall time: 24.6 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [120]:
%%time
log_reg_predicted_valid = log_reg.predict(features_valid)

Wall time: 251 ms


In [121]:
log_reg_f1 = f1_score(target_valid, log_reg_predicted_valid)
log_reg_f1

0.7325484994196652

Логистическая регрессия продемонстировала неплохой результат на лемматизированных с POS-метками текстах, однако он все-таки неудовлетворительный. 

In [154]:
log_reg_noPOS = LogisticRegression()
log_reg_noPOS.fit(features_train_noPOS, target_train_noPOS)
log_reg_predicted_valid_noPOS = log_reg_noPOS.predict(features_valid_noPOS)
log_reg_f1_noPOS = f1_score(target_valid_noPOS, log_reg_predicted_valid_noPOS)
log_reg_f1_noPOS

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.7317397078353254

Логистическая регрессия на данных без POS-меток продемонстрировала близкий, хоть и чуть худший результат.

## Случайный лес

In [136]:
%%time
def forest_search(features_train, target_train, features_valid, target_valid):
    forest_depth_col = []
    forest_estim_col = []
    forest_score_col = []
    for depth in [12, 30]:
        for n_estim in [120, 200]:
            forest_model = RandomForestClassifier(n_estimators = n_estim, 
                                                   max_depth = depth, 
                                                   random_state = state)
            forest_model.fit(features_train, target_train)
            forest_predicted_valid = forest_model.predict(features_valid)
            forest_score = f1_score(target_valid, forest_predicted_valid)
            forest_depth_col.append(depth)
            forest_estim_col.append(n_estim)
            forest_score_col.append(forest_score)
    forest_hyperparameters_dict = {'max_depth': forest_depth_col, 'n_estimators': forest_estim_col, 'f1_score': forest_score_col}
    forest_hyperparameters = pd.DataFrame(data = forest_hyperparameters_dict)
    return forest_hyperparameters


Wall time: 0 ns


In [137]:
%%time
forest_results = forest_search(features_train, target_train, features_valid, target_valid)
forest_results

Wall time: 4min 17s


Unnamed: 0,max_depth,n_estimators,f1_score
0,30,200,0.364735


Случайный лес продемонстрировал очень низкий показатель, значительно ниже, чем логистическая регрессия.

## CatBoost

Попробуем применить алгоритм градиентного бустинга. Для начала - пробный вариант.

In [130]:
%%time
catbo = CatBoostClassifier(loss_function="Logloss", iterations=120)
catbo.fit(features_train, target_train, verbose=10)

Learning rate set to 0.476972
0:	learn: 0.3443652	total: 660ms	remaining: 1m 18s
10:	learn: 0.1846563	total: 2.48s	remaining: 24.6s
20:	learn: 0.1646103	total: 4.05s	remaining: 19.1s
30:	learn: 0.1499102	total: 5.61s	remaining: 16.1s
40:	learn: 0.1411258	total: 7.16s	remaining: 13.8s
50:	learn: 0.1345853	total: 8.74s	remaining: 11.8s
60:	learn: 0.1287429	total: 10.3s	remaining: 9.99s
70:	learn: 0.1249347	total: 11.9s	remaining: 8.19s
80:	learn: 0.1217940	total: 13.5s	remaining: 6.49s
90:	learn: 0.1183494	total: 15s	remaining: 4.79s
100:	learn: 0.1148458	total: 16.6s	remaining: 3.13s
110:	learn: 0.1121057	total: 18.2s	remaining: 1.48s
119:	learn: 0.1102410	total: 19.6s	remaining: 0us
Wall time: 58 s


<catboost.core.CatBoostClassifier at 0x234a89c1988>

In [131]:
%%time
catbo_predicted_valid = catbo.predict(features_valid)
catbo_f1 = f1_score(target_valid, catbo_predicted_valid)
catbo_f1

Wall time: 16.7 s


0.7394796016704143

Результат довольно близок к требуемому. Благо, мы можем попробовать другие гиперпараметры, чтобы преодолеть требуемый порог качества.

In [169]:
%%time
def CatBoost_search(features_train, target_train, features_valid, target_valid):
    learning_rate_col = []
    iterations_col = []
    f1_score_col = []
    l2_reg_col = []
    for learn_rate in [0.6,0.7]:
        for l2_leaf_reg in [8,12]:
            for iterations in [190, 200]:
                catboost_model = CatBoostClassifier(loss_function = 'Logloss',
                                                       iterations = iterations,
                                                       learning_rate = learn_rate, 
                                                    l2_leaf_reg = l2_leaf_reg,
                                                       random_state = state, 
                                                       verbose = False)
                catboost_model.fit(features_train, 
                                       target_train, 
                                       verbose=False)
                catboost_predicted_valid = catboost_model.predict(features_valid)
                catbo_score = f1_score(target_valid, catboost_predicted_valid)
                l2_reg_col.append(l2_leaf_reg)
                learning_rate_col.append(learn_rate)
                iterations_col.append(iterations)
                f1_score_col.append(catbo_score)
    catboost_hyperparameters_dict = {'learning_rate': learning_rate_col, 
                                     'iterations': iterations_col, 
                                     'l2_leaf_reg': l2_reg_col,
                                     'f1_score': f1_score_col}
    catboost_hyperparameters = pd.DataFrame(data = catboost_hyperparameters_dict)
    return catboost_hyperparameters

Wall time: 0 ns


In [170]:
%%time
catbo_results = CatBoost_search(features_train, target_train, features_valid, target_valid)
catbo_results.sort_values(by = 'f1_score', ascending = False)

Wall time: 12min 59s


Unnamed: 0,learning_rate,iterations,l2_leaf_reg,f1_score
7,0.7,200,12,0.755424
0,0.6,190,8,0.755381
6,0.7,190,12,0.754528
1,0.6,200,8,0.75361
3,0.6,200,12,0.749058
5,0.7,200,8,0.748985
2,0.6,190,12,0.748901
4,0.7,190,8,0.74828


In [171]:
catbo_best = catbo_results.sort_values(by = 'f1_score', ascending = False).head(1)
catbo_best

Unnamed: 0,learning_rate,iterations,l2_leaf_reg,f1_score
7,0.7,200,12,0.755424


In [172]:
catbo_best_rate = float(catbo_best.learning_rate)
catbo_best_iterations = int(catbo_best.iterations)
catbo_best_reg = int(catbo_best.l2_leaf_reg)

In [181]:
%%time
catbo_results_noPOS = CatBoost_search(features_train_noPOS, target_train_noPOS, features_valid_noPOS, target_valid_noPOS)
catbo_results_noPOS

Wall time: 14min 38s


Unnamed: 0,learning_rate,iterations,l2_leaf_reg,f1_score
0,0.6,190,8,0.7455
1,0.6,200,8,0.74625
2,0.6,190,12,0.750988
3,0.6,200,12,0.750672
4,0.7,190,8,0.750705
5,0.7,200,8,0.750352
6,0.7,190,12,0.752225
7,0.7,200,12,0.752225


In [182]:
catbo_results_noPOS.sort_values(by = 'f1_score', ascending = False).head(1)

Unnamed: 0,learning_rate,iterations,l2_leaf_reg,f1_score
6,0.7,190,12,0.752225


Модели на лемматизированных без меток данных показали удовлетворительное качество на валидационной выборке, уступая моделям на данных с POS-тегами самую малость. Впрочем, именно этой небольшой разницы может не хватить для удовлетворительного результата на тестовой выборке (на ней качество обычно ниже по сравнению с валидационной), поэтому тестировать будем модель CatBoost на данных с POS-тегами.

## Тестирование

У нас есть модель, которая показала удовлетворительное качество на валидационном массиве. Протестируем ее.

In [173]:
%%time
catbo_best = CatBoostClassifier(loss_function="Logloss", l2_leaf_reg = catbo_best_reg, learning_rate = catbo_best_rate, iterations=catbo_best_iterations)
catbo_best.fit(features_train, target_train, verbose=10)
predicted_test = catbo_best.predict(features_test)
f1_test = f1_score(target_test, predicted_test)
f1_test

0:	learn: 0.2813201	total: 231ms	remaining: 46s
10:	learn: 0.1729919	total: 2.31s	remaining: 39.7s
20:	learn: 0.1526421	total: 4.38s	remaining: 37.3s
30:	learn: 0.1400870	total: 6.43s	remaining: 35.1s
40:	learn: 0.1325819	total: 8.46s	remaining: 32.8s
50:	learn: 0.1275616	total: 10.5s	remaining: 30.6s
60:	learn: 0.1242400	total: 12.5s	remaining: 28.5s
70:	learn: 0.1222648	total: 14.5s	remaining: 26.4s
80:	learn: 0.1200765	total: 16.6s	remaining: 24.3s
90:	learn: 0.1157384	total: 18.7s	remaining: 22.4s
100:	learn: 0.1123619	total: 20.8s	remaining: 20.3s
110:	learn: 0.1109482	total: 22.8s	remaining: 18.3s
120:	learn: 0.1099555	total: 24.9s	remaining: 16.2s
130:	learn: 0.1085034	total: 26.9s	remaining: 14.2s
140:	learn: 0.1070719	total: 29s	remaining: 12.1s
150:	learn: 0.1065479	total: 31s	remaining: 10.1s
160:	learn: 0.1058025	total: 33s	remaining: 8s
170:	learn: 0.1050202	total: 35s	remaining: 5.94s
180:	learn: 0.1043385	total: 37s	remaining: 3.89s
190:	learn: 0.1032147	total: 39.1s	rem

0.7524177949709864

F1 на тестовой выборке больше 0.75, а значит, требования заказчика выполнены. Однако мы можем обучить модель на корпусе, слитом из обучающего и валидационного, чтобы посмотреть весь потенциал модели.

In [174]:
features_train_valid = count_tf_idf.transform(lemm_text_train_valid.values.astype('U')).toarray()


In [175]:
%%time
catbo_best2 = CatBoostClassifier(loss_function="Logloss", l2_leaf_reg = catbo_best_reg, learning_rate = catbo_best_rate, iterations=catbo_best_iterations)
catbo_best2.fit(features_train_valid, target_train_valid, verbose=10)
predicted_test2 = catbo_best2.predict(features_test)
f1_test2 = f1_score(target_test, predicted_test2)
f1_test2

0:	learn: 0.2886323	total: 582ms	remaining: 1m 55s
10:	learn: 0.1719158	total: 3.67s	remaining: 1m 3s
20:	learn: 0.1518512	total: 6.46s	remaining: 55.1s
30:	learn: 0.1407321	total: 9.25s	remaining: 50.5s
40:	learn: 0.1331755	total: 12.1s	remaining: 46.8s
50:	learn: 0.1277977	total: 14.9s	remaining: 43.6s
60:	learn: 0.1237537	total: 17.8s	remaining: 40.5s
70:	learn: 0.1218356	total: 20.6s	remaining: 37.4s
80:	learn: 0.1190993	total: 23.4s	remaining: 34.4s
90:	learn: 0.1164301	total: 26.2s	remaining: 31.3s
100:	learn: 0.1145110	total: 28.9s	remaining: 28.4s
110:	learn: 0.1119770	total: 31.7s	remaining: 25.4s
120:	learn: 0.1104388	total: 34.6s	remaining: 22.6s
130:	learn: 0.1091784	total: 37.4s	remaining: 19.7s
140:	learn: 0.1082627	total: 40.1s	remaining: 16.8s
150:	learn: 0.1074990	total: 42.9s	remaining: 13.9s
160:	learn: 0.1064795	total: 45.7s	remaining: 11.1s
170:	learn: 0.1060333	total: 48.5s	remaining: 8.23s
180:	learn: 0.1052480	total: 51.3s	remaining: 5.39s
190:	learn: 0.1049272	

0.7564752638070439

Качество предсказания лучше, если обучить модель CatBoost на корпусе из обучающей и валидационной выборки, хоть и не очень сильно.

# 3. Выводы

Мы выполнили требование заказчика с помощью модели градиентного бустинга на алгоритме CatBoost, обученной на предобработанных - очищенных, токенизированных и лемматизированных с POS-тегами - данных.  
Можно отметить, что качество моделей на данных с POS-тегами выше, чем у моделей на данных, лемматизированных без POS-тегов, но не сильно. Можно предположить, что лемматизация в данном случае - а мы работали с английским языком, в котором в тексте используется сравнительно меньше словоформ, чем в других языках, как минимум из-за отсутствия падежей - играла не столь значительную роль, за счет этого и разница между лучше и хуже лемматизированными текстами не давала различий в качестве.  
Однако нельзя исключать, что лемматизация во втором случае, даже с POS-тегами прошла не лучшим образом, некоторые слова все равно не были возвращены в начальную форму. Возможно, на это оказала влияние очистка корпуса, которая могла помимо мешающих элементов исключить и важные, способствующие определению формы слова и части речи симоволы, например, апостроф.  

Тем не менее, требование заказчика выполнено. 