<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>TF-IDF</a></span></li><li><span><a href="#BERT" data-toc-modified-id="BERT-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>BERT</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Логистическая-регрессия" data-toc-modified-id="Логистическая-регрессия-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Логистическая регрессия</a></span></li><li><span><a href="#Рандомный-лес" data-toc-modified-id="Рандомный-лес-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Рандомный лес</a></span></li><li><span><a href="#Лучшая-модель" data-toc-modified-id="Лучшая-модель-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Лучшая модель</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

# Определение тональности комментариев (BERT)

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

**Цель**

Обучить модель, которая будет классифицировать комментарии на позитивные и негативные.

**Задача**

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Описание данных**

В распоряжении набор данных с разметкой о токсичности правок. Столбец *text* содержит текст комментария, а *toxic* — целевой признак.

In [1]:
!pip install transformers -q



In [2]:
import torch
torch.cuda.is_available()

False

In [3]:
import time
import numpy as np
import os
import pandas as pd
import warnings

# подготовка данных TF-IDF
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# подготовка данных BERT
import transformers
from tqdm import tqdm
from tqdm import notebook

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

# модели МО
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [4]:
# константы
RANDOM_STATE = 42
TEST_SIZE = 0.20
SAMPLES = 5000      # размер семплов для BERT
MAX_LEN = 512       # максимальная длина токена для BERT
BATCH_SIZE = 100    # размер батчей для ембеддингов BERT


# снимаем ограничение на количество столбцов
pd.set_option('display.max_columns', None)

# снимаем ограничение на ширину столбцов
pd.set_option('display.max_colwidth', None)

# выставляем ограничение на показ знаков после запятой
pd.options.display.float_format = '{:,.2f}'.format

# установка ядра tqdm_notebook для отображения прогресса в цикле
notebook.tqdm_notebook.pandas()

# скачиваем 
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

os.environ["TOKENIZERS_PARALLELISM"] = "false"
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
def import_data(file_name: str) -> pd.DataFrame:
    '''
    Чтение csv-файла из этой же директории или с сервера яндекса
    '''
    local_pth = file_name
    server_pth = 'https://example.ru/' + file_name

    try:
        data = pd.read_csv(local_pth, index_col='Unnamed: 0')
    except:
        data = pd.read_csv(server_pth, index_col='Unnamed: 0')
    
    return data

## Подготовка

In [6]:
# загружаем файл
df_comments = import_data('toxic_comments.csv')

In [7]:
# выводим рандомные строки
display(df_comments.sample(10, random_state=RANDOM_STATE))
print()
# информация о таблице
df_comments.info()

Unnamed: 0,text,toxic
31055,"Sometime back, I just happened to log on to www.izoom.in with a friend’s reference and I was amazed to see the concept Fresh Ideas Entertainment has come up with. So many deals… all under one roof. This website is very user friendly and easy to use and is fun to be on.\nYou have Gossip, Games, Facts… Another exciting feature to add to it is Face of the Week… Every week, 4 new faces are selected and put up as izoom faces. It’s great to have been selected in four out of a group of millions. \nThis new start up has already got many a deals in its kitty. Few of them being TheFortune Hotel, The Beach… are my personal favorites. izoom.in has a USP of mobile coupons. Coupons are available even when a user cannot access internet. You just need to SMS izoom support to 56767 and you get attended immediately.\nAll I can say is izoom.in is a must visit website for everyone before they go out for shopping or dining or for outing.\nCheers!!!",0
102929,"""\n\nThe latest edit is much better, don't make this article state """"super."""" at all. 71.237.70.49 """,0
67385,""" October 2007 (UTC)\n\nI would think you'd be able to get your point across, and be immune to any objections, were you to simply embellish the second sentence of the article by changing """"he was schooled at Thornleigh Salesian College"""" to """"he was schooled at (the then all-Catholic) Thornleigh Salesian College"""". \n\nGood suggestion from an Anon - what do you think? Rgds, - 07:53, 5""",0
81167,Thanks for the tip on the currency translation. Think it's all done now.,0
90182,"I would argue that if content on the Con in comparison to the Arts Music is out of proportion, then it warrants further contribution to the article, not the removal of an indepth piece of content. Also, as I mentioned before, the Arts Music unit has a notable history comparable to that of the Con itself. Because of this, I would further argue that content on the Arts Music Unit is more relevant to this article than the information on the Newcastle Conservatorium.",0
1860,"""=Reliable sources===\nCheating:\n""""Barry Bonds:Cheater"""" from CBS, yea I kinda think that is reliable. \n""""Dear Barry Bonds, You are either an outright cheater or very stupid"""" from the USA Today \n""""Yes, Barry Bonds is a cheater. He is a cheater of the worst sort"""" \nLying:\n""""It's clear, Barry Bonds' a liar"""" New York Daily News, another pretty freakin' reliable source. \n""""Barry Lamar Bonds is a bad man"""" Baseball Digest \n""""but Bonds is a liar, a cheater, a whiner and a bad influence on America's youth"""" Mark Barnes\n\n==""",1
125422,WTF=\n\nHow The Fuck Does This Person Merit A Page On Wikipedia.,1
149142,"cajuns, acadians\nCajuns, acadians, louisianans, they're so many different names for different americans of french descent because their culture is so rich and somewhat so different but so close at the same time. I'm an acadian but more importantly I'm a french american so I really don't see why there should be a difference. \n\nIf you say there should be two different list, it doesn't make sense. The people on the french-american list should be in one of wiki-invented list of cajuns or acadians. I understand there are some more recent french-americans who are only 1 or 2 generations-old americans but this distinction isn't made for italian-americans, german-americans. I'm surprised to see Albert Einstein a fairly recent immigrant in American history next to Katherine Heigl, a 10 to 12 generations american. \\\n\nThis is all race-based, biased because french-bashing don't stop at the bush government level.",0
89784,"Hi - I dropped a pin in Google Maps at the ceremonial site near Chief Tayak's grave and recorded the lat/long. Is this not permitted? It maps correctly in other sites, as well, so I believe they're the correct coordinates. I live nearby, so could also visit the site and get the coordinates from GPS.",0
64323,Re removal of accessdate= for urls books \n\nThis is from Template:Cite book\naccessdate: Full date when url was accessed. Should be used when url field is used.\nSomeone has claimed the exact opposite & removed the accessdates,0



<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


1. В данных имеются английские комментарии, большую часть которых нужно очистить от лишних символов ('/n', знаки пунктуации, математические символы).
2. Имеется 159292 комментария, пропусков данных не наблюдается.
3. Данные правильно типизированы.

In [8]:
print(f"Количество явных дубликатов: {df_comments.duplicated().sum()}")

Количество явных дубликатов: 0


In [9]:
# выведем распределение классов
print('Распределение классов ЦП:')
df_comments.toxic.value_counts(normalize=True)

Распределение классов ЦП:


0   0.90
1   0.10
Name: toxic, dtype: float64

In [10]:
# выведем токсичные комментарии
df_comments.query('toxic==1').sample(10, random_state=RANDOM_STATE)

Unnamed: 0,text,toxic
92964,"Dude, get a fucking life you queer.. get a real paying job instead of e-wiki policing around, you 50 year old piece of shite... and your mom said to make your bed.",1
120398,The only nonsense I will approve is nonsense committed by 1)chaka zulu warriors. 2)anglican gay priests 3)cocksuck ers,1
47519,Thou shalt be ravished in thy buttocks \n\nIt's sure to be hurtin!,1
815,"Groin \n\nYou sad fuck. I'm going to restore my edit, and you are going to leave it, or else I wil knee you in the groin yourself, prick. 91.108.208.101",1
55660,"Cause of Death \n\nFuck exicution, put down Murder was the mutha fuckin' case, and Shawtnegger was tha murdera!",1
97286,"Getting REALLY annoyed\nOK the list here is barely usable. It sucks. I don't think we should change it because the Arcs and other useful information supposedly don't fit here. BUT WHY CANT I PUT A SIMPLE LINK TO A WEBSITE THAT DOES HAVE THAT INFORMATION???? That website has all kinds of valuable information that is DIRECTLY RELATED TO THE ARTICLE. Yet some WikiNazi thinks that this is his personal article and keeps making up some BS about how my link is a form of advertisement or some shit. There are NO ADS. It is another wiki site, and it is 100% ON TOPIC WITH USEFUL INFORMATION THAT EVERYONE KEEPS TRYING TO ADD TO THIS ARTICLE. Since he won't let his PERSONAL wiki site be altered with USEFUL INFORMATION then let me keep the link. I'm so sick of pedantic fools making wikipedia so useless and difficult to use. I get that this page should reflect wiki-standards but that link I added is 100% relevant and is far more justified than plenty of external links I've seen on other pages.",1
148604,"Harun Yahya/Adnan Oktar \n\nThis is quoted from the wikipedia article on Harun Yahya:\n\nEven though he often writes about science, he has never actually studied any science at a university level. (, , )\n\nThere are also many disturbing data in that article: he claims to be a moral authority and an expert on Muslim ethics but has anal and oral sex with children, claiming that it is halal. If we are to keep his work in this article he needs to be demonstrated as having some standing as a Quranic scholar. Is he so received in the Muslim world? Is there any evidence for this?",1
33368,""":::: """"I will ignore but rest assured that, should you cause any further disruption to articles or talk pages, I will report you and ask that you be blocked longer than you were last time."""" \nSeems your only response is violence? Am i right.... American self appointed dictator asshole? Whats the matter? Cant handle it when it gets tough? Sometimes the other one hits back and what do you do American fag? Run home to mother you fuck. \n\n""",1
39960,Ima rude son of a bitch too because I have a period remember im gay...,1
35349,This guy is so dirty. He's awesome FUCK YEAH AHAHAHHAAHA,1


In [11]:
print(f"Максимальная длинна комментария: {max(df_comments['text'].str.len())}")

Максимальная длинна комментария: 5000


### TF-IDF

Для быстрого обучения моделей сократим выборку данных.

In [12]:
# скопируем исходный файл
data = df_comments.sample(random_state=RANDOM_STATE, n=SAMPLES * 10).reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    50000 non-null  object
 1   toxic   50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 781.4+ KB


In [13]:
# выведем распределение классов
data.toxic.value_counts(normalize=True)

0   0.90
1   0.10
Name: toxic, dtype: float64

In [14]:
class TextProcessor:
    """
    Класс для обработки текста, включая очистку от лишних символов, удаление стоп-слов и лемматизацию.

    Атрибуты:
        stopwords (set): Множество стоп-слов для указанного языка.
        lemmatizer (WordNetLemmatizer): Объект для лемматизации слов с использованием WordNetLemmatizer.

    Методы:
        clear_text(text): Очищает текст от лишних символов и стоп-слов.
        lemm_text(text): Лемматизирует текст.
        postag_lemm_text(text): Лемматизирует текст с учетом частей речи.
        get_wordnet_pos(word): Возвращает POS-тег WordNet для слова.

    """
    # Инициализация класса
    def __init__(self, stopwords_language='english'):
        self.stopwords = set(stopwords.words(stopwords_language)) # загрузка 
        self.wnl = WordNetLemmatizer()
        
    # Метод для очистки текста от лишних символов и стоп-слов
    def clear_text(self, text):
        text = text.lower() 
        word_list = re.sub(r"[^a-z ]", ' ', text).split()
        word_notstop_list = [w for w in word_list if w not in self.stopwords]
        return ' '.join(word_notstop_list)
    
    # Метод для лемматизации текста
    def lemm_text(self, text):
        word_list = text.split()
        lemmatized_text = ' '.join([self.wnl.lemmatize(w) for w in word_list])
        return lemmatized_text
    
    # Метод для лемматизации текста с учетом частей речи
    def postag_lemm_text(self, text):
        word_list = text.split()
        lemmatized_text = ' '.join([self.wnl.lemmatize(w, self.get_wordnet_pos(w)) for w in word_list])
        return lemmatized_text
    
    @staticmethod
    # Статический метод для определения части речи слова с использованием pos_tag
    def get_wordnet_pos(word):
        # Получение POS-тега для слова с использованием pos_tag
        tag = nltk.pos_tag([word])[0][1][0].upper()
        # Отображение POS-тегов WordNet на первую букву, используемую lemmatize
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        # Возврат соответствующего POS-тега WordNet или 'n' (существительное) по умолчанию
        return tag_dict.get(tag, wordnet.NOUN)

# Инициализация объекта для обработки текста
text_processor = TextProcessor()

In [16]:
# очистка текста
data['clean_text'] = data['text'].progress_apply(text_processor.clear_text)

# # лемматизация текста
# data['wnl_text'] = data['clean_text'].progress_apply(text_processor.lemm_text)

# лемматизация с POS-тегами
data['wnlpostag_text'] = data['clean_text'].progress_apply(text_processor.postag_lemm_text)

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

In [17]:
data.sample(5, random_state=RANDOM_STATE)

Unnamed: 0,text,toxic,clean_text,wnlpostag_text
1,"""\n\nThe latest edit is much better, don't make this article state """"super."""" at all. 71.237.70.49 """,0,latest edit much better make article state super,late edit much well make article state super
4,"I would argue that if content on the Con in comparison to the Arts Music is out of proportion, then it warrants further contribution to the article, not the removal of an indepth piece of content. Also, as I mentioned before, the Arts Music unit has a notable history comparable to that of the Con itself. Because of this, I would further argue that content on the Arts Music Unit is more relevant to this article than the information on the Newcastle Conservatorium.",0,would argue content con comparison arts music proportion warrants contribution article removal indepth piece content also mentioned arts music unit notable history comparable con would argue content arts music unit relevant article information newcastle conservatorium,would argue content con comparison art music proportion warrant contribution article removal indepth piece content also mention art music unit notable history comparable con would argue content art music unit relevant article information newcastle conservatorium
2,""" October 2007 (UTC)\n\nI would think you'd be able to get your point across, and be immune to any objections, were you to simply embellish the second sentence of the article by changing """"he was schooled at Thornleigh Salesian College"""" to """"he was schooled at (the then all-Catholic) Thornleigh Salesian College"""". \n\nGood suggestion from an Anon - what do you think? Rgds, - 07:53, 5""",0,october utc would think able get point across immune objections simply embellish second sentence article changing schooled thornleigh salesian college schooled catholic thornleigh salesian college good suggestion anon think rgds,october utc would think able get point across immune objection simply embellish second sentence article change school thornleigh salesian college school catholic thornleigh salesian college good suggestion anon think rgds
0,"Sometime back, I just happened to log on to www.izoom.in with a friend’s reference and I was amazed to see the concept Fresh Ideas Entertainment has come up with. So many deals… all under one roof. This website is very user friendly and easy to use and is fun to be on.\nYou have Gossip, Games, Facts… Another exciting feature to add to it is Face of the Week… Every week, 4 new faces are selected and put up as izoom faces. It’s great to have been selected in four out of a group of millions. \nThis new start up has already got many a deals in its kitty. Few of them being TheFortune Hotel, The Beach… are my personal favorites. izoom.in has a USP of mobile coupons. Coupons are available even when a user cannot access internet. You just need to SMS izoom support to 56767 and you get attended immediately.\nAll I can say is izoom.in is a must visit website for everyone before they go out for shopping or dining or for outing.\nCheers!!!",0,sometime back happened log www izoom friend reference amazed see concept fresh ideas entertainment come many deals one roof website user friendly easy use fun gossip games facts another exciting feature add face week every week new faces selected put izoom faces great selected four group millions new start already got many deals kitty thefortune hotel beach personal favorites izoom usp mobile coupons coupons available even user cannot access internet need sms izoom support get attended immediately say izoom must visit website everyone go shopping dining outing cheers,sometime back happen log www izoom friend reference amaze see concept fresh idea entertainment come many deal one roof website user friendly easy use fun gossip game fact another excite feature add face week every week new face select put izoom face great select four group million new start already get many deal kitty thefortune hotel beach personal favorite izoom usp mobile coupon coupon available even user cannot access internet need sm izoom support get attend immediately say izoom must visit website everyone go shopping din out cheer
3,Thanks for the tip on the currency translation. Think it's all done now.,0,thanks tip currency translation think done,thanks tip currency translation think do


Разделим данные на обучающую, валидационную и тренировочную выборки.

In [16]:
# разделение на тренировочный и тестовый наборы
features, features_test, target, target_test = train_test_split(
    data['wnlpostag_text'], data['toxic'], 
    stratify=data['toxic'],
    test_size=TEST_SIZE, random_state=RANDOM_STATE
)

# разделение данных на тренировочные и валидационные наборы
X_train, X_valid, y_train, y_valid = train_test_split(
    features, target,
    stratify=target,
    random_state=RANDOM_STATE
)

Теперь можно токенизировать данные, но это будет сделано в пайплайне, чтобы избежать утечек при кроссвалидации моделей.

### BERT

Для работы с моделью BERT необходимо сократить выборку данных, так как токенизация занимает продолжительное время.

In [17]:
# сократим кол-во данных
data_mini = df_comments.sample(random_state=RANDOM_STATE, n=SAMPLES).reset_index(drop=True)
data_mini.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5000 non-null   object
 1   toxic   5000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 78.2+ KB


In [18]:
# выведем распределение классов
data_mini.toxic.value_counts(normalize=True)

0   0.90
1   0.10
Name: toxic, dtype: float64

---
---

**embeddings with cuda**

Создадим ембеддинги с помощью предобученной модели `"unitary/toxic-bert"`.

In [19]:
model_name = "unitary/toxic-bert"
model = transformers.AutoModel.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/811 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the model checkpoint at unitary/toxic-bert were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/174 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
# токенизируем данные
tokenized = data_mini['text'].apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=MAX_LEN, truncation=True)
)

# приводим строки к одной длине 
padded = np.array([i + [0]*(MAX_LEN - len(i)) for i in tokenized.values if len(i) <= MAX_LEN])

# создаем маску
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
device

In [None]:
# эмбеддинги по батчам
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // BATCH_SIZE)):
    batch = torch.LongTensor(padded[BATCH_SIZE * i : BATCH_SIZE * (i + 1)])
    attention_mask_batch = torch.LongTensor(attention_mask[BATCH_SIZE * i : BATCH_SIZE * (i + 1)])
    batch = batch.to(device)
    attention_mask_batch = attention_mask_batch.to(device)

    with torch.no_grad():
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)

    embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

In [None]:
# объединим нужные данные
np_embeddings = np.concatenate(embeddings)

# сохраним готовые ембеддинги в текущей директории
np.save('toxic_comments_embeddings.npy', np_embeddings)

np_embeddings.shape

---
---
**embeddings from local path**

In [20]:
# загрузка готовых эмбеддингов из директории
np_embeddings = np.load('toxic_comments_embeddings.npy')
np_embeddings.shape

(5000, 768)

---
---

Разделим данные на выборки.

In [21]:
# выборки для BERT
features_bert, features_test_bert, target_bert, target_test_bert = train_test_split(
    np_embeddings, data_mini['toxic'], 
    test_size=TEST_SIZE, random_state=RANDOM_STATE
)

In [22]:
X_train_bert, X_valid_bert, y_train_bert, y_valid_bert = train_test_split(features_bert, target_bert, stratify=target_bert, random_state=RANDOM_STATE)

### Выводы

1. Загружены данные, пропусков и дубликатов данных нет.
2. Подготовлены TF-IDF вектора и эмбеддинги.
3. Данные разделены на обучающую, валидационную и тренировочную выборки.

## Обучение

In [23]:
# создадим датафрейм, в который будем записывать результаты подбора гиперпараметров
models_results = pd.DataFrame(columns=['model_name', 'params', 'score', 'time'])

In [24]:
def gscv_fit(model, param_grid, X, y, model_name):
    '''
    Поиск гиперпараметров для модели с помощью GridSearchCV(), метрика F1
    
    Обновляет models_results
    Возвращает "best_estimator_"
    '''
    
    gscv = GridSearchCV(model, param_grid, scoring='f1', cv=5, n_jobs=-1, verbose=0)

    gscv.fit(X, y)
    start = time.time()
    gscv.best_estimator_.fit(X, y)
    end = time.time()
    print(f'Best CV-F1 score Train: {gscv.best_score_:.2f}')
    
    global models_results
    models_results = models_results.append({'model_name': model_name, 'params': gscv.best_params_, 'score': gscv.best_score_, 'time': end - start}, ignore_index=True)
    
    return gscv.best_estimator_

In [25]:
# токениззатор для TF-IDF
count_tf_idf = TfidfVectorizer()

# модель логистической регрессии и гиперпараметры для нее
model_logreg = LogisticRegression(class_weight='balanced', max_iter=500, random_state=RANDOM_STATE)
param_grid_lr_tf_idf = {'model__C': [i/100 for i in range(1, 500, 50)]}
param_grid_lr = {'C': [i/100 for i in range(1, 500, 50)]}

# модель рандомного леса и гиперпараметры для нее
model_forest =  RandomForestClassifier(class_weight='balanced', random_state=RANDOM_STATE)
param_grid_frst_tf_idf = {
    'model__n_estimators': range(10, 41, 10),
    'model__max_depth': range(6, 16, 5)
}
param_grid_frst = {
    'n_estimators': range(10, 41, 10),
    'max_depth': range(6, 16, 5)
}

### Логистическая регрессия

Обучим линейные модели на разных тренировочных данных.

In [26]:
log_reg_wnl = gscv_fit(
    Pipeline([('tf_idf', count_tf_idf), ('model', model_logreg)]),
    param_grid_lr_tf_idf,
    X_train, y_train,
    'LogReg_TF-IDF')

Best CV-F1 score Train: 0.75


In [27]:
log_reg_bert = gscv_fit(model_logreg, param_grid_lr, X_train_bert, y_train_bert, 'LogReg_BERT')

Best CV-F1 score Train: 0.94


### Рандомный лес

Обучим лес на разных тренировочных данных.

In [28]:
forest_postag = gscv_fit(
    Pipeline([('tf_idf', count_tf_idf), ('model', model_forest)]),
    param_grid_frst_tf_idf,
    X_train, y_train,
    'RandForest_TF-IDF')

Best CV-F1 score Train: 0.36


In [29]:
forest_bert = gscv_fit(model_forest, param_grid_frst, X_train_bert, y_train_bert, 'RandForest_BERT')

Best CV-F1 score Train: 0.95


### Лучшая модель

In [30]:
# выведем результаты обучения моделей
models_results

Unnamed: 0,model_name,params,score,time
0,LogReg_TF-IDF,{'model__C': 4.51},0.75,15.7
1,LogReg_BERT,{'C': 0.51},0.94,2.7
2,RandForest_TF-IDF,"{'model__max_depth': 11, 'model__n_estimators': 40}",0.36,1.2
3,RandForest_BERT,"{'max_depth': 6, 'n_estimators': 30}",0.95,0.47


Предсказания моделей, обученных на ембеддингах BERT, точнее предсказаний моделей, обученных на TF-IDF векторах.

Модель рандомного леса чуть лучше справилась с предсказанием на тренировочной выборке, чем модель логистической регрессии.

Проверим модели на валидационных выборках.

In [31]:
print(f'LogReg F1 valid: {f1_score(y_valid_bert, log_reg_bert.predict(X_valid_bert)):.2f}')
print(f'RandForest F1 valid: {f1_score(y_valid_bert, forest_bert.predict(X_valid_bert)):.2f}')

LogReg F1 valid: 0.92
RandForest F1 valid: 0.91


Логистическая регрессия лучше справилась с предсказаниями на валидационных данных, значит модель леса переобучилась на тренировочных данных. 

Дообучим модель леса. 

In [32]:
# дообучим модель features_bert == X_train_bert + X_valid_bert
forest_bert.fit(features_bert, target_bert)
# предсказания модели на тренировочных данных
preds = forest_bert.predict(features_test_bert)
# выведем метрику
print(f'Метрика F1 на тестовых данных: {f1_score(target_test_bert, preds):.2f}')

Метрика F1 на тестовых данных: 0.96


Метрика соответсвует требованиям к качаству предсказаний (< 0.75).

## Выводы

1. Были загружены данные, пропусков и дубликатов данных не обнаружено.
2. Для быстрой работы выборки были сокращены до 50000 для tf-idf векторизации данных и до 5000 для ембеддингов bert.
3. Обучено 2 типа моделей с подбором гиперпараметров: логистическая регрессия, рандомный лес. Для подготовки обучающих выборок использовали 2 типа подготовки текста: TF-IDF векторизация с учетом частей речи, ембеддинги BERT.
4. Модели, обученные на ембеддингах BERT, лучше детектируют токсичные комментарии. Модель рандомного леса справилась с тренироовочной выборкой лучше всех остальных. Для уменьшения влияния переобучения на тренировочных данных модель была дообучена.
5. Метрика F1 лучшей модели на тестовой выборке составила 0.96.