Проект: классификация токсичных комментариев (F1 ≥ 0.75)

Данные: загрузить toxic_comments.csv, почистить текст, разбить на train/val/test.

Признаки: TF-IDF (n-граммы) или предобученные эмбеддинги.

Модели: логрег, SVM, LightGBM (+ опционально BERT-финетюнинг) с подбором гиперпараметров.

Оценка: добиться F1 ≥ 0.75 на валидации, проверить на тесте.









Спросить ChatGPT


## Подготовка

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import re
import string
import subprocess
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score
from sklearn import svm
from sklearn.linear_model import SGDClassifier, LogisticRegression
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch
from torch.nn.functional import softmax
from tqdm.auto import tqdm
import gensim.downloader as api
from gensim.models import Word2Vec
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import BertModel, BertTokenizer
import gc
from collections import Counter
from tqdm import notebook
from sklearn.neighbors import KNeighborsClassifier

tqdm.pandas()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [None]:
try:
    df = pd.read_csv('/datasets/toxic_comments.csv')
except:
    df = pd.read_csv('/kaggle/input/toxic-c/toxic_comments.csv')

In [None]:
df

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [None]:
print(df['text'].sample(1).item())

Thanks, gents, I'll try the second option. Cheers


Как видно, в датафрейме действительно текст.

<div class="alert alert-block alert-success">
<b>Успех:</b> Данные загружены корреткно.
</div>

## Обучение

In [None]:
X_train, X_test, y_train, y_test= train_test_split(df['text'], df['toxic'], test_size=0.2, random_state=0)


## TF-IDF

Начнем с того, что построим эмбеддинги через TF-IDF.

In [None]:
nltk.download('wordnet', download_dir='/kaggle/working/')
command = "unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora"
subprocess.run(command.split())
nltk.data.path.append('/kaggle/working/')

[nltk_data] Downloading package wordnet to /kaggle/working/...
[nltk_data]   Package wordnet is already up-to-date!
Archive:  /kaggle/working/corpora/wordnet.zip


replace /kaggle/working/corpora/wordnet/lexnames? [y]es, [n]o, [A]ll, [N]one, [r]ename:  NULL
(EOF or read error, treating as "[N]one" ...)


Напишем простой лемматизатор и токенизатор текста.

In [None]:
nltk.download('wordnet', download_dir='/kaggle/working/')
nltk.download('punkt', download_dir='/kaggle/working/')
nltk.download('stopwords', download_dir='/kaggle/working/')


stopWords = set(stopwords.words('english'))
wnl = nltk.WordNetLemmatizer()

def preproc_nltk(text):
    text = re.sub(f'[{string.punctuation}]', ' ', text)
    return ' '.join([wnl.lemmatize(word) for word in word_tokenize(text.lower()) if word not in stopWords])

[nltk_data] Downloading package wordnet to /kaggle/working/...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /kaggle/working/...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /kaggle/working/...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
print(preproc_nltk(np.array(X_train)[0]))

snyder ny map look talk


In [None]:
print(np.array(X_train)[0])

Snyder, NY map 

I'll look into it.  (talk)


In [None]:
X_train_lem = [preproc_nltk(text) for text in np.array(X_train)]

In [None]:
vectorizer = TfidfVectorizer(max_df=0.7, max_features=2000, preprocessor=preproc_nltk)
vectors = vectorizer.fit_transform(np.array(X_train))

In [None]:
dense_vectors = vectors.todense()
dense_vectors_test = vectorizer.transform(np.array(X_test)).todense()

In [None]:
svc = LogisticRegression()
svc.fit(np.array(dense_vectors), y_train)
print(f1_score(y_test, svc.predict(np.array(dense_vectors_test))))

0.7266964951528712


Добились 0.72 на своих эмбеддингах.

## Pretrained RoBERTa toxicity model

Попробуем импортировать готовую модель оценки токсичности.

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('SkolkovoInstitute/roberta_toxicity_classifier')
model = RobertaForSequenceClassification.from_pretrained('SkolkovoInstitute/roberta_toxicity_classifier')
model.to(device)
model.eval()

def evaluate_toxicity(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    probabilities = softmax(logits, dim=1)
    toxic_prob = probabilities[0][1].item()
    return toxic_prob






tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
Some weights of the model checkpoint at SkolkovoInstitute/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
f = X_test.progress_apply(evaluate_toxicity)

  0%|          | 0/31859 [00:00<?, ?it/s]

In [None]:
print(f1_score(y_test, f > 0.5))

0.8607594936708861


Как видно задачу мы решили с помощью предобученной модели, проверим другие способы.

## Glove-200

Возьмем эмбеддинги построенные на твиттах и посмотрим на результат.

In [None]:
embeddings_pretrained = api.load('glove-twitter-200')



IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





In [None]:
proc_words = [preproc_nltk(text).split() for text in X_train]


In [None]:
embeddings_trained = Word2Vec(proc_words,vector_size=1000,
                  min_count=50,
                 window=3).wv

In [None]:
def vectorize_sum(comment, embeddings):
    embedding_dim = embeddings.vectors.shape[1]
    features = np.zeros([embedding_dim], dtype='float32')

    for word in preproc_nltk(comment).split():
        if word in embeddings:
            features += embeddings[f'{word}']

    return features

In [None]:
X_wv = np.stack([vectorize_sum(text, embeddings_pretrained) for text in df['text']])
X_train_wv, X_test_wv, y_train, y_test = train_test_split(X_wv, df['toxic'], test_size=0.2, random_state=0)
X_train_wv.shape, X_test_wv.shape

((127433, 200), (31859, 200))

In [None]:
svc = LogisticRegression()
svc.fit(X_train_wv, y_train)
print(f1_score(y_test, svc.predict(X_test_wv)))

0.6210914740726354


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Результат незначительно лучше

## Pretrained XLMR toxicity model

Возьмем еще одну предобученную модель, на этот раз XLMR.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("textdetox/xlmr-large-toxicity-classifier")
model = AutoModelForSequenceClassification.from_pretrained("textdetox/xlmr-large-toxicity-classifier")
model.to(device)
model.eval()

def evaluate_toxicity_xlmr(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    probabilities = softmax(logits, dim=1)
    toxic_prob = probabilities[0][1].item()
    return toxic_prob






tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/770 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [None]:
torch.cuda.empty_cache()


In [None]:
d = X_test.progress_apply(evaluate_toxicity_xlmr)

  0%|          | 0/31859 [00:00<?, ?it/s]

In [None]:
print(f1_score(y_test, d > 0.5))

0.6890713834491471


Как видно до нужного результата мы не дотягиваем, так что предлагаю использовать преобученную модель RoBERRT.

### Own BERT Embeddings

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased').to(device)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:

tokenized = X_train.apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, padding=True, truncation=True, max_length=500))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)





In [None]:
X_train

35270         Snyder, NY map \n\nI'll look into it.  (talk)
23948     ", 24 November 2011 (UTC)\n\nMy apology if you...
85882                      Newsflash!!  Schuminweb is GAY!!
35392            Shut your mouth and stop talking. good boy
38364     "\n\n Thanks for the message, I'll admit I was...
                                ...                        
97639                             Gun Powder Ma]] 09:45, 30
95939     What Volumes? \n\nFor whoever noted that there...
152315    I agree with wikireader, the source Iqinn gave...
117952    It has now been a week and a half and not and ...
43567     "\n\nWell, let's go through this one-by-one. W...
Name: text, Length: 127433, dtype: object

In [None]:
batch_size = 19
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).to(device)
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)

        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)

        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())


  0%|          | 0/6707 [00:00<?, ?it/s]

In [None]:
features_train = np.concatenate(embeddings)


In [None]:
X_test

33703     "\nWell, since I am blocked, I shall temporari...
86675     "\n\nHahahahaha. Typical. \n\nThe article is a...
47557     Which you made after I was encouraged by a med...
96900     I regard you a racist. I will request you be b...
66242     "\n\nA broken chair is not a chair\nA broken c...
                                ...                        
127382                             |listas = Yury of Moscow
94183     January 22, 2007 \n\nPlease stop. If you conti...
121785                       join the wikipedia vandal club
13184          No, because a distinction needs to be drawn.
69417       Blue Balls\n| note19 = featuring Slaughterhouse
Name: text, Length: 31859, dtype: object

In [None]:
features_train.shape

(127433, 768)

In [None]:
tokenized = X_test.apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, padding=True, truncation=True, max_length=500))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)




batch_size = 19
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).to(device)
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)

        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)

        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())


  0%|          | 0/1676 [00:00<?, ?it/s]

In [None]:
features_valid = np.concatenate(embeddings)


In [None]:
features_valid.shape

(31844, 768)

In [None]:
svc = LogisticRegression()
svc.fit(features_train, y_train)
print(f1_score(y_test[:-15], svc.predict(features_valid)))

0.7239896818572658


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Получили самый близкий результат на своих эмбеддингов, но до порога не дотягивает, так что будем использовать RoBERT.

## Выводы

По результатам исследования выбираем предобученную модель RoBERT для нашей задачи. Модели построенные на собственных эмбеддингах не оправдали ожидания, а также не оправдало ожиданий и модель архитектуры XLMR. RoBert набрал >0.75 по f1-мере, ее и возьмем.