# Проект для «Викишоп»

Цели работы:
построить модель машинного обучения которая сможет находить токсичные комментарии на основе уже имеющихся и размеченных; и отправлять их на ручную модерацию

этапы работы:
* [загрузить данные](#загрузка-и-обработка)
* [удалить из сообщений слова не несущие полезной нагрузки](#stopwords)
* [лемматизировать слова в сообщениях](#лемматизация)
* [перевести сообщения в векторный вид](#векторизация)
* [обучить несколько моделей](#обучение)
* [выбрать наилучшую, проверить что метрика качества f1 на валиадационной и тестовой выборках у нее не меньше 0.75](#проверка-на-тестовой-выборке)
* [выводы](#выводы)


In [1]:
import pandas as pd
import re
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import en_core_web_sm
from nltk.corpus import stopwords

In [2]:
STATE = 12345

### загрузка и обработка

In [3]:
df = pd.read_csv('C:\\Users\\Freo\\Desktop\\projects\\datasets\\p12.csv')

In [4]:
print(df)

        Unnamed: 0                                               text  toxic
0                0  Explanation\nWhy the edits made under my usern...      0
1                1  D'aww! He matches this background colour I'm s...      0
2                2  Hey man, I'm really not trying to edit war. It...      0
3                3  "\nMore\nI can't make any real suggestions on ...      0
4                4  You, sir, are my hero. Any chance you remember...      0
...            ...                                                ...    ...
159287      159446  ":::::And for the second time of asking, when ...      0
159288      159447  You should be ashamed of yourself \n\nThat is ...      0
159289      159448  Spitzer \n\nUmm, theres no actual article for ...      0
159290      159449  And it looks like it was actually you who put ...      0
159291      159450  "\nAnd ... I really don't think you understand...      0

[159292 rows x 3 columns]


In [5]:
df = df.drop('Unnamed: 0', axis = 1)

In [6]:
df['toxic'].value_counts()

toxic
0    143106
1     16186
Name: count, dtype: int64

классы несбалансированы, когда будем делить выборки это надо учесть

### stopwords

In [7]:
stop_words = set(stopwords.words('english'))


In [8]:
nlp = en_core_web_sm.load(disable=['parser','ner'])

In [9]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [10]:
stop_words.update('u')

интересно почему "you" в стопвордах есть, а "i" нет

In [11]:
df

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159287,""":::::And for the second time of asking, when ...",0
159288,You should be ashamed of yourself \n\nThat is ...,0
159289,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,And it looks like it was actually you who put ...,0


### Лемматизация

создадим 2 функции; для очистки строки и для ее лемматизации

In [12]:
def clear_text(text):
    text = re.sub(r'[^A-Za-z]', ' ', text)
    text = text.split()
    res = ' '.join(text)
    return res.lower()

In [13]:
def lemmatize(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc if token.lemma_ not in stop_words])

In [14]:
df['text'].head()

0    Explanation\nWhy the edits made under my usern...
1    D'aww! He matches this background colour I'm s...
2    Hey man, I'm really not trying to edit war. It...
3    "\nMore\nI can't make any real suggestions on ...
4    You, sir, are my hero. Any chance you remember...
Name: text, dtype: object

In [15]:
%%time
df['text'] = df['text'].apply(clear_text)

CPU times: total: 4.84 s
Wall time: 4.87 s


In [16]:
df['text'].head()

0    explanation why the edits made under my userna...
1    d aww he matches this background colour i m se...
2    hey man i m really not trying to edit war it s...
3    more i can t make any real suggestions on impr...
4    you sir are my hero any chance you remember wh...
Name: text, dtype: object

In [17]:
%%time
df['text'] = df['text'].apply(lemmatize)

CPU times: total: 14min 28s
Wall time: 14min 39s


In [18]:
df['text'].head()

0    explanation edit make username hardcore metall...
1    aww match background colour I seemingly stuck ...
2    hey man I really try edit war guy constantly r...
3    I make real suggestion improvement I wonder se...
4                        sir hero chance remember page
Name: text, dtype: object

In [19]:
x = df['text']
y = df['toxic']

поделим на трейн, тест и валид

In [21]:
x_temp, x_test, y_temp, y_test = train_test_split(x, y, test_size=0.1, random_state=STATE, stratify = y)
x_train, x_valid, y_train, y_valid = train_test_split(x_temp, y_temp, test_size=0.25,random_state=STATE, stratify = y_temp)

создадим корпусы тесктов и векторизуем с помочью тф-идф

In [26]:
corpus_x_train = x_train.values
corpus_x_valid = x_valid.values
corpus_x_test = x_test.values

### Векторизация

In [27]:
count_tf_idf = TfidfVectorizer(ngram_range=(1,1))

In [28]:
tf_idf_train = count_tf_idf.fit_transform(corpus_x_train)
tf_idf_valid = count_tf_idf.transform(corpus_x_valid)
tf_idf_test = count_tf_idf.transform(corpus_x_test)

## Обучение

###  LogisticRegression

In [32]:
model = LogisticRegression(solver='liblinear', random_state=STATE, class_weight = 'balanced')
model.fit(tf_idf_train,y_train)
pred = model.predict(tf_idf_valid)
print('f1',f1_score(pred,y_valid))

f1 0.7446300715990454


### DecisionTreeClassifier

In [33]:
for de in range(7,40):
    model = DecisionTreeClassifier(random_state = 123, max_depth = de)
    model.fit(tf_idf_train,y_train)
    pred = model.predict(tf_idf_valid)
    print(de, f1_score(pred,y_valid))

7 0.5609022556390978
8 0.5751412429378531
9 0.5940194459732159
10 0.6016081871345029
11 0.6138116591928251
12 0.6229566453447051
13 0.6308535076868704
14 0.6342648845686513
15 0.6399860188745194
16 0.6449387825487153
17 0.6519355943816375
18 0.6539053153307127
19 0.6629346904156065
20 0.662947937795808
21 0.6701289998324678
22 0.669907795473596
23 0.6709136630343672
24 0.670894526034713
25 0.6780617324925323
26 0.6765194287612089
27 0.6814569536423841
28 0.6803305785123966
29 0.6869120316883974
30 0.6893172165958837
31 0.6877147063634876
32 0.6899237384390718
33 0.6885299396312612
34 0.6920213626800453
35 0.6932167718957423
36 0.6942202131094608
37 0.696466031950944
38 0.6979736249597941
39 0.6961699388477631


In [34]:
model = DecisionTreeClassifier(random_state = 123, max_depth = 39)
model.fit(tf_idf_train,y_train)
pred = model.predict(tf_idf_valid)
print('f1',f1_score(pred,y_valid))

f1 0.6961699388477631


### CatBoostClassifier

In [35]:
for i in range (2,15):
    model = CatBoostClassifier(iterations=200, depth=i)
    model.fit(tf_idf_train,y_train, verbose=25)
    pred = model.predict(tf_idf_valid)
    print(f1_score(pred, y_valid))
    print(i)

Learning rate set to 0.332152
0:	learn: 0.4290266	total: 317ms	remaining: 1m 3s
25:	learn: 0.2000186	total: 4.37s	remaining: 29.2s
50:	learn: 0.1779673	total: 8.45s	remaining: 24.7s
75:	learn: 0.1655835	total: 12.5s	remaining: 20.4s
100:	learn: 0.1574405	total: 16.6s	remaining: 16.3s
125:	learn: 0.1504579	total: 20.6s	remaining: 12.1s
150:	learn: 0.1451605	total: 24.6s	remaining: 8s
175:	learn: 0.1410573	total: 28.7s	remaining: 3.91s
199:	learn: 0.1376668	total: 32.6s	remaining: 0us
0.7198132088058706
2
Learning rate set to 0.332152
0:	learn: 0.4271336	total: 249ms	remaining: 49.5s
25:	learn: 0.1872284	total: 6.03s	remaining: 40.3s
50:	learn: 0.1653380	total: 11.8s	remaining: 34.6s
75:	learn: 0.1533076	total: 17.6s	remaining: 28.7s
100:	learn: 0.1451087	total: 23.3s	remaining: 22.9s
125:	learn: 0.1391730	total: 29s	remaining: 17.1s
150:	learn: 0.1336268	total: 34.9s	remaining: 11.3s
175:	learn: 0.1298397	total: 40.8s	remaining: 5.56s
199:	learn: 0.1267097	total: 46.2s	remaining: 0us
0.

In [36]:
for i in range (100,501,50):
    model = CatBoostClassifier(iterations=i, depth=7)
    model.fit(tf_idf_train,y_train, verbose=25)
    pred = model.predict(tf_idf_valid)
    print(f1_score(pred, y_valid))
    print(i)

Learning rate set to 0.5
0:	learn: 0.3342696	total: 843ms	remaining: 1m 23s
25:	learn: 0.1470437	total: 20.9s	remaining: 59.6s
50:	learn: 0.1276775	total: 40.8s	remaining: 39.2s
75:	learn: 0.1171343	total: 1m	remaining: 19.2s
99:	learn: 0.1098084	total: 1m 20s	remaining: 0us
0.7526679221594476
100
Learning rate set to 0.43242
0:	learn: 0.3621140	total: 827ms	remaining: 2m 3s
25:	learn: 0.1525892	total: 20.8s	remaining: 1m 39s
50:	learn: 0.1323047	total: 40.6s	remaining: 1m 18s
75:	learn: 0.1209870	total: 1m	remaining: 58.9s
100:	learn: 0.1127696	total: 1m 20s	remaining: 38.9s
125:	learn: 0.1065566	total: 1m 40s	remaining: 19s
149:	learn: 0.1019906	total: 1m 59s	remaining: 0us
0.760193719731292
150
Learning rate set to 0.332152
0:	learn: 0.4139885	total: 829ms	remaining: 2m 44s
25:	learn: 0.1612226	total: 20.8s	remaining: 2m 19s
50:	learn: 0.1400567	total: 40.7s	remaining: 1m 58s
75:	learn: 0.1280034	total: 1m	remaining: 1m 38s
100:	learn: 0.1201656	total: 1m 20s	remaining: 1m 18s
125:	

In [37]:
model = CatBoostClassifier(iterations=200, depth=7)
model.fit(tf_idf_train,y_train, verbose=25)
pred = model.predict(tf_idf_valid)
print('f1',f1_score(pred, y_valid))

Learning rate set to 0.332152
0:	learn: 0.4139885	total: 831ms	remaining: 2m 45s
25:	learn: 0.1612226	total: 20.5s	remaining: 2m 17s
50:	learn: 0.1400567	total: 40.1s	remaining: 1m 57s
75:	learn: 0.1280034	total: 59.7s	remaining: 1m 37s
100:	learn: 0.1201656	total: 1m 19s	remaining: 1m 17s
125:	learn: 0.1142808	total: 1m 38s	remaining: 58.1s
150:	learn: 0.1098386	total: 1m 58s	remaining: 38.4s
175:	learn: 0.1059136	total: 2m 17s	remaining: 18.8s
199:	learn: 0.1018940	total: 2m 36s	remaining: 0us
f1 0.7590987868284229


### LGBMC

In [38]:
for i in range (2,13):
    model = LGBMClassifier(max_depth=i, learning_rate=0.5 )
    model.fit(tf_idf_train,y_train)
    pred = model.predict(tf_idf_valid)
    print(f1_score(pred, y_valid))
    print('dpt', i)

[LightGBM] [Info] Number of positive: 10925, number of negative: 96596
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.041622 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 476442
[LightGBM] [Info] Number of data points in the train set: 107521, number of used features: 8803
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101608 -> initscore=-2.179484
[LightGBM] [Info] Start training from score -2.179484
0.7360850531582239
dpt 2
[LightGBM] [Info] Number of positive: 10925, number of negative: 96596
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.077728 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 476442
[LightGBM] [Info] Number of data points in the train set: 107521, number of use

In [39]:
for i in range(5, 51, 5):
    model = LGBMClassifier(max_depth=3, learning_rate=(i * 0.01))
    model.fit(tf_idf_train,y_train)
    pred = model.predict(tf_idf_valid)
    print(f1_score(pred, y_valid))
    print('lr', (i * 0.01))

[LightGBM] [Info] Number of positive: 10925, number of negative: 96596
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.998139 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 476442
[LightGBM] [Info] Number of data points in the train set: 107521, number of used features: 8803
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101608 -> initscore=-2.179484
[LightGBM] [Info] Start training from score -2.179484
0.5055837563451777
lr 0.05
[LightGBM] [Info] Number of positive: 10925, number of negative: 96596
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.944638 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 476442
[LightGBM] [Info] Number of data points in the train set: 107521, number of u

In [40]:
for i in range(40, 55, 1):
    model = LGBMClassifier(max_depth=3, learning_rate=(i * 0.01))
    model.fit(tf_idf_train,y_train)
    pred = model.predict(tf_idf_valid)
    print(f1_score(pred, y_valid))
    print('lr', (i * 0.01))

[LightGBM] [Info] Number of positive: 10925, number of negative: 96596
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.979645 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 476442
[LightGBM] [Info] Number of data points in the train set: 107521, number of used features: 8803
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101608 -> initscore=-2.179484
[LightGBM] [Info] Start training from score -2.179484
0.7331064657267866
lr 0.4
[LightGBM] [Info] Number of positive: 10925, number of negative: 96596
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.985106 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 476442
[LightGBM] [Info] Number of data points in the train set: 107521, number of used features: 8803
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101608 -> initscore=-2.179484
[LightGBM] [Info] Start traini

In [41]:
model = LGBMClassifier(max_depth=3, learning_rate=0.48)
model.fit(tf_idf_train,y_train)
pred = model.predict(tf_idf_valid)
print('f1',f1_score(pred, y_valid))

[LightGBM] [Info] Number of positive: 10925, number of negative: 96596
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.891868 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 476442
[LightGBM] [Info] Number of data points in the train set: 107521, number of used features: 8803
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101608 -> initscore=-2.179484
[LightGBM] [Info] Start training from score -2.179484
f1 0.7393290281934163


## проверка на тестовой выборке

кэтбуст показал достаточные результаты при первичной проверке, удостоверимся что все ок на тестовой выборке

In [42]:
model = CatBoostClassifier(iterations=200, depth=7)
model.fit(tf_idf_train,y_train, verbose=25)
pred = model.predict(tf_idf_test)
print(f1_score(pred, y_test))

Learning rate set to 0.332152
0:	learn: 0.4139885	total: 833ms	remaining: 2m 45s
25:	learn: 0.1612226	total: 20.5s	remaining: 2m 17s
50:	learn: 0.1400567	total: 40.2s	remaining: 1m 57s
75:	learn: 0.1280034	total: 59.8s	remaining: 1m 37s
100:	learn: 0.1201656	total: 1m 19s	remaining: 1m 17s
125:	learn: 0.1142808	total: 1m 38s	remaining: 58.1s
150:	learn: 0.1098386	total: 1m 58s	remaining: 38.5s
175:	learn: 0.1059136	total: 2m 18s	remaining: 18.8s
199:	learn: 0.1018940	total: 2m 36s	remaining: 0us
0.7522345370039328


## Выводы

Мы построили модель машинного обучения которая с достаточной точностью может находить токсичные комментарии на основе уже имеющихся и размеченных.

В ходе работы мы
* обработали сообщения: очистив их от не несущих пользу слов, лемматизировали, привели в векторный вид
* обучили несколько моделей
* под необходимые критерии подошла модель CatBoost
* удостоверились что требования выполняются и на тестовой выборке

векторизацией слов с помощью TF-IDF удалось достичь необходимого бейслайна. Особенностью метода в данном случае можно выделить то что точность предсказаний растет с ростом выборки (впрочем как и время расчетов). Дополнительный прирост точности прогнозов также можно было получить обьединив тренировочную и валидационную выбоки