Интернет-магазин «Викишоп» запускает новый сервис. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Нужна модель, которая сможет классифицировать комментарии на позитивные и негативные. 

Постройте модель со значением метрики качества *F1* не меньше 0.75. 


### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

In [1]:
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from sklearn import *
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import f1_score
import numpy as np
from sklearn import metrics
from pymystem3 import Mystem
m = Mystem()
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
comments = pd.read_csv('/datasets/toxic_comments.csv')
comments.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


Очистим тексты от символов и приведем все в нижнему регистру:

In [3]:
def clean_data(row):
    row = re.sub(r"(?:\n|\r)", " ", row)
    row = re.sub(r"[^a-zA-Z ]+", "", row).strip()
    row = row.lower()
    return row

comments['text'] = comments['text'].apply(clean_data)
comments.head()

Unnamed: 0,text,toxic
0,explanation why the edits made under my userna...,0
1,daww he matches this background colour im seem...,0
2,hey man im really not trying to edit war its j...,0
3,more i cant make any real suggestions on impro...,0
4,you sir are my hero any chance you remember wh...,0


In [4]:
comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Разделим датасет на выборки:

In [5]:
df_train_valid, df_test = train_test_split(comments, test_size = 0.1, random_state = 12345)
df_train, df_valid = train_test_split(df_train_valid, shuffle=False, test_size=0.25, random_state = 12345)
print(df_train.shape, df_valid.shape, df_test.shape)
    

(107709, 2) (35904, 2) (15958, 2)


In [6]:
features_train = df_train.drop('toxic', axis=1).values
target_train = df_train['toxic'].values

features_valid = df_valid.drop('toxic', axis=1).values
target_valid = df_valid['toxic'].values

features_test = df_test.drop('toxic', axis=1).values
target_test = df_test['toxic'].values

print(features_train.shape, features_valid.shape, features_test.shape, target_train.shape[0])

(107709, 1) (35904, 1) (15958, 1) 107709


Создадим корпус нашего текста и напишем функцию лемматизации:

In [7]:

train_corpus = df_train['text'].values.astype('U')
def lemmatize(text):
    lemm = m.lemmatize(text)
    return "".join(lemm)

train_corpus[0] = lemmatize(train_corpus[0])
train_corpus

array(['there is something wrong with you madam   madam stop all of this nonsence on the abc kids page i believe that disneys one saturday morning launched in  not  and im thinking maters tall tales was not on abc kids so cut this crap out darby or im gonna kill you     july  utc\n',
       'amyas godfrey the page has come under scrutiny as of late how can we fix it so it is more in line with wiki guidelines should it stay or should it go',
       'i have now reported your fourth revert on wpanrr if you want to avoid a block you could reinsert the paragraph again   talk',
       ...,
       'shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the fuck up shut the f

In [9]:
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords, ngram_range=(1,1))
tf_idf_train = count_tf_idf.fit_transform(train_corpus)

In [10]:
#columns_true_valid = list(set(df_valid.columns) & set(df_train.columns))

valid_corpus = df_valid['text'].values.astype('U')

valid_corpus[0] = lemmatize(valid_corpus[0])
valid_corpus

array(['they won a grammy for it what more proof do you want\n',
       'keep quite you dumb ass the wiki guidelines clearly state that discuss the issue on talk page before removing the propdel  dumb fools like you are the one who have brought noncredibility to wikipedia as well as to englandwhich cannot even survive a war against india for more than  days the whole world knows that england is not a filthy and backward country with no technologyuk buys all defence equipment from us cannot construct its own missiles the only slbm of uk is brought from us no wonder uk is almost in gutter  shamless editor instead of accepting mistake rant like a bastard christian',
       'btw did you know the article links to a site where you can get a live fatwa online',
       ...,
       'hmm yes as i said i also watch the emily ruetearticle as i wrote most of it and i did notice it got vandalized too anyway in the sandbox you can experiment all you wantbeeing bold or you can beup in the sky or color

In [11]:
tf_idf_valid = count_tf_idf.transform(valid_corpus)

In [12]:
#columns_true_test = list(set(df_test.columns) & set (df_train.columns))
test_corpus = df_test['text'].values.astype('U')

test_corpus[0] = lemmatize(test_corpus[0])
test_corpus

array(['ahh shut the fuck up you douchebag sand nigger   go blow up some more people you muslim piece of shit fuck you sand nigger i will find u in real life and slit your throat\n',
       'reply there is no such thing as texas commerce bank of chicago  likewise there is no such thing as the united farmers bank of baltimore and albuquerque  so salvio you are incorrect  if you want to prevent even the remote possibility of confusion then you should not be allowed to use your name salvio because there may be confusion that you are related to salvador dali',
       'reply hey you could at least mention jasenovac and   killed not only serbs but you say its all bs well what is vandalism death of innocent or putting truth here',
       ..., 'replied in new section there',
       'the legal threat was withdrawn and kudos to you for approaching that subject with half a brain  i cant argue against an extended block but to be honest im just going to move on to a new name instead of waiting  wee

In [13]:
tf_idf_test = count_tf_idf.transform(test_corpus)

# 2. Обучение

Попробуем модель логистической регрессии:

In [14]:
%%time
lr = LogisticRegression(random_state=1, solver='liblinear', max_iter=100)
params = {
   'penalty':['l1', 'l2'],        
   'C':list(range(1,15,3)) 
}



lr_gs = GridSearchCV(lr, params, cv=3, scoring='f1', verbose=True).fit(tf_idf_train, target_train)

print ("Best Params", lr_gs.best_params_)
print ("Best Score", lr_gs.best_score_)



Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  3.8min finished


Best Params {'C': 4, 'penalty': 'l1'}
Best Score 0.7695274826603975
CPU times: user 2min 24s, sys: 1min 23s, total: 3min 47s
Wall time: 3min 48s


Проверим на валидационной:

In [17]:
lr_best = LogisticRegression(random_state=1, class_weight = 'balanced', C = 4, penalty = 'l1', solver='liblinear', max_iter=100)
lr_best.fit(tf_idf_train, target_train)


LogisticRegression(C=4, class_weight='balanced', dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [18]:
pred1 = lr_best.predict(tf_idf_valid)
f1_score(target_valid, pred1)


0.7699745547073792

Проверим теперь решающее дерево:

In [19]:
%%time
tree = DecisionTreeClassifier(random_state = 123)
params = {
   'criterion':['gini', 'entropy'],        
   'max_depth':list(range(1,15,5)) 
}



tree_gs = GridSearchCV(tree, params, cv=3, scoring='f1', verbose=True).fit(tf_idf_train, target_train)

print ("Best Params", tree_gs.best_params_)
print ("Best Score", tree_gs.best_score_)

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:  3.9min finished


Best Params {'criterion': 'gini', 'max_depth': 11}
Best Score 0.584180160956813
CPU times: user 3min 59s, sys: 0 ns, total: 3min 59s
Wall time: 4min 4s


Проверим на валидационной:

In [20]:
tree_best = DecisionTreeClassifier(random_state = 123, criterion='gini', max_depth=11)
tree_best.fit(tf_idf_train, target_train)
pred2 = tree_best.predict(tf_idf_valid)
f1_score(target_valid, pred2)

0.5916154680159017

Проверим логист. регрессию на тестовой выборке, у нее лучше значение метрики:

In [21]:
pred1 = lr_best.predict(tf_idf_test)      
f1_lr = f1_score(target_test, pred1)     
f1_lr

0.7626591230551627

# 3. Выводы

Пороговое значение метрики 0.75 преодолено, получено 0.78. Лучшая модель - логистическая регрессия с подробранными гиперпараметрами C = 4, penalty = 'l1'.