<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

In [1]:
import re
import numpy as np
import pandas as pd
import gc

import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import PorterStemmer,SnowballStemmer,LancasterStemmer

from pymystem3 import Mystem 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

from catboost import CatBoostClassifier,Pool
from catboost.text_processing import Tokenizer

from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,confusion_matrix



## Подготовка

In [2]:
nltk.download('stopwords')
stopwords = list(set(nltk_stopwords.words('english')))
stopwords

[nltk_data] Downloading package stopwords to /home/ltz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['then',
 'its',
 'an',
 'y',
 'they',
 'off',
 'it',
 'to',
 'most',
 'own',
 'were',
 "weren't",
 'why',
 'itself',
 'with',
 'being',
 'was',
 'during',
 'their',
 "won't",
 'wouldn',
 'shouldn',
 'won',
 'there',
 'more',
 'this',
 'too',
 'wasn',
 'about',
 'a',
 'on',
 're',
 'up',
 'be',
 'myself',
 'all',
 "you'd",
 'he',
 'some',
 'can',
 'will',
 'yours',
 'further',
 'does',
 "haven't",
 'been',
 'below',
 "shouldn't",
 "she's",
 "it's",
 'that',
 'those',
 'how',
 'didn',
 'isn',
 'our',
 "mightn't",
 'where',
 'such',
 'i',
 'his',
 "you're",
 'ain',
 'but',
 'between',
 'she',
 'ourselves',
 'of',
 'nor',
 'is',
 'do',
 'should',
 'or',
 'no',
 "don't",
 'd',
 'other',
 'theirs',
 'mightn',
 "you've",
 'themselves',
 'same',
 'o',
 'because',
 'from',
 "shan't",
 'has',
 'did',
 'through',
 'again',
 'the',
 'until',
 'them',
 'any',
 "hadn't",
 'just',
 "doesn't",
 'you',
 'll',
 'if',
 "isn't",
 'her',
 'hasn',
 'down',
 'under',
 'few',
 'am',
 'herself',
 'your',
 "wa

In [3]:
df = pd.read_csv('/datasets/toxic_comments.csv',index_col=0,encoding='utf-8')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
%%time
mst = Mystem()
lst = LancasterStemmer()
stemmer = lst.stem

def lemmatize(text):
    text = ''.join( re.sub(r"([^a-z\'])+",' ',text.lower()) )
#     print(text)
#     text = re.sub('( )+',' ',text)
    return ''.join( stemmer(text)).strip('\n')                

# lemmatize(df.text[2])                    
df['lemmas'] = df.text.apply(lemmatize)

df.sample(10).lemmas.apply(lambda st: print(f" {st:.99s}") )

 from jayanthv thanx i dont know wat a admin does anyway but please dont block me jayanthv answer on
 if i agree wha what do i have to do with this 
  thepalace com domain article states thepalace com domain is available but was actually purchased s
 instead of turning r
 having discussed this matter at length i think wikipedia will benefit from an absence by me for sev
 adding fr to templates please have a look at this diff to see how you should have added fr to a tem
  these quotes do not constitute evidence for the statement modern calculus has solved the mathemati
 jewish american don't worry madmax you did not drag me in i went voluntary unfortunately there are 
 about the fake history of a fake marine
 redirect talk jeopardy video games 
CPU times: user 4.18 s, sys: 106 ms, total: 4.28 s
Wall time: 4.28 s


47570     None
20081     None
50844     None
53034     None
99896     None
115764    None
106675    None
56782     None
73204     None
1203      None
Name: lemmas, dtype: object

In [5]:
%%time

tr,te = train_test_split(df, test_size = .25, shuffle = True)
X = [ 'lemmas' ] 
y = 'toxic'


CPU times: user 7.62 ms, sys: 51 ms, total: 58.6 ms
Wall time: 56.9 ms


## Обучение

In [6]:
%%time
tfidf = TfidfVectorizer(stop_words = stopwords,max_features = 10_000,min_df=1/5_000,ngram_range=(1,3))
vectors = tfidf.fit_transform(tr.lemmas)
vectors 

CPU times: user 25.1 s, sys: 1.58 s, total: 26.7 s
Wall time: 26.6 s


<119469x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 3330140 stored elements in Compressed Sparse Row format>

In [None]:
%%time
gs = GridSearchCV(LogisticRegression(random_state=4999),
                 {'max_iter':[2000],'class_weight':['balanced',None],'solver':['newton-cg', 'lbfgs', 'liblinear']},
                  scoring = 'f1_micro',cv= 3)
gs.fit(vectors,tr[y])
lr = LogisticRegression( **gs.best_params_)
gs.cv_results_

In [None]:
te_vectors = tfidf.transform(te.lemmas)
lr.fit(vectors,tr[y])
pr = lr.predict(te_vectors)
for metric in [f1_score,accuracy_score,precision_score,recall_score]:
    print( f"{metric.__name__}:{round( metric(te[y],pr), 4)}\t" )
"confusion_matrix:\n",confusion_matrix(te[y],pr)


In [None]:
import time
del lr
del gs
time.sleep(5)
gc.collect()
time.sleep(5)

In [7]:
cbi = CatBoostClassifier(learning_rate = .075,n_estimators =8192 ,
                         bootstrap_type='Bernoulli',subsample=.2,
                         verbose= 128, eval_metric='F1',random_state=499)
cbi.fit(vectors,tr[y])

0:	learn: 0.3296584	total: 5.91s	remaining: 13h 27m 27s
128:	learn: 0.6231289	total: 1m 20s	remaining: 1h 24m 20s
256:	learn: 0.6803874	total: 2m 47s	remaining: 1h 26m 1s
384:	learn: 0.7072470	total: 4m 13s	remaining: 1h 25m 36s
512:	learn: 0.7257902	total: 5m 31s	remaining: 1h 22m 47s
640:	learn: 0.7370985	total: 6m 48s	remaining: 1h 20m 13s
768:	learn: 0.7473814	total: 7m 58s	remaining: 1h 16m 55s
896:	learn: 0.7561617	total: 9m 8s	remaining: 1h 14m 18s
1024:	learn: 0.7593365	total: 10m 17s	remaining: 1h 11m 57s
1152:	learn: 0.7642716	total: 11m 26s	remaining: 1h 9m 52s
1280:	learn: 0.7675067	total: 12m 35s	remaining: 1h 7m 55s
1408:	learn: 0.7697397	total: 13m 44s	remaining: 1h 6m 7s
1536:	learn: 0.7724731	total: 14m 52s	remaining: 1h 4m 25s
1664:	learn: 0.7744210	total: 16m 2s	remaining: 1h 2m 52s
1792:	learn: 0.7774822	total: 17m 10s	remaining: 1h 1m 18s
1920:	learn: 0.7785420	total: 18m 19s	remaining: 59m 48s
2048:	learn: 0.7805340	total: 19m 28s	remaining: 58m 21s
2176:	learn: 0

<catboost.core.CatBoostClassifier at 0x7fb135636250>

In [8]:
te_vectors = tfidf.transform(te.lemmas)
pr = cbi.predict(te_vectors)
for metric in [f1_score,accuracy_score,precision_score,recall_score]:
    print( f"{metric.__name__}:{round( metric(te[y],pr), 4)}\t" )
"confusion_matrix:\n",confusion_matrix(te[y],pr)

f1_score:0.7613	
accuracy_score:0.9576	
precision_score:0.8776	
recall_score:0.6722	


('confusion_matrix:\n',
 array([[35438,   376],
        [ 1314,  2695]]))

In [12]:
for t in np.arange(.2,.6,.025):
    cbi.set_probability_threshold(t)
    pr = cbi.predict(te_vectors)
    #confusion_matrix(te[y],pr)
    
    s = f"{round(t,2)} -->\t"
    for metric in [f1_score,accuracy_score,precision_score,recall_score]:
        s+=f"{metric.__name__}:{round( metric(te[y],pr), 4)}\t"
    print(s) 

0.2 -->	f1_score:0.7603	accuracy_score:0.9491	precision_score:0.7228	recall_score:0.8019	
0.22 -->	f1_score:0.7681	accuracy_score:0.9519	precision_score:0.7464	recall_score:0.7912	
0.25 -->	f1_score:0.7732	accuracy_score:0.954	precision_score:0.7683	recall_score:0.7782	
0.28 -->	f1_score:0.7727	accuracy_score:0.9547	precision_score:0.7811	recall_score:0.7645	
0.3 -->	f1_score:0.7752	accuracy_score:0.9559	precision_score:0.7958	recall_score:0.7556	
0.32 -->	f1_score:0.7757	accuracy_score:0.9566	precision_score:0.8081	recall_score:0.7458	
0.35 -->	f1_score:0.7744	accuracy_score:0.9569	precision_score:0.8182	recall_score:0.7351	
0.38 -->	f1_score:0.7735	accuracy_score:0.9573	precision_score:0.8291	recall_score:0.7249	
0.4 -->	f1_score:0.7737	accuracy_score:0.9579	precision_score:0.8423	recall_score:0.7154	
0.42 -->	f1_score:0.7723	accuracy_score:0.9581	precision_score:0.8522	recall_score:0.7062	
0.45 -->	f1_score:0.7691	accuracy_score:0.958	precision_score:0.8617	recall_score:0.6944	
0.47

In [9]:
%%time
tfidf = TfidfVectorizer(stop_words = stopwords,max_features = 10_000,min_df = 1/10_000,ngram_range=(1,3))
vectors = tfidf.fit_transform(df.lemmas)
vectors 

KeyboardInterrupt: 

In [None]:
text_options =  {
    "tokenizers" : [{
        "tokenizer_id" : "Space",
        "delimiter" : " ,",
        "lowercasing" : "true",
#        "lemmatizing" : "true",
        "split_by_set": "true"                    
    },{
    'tokenizer_id': 'Sense',
    'separator_type': 'BySense'
    }],

    "dictionaries" : [{"dictionary_id" : "BiGram",
            "max_dictionary_size" : "100000",
            "occurrence_lower_bound" : "10",
            "gram_order" : "2"
        },{
        "dictionary_id" : "Word",
        "occurrence_lower_bound" : "20",
        "gram_order" : "1"
    }],

    "feature_processing" : {
        "default" : [{
            "dictionaries_names" : ["Word","BiGram"],
            "feature_calcers" : ["BoW"],
            "tokenizers_names" : ["Sense"]
        }]
        
    }
}
cbm = CatBoostClassifier(learning_rate = .075,n_estimators =8192 ,
                         bootstrap_type='Bernoulli',subsample=.2,
                         text_processing = text_options,
                         verbose= 128, eval_metric='F1',random_state=499)

In [None]:
p_tr = Pool(tr[X],tr[y], text_features = X ) 
p_te = Pool( te[X], text_features = X )
cbm.fit(p_tr)    


In [None]:
%%time
cbm.fit(tr[X],tr[y],verbose=128)

In [None]:
pr = cbm.predict(p_te)
confusion_matrix(te[y],pr)

In [10]:
for t in np.arange(.2,.6,.025):
    cbm.set_probability_threshold(t)
    pr = cbm.predict(te[X])
    #confusion_matrix(te[y],pr)
    
    s = f"{round(t,2)} -->\t"
    for metric in [f1_score,accuracy_score,precision_score,recall_score]:
        s+=f"{metric.__name__}:{round( metric(te[y],pr), 4)}\t"
    print(s)        

CatBoostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=0]=" good luck at trying to turn this article into a good npov article when the cover to the book states that it was written by michael d langone and his colleagues from what i can see most of the notable authors belong to were associated with or associate themselves with the american family association as a matter of fact michael d langone phd is the executive director for the american family association which promotes controversial conservative fundamentalist christian values this indicates to me that this book and its authors are inherently biased the book appears to have been published to further the agenda of the american family association as it advertises it on the back cover of the book itself for example margaret singer was on the board of the afa talk email ": Cannot convert 'b' good luck at trying to turn this article into a good npov article when the cover to the book states that it was written by michael d langone and his colleagues from what i can see most of the notable authors belong to were associated with or associate themselves with the american family association as a matter of fact michael d langone phd is the executive director for the american family association which promotes controversial conservative fundamentalist christian values this indicates to me that this book and its authors are inherently biased the book appears to have been published to further the agenda of the american family association as it advertises it on the back cover of the book itself for example margaret singer was on the board of the afa talk email '' to float

In [None]:
cbi.set_probability_threshold(.33)
pr = cbi.predict(te[X])
for metric in [f1_score,accuracy_score,precision_score,recall_score]:
    print( f"{metric.__name__}:{round( metric(te[y],pr), 4)}\t" )
"confusion_matrix:\n",confusion_matrix(te[y],pr)


In [None]:
# %%time
# params_grid ={
#     'learning_rate':[0.3,.1,.03,.01],
#     'auto_class_weights':['Balanced']
# }
# cbc = GridSearchCV( 
#     CatBoostClassifier( eval_metric='F1',text_features = X,tokenizers=verbose= 100  ),
#     params_grid,
#     scoring='f1_weighted',
#     cv= 4)  
# cbc.fit(tr[X],tr[y])

In [None]:
cbc.cv_results_

In [None]:
pr = cbc.predict(te[X])

In [None]:
model =  CatBoostClassifier( eval_metric='F1',eval_period=50,text_features = X,verbose= 100,n_estimators=500 , learning_rate=.03 )

## Выводы

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Данные загружены и подготовлены
- [ ]  Модели обучены
- [ ]  Значение метрики *F1* не меньше 0.75
- [ ]  Выводы написаны