<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Random Forest</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [3]:
!/opt/conda/bin/python -m pip install spacy
!/opt/conda/bin/python -m pip install download en_core_web_sm



import en_core_web_sm

import warnings
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords as nltk_stopwords
import re
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline


nlp = spacy.load("en_core_web_sm")
warnings.filterwarnings('ignore')

Collecting download
  Downloading download-0.3.5-py3-none-any.whl (8.8 kB)
Installing collected packages: download
Successfully installed download-0.3.5


In [4]:
data = pd.read_csv('/datasets/toxic_comments.csv', index_col=0)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


In [6]:
data

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159446,""":::::And for the second time of asking, when ...",0
159447,You should be ashamed of yourself \n\nThat is ...,0
159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159449,And it looks like it was actually you who put ...,0


In [7]:
def clear_text(text):
    clear_text = re.sub(r'[^a-zA-Z ]', ' ', text) 
    clear_text =" ".join(clear_text.split())
    return clear_text

In [8]:
data['text'] = data['text'].apply(clear_text)

In [9]:
data.text

0         Explanation Why the edits made under my userna...
1         D aww He matches this background colour I m se...
2         Hey man I m really not trying to edit war It s...
3         More I can t make any real suggestions on impr...
4         You sir are my hero Any chance you remember wh...
                                ...                        
159446    And for the second time of asking when your vi...
159447    You should be ashamed of yourself That is a ho...
159448    Spitzer Umm theres no actual article for prost...
159449    And it looks like it was actually you who put ...
159450    And I really don t think you understand I came...
Name: text, Length: 159292, dtype: object

In [10]:
nlp = en_core_web_sm.load()

def lemmetaze(text):
        doc = nlp(text)
        text_data=" ".join([token.lemma_ for token in doc])
        return text_data

In [11]:
sentence1 = "The striped bats are hanging on their feet for best"
sentence2 = "you should be ashamed of yourself went worked"
df_my = pd.DataFrame([sentence1, sentence2], columns = ['text'])

print(df_my)


print(df_my['text'].apply(lemmetaze))

                                                text
0  The striped bats are hanging on their feet for...
1      you should be ashamed of yourself went worked
0    the stripe bat be hang on their foot for good
1        you should be ashamed of yourself go work
Name: text, dtype: object


In [13]:
data['text'] = data['text'].apply(lemmetaze)

In [14]:
data

Unnamed: 0,text,toxic
0,explanation why the edit make under my usernam...,0
1,d aww he match this background colour I m seem...,0
2,hey man I m really not try to edit war it s ju...,0
3,More I can t make any real suggestion on impro...,0
4,you sir be my hero any chance you remember wha...,0
...,...,...
159446,and for the second time of ask when your view ...,0
159447,you should be ashamed of yourself that be a ho...,0
159448,Spitzer Umm there s no actual article for pros...,0
159449,and it look like it be actually you who put on...,0


In [15]:
features = data['text']
target = data['toxic']

## Обучение

In [16]:
features_train, features_test, target_train, target_test = train_test_split(features, target, 
                                                                            random_state=42,
                                                                            test_size=0.25,
                                                                            stratify=target)

In [17]:
nltk.download('stopwords')

stopwords = set(nltk_stopwords.words('english'))

#count_tf_idf = TfidfVectorizer(stop_words=stopwords)                             
#tf_idf_train = count_tf_idf.fit_transform(features_train)
#tf_idf_test = count_tf_idf.transform(features_test)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<div class="alert alert-info"> <b>Комментарий студента:</b> закомментила тут, потому что делаем потом тоже самое в pipeline

### Logistic Regression

In [19]:
params={'model__C':[5,10],
        'model__penalty':['l1','l2']}
pipeline = Pipeline([
    ('vect', CountVectorizer(stop_words=stopwords,ngram_range=(1,1))),
    ('tfidf', TfidfTransformer()),
    ('model',LogisticRegression(random_state=42))])
grid = GridSearchCV(pipeline, cv=5, n_jobs=-1, param_grid=params ,scoring='f1')
grid.fit(features_train, target_train)

print("Лучшие параметры:",grid.best_params_)
print("F1:",grid.best_score_)

Лучшие параметры: {'model__C': 10, 'model__penalty': 'l2'}
F1: 0.7753037963091165


### Random Forest

In [30]:
params_rf={'classifier__n_estimators': [20, 30, 50],
            'classifier__max_depth': [2, 4],
            'classifier__min_samples_leaf': [2, 4]}

In [31]:
pipeline_rf = Pipeline([
    ('vect', CountVectorizer(stop_words=stopwords,ngram_range=(1,1))),
    ('tfidf', TfidfTransformer()),
    ('classifier', RandomForestClassifier(random_state = 42))])

In [32]:
grid_rf = GridSearchCV(pipeline_rf, cv=5, param_grid=params_rf ,scoring='f1')
grid_rf.fit(features_train, target_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(stop_words={'a',
                                                                    'about',
                                                                    'above',
                                                                    'after',
                                                                    'again',
                                                                    'against',
                                                                    'ain',
                                                                    'all', 'am',
                                                                    'an', 'and',
                                                                    'any',
                                                                    'are',
                                                                    'aren',
          

In [33]:
print("Лучшие параметры:",grid_rf.best_params_)
print("F1:",grid_rf.best_score_)

Лучшие параметры: {'classifier__max_depth': 2, 'classifier__min_samples_leaf': 2, 'classifier__n_estimators': 20}
F1: 0.0


## Выводы

In [36]:
grid.best_estimator_

Pipeline(steps=[('vect',
                 CountVectorizer(stop_words={'a', 'about', 'above', 'after',
                                             'again', 'against', 'ain', 'all',
                                             'am', 'an', 'and', 'any', 'are',
                                             'aren', "aren't", 'as', 'at', 'be',
                                             'because', 'been', 'before',
                                             'being', 'below', 'between',
                                             'both', 'but', 'by', 'can',
                                             'couldn', "couldn't", ...})),
                ('tfidf', TfidfTransformer()),
                ('model', LogisticRegression(C=10, random_state=42))])

In [37]:
pred = grid.predict(features_test)
f1_score(pred, target_test)

0.7774242633835938

<b>Вывод</b>

В ходе проекта:
<br>Мы изучиили данные. Очистили текст от лишних символов, лемматизировали текст, убрали стоп слова.
Обучили две модели - логистическую регрессию и рандомный лес. На валидационной выборке мы выбрали логистическую регрессию - она показала результат - F1 0.77. Потом модель проверили на тестовой выборке, результат - 0.78.