# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from pymystem3 import Mystem
m = Mystem()

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_colwidth', -1)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0
...,...,...
159566,""":::::And for the second time of asking, when your view completely contradicts the coverage in reliable sources, why should anyone care what you feel? You can't even give a consistent argument - is the opening only supposed to mention significant aspects, or the """"most significant"""" ones? \n\n""",0
159567,You should be ashamed of yourself \n\nThat is a horrible thing you put on my talk page. 128.61.19.93,0
159568,"Spitzer \n\nUmm, theres no actual article for prostitution ring. - Crunch Captain.",0
159569,And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it.,0


Данные содержат 159 тысяч строк, 2 колонки - сам текст и тип токсичности(0 - нетоксичный, 1 - токсичный). Пропусков нет. Текст помимо слов содержит цифры и лишние символы, от которых следуете избавиться. Так же проведем лемматизацию и токенизацию

In [4]:
print('Количество дубликатов -', df.duplicated().sum())

Колмчество дубликатов - 0


Дубликатов нет.

In [5]:
df['toxic'].value_counts(normalize=True)

0    0.898321
1    0.101679
Name: toxic, dtype: float64

Количество токсичных комментариев во много раз меньше, чем нетоксичных.

Сделаем лемматизацию текста.

In [6]:
df['text'] = df['text'].apply(lambda x: "".join(m.lemmatize(x)))
df

Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27\n",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)\n",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.\n",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominationsTransport ""\n",0
4,"You, sir, are my hero. Any chance you remember what page that's on?\n",0
...,...,...
159566,""":::::And for the second time of asking, when your view completely contradicts the coverage in reliable sources, why should anyone care what you feel? You can't even give a consistent argument - is the opening only supposed to mention significant aspects, or the """"most significant"""" ones? \n\n""\n",0
159567,You should be ashamed of yourself \n\nThat is a horrible thing you put on my talk page. 128.61.19.93\n,0
159568,"Spitzer \n\nUmm, theres no actual article for prostitution ring. - Crunch Captain.\n",0
159569,And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it.\n,0


Далее удалим лишние символы с помощью регулярных выражений "re" и приведем текст в нижний регистр.

In [7]:
df['text'] = df['text'].apply(lambda x: " ".join(re.sub(r'[^a-zA-Z ]', ' ', x).split()))
df

Unnamed: 0,text,toxic
0,Explanation Why the edits made under my username Hardcore Metallica Fan were reverted They weren t vandalisms just closure on some GAs after I voted at New York Dolls FAC And please don t remove the template from the talk page since I m retired now,0
1,D aww He matches this background colour I m seemingly stuck with Thanks talk January UTC,0
2,Hey man I m really not trying to edit war It s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page He seems to care more about the formatting than the actual info,0
3,More I can t make any real suggestions on improvement I wondered if the section statistics should be later on or a subsection of types of accidents I think the references may need tidying so that they are all in the exact same format ie date format etc I can do that later on if no one else does first if you have any preferences for formatting style on references or want to do it yourself please let me know There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up It s listed in the relevant form eg Wikipedia Good article nominationsTransport,0
4,You sir are my hero Any chance you remember what page that s on,0
...,...,...
159566,And for the second time of asking when your view completely contradicts the coverage in reliable sources why should anyone care what you feel You can t even give a consistent argument is the opening only supposed to mention significant aspects or the most significant ones,0
159567,You should be ashamed of yourself That is a horrible thing you put on my talk page,0
159568,Spitzer Umm theres no actual article for prostitution ring Crunch Captain,0
159569,And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it,0


In [8]:
df['text'] = df['text'].apply(lambda x: x.lower())
df

Unnamed: 0,text,toxic
0,explanation why the edits made under my username hardcore metallica fan were reverted they weren t vandalisms just closure on some gas after i voted at new york dolls fac and please don t remove the template from the talk page since i m retired now,0
1,d aww he matches this background colour i m seemingly stuck with thanks talk january utc,0
2,hey man i m really not trying to edit war it s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info,0
3,more i can t make any real suggestions on improvement i wondered if the section statistics should be later on or a subsection of types of accidents i think the references may need tidying so that they are all in the exact same format ie date format etc i can do that later on if no one else does first if you have any preferences for formatting style on references or want to do it yourself please let me know there appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up it s listed in the relevant form eg wikipedia good article nominationstransport,0
4,you sir are my hero any chance you remember what page that s on,0
...,...,...
159566,and for the second time of asking when your view completely contradicts the coverage in reliable sources why should anyone care what you feel you can t even give a consistent argument is the opening only supposed to mention significant aspects or the most significant ones,0
159567,you should be ashamed of yourself that is a horrible thing you put on my talk page,0
159568,spitzer umm theres no actual article for prostitution ring crunch captain,0
159569,and it looks like it was actually you who put on the speedy to have the first version deleted now that i look at it,0


In [9]:
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    clean_text = [w for w in text.split() if not w in stop_words]  
    return clean_text

df['text'] = df['text'].apply(lambda x: remove_stopwords(x))
df

Unnamed: 0,text,toxic
0,"[explanation, edits, made, username, hardcore, metallica, fan, reverted, vandalisms, closure, gas, voted, new, york, dolls, fac, please, remove, template, talk, page, since, retired]",0
1,"[aww, matches, background, colour, seemingly, stuck, thanks, talk, january, utc]",0
2,"[hey, man, really, trying, edit, war, guy, constantly, removing, relevant, information, talking, edits, instead, talk, page, seems, care, formatting, actual, info]",0
3,"[make, real, suggestions, improvement, wondered, section, statistics, later, subsection, types, accidents, think, references, may, need, tidying, exact, format, ie, date, format, etc, later, one, else, first, preferences, formatting, style, references, want, please, let, know, appears, backlog, articles, review, guess, may, delay, reviewer, turns, listed, relevant, form, eg, wikipedia, good, article, nominationstransport]",0
4,"[sir, hero, chance, remember, page]",0
...,...,...
159566,"[second, time, asking, view, completely, contradicts, coverage, reliable, sources, anyone, care, feel, even, give, consistent, argument, opening, supposed, mention, significant, aspects, significant, ones]",0
159567,"[ashamed, horrible, thing, put, talk, page]",0
159568,"[spitzer, umm, theres, actual, article, prostitution, ring, crunch, captain]",0
159569,"[looks, like, actually, put, speedy, first, version, deleted, look]",0


## Обучение

Разделим данные на признак и целевой признак. Затем на обучающую и тестовую выборки.

In [10]:
features = df['text']
target = df['toxic']

features_train, features_test, target_train, target_test = train_test_split(features, 
                                                                            target, 
                                                                            train_size=0.80, 
                                                                            test_size=0.20, 
                                                                            random_state=123)

In [11]:
print('Размер выборки features_train:', features_train.shape)
print('Размер выборки target_train:', target_train.shape)
print('Размер выборки features_test:', features_test.shape)
print('Размер выборки target_test:', target_test.shape)

Размер выборки features_train: (127656,)
Размер выборки target_train: (127656,)
Размер выборки features_test: (31915,)
Размер выборки target_test: (31915,)


Успешно.

Приведем текст в понятный для питона тип данных Unicode

In [12]:
corpus_train = features_train.astype('U')
corpus_test = features_test.astype('U')
corpus_test[:3]

50446    ['redirect', 'names']                                                                                                                                                 
81571    ['sinebot', 'please', 'read', 'comments']                                                                                                                             
25983    ['thank', 'good', 'answer', 'realized', 'reading', 'question', 'stupid', 'probably', 'doublechecking', 'names', 'adding', 'appropriate', 'redirects', 'thank', 'work']
Name: text, dtype: object

Посчитаем tf_idf для корпуса текстов.

In [13]:
count_tf_idf = TfidfVectorizer(stop_words=stop_words)
tf_idf = count_tf_idf.fit(corpus_train)

In [14]:
train_X = tf_idf.transform(corpus_train)
test_X = tf_idf.transform(corpus_test)

### LogisticRegression

In [15]:
model = LogisticRegression(random_state=123, class_weight='balanced') 
model.fit(train_X, target_train)
predictions = model.predict(test_X)

f1 = f1_score(target_test, predictions)
print('LogisticRegression:', f1)

LogisticRegression: 0.7500346981263011


Результат неплохой. Укладываемся в условия задачи.

### RandomForestClassifier

Сначала попробуем без перебора гиперпараметров.

In [16]:
model = RandomForestClassifier(random_state=123, class_weight = 'balanced') 
model.fit(train_X, target_train)
predictions = model.predict(test_X)

f1 = f1_score(target_test, predictions)
print('RandomForestClassifier', f1)

RandomForestClassifier 0.6048988285410011


In [17]:
%%time
results_rfc = []

for depth in range(1,11):
    
    for estimator in range(10, 101, 10):
        
        model = RandomForestClassifier(random_state=123, 
                                       class_weight = 'balanced', 
                                       n_estimators=estimator, 
                                       max_depth=depth) 
        
        model.fit(train_X, target_train)
        predictions = model.predict(test_X)

        f1 = f1_score(target_test, predictions)
        results_rfc.append({'Model': 'RandomForestClassifier', 
                            'Hyperparameters': {'random_state': 123, 
                                                'class_weight': 'balanced',
                                                'n_estimators': estimator, 
                                                'max_depth':depth}, 
                            'F1 score': f1})

CPU times: user 4min 14s, sys: 0 ns, total: 4min 14s
Wall time: 4min 16s


In [18]:
pd.DataFrame(results_rfc)

Unnamed: 0,Model,Hyperparameters,F1 score
0,RandomForestClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'n_estimators': 10, 'max_depth': 1}",0.191673
1,RandomForestClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'n_estimators': 20, 'max_depth': 1}",0.202914
2,RandomForestClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'n_estimators': 30, 'max_depth': 1}",0.214708
3,RandomForestClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'n_estimators': 40, 'max_depth': 1}",0.228027
4,RandomForestClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'n_estimators': 50, 'max_depth': 1}",0.235229
...,...,...,...
95,RandomForestClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'n_estimators': 60, 'max_depth': 10}",0.356708
96,RandomForestClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'n_estimators': 70, 'max_depth': 10}",0.345813
97,RandomForestClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'n_estimators': 80, 'max_depth': 10}",0.342980
98,RandomForestClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'n_estimators': 90, 'max_depth': 10}",0.342681


Результаты намного хуже, чем у логистической регрессии.

### DecisionTreeClassifier

Попробуем дерево решений без перебора гиперпараметров.

In [19]:
model = DecisionTreeClassifier(random_state=123, class_weight = 'balanced')
model.fit(train_X, target_train)
predictions = model.predict(test_X)

f1 = f1_score(target_test, predictions)
print('DecisionTreeClassifier', f1)

DecisionTreeClassifier 0.6431767337807606


Теперь дерево решений с перебором гиперпараметров.

In [20]:
results_dtc = []

for depth in range(1,11):
    model = DecisionTreeClassifier(random_state=123, max_depth=depth, class_weight = 'balanced')
    model.fit(train_X, target_train)
    predictions = model.predict(test_X)

    f1 = f1_score(target_test, predictions)

    results_dtc.append({'Model': 'DecisionTreeClassifier', 
                        'Hyperparameters': {'random_state': 123, 
                                            'class_weight': 'balanced', 
                                            'max_depth': depth},
                        'F1 score': f1})

In [21]:
pd.DataFrame(results_dtc)

Unnamed: 0,Model,Hyperparameters,F1 score
0,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 1}",0.28378
1,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 2}",0.372474
2,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 3}",0.372474
3,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 4}",0.431239
4,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 5}",0.43276
5,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 6}",0.474365
6,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 7}",0.506631
7,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 8}",0.536789
8,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 9}",0.55809
9,DecisionTreeClassifier,"{'random_state': 123, 'class_weight': 'balanced', 'max_depth': 10}",0.556175


Результаты немного лучше, чем случайного леса, но в целом до нужной нам метрики f1 не дотягивает.

## Выводы

Нами была произведена подготовка текста для обучения и тестирования моделей. Дублей не было. Мы провели лемматизацию текста, убрали из него лишние символы и стоп-слова, привели к нижнему регистру. Было выявлено, что классов типа "0" намного больше, чем "1", поэтому при обучении мы использовали баланс классов в параметрах моделей. В результате обучения и тестирования моделей наилучшим образом себя показала модель логистической регрессии, которая помогла нам достичь целевого значения метрики равной 75.