# Новый сервис для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

**Цель**

Определить наилучшую модель для решения задачи классификации в работе с текстами

## Подготовка

In [39]:
# импорт Pandas
import pandas as pd

# импорт Os для работы с файлами и каталогами
import os

# импорт Numpy
import numpy as np

# импорт Sklearn

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle
from sklearn.feature_extraction.text import TfidfVectorizer

# импорт LightGBM

from lightgbm import LGBMClassifier


from nltk.corpus import stopwords
nltk.download('stopwords')

 

import re

pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nikka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [40]:
# напишем функцию для комфортной загрузки датасетов и локально, и с сервера

def file_loader(pth1, pth2):
    if os.path.exists(pth1):
        return pd.read_csv(pth1)
    elif os.path.exists(pth2):
        return pd.read_csv(pth2)
    else:
        print('Check the path')

In [41]:
df = file_loader('toxic_comments.csv', '/datasets/toxic_comments.csv')

In [42]:
df.head(15)

Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0
5,"""\n\nCongratulations from me as well, use the tools well. · talk """,0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,"Your vandalism to the Matt Shirvington article has been reverted. Please don't do it again, or you will be banned.",0
8,"Sorry if the word 'nonsense' was offensive to you. Anyway, I'm not intending to write anything in the article(wow they would jump on me for vandalism), I'm merely requesting that it be more encyclopedic so one can use it for school as a reference. I have been to the selective breeding page but it's almost a stub. It points to 'animal breeding' which is a short messy article that gives you no info. There must be someone around with expertise in eugenics? 93.161.107.169",0
9,alignment on this subject and which are contrary to those of DuLithgow,0


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [44]:
df.duplicated().sum()

0

In [45]:
df['toxic'].value_counts(1)

0   0.90
1   0.10
Name: toxic, dtype: float64

Данные в нормальном состоянии, пропусков и дубликатов нет, форматы адекватные. Наблюдается сильный дисбаланс классов в сторону негативного.

In [46]:
stop_words = set(stopwords.words('english'))


def preprocess_text(text):
    result = re.sub(r'[^A-Za-z ]', ' ', text)
    result = result.split()
    result = " ".join(result)
    return result

In [47]:
%%time

df['text'] = df['text'].apply(preprocess_text)

Wall time: 2.12 s


In [48]:
df

Unnamed: 0,text,toxic
0,Explanation Why the edits made under my username Hardcore Metallica Fan were reverted They weren t vandalisms just closure on some GAs after I voted at New York Dolls FAC And please don t remove the template from the talk page since I m retired now,0
1,D aww He matches this background colour I m seemingly stuck with Thanks talk January UTC,0
2,Hey man I m really not trying to edit war It s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page He seems to care more about the formatting than the actual info,0
3,More I can t make any real suggestions on improvement I wondered if the section statistics should be later on or a subsection of types of accidents I think the references may need tidying so that they are all in the exact same format ie date format etc I can do that later on if no one else does first if you have any preferences for formatting style on references or want to do it yourself please let me know There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up It s listed in the relevant form eg Wikipedia Good article nominations Transport,0
4,You sir are my hero Any chance you remember what page that s on,0
...,...,...
159566,And for the second time of asking when your view completely contradicts the coverage in reliable sources why should anyone care what you feel You can t even give a consistent argument is the opening only supposed to mention significant aspects or the most significant ones,0
159567,You should be ashamed of yourself That is a horrible thing you put on my talk page,0
159568,Spitzer Umm theres no actual article for prostitution ring Crunch Captain,0
159569,And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it,0


In [49]:
corpus = df['text'].values

In [50]:
corpus.shape

(159571,)

In [51]:
features = corpus
target = df['toxic']

In [52]:
features[5]

'Congratulations from me as well use the tools well talk'

In [53]:
train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.5)

In [54]:
count_tf_idf = TfidfVectorizer(stop_words=stop_words)
train_features = count_tf_idf.fit_transform(train_features)
test_features = count_tf_idf.transform(test_features)

In [55]:
train_features.shape, test_features.shape

((79785, 113400), (79786, 113400))

In [56]:
print(train_features[5])

  (0, 104736)	0.3557795636177494
  (0, 3158)	0.2682745924973377
  (0, 17862)	0.38694657865112486
  (0, 41126)	0.4200836968515402
  (0, 24307)	0.29699644344715337
  (0, 73630)	0.5647832571599587
  (0, 19839)	0.2608920524922392


## Обучение моделей и подбор гиперпараметров

1. Решающее дерево

In [57]:
%%time


model = DecisionTreeClassifier(random_state=1982)
depth_range = list(range(0, 21))
params = {'max_depth': depth_range}
tree_model = RandomizedSearchCV(model, params, n_iter=3, scoring='f1', verbose=1, random_state=1982)
tree_model.fit(train_features, train_target)
tree_best_params = tree_model.best_params_
tree_best_params

Fitting 5 folds for each of 3 candidates, totalling 15 fits
Wall time: 2min 29s


{'max_depth': 13}

In [58]:
tree_model = DecisionTreeClassifier(**tree_best_params, class_weight='balanced', random_state=1982)

2. Случайный лес

In [59]:
%%time

model = RandomForestClassifier(class_weight='balanced', random_state=1982)
depth_range = list(range(1, 21))
est_range = list(range(1, 11))
params = {'max_depth': depth_range, 'n_estimators': est_range}
rf_model = RandomizedSearchCV(model, params, verbose=1, scoring='f1', random_state=1982)
rf_model.fit(train_features, train_target)
rf_best_params = rf_model.best_params_
rf_best_params

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Wall time: 1min 1s


{'n_estimators': 8, 'max_depth': 20}

In [60]:
rf_model = RandomForestClassifier(**rf_best_params, class_weight='balanced', random_state=1982)

3. Линейная регрессия

In [61]:
lr_model = LogisticRegression(class_weight='balanced', max_iter=500, random_state=1982)

6. LightGBM

In [62]:
%%time

model = LGBMClassifier(class_weight='balanced', random_state=1982)
depth_range = list(range(1, 21))
est_range = [1, 3, 5, 10, 20, 100, 500]
speed_range = [0.03, 0.1, 0.5, 0.7, 1]
params = {'max_depth': depth_range, 'n_estimators': est_range, 'learning_rate': speed_range}
lgbm_model = RandomizedSearchCV(model, params, scoring='f1', n_iter=5, verbose=1, random_state=1982)
lgbm_model.fit(train_features, train_target)
lgbm_best_params1 = lgbm_model.best_params_
lgbm_best_params1

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Wall time: 1min 31s


{'n_estimators': 100, 'max_depth': 8, 'learning_rate': 1}

In [63]:
lgbm_model = LGBMClassifier(**lgbm_best_params1, class_weight='balanced', random_state=1982)

In [64]:
def model_fit(model):
    model.fit(train_features, train_target)

In [65]:
def model_test(model):
    test_predictions = model.predict(test_features)
    f1 = f1_score(test_target, test_predictions)
    return f1

In [66]:
pd.set_option('max_colwidth', 120)

models = [tree_model, rf_model, lr_model, lgbm_model]
results = []
for model in models:
    row = []
    l_time = %timeit -n1 -r1 -o model_fit(model);
    p_time = %timeit -n1 -r1 -o model_test(model);
    model_fit(model)
    f1 = model_test(model)
    row.append(str(model))
    row.append(f1)
    row.append(str(l_time)[:7])
    row.append(str(p_time)[:7])
    results.append(row)
results = pd.DataFrame(results, columns=['Model', 'F1', 'Learning_time', 'Predicting_time'])
results

4.57 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
86.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
476 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
180 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
2.21 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
31.3 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
7.87 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
231 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Unnamed: 0,Model,F1,Learning_time,Predicting_time
0,"DecisionTreeClassifier(class_weight='balanced', max_depth=13, random_state=1982)",0.56,4.57 s,86.8 ms
1,"RandomForestClassifier(class_weight='balanced', max_depth=20, n_estimators=8,\n random_state=1...",0.33,476 ms,180 ms
2,"LogisticRegression(class_weight='balanced', max_iter=500, random_state=1982)",0.75,2.21 s,31.3 ms
3,"LGBMClassifier(class_weight='balanced', learning_rate=1, max_depth=8,\n random_state=1982)",0.71,7.87 s,231 ms


## Выводы

1. В работе с текстами, преобразованными в векторный вид, используя инструмент TfidfVectorizer, лидирует модель LogisticRegression. Она показывает значение метрики F1 равное 0.75.

2. Приближается по метрике качества к порогу задания модель бустинга LGBM

3. Остальные модели для данной задачи явно не подходят и показывают крайне низкий результат.

Рекомендуется использовать Логистическую регрессию.