# Проект для онлайн-магазина
___
# Project for an online store

В рамках онлайн-магазина происходит запуск сервиса, с помощью которого клиенты могут дополнять описания товаров, а также их редактировать.

Необходимо создать инструмент, позволяющий искать токсичные комментарии и отправлять их на модерацию.

Значение метрики качества *F1* должно быть не меньше 0.75.
___
As part of the online store, a service is being launched, with the help of which customers can supplement product descriptions, as well as edit them.

It is necessary to create a tool that allows you to search for toxic comments and send them for moderation.

The value of the quality metric *F1* must be at least 0.75.

## Подготовка данных
___
## Data preparation

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from pymystem3 import Mystem
import spacy
from spacy.lang.en import English
import re 
from sklearn.pipeline import Pipeline
from sklearn.utils import shuffle

In [2]:
data = pd.read_csv("toxic_comments.csv")

In [3]:
data = data.drop(columns=["Unnamed: 0"])

Уменьшим количество объектов, чтобы было возможно производить необходимые операции не в течении часов из-за ограниченного ресурса оперативной памяти.
___
We will reduce the number of objects so that it is possible to perform the necessary operations not within hours due to the limited resource of RAM.

In [4]:
data = data.sample(50000).reset_index(drop=True)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
text     50000 non-null object
toxic    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.4+ KB


In [6]:
nlp = spacy.load("en_core_web_sm")

In [7]:
def lemmatize(text):
    global nlp
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc])

In [8]:
data["lemm"] = data["text"].apply(lemmatize)

In [9]:
data["lemm"] = [re.sub(r'[^a-zA-Z ]', ' ', (i)).lower() for i in data["lemm"]]

In [10]:
features = data["lemm"]
target = data["toxic"]

In [11]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size = 0.4, random_state=12345)


In [12]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

In [13]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

In [14]:
target_upsampled.value_counts()

1    30350
0    26965
Name: toxic, dtype: int64

In [15]:
nltk.download("stopwords")
stopwords = set(nltk_stopwords.words("english"))
# count_vect = TfidfVectorizer(stop_words=stopwords)
# tf_idf_train = count_vect.fit_transform(features_train)
# tf_idf_test = count_vect.transform(features_test)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kiril\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Обучение
___
## Training

### Логистическая регрессия
___
### Logistic regression

#### Эксперимент - логистическая регрессия и балансировка классов
___
#### Experiment - logistic regression and class balancing

In [16]:
pipe_lr_cb = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                    ('log_r', LogisticRegression(class_weight={0: 1, 1: 10}))])

parameters_lr_cb = {
                "log_r__C" : [1, 10, 100]
                }

grid_lr_cb = GridSearchCV(estimator=pipe_lr_cb, param_grid=parameters_lr_cb, scoring= 'f1', n_jobs=-1)

grid_lr_cb.fit(features_train, target_train)
print(grid_lr_cb.best_score_)

  if LooseVersion(joblib_version) < '0.12':


0.7334929226242076


In [17]:
modelr_lr_cb = grid_lr_cb.best_estimator_

In [18]:
pred_lr_cb = modelr_lr_cb.predict(features_test)

print("F1 LR_cb =", f1_score(target_test, pred_lr_cb))

F1 LR_cb = 0.752724630661177


#### Эксперимент - логистическая регрессия и апсэмплинг
___
#### Experiment - logistic regression and upsampling

In [19]:
pipe_lr_up = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                    ('log_r', LogisticRegression())])

parameters_lr_up = {
                "log_r__C" : [1, 10, 100]
                }

grid_lr_up = GridSearchCV(estimator=pipe_lr_up, param_grid=parameters_lr_up, scoring= 'f1', n_jobs=-1)

grid_lr_up.fit(features_upsampled, target_upsampled)
print(grid_lr_up.best_score_)

  if LooseVersion(joblib_version) < '0.12':


0.9825062945368035


In [20]:
modelr_lr_up = grid_lr_up.best_estimator_

In [21]:
pred_lr_up = modelr_lr_up.predict(features_test)

print("F1 LR_up =", f1_score(target_test, pred_lr_up))

F1 LR_up = 0.7290322580645161


### Дерево решений
___
### Decision tree

#### Эксперимент - дерево решений и балансировка классов
___
#### Experiment - decision tree and class balancing

In [22]:
pipe_dtr_cb = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                    ('dtr', DecisionTreeClassifier(class_weight={0: 1, 1: 10}))])

params_dtr_cb = {"dtr__max_depth": range(1, 100, 5)}

grid_dtr_cb = GridSearchCV(pipe_dtr_cb, param_grid=params_dtr_cb, scoring= 'f1', n_jobs=-1)

grid_dtr_cb.fit(features_train, target_train)
print(grid_dtr_cb.best_score_)

  if LooseVersion(joblib_version) < '0.12':


0.6022311631837429


In [23]:
model_dtr_cb = grid_dtr_cb.best_estimator_

In [24]:
pred_dtr_cb = model_dtr_cb.predict(features_test)

print("F1 DTR_cb =", f1_score(target_test, pred_dtr_cb))

F1 DTR_cb = 0.620126079850572


#### Эксперимент - дерево решений и апсэмплинг
___
#### Experiment - decision tree and upsampling

In [25]:
pipe_dtr_up = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                    ('dtr', DecisionTreeClassifier())])

params_dtr_up = {"dtr__max_depth": range(1, 100, 5)}

grid_dtr_up = GridSearchCV(pipe_dtr_up, param_grid=params_dtr_up, scoring= 'f1', n_jobs=-1)

grid_dtr_up.fit(features_upsampled, target_upsampled)
print(grid_dtr_up.best_score_)

  if LooseVersion(joblib_version) < '0.12':


0.9233544578363887


In [26]:
model_dtr_up = grid_dtr_up.best_estimator_

In [27]:
pred_dtr_up = model_dtr_up.predict(features_test)

print("F1 DTR_up =", f1_score(target_test, pred_dtr_up))

F1 DTR_up = 0.6198444496818291


### Случайный лес
___
### Random Forest

#### Эксперимент - cлучайный лес и балансировка классов
___
#### Experiment - random forest and class balancing

In [28]:
pipe_rfc_cb = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                    ('rfc', RandomForestClassifier(class_weight={0: 1, 1: 10}))])

params_rfc_cb = {
    "rfc__n_estimators": [100,200],
    "rfc__max_depth": range(1, 10)
}

grid_rfc_cb = GridSearchCV(pipe_rfc_cb, param_grid=params_rfc_cb, scoring= 'f1', n_jobs=-1)

grid_rfc_cb.fit(features_train, target_train)
print(grid_rfc_cb.best_score_)

  if LooseVersion(joblib_version) < '0.12':
  if _joblib.__version__ >= LooseVersion('0.12'):


0.27527642273083147


In [29]:
model_rfc_cb = grid_rfc_cb.best_estimator_

In [30]:
model_rfc_cb.fit(features_train, target_train)
pred_rfc_cb = model_rfc_cb.predict(features_test)
print("F1 RFC_cb =", f1_score(target_test, pred_rfc_cb))

  if LooseVersion(joblib_version) < '0.12':
  if _joblib.__version__ >= LooseVersion('0.12'):
  if _joblib.__version__ >= LooseVersion('0.12'):


F1 RFC_cb = 0.26903944510614447


#### Эксперимент - cлучайный лес и апсэмплинг
___
#### Experiment - random forest and upsampling

In [31]:
pipe_rfc_up = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                    ('rfc', RandomForestClassifier())])

params_rfc_up = {
    "rfc__n_estimators": [100,200],
    "rfc__max_depth": range(1, 10)
}

grid_rfc_up = GridSearchCV(pipe_rfc_up, param_grid=params_rfc_up, scoring= 'f1', n_jobs=-1)

grid_rfc_up.fit(features_upsampled, target_upsampled)
print(grid_rfc_up.best_score_)

  if LooseVersion(joblib_version) < '0.12':
  if _joblib.__version__ >= LooseVersion('0.12'):


0.7901851375718838


In [32]:
model_rfc_up = grid_rfc_up.best_estimator_

In [33]:
model_rfc_up.fit(features_upsampled, target_upsampled)
pred_rfc_up = model_rfc_up.predict(features_test)
print("F1 RFC_up =", f1_score(target_test, pred_rfc_up))

  if LooseVersion(joblib_version) < '0.12':
  if _joblib.__version__ >= LooseVersion('0.12'):


F1 RFC_up = 0.26636883987156224


  if _joblib.__version__ >= LooseVersion('0.12'):


### LGBM

#### Эксперимент - LGBM и балансировка классов
___
#### Experiment - LGBM and class balancing

In [34]:
pipe_lgbm_cb = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                    ('lgbm', lgb.LGBMClassifier(class_weight={0: 1, 1: 10}))])

parameters_lgbm_cb = {
    'lgbm__boosting_type': ['gbdt'],
    'lgbm__num_leaves': [25,35],
    "lgbm__max_depth": [1, 10],
    'lgbm__n_estimators': [50, 100]
}

grid_lgbm_cb = GridSearchCV(pipe_lgbm_cb, param_grid=parameters_lgbm_cb, scoring= 'f1', n_jobs=-1)

grid_lgbm_cb.fit(features_train, target_train)
print(grid_lgbm_cb.best_score_)

  if LooseVersion(joblib_version) < '0.12':


0.683337966071608


In [35]:
model_lgbm_cb = grid_lgbm_cb.best_estimator_

In [36]:
model_lgbm_cb.fit(features_train, target_train)

pred_lgbm_cb = model_lgbm_cb.predict(features_test)

print("F1 LGBM_cb =", f1_score(target_test, pred_lgbm_cb))

  if LooseVersion(joblib_version) < '0.12':


F1 LGBM_cb = 0.6948185345811457


#### Эксперимент - LGBM и апсэмплинг.
___
#### Experiment - LGBM and upsampling.

In [37]:
pipe_lgbm_up = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                    ('lgbm', lgb.LGBMClassifier())])

parameters_lgbm_up = {
    'lgbm__boosting_type': ['gbdt'],
    'lgbm__num_leaves': [25,35],
    "lgbm__max_depth": [1, 10],
    'lgbm__n_estimators': [50, 100]
}

grid_lgbm_up = GridSearchCV(pipe_lgbm_up, param_grid=parameters_lgbm_up, scoring= 'f1', n_jobs=-1)

grid_lgbm_up.fit(features_upsampled, target_upsampled)
print(grid_lgbm_up.best_score_)

  if LooseVersion(joblib_version) < '0.12':


0.9082116069939681


In [38]:
model_lgbm_up = grid_lgbm_up.best_estimator_

In [39]:
model_lgbm_up.fit(features_upsampled, target_upsampled)

pred_lgbm_up = model_lgbm_up.predict(features_test)

print("F1 LGBM_up =", f1_score(target_test, pred_lgbm_up))

  if LooseVersion(joblib_version) < '0.12':


F1 LGBM_up = 0.7007984969469235


## Выводы
___
## Conclusions

Соберем результаты **F1_score** в одной таблице для объективизации разницы.
___
Let's collect the results of **F1_score** in one table to objectify the difference.

In [40]:
table = [
    [
        f1_score(target_test, pred_lr_cb),
        f1_score(target_test, pred_dtr_cb),
        f1_score(target_test, pred_rfc_cb),
        f1_score(target_test, pred_lgbm_cb),
    ],
    [
        f1_score(target_test, pred_lr_up),
        f1_score(target_test, pred_dtr_up),
        f1_score(target_test, pred_rfc_up),
        f1_score(target_test, pred_lgbm_up),
    ]
]
columns = ["lr", "dtr", "rfc", "lgbm"]
index = ["F1_score_cb", "F1_score_up"]
table = pd.DataFrame(table, index, columns)
display(table)

Unnamed: 0,lr,dtr,rfc,lgbm
F1_score_cb,0.752725,0.620126,0.269039,0.694819
F1_score_up,0.729032,0.619844,0.266369,0.700798


Построение моделей было произведено на ограниченной части датафрейма, так как иначе лемматизация заняла слишком много времени в силу малого количества оперативной памяти. 

Наиболее удачной моделью является **Линейная регрессия**, значение его **F1_score** равно 0.752725, что выше обозначенного в задаче минимума.
___
The models were built on a limited part of the dataframe, since otherwise the lemmatization took too long due to the small amount of RAM.
​
The most successful model is **Linear regression**, its **F1_score** value is 0.752725, which is higher than the minimum indicated in the task.