<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Загрузка-данных" data-toc-modified-id="Загрузка-данных-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Загрузка данных</a></span></li><li><span><a href="#Предобработка-данных" data-toc-modified-id="Предобработка-данных-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Предобработка данных</a></span></li><li><span><a href="#Лемматизация-и-удаление-стоп-слов" data-toc-modified-id="Лемматизация-и-удаление-стоп-слов-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Лемматизация и удаление стоп-слов</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

**Название проекта**: Классификация токсичных комментариев для интернет-магазина «Викишоп»

**Цель исследования:**

Разработать модель машинного обучения, способную автоматически определять токсичные комментарии в пользовательских правках товаров интернет-магазина «Викишоп» для их последующей модерации.

**Задачи исследования:**

1. Загрузить и предобработать текстовые данные (очистка, лемматизация, удаление стоп-слов).
2. Обучить несколько моделей классификации для предсказания токсичности комментариев.
3. Оценить качество моделей с помощью метрики F1-score и выбрать лучшую (целевое значение ≥ 0.75).
4. Проверить работу модели на тестовых данных и сделать выводы о её применимости в реальных условиях.

**Входные данные:**

Датасет toxic_comments.csv, содержащий:

1. text — текст комментария.
2. toxic — бинарный целевой признак (1 — токсичный, 0 — нейтральный/позитивный).

**Импорт библиотек**

In [1]:
pip install --upgrade scikit-learn -q

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    RandomizedSearchCV,
)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC


import re
import spacy

RANDOM_STATE = 43
TEST_SIZE = 0.25

pd.set_option('display.max_colwidth', None)

## Подготовка

### Загрузка данных

In [3]:
df = pd.read_csv('/datasets/toxic_comments.csv', index_col=0)
df.head()

Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


In [5]:
df.shape

(159292, 2)

*Вывод:*

1. Данные были загружены и соответствуют ТЗ.
2. Названия столбцов не требуют корректировки.
3. Размер датафрейма - (159292, 2).
4. Пропуски не обнаружены.

### Предобработка данных

In [6]:
df['text'].isna().sum()

0

In [7]:
df['text'].duplicated().sum()

0

*Вывод:*

1. Нет пропусков.
2. Нет дубликатов.

### Лемматизация и удаление стоп-слов

In [8]:
def lemmatize(text, nlp_object):
    '''
    Лемматизация, удаление стоп-слов и лишних пробелов
    '''
    doc = nlp(text)
    lemm = " ".join([token.lemma_ for token in doc])
    clear = re.sub(r'[^a-zA-Z ]', ' ', lemm.lower())
    return ' '.join(clear.split())

In [9]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [10]:
df['clear_text'] = df['text'].apply(lambda x: lemmatize(x, nlp))
df.head()

Unnamed: 0,text,toxic,clear_text
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,explanation why the edit make under my username hardcore metallica fan be revert they be not vandalism just closure on some gas after i vote at new york dolls fac and please do not remove the template from the talk page since i be retire now
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,d aww he match this background colour i be seemingly stuck with thank talk january utc
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0,hey man i be really not try to edit war it be just that this guy be constantly remove relevant information and talk to i through edit instead of my talk page he seem to care more about the formatting than the actual info
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0,more i can not make any real suggestion on improvement i wonder if the section statistic should be later on or a subsection of type of accident i think the reference may need tidy so that they be all in the exact same format ie date format etc i can do that later on if no one else do first if you have any preference for format style on reference or want to do it yourself please let i know there appear to be a backlog on article for review so i guess there may be a delay until a reviewer turn up it be list in the relevant form eg wikipedia good article nominations transport
4,"You, sir, are my hero. Any chance you remember what page that's on?",0,you sir be my hero any chance you remember what page that be on


In [11]:
df['clear_text'].duplicated().sum()

1323

Нужно удалить дубликаты и проверить баланс классов

In [12]:
df = df.drop_duplicates(subset=['clear_text'])
df['toxic'].value_counts() / len(df)

0    0.898436
1    0.101564
Name: toxic, dtype: float64

Есть сильный дисбаланс классов

*Выводы*:

1. Создана функция для лемматизации и удаления стоп-слов и лишних пробелов.
2. Удалены дубликаты.
3. Присутствует дисбаланс классов

## Обучение

In [13]:
X = df['clear_text']
y = df['toxic']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

In [15]:
count_tf_idf = TfidfVectorizer(stop_words='english')

In [16]:
pipeline = Pipeline(
    [
        ("vect", TfidfVectorizer()),
        ("models", LogisticRegression(random_state=RANDOM_STATE)),
    ]
)

In [17]:
param_distributions = [
    {
        'models': [KNeighborsClassifier()],
        'models__n_neighbors': [3, 5, 7]
    },
    {
        'models': [DecisionTreeClassifier(random_state=RANDOM_STATE)],
        'models__max_depth': [3, 5],
        'models__min_samples_split': [2, 5],
    },
    {
        'models': [LogisticRegression(random_state=RANDOM_STATE, solver='liblinear', max_iter=1000)],
        'models__C': [0.1, 1, 10], 
        'vect__max_features': [5000, 10000],
        'vect__max_df': [0.4, 0.6, 0.8], 
        'vect__min_df': [1, 3],
        'vect__norm': ['l1', 'l2'],
    }
]

In [18]:
rndom = RandomizedSearchCV(
    pipeline,
    param_distributions,
    n_iter=10,
    cv=5,
    n_jobs=-1,
    scoring='f1'
)

In [19]:
rndom.fit(X_train, y_train)

In [20]:
results = rndom.cv_results_
results_df = pd.DataFrame(results)

ranking_columns = ['params', 'mean_test_score', 'rank_test_score']
ranking_df = results_df[ranking_columns].sort_values(by='rank_test_score')

print("Рейтинг моделей:")
display(ranking_df.head(10))

Рейтинг моделей:


Unnamed: 0,params,mean_test_score,rank_test_score
3,"{'vect__norm': 'l2', 'vect__min_df': 1, 'vect__max_features': 10000, 'vect__max_df': 0.6, 'models__C': 10, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.783392,1
6,"{'vect__norm': 'l2', 'vect__min_df': 1, 'vect__max_features': 5000, 'vect__max_df': 0.6, 'models__C': 10, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.779239,2
4,"{'vect__norm': 'l2', 'vect__min_df': 3, 'vect__max_features': 10000, 'vect__max_df': 0.4, 'models__C': 10, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.775095,3
0,"{'vect__norm': 'l2', 'vect__min_df': 3, 'vect__max_features': 5000, 'vect__max_df': 0.8, 'models__C': 1, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.747363,4
7,"{'vect__norm': 'l1', 'vect__min_df': 3, 'vect__max_features': 10000, 'vect__max_df': 0.8, 'models__C': 10, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.686504,5
2,"{'vect__norm': 'l1', 'vect__min_df': 1, 'vect__max_features': 10000, 'vect__max_df': 0.8, 'models__C': 10, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.68646,6
5,"{'vect__norm': 'l2', 'vect__min_df': 1, 'vect__max_features': 5000, 'vect__max_df': 0.6, 'models__C': 0.1, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.569413,7
8,"{'vect__norm': 'l1', 'vect__min_df': 1, 'vect__max_features': 5000, 'vect__max_df': 0.4, 'models__C': 1, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.512368,8
1,"{'vect__norm': 'l1', 'vect__min_df': 1, 'vect__max_features': 10000, 'vect__max_df': 0.8, 'models__C': 1, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.492017,9
9,"{'vect__norm': 'l1', 'vect__min_df': 3, 'vect__max_features': 10000, 'vect__max_df': 0.8, 'models__C': 0.1, 'models': LogisticRegression(max_iter=1000, random_state=43, solver='liblinear')}",0.114829,10


Лучшая модель - **LogisticRegression**(C=10, max_iter=1000, random_state=43, solver='liblinear') c test_score = **0.783**

In [21]:
best_estimator = rndom.best_estimator_
vect = best_estimator[0]
model = best_estimator[1]

In [22]:
tf_idf_test = vect.transform(X_test)

In [23]:
y_pred = model.predict(tf_idf_test)
f1_score(y_test, y_pred)

0.7868265123329199

Так как критерий успеха значение метрики f1 >= 0.75, то считаем нашу модель успешной

*Выводы:*

1. Был создан паплайн
2. Были обучены модели - KNeighborsClassifier, DecisionTreeClassifier, LogisticRegression
3. Для каждой модели были подобраны гиперпараметры
3. Лучшая модель на тренировочной выборке - LogisticRegression с f1 = 0.783
4. Значение метрики на тестовой выборке - **0.787**

## Выводы

**Общий вывод**

1. Данные были загружены и соответствуют ТЗ.
2. Названия столбцов не требуют корректировки.
3. Размер датафрейма - (159292, 2).
4. Пропуски не обнаружены.
1. Создана функция для лемматизации и удаления стоп-слов и лишних пробелов.
2. Удалены дубликаты.
3. Присутствует дисбаланс классов
1. Был создан паплайн
2. Были обучены модели - KNeighborsClassifier, DecisionTreeClassifier, LogisticRegression
3. Для каждой модели были подобраны гиперпараметры
3. Лучшая модель на тренировочной выборке - **LogisticRegression**(C=10, max_iter=1000, random_state=43, solver='liblinear') с f1 = **0.783**
4. Значение метрики на тестовой выборке - **0.787**