# Проект для «Викишоп»

### Описание проекта

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Нужно обучить модель классифицировать комментарии на позитивные и негативные. У нас есть набор данных с разметкой о токсичности правок.

Нужно построить модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.


## Подготовка

In [1]:
import pandas as pd
import numpy as np
import torch
import transformers
from tqdm import notebook
import nltk
import re
from nltk.corpus import stopwords as nltk_stopwords
import spacy
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

In [2]:
STATE = np.random.RandomState(12345)

In [3]:
df_toxic = pd.read_csv('/datasets/toxic_comments.csv')

In [4]:
df_toxic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [5]:
df_toxic.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


In [6]:
# Удаление неинформативного столбца Unnamed: 0
df_toxic = df_toxic.drop(['Unnamed: 0'], axis=1)

In [7]:
# Проверка данных на дубликаты
df_toxic.duplicated().sum()

0

In [8]:
# Посмотрим на частоту классов в целевом признаке
df_toxic['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [12]:
# Длина текста
df_toxic.iloc[df_toxic['text'].str.len().sort_values(ascending=False).tail(20).index]

Unnamed: 0,text,toxic
15588,You've got mail,0
61351,31 January 2007,0
54257,Speedy deletion,0
100520,hello cow head,1
3631,September 2008,0
20550,September 2015,0
10193,"64.86.141.133""",0
141644,Unblock Please,0
101306,GONE FISHING.,0
38743,"88.104.31.21""",0


In [13]:
# Лемматизация с помощью библиотеки spacy 
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def lemmatize(row):
    doc = nlp(row['text'])
    lemm_doc = " ".join([token.lemma_ for token in doc])
    return lemm_doc 

In [14]:
df_toxic['lemm_text'] = df_toxic.apply(lemmatize, axis=1)
df_toxic.head(10)

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation \n why the edit make under my user...
1,D'aww! He matches this background colour I'm s...,0,D'aww ! he match this background colour I be s...
2,"Hey man, I'm really not trying to edit war. It...",0,"hey man , I be really not try to edit war . it..."
3,"""\nMore\nI can't make any real suggestions on ...",0,""" \n More \n I can not make any real suggestio..."
4,"You, sir, are my hero. Any chance you remember...",0,"you , sir , be my hero . any chance you rememb..."
5,"""\n\nCongratulations from me as well, use the ...",0,""" \n\n Congratulations from I as well , use th..."
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,COCKSUCKER before you pis around on my work
7,Your vandalism to the Matt Shirvington article...,0,your vandalism to the Matt Shirvington article...
8,Sorry if the word 'nonsense' was offensive to ...,0,sorry if the word ' nonsense ' be offensive to...
9,alignment on this subject and which are contra...,0,alignment on this subject and which be contrar...


In [15]:
# Очищение текста от лишних символов

def clear_text(row):
    doc = re.sub(r'[^a-zA-Z ]', ' ', row['text'])
    text_clear = " ".join(doc.lower().split())
    return text_clear

In [16]:
df_toxic['lemm_text'] = df_toxic.apply(clear_text, axis=1)
df_toxic.head(10)

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,d aww he matches this background colour i m se...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not trying to edit war it s...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...
5,"""\n\nCongratulations from me as well, use the ...",0,congratulations from me as well use the tools ...
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,cocksucker before you piss around on my work
7,Your vandalism to the Matt Shirvington article...,0,your vandalism to the matt shirvington article...
8,Sorry if the word 'nonsense' was offensive to ...,0,sorry if the word nonsense was offensive to yo...
9,alignment on this subject and which are contra...,0,alignment on this subject and which are contra...


In [22]:
df_toxic.iloc[df_toxic['lemm_text'].str.len().sort_values(ascending=False).tail(20).index]

Unnamed: 0,text,toxic,lemm_text
132833,True. 75.85.112.149,0,true
72382,"Also, 128.61.126.234",0,also
89846,"203.124.2.14]] 07:29, 27 July",0,july
20524,ISBN 978-5-98227-158-7,0,isbn
177,"86.29.244.57|86.29.244.57]] 04:21, 14 May 2007",0,may
72859,"""32]]''''' 06:18, 4 May""",0,may
12273,"@]] 04:15, 2005 Jan 7",0,jan
150648,"04:59, 22 Au",0,au
151107,10 - 2010 04 08 to 2010 05 12,0,to
53679,92.24.199.233|92.24.199.233]],0,


In [62]:
# Cоздание корпуса твитов
corpus = df_toxic['lemm_text']

In [63]:
# Указание стоп-слов
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [54]:
# Выделение целевого признака в датасете
target = df_toxic['toxic']

In [56]:
# Разделение выборки на train и test 
corpus_train, corpus_test, target_train, target_test = train_test_split(
    corpus, target, test_size=0.2, random_state=STATE)

## Обучение

In [75]:
%%time
# Обучение модели логистической регрессии

logistic = LogisticRegression(max_iter=10000, class_weight='balanced', random_state=STATE)
pipe_log = Pipeline([
    ('vect', TfidfVectorizer(stop_words=stopwords)),
    ('logistic', logistic)
])
param_grid = {'logistic__C': np.logspace(0.01, 50, 10)}
grid_log = GridSearchCV(
    estimator=pipe_log, param_grid=param_grid, scoring='f1', cv=3, n_jobs=-1, verbose=10
)
grid_log.fit(corpus_train, target_train)
print(grid_log.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3; 1/10] START logistic__C=1.023292992280754..............................
[CV 1/3; 1/10] END ............logistic__C=1.023292992280754; total time=  37.0s
[CV 2/3; 1/10] START logistic__C=1.023292992280754..............................
[CV 2/3; 1/10] END ............logistic__C=1.023292992280754; total time=  51.6s
[CV 3/3; 1/10] START logistic__C=1.023292992280754..............................
[CV 3/3; 1/10] END ............logistic__C=1.023292992280754; total time=  45.3s
[CV 1/3; 2/10] START logistic__C=366812.76823930326.............................
[CV 1/3; 2/10] END ...........logistic__C=366812.76823930326; total time=32.8min
[CV 2/3; 2/10] START logistic__C=366812.76823930326.............................
[CV 2/3; 2/10] END ...........logistic__C=366812.76823930326; total time=28.8min
[CV 3/3; 2/10] START logistic__C=366812.76823930326.............................
[CV 3/3; 2/10] END ...........logistic__C=366812

In [77]:
pred_train = grid_log.predict(corpus_train)
f1 = f1_score(target_train, pred_train)
f1

0.8430936995153474

In [88]:
%%time
# Обучение модели LightGBM

lgbm = LGBMClassifier(boosting_type='gbdt', random_state=STATE) 
pipe_lgbm = Pipeline([
    ('vect', TfidfVectorizer(stop_words=stopwords)),
    ('lgbm', lgbm)
])
param_grid = {'lgbm__max_depth': [2, 4, 6],
             'lgbm__n_estimators': [20, 40, 60]}
grid_lgbm = GridSearchCV(
    estimator=pipe_lgbm, param_grid=param_grid, scoring='f1', cv=3, n_jobs=-1, verbose=10
)
grid_lgbm.fit(corpus_train, target_train)
print(grid_lgbm.best_score_)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV 1/3; 1/9] START lgbm__max_depth=2, lgbm__n_estimators=20....................
[CV 1/3; 1/9] END ..lgbm__max_depth=2, lgbm__n_estimators=20; total time=  38.5s
[CV 2/3; 1/9] START lgbm__max_depth=2, lgbm__n_estimators=20....................
[CV 2/3; 1/9] END ..lgbm__max_depth=2, lgbm__n_estimators=20; total time=  37.6s
[CV 3/3; 1/9] START lgbm__max_depth=2, lgbm__n_estimators=20....................
[CV 3/3; 1/9] END ..lgbm__max_depth=2, lgbm__n_estimators=20; total time=  38.1s
[CV 1/3; 2/9] START lgbm__max_depth=2, lgbm__n_estimators=40....................
[CV 1/3; 2/9] END ..lgbm__max_depth=2, lgbm__n_estimators=40; total time=  42.8s
[CV 2/3; 2/9] START lgbm__max_depth=2, lgbm__n_estimators=40....................
[CV 2/3; 2/9] END ..lgbm__max_depth=2, lgbm__n_estimators=40; total time=  47.6s
[CV 3/3; 2/9] START lgbm__max_depth=2, lgbm__n_estimators=40....................
[CV 3/3; 2/9] END ..lgbm__max_depth=2, lgbm__n_es

In [89]:
pred_train = grid_lgbm.predict(corpus_train)
f1 = f1_score(target_train, pred_train)
f1

0.6194167589516426

Обучила 2 модели. Регрессия дала результат f1 = 0.84. Модель LGBM показала себя хуже, здесь f1 = 0.61. 

In [90]:
# Тестирование модели логистической регрессии
pred_train = grid_log.predict(corpus_test)
f1 = f1_score(target_test, pred_train)
f1

0.7514592099904981

## Выводы

ВЫВОДЫ: Я выгрузила, проанализировала и подготовила данные к дальнейшей работе, а именно провела лемматизацию и очистку текста, загрузку стоп-слов. Далее обучила 2 модели - логистическую регрессию и модель LGBM. Модель логистической регрессии и на трейне, и на тесте показала нужные значения метрики f1 выше 0,75. 