# Project for "Wikishop"

Online store "Wikishop" launches a new service. Now users can edit and supplement product descriptions, just like in wiki communities. That is, clients propose their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and submit them for moderation.

**To Be Done**

1. Download and prepare data.
2. Train different models.
3. Make conclusion.

**Data Description**

The data is in the `toxic_comments.csv` file. The *text* column contains the text of the comment, and *toxic* is the target attribute.

## Data Preparation

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import lightgbm as lgbm
from catboost import CatBoostClassifier
import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')

First look at the data

In [None]:
data.sample(10)

Unnamed: 0,text,toxic
141896,Whoooo HOOO Kitty!! \n\n I thought you Doucheb...,0
85873,Proposed deletion from because of Wikipedia:G...,0
79509,"""\n{{unblock|I have not been had """"several war...",0
142075,Image copyright problem with Image:Annang_Map_...,0
144890,Second nice guy I meet here. Hope there are mo...,0
124035,I'm interested in how this checkuser turned ou...,0
66581,"""\n\nAlmost forgot about this, here are the is...",0
22040,Re:Disam \n\nWhy thank you! ) I really appreci...,0
59230,In Wikipedia the terms to be used are generall...,0
55136,Question \nI have seen portions of the discus...,0


We have a dataset of two columns, a comment text and a target feature. At first glance, everything looks very clean.

In [None]:
data['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

Class balancing is not observed, when training models, we will try to set the appropriate parameter.

In [None]:
data.duplicated().sum()

0

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


No duplicates or gaps found.

<br>
Let's tokenize the texts: first, we clean it from the superfluous (numbers, signs, etc.), lemmatize, split it into tokens, and remove the stop words.

In [None]:
def clear_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    return ' '.join(text.split())

def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(word) for word in text]
    return text

def tokenization(text):
    text = word_tokenize(text)
    text = [word.lower() for word in text]
    return text

def remove_stop_words(text):
    stop_words = stopwords.words('english')
    text = [word for word in text if not word.lower() in stop_words]
    return text

In [None]:
data['text'] = data['text'].apply(clear_text)
data['tokens'] = data['text'].apply(tokenization)
data['lemmatize'] = data['tokens'].apply(lemmatize)
data['clear_text'] = data['lemmatize'].apply(remove_stop_words)

In [None]:
data.head(3)

Unnamed: 0,text,toxic,tokens,lemmatize,clear_text
0,Explanation Why the edits made under my userna...,0,"[explanation, why, the, edits, made, under, my...","[explanation, why, the, edits, made, under, my...","[explanation, edits, made, username, hardcore,..."
1,D aww He matches this background colour I m se...,0,"[d, aww, he, matches, this, background, colour...","[d, aww, he, match, this, background, colour, ...","[aww, match, background, colour, seemingly, st..."
2,Hey man I m really not trying to edit war It s...,0,"[hey, man, i, m, really, not, trying, to, edit...","[hey, man, i, m, really, not, trying, to, edit...","[hey, man, really, trying, edit, war, guy, con..."


Our data is ready for encoding, for further work we will use the **clear_text** column.

## Training

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['clear_text'], data['toxic'], test_size=0.25)

In [None]:
tf_idf = TfidfVectorizer()

In [None]:
X_train = X_train.astype('U')
X_train = tf_idf.fit_transform(X_train)
X_test = X_test.astype('U')
X_test = tf_idf.transform(X_test)

### Model training

<br>
The results of parameter selection and model training are stored in a separate variable for convenience.

In [None]:
f1_results = [{'best_params': {'C': 10.0, 'class_weight': None, 'penalty': 'l2'},
  'f1_score': 0.7728907330567081,
  'model': 'LogisticRegression'},
              
 {'best_params': {'auto_class_weights': 'Balanced',
   'depth': 4,
   'iterations': 500,
   'l2_leaf_reg': 3,
   'learning_rate': 0.03},
  'f1_score': 0.7177248052867595,
  'model': 'CatBoostClassifier'},
              
 {'best_params': {'class_weight': None,
   'learning_rate': 0.03,
   'min_data_in_leaf': 30,
   'num_leaves': 80,
   'objective': 'binary',
   'reg_alpha': 0.1},
  'f1_score': 0.7249448123620309,
  'model': 'LGBMClassifier'}]

In [None]:
pd.DataFrame(data=f1_results, columns=['model',	'f1_score',	'best_params']).sort_values(by='f1_score', ascending=False)

Unnamed: 0,model,f1_score,best_params
0,LogisticRegression,0.772891,"{'C': 10.0, 'class_weight': None, 'penalty': '..."
2,LGBMClassifier,0.724945,"{'class_weight': None, 'learning_rate': 0.03, ..."
1,CatBoostClassifier,0.717725,"{'auto_class_weights': 'Balanced', 'depth': 4,..."


## Conclusion

Based on the results of training several models on different parameters, only the logistic regression model was able to achieve the required result f1 > 0.75. At the same time, the model did not respond to the class imbalance.