<h1>Content<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Prepare" data-toc-modified-id="Prepare-1"><span class= "toc-item-num">1</span>Getting Ready</a></span><ul class="toc-item"><li><span><a href="#Let's get-to-the-data" data-toc-modified-id="Let's see-the-data-1.1"><span class="toc-item-num">1.1</span>Let's see the data</a></span></li> <li><span><a href="#Prepare-features" data-toc-modified-id="Prepare-features-1.2"><span class="toc-item-num">1.2</span>Prepare features</a></span><li><span><a href="#Lemmatization-spacy" data-toc-modified-id="Lemmatization-spacy-1.2.3"><span class="toc- item-num">1.2.3</span>Spacy Lemmatrization</a></span></li></ul></li><li><span><a href="#Prepare-selections" data-toc-modified-id="Staging-selections-1.3"><span class="toc-item-num">1.3</span>Staging-selections</a></span><ul class="toc- item"><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-1.3.1"><span class="toc-item-num">1.3.1</ span>Tokenization</a></span></ li></ul></li></ul></li><li><span><a href="#Training" data-toc-modified-id="Training-2"><span class= "toc-item-num">2</span>Training</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression-on-unbalanced- classes" data-toc-modified-id="LogisticRegression-on-unbalanced-2.1-classes"><span class="toc-item-num">2.1</span>LogisticRegression-on-unbalanced-classes</a></span ></li><li><span><a href="#LogisticRegression-with-class_weight='balanced'" data-toc-modified-id="LogisticRegression-with-class_weight='balanced'- 2.2"><span class="toc-item-num">2.2</span>LogisticRegression with class_weight='balanced'</a></span></li><li><span><a href= "#DecisionTree-with-class_weight='balanced'" data-toc-modified-id="DecisionTree-with-class_weight='balanced'-2.3"><span class="toc-item-num"> 2.3</span>DecisionTree with class_weight='balanced'</a></span></li><li><span><a href="#Let's try-balance-classes-via-smart select-num" data-toc-modified-id="Let's-try-to-balance-classes-by-select-reduction-2.4"><span class="toc-item-num">2.4</span>Let's try to balance-classes-by-reduce samples</a></span></li><li><span><a href="#Logistic-regression-on-reduced-sample" data-toc-modified-id="Logistic-regression-on- downsampled-2.5"><span class="toc-item-num">2.5</span>Reduced-sample logistic regression</a></span></li></ul></li>< li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-3"><span class="toc-item-num">3</span>Conclusions</a>< /span><ul class="toc-item"><li><span><a href="#Check-list" data-toc-modified-id="Check-list-4">< span class="toc-item-num">4</span>Checklist</a></span></li></ul></div>

# Project for "Vykishop"

Online store "Wikishop" launches a new service. Now users can edit and supplement product descriptions, just like in wiki communities. That is, clients propose their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and submit them for moderation.

Train the model to classify comments into positive and negative. At your disposal is a dataset with markup on the toxicity of edits.

Build a model with a quality metric *F1* of at least 0.75.

**Instructions for the implementation of the project**

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.

It is not necessary to use *BERT* to run the project, but you can try.

**Data Description**

The data is in the `toxic_comments.csv` file. The *text* column contains the text of the comment, and *toxic* is the target attribute.

## Preparation

In [2]:
import time
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re

import spacy
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
import nltk

from sklearn.linear_model import LogisticRegression
from catboost import CatBoostRegressor

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

from sklearn.model_selection import cross_val_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle

In [3]:
warnings.filterwarnings('ignore')

In [5]:
df = pd.read_csv('~/toxic_comments.csv')

### Let's get acquainted with the data

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


We have English here, let's celebrate it!

In [8]:
df.duplicated().sum()

0

In [9]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

class imbalance is visible

### Let's prepare the signs

#### Spacy lemmatization

In [19]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def spacy_lemm(row):
    doc = nlp(row)  
    lemma = ' '.join([token.lemma_ for token in doc])
    lemma = ''.join(re.sub(r'[^A-Za-z]',' ',lemma))
    lemma = " ".join(lemma.split())
    return lemma

In [20]:
%%time
df['lemm']=df['text'].apply(spacy_lemm)

CPU times: user 16min 23s, sys: 7.25 s, total: 16min 30s
Wall time: 16min 33s


Very long lemmatization...

In [21]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic,lemm
0,0,Explanation\nWhy the edits made under my usern...,0,explanation why the edit make under my usernam...
1,1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour I be see...
2,2,"Hey man, I'm really not trying to edit war. It...",0,hey man I be really not try to edit war it be ...
3,3,"""\nMore\nI can't make any real suggestions on ...",0,More I can not make any real suggestion on imp...
4,4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...


### Preparing samples
Let's break it down into training and test sets before tokinesis.

In [22]:
target = df['toxic']
features = df['lemm']

features_train, features_valid, target_train, target_valid = train_test_split(features, 
                                                                              target, 
                                                                              test_size=0.2, 
                                                                              random_state=12345)

In [23]:
print('train:',features_train.shape[0])
print('valid:',features_valid.shape[0])

train: 127433
valid: 31859


In [24]:
features_train.head()

97400     bushranger you be a GRASS with no sense of hum...
4383      need administrative help I have be block iniqu...
103680    I would also like to point out that he have us...
38573        you can not block I you fuck retard BRB nigger
128311    I believe that the frequency of the wave need ...
Name: lemm, dtype: object

#### Tokenization

In [25]:
nltk.download('stopwords')
stop_words_my = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /Users/konn4/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [30]:
corpus_train = features_train
corpus_valid = features_valid

In [31]:
%%time
count_tf_idf = TfidfVectorizer(stop_words=stop_words_my) 

tf_idf_train = count_tf_idf.fit_transform(corpus_train) 
tf_idf_valid = count_tf_idf.transform(corpus_valid)

print("Размер матрицы train:", tf_idf_train.shape)
print("Размер матрицы train:", tf_idf_valid.shape)

Размер матрицы train: (127433, 137453)
Размер матрицы train: (31859, 137453)
CPU times: user 3.6 s, sys: 49.6 ms, total: 3.65 s
Wall time: 3.65 s


## Training
### LogisticRegression on unbalanced classes

In [32]:
cv = KFold(n_splits=3, shuffle=True, random_state=12345)

In [33]:
%%time

model_lr = LogisticRegression()

train_f1 = cross_val_score(model_lr, 
                      tf_idf_train, 
                      target_train, 
                      cv=cv, 
                      scoring='f1')

print('F1 на CV', train_f1.mean())

F1 на CV 0.7089543561992754
CPU times: user 6.64 s, sys: 1.83 s, total: 8.47 s
Wall time: 4.76 s


In [34]:
model_lr.fit(tf_idf_train, target_train)

prediction_lr_valid = model_lr.predict(tf_idf_valid)

print('F1 valid', f1_score(target_valid, prediction_lr_valid))

F1 valid 0.7497681320719717


### LogisticRegression with class_weight='balanced'

In [35]:
%%time

model_lr_bal = LogisticRegression(class_weight='balanced')

train_f1 = cross_val_score(model_lr_bal, 
                      tf_idf_train, 
                      target_train, 
                      cv=cv, 
                      scoring='f1')

print('F1 на CV', train_f1.mean())

F1 на CV 0.7472192456303787
CPU times: user 6.85 s, sys: 1.88 s, total: 8.73 s
Wall time: 4.91 s


In [36]:
model_lr_bal.fit(tf_idf_train, target_train)

prediction_lr_valid = model_lr_bal.predict(tf_idf_valid)

print('F1 valid', f1_score(target_valid, prediction_lr_valid))

F1 valid 0.750943396226415


### DecisionTree with class_weight='balanced'

In [37]:
%%time

model_dt = DecisionTreeClassifier()
params = [{'max_depth':[10,30,50,100],
                'random_state':[12345],
                'class_weight':['balanced']}]

gridsearch = GridSearchCV(model_dt, params, scoring='f1',cv=cv)
gridsearch.fit(tf_idf_train, target_train)
print(gridsearch.best_params_)

{'class_weight': 'balanced', 'max_depth': 100, 'random_state': 12345}
CPU times: user 4min 18s, sys: 784 ms, total: 4min 19s
Wall time: 4min 19s


In [38]:
gridsearch.best_score_

0.6423755884378052

In [39]:
prediction_dt_valid = gridsearch.best_estimator_.predict(tf_idf_valid)

print('DecisionTree F1 valid', f1_score(target_valid, prediction_dt_valid))

DecisionTree F1 valid 0.6489563567362429


### Let's try to balance classes through sample reduction

In [40]:
toxic_comments_train = df.iloc[target_train.index]

target_train_0 = toxic_comments_train[toxic_comments_train['toxic'] == 0]['toxic']
target_train_1 = toxic_comments_train[toxic_comments_train['toxic'] == 1]['toxic']

In [41]:
print('Дисбаланс классов 1 к', target_train_0.shape[0]/target_train_1.shape[0])

Дисбаланс классов 1 к 8.835828959555418


In [42]:
target_train_0_downsample = target_train_0.sample(target_train_1.shape[0],random_state=12345)
target_train_downsample = pd.concat([target_train_0_downsample, target_train_1])

features_train_downsample = df.iloc[target_train_downsample.index]

features_train_downsample, target_train_downsample = shuffle(features_train_downsample,
                                                             target_train_downsample,
                                                             random_state=12345)

features_train_downsample = count_tf_idf.transform(features_train_downsample['lemm'].values.astype('U'))

### Logistic regression on a reduced sample

In [43]:
%%time

model_lr_ds = LogisticRegression()

train_f1 = cross_val_score(model_lr_ds, 
                      features_train_downsample, 
                      target_train_downsample, 
                      cv=cv, 
                      scoring='f1')

print('F1 на CV', train_f1.mean())

F1 на CV 0.888140831709479
CPU times: user 3.65 s, sys: 1.75 s, total: 5.4 s
Wall time: 1.66 s


In [44]:
model_lr_ds.fit(features_train_downsample, target_train_downsample)

prediction_lr_valid_ds = model_lr_ds.predict(tf_idf_valid)

print('F1 valid downsampled', f1_score(target_valid, prediction_lr_valid_ds))

F1 valid downsampled 0.7006211180124224


## conclusions

| Model | F1 train | F1 valid | pass |
| ----------- | ----------- | -------- | --- |
| LogisticRegression | 0.7089543561992754 | 0.7497681320719717 | no |
| LogisticRegression balanced | 0.7472192456303787 | 0.750943396226415 | Yes |
| LogisticRegression downsamled | 0.888140831709479 | 0.7006211180124224 | no |
| DecisionTree | 0.6423755884378052 | 0.6489563567362429 | no |

### Conclusion
It was possible to achieve the F1 metric of 0.750943396226415 on the validation dataset by increasing the training set and leaving only the validation one, as well as by making lemmatization.

## Checklist

- [x] Jupyter Notebook open
- [x] All code runs without errors
- [x] Cells with code are arranged in execution order
- [x] Data loaded and prepared
- [x] Models trained
- [x] Metric value *F1* not less than 0.75
- [x] Conclusions written