The store is launching a new service. Now users can edit and add product descriptions. Customers propose their edits and comment others' edits. The store needs a tool that will search for toxic comments and submit them for moderation.

Train the model to classify comments as positive and negative. We have a dataset with markup on the toxicity of revisions.

Build a model with a quality metric * F1 * of at least 0.75.

#### Table of contents <a id='contents'></a></font>


1.[About data](#meatdata)

- 1.1. [Library import and check data](#library_table)
- 1.2. [Duplicates and nulls](#nulls_dupls)
- 1.3. [Clear text](#lemmas)
- 1.4. [Conclusion](#conclusion_1)

2.[Train](#training)

- 2.1. [Split data](#split)
- 2.2. [Vectorize and transform feauters](#vect_transform_feauters)
- 2.3. [Get params](#params)
- 2.4. [Train](#train)
- 2.5. [Test](#test)
- 2.6. [Conclusion](#conclusion_2)

3.[Final conclusion](#conclusion_total)

# 1.  About data<a id='meatdata'></a>

### Library import and check data<a id='library_table'></a>

In [1]:
#library import
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
from sklearn.linear_model import LogisticRegression
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
import lightgbm as lgbm
from lightgbm import LGBMClassifier

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Helga\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
#check data
toxic_comments = pd.read_csv('toxic_comments.csv')
display(toxic_comments.head())

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [3]:
toxic_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


### Duplicates and nulls<a id='nulls_dupls'></a>

In [4]:
#check duplicates
toxic_comments.duplicated().sum()

0

In [5]:
#check nulls
toxic_comments.isnull().sum()

text     0
toxic    0
dtype: int64

### Clear text<a id='lemmas'></a>

In [6]:
#text to string
toxic_comments['text'] = toxic_comments['text'].astype(str)

In [7]:
#clear test function
def clear_text(text):
    new_text = re.sub(r'[^a-zA-Z]',' ', text)
    return " ".join(new_text.split())

toxic_comments['lemm_text']  = toxic_comments['text'].apply(clear_text)
display(toxic_comments.head())

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...


In [8]:
#str.lower and type string
toxic_comments['lemm_text'] = toxic_comments['lemm_text'].str.lower()
toxic_comments['lemm_text'] = toxic_comments['lemm_text'].astype(str)

In [9]:
#lemmatize function
def lemm_lemm(sentence):

    def get_wordnet_pos(word):
        """Map POS tag to first character lemmatize() accepts"""
        tag = nltk.pos_tag([word])[0][1][0].upper()
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        return tag_dict.get(tag, wordnet.NOUN)
    

    return " ".join(([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)]))
    

toxic_comments['lemm_text']  = toxic_comments['lemm_text'].apply(lemm_lemm)
display(toxic_comments.head())

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits make under my userna...
1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour i m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not try to edit war it s ju...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i can t make any real suggestion on impro...
4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...


In [10]:
#check types
toxic_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 3 columns):
text         159571 non-null object
toxic        159571 non-null int64
lemm_text    159571 non-null object
dtypes: int64(1), object(2)
memory usage: 3.7+ MB


### Conclusion <a id='conclusion_1'></a>

We have 159571 entries and 2 columns:
- text - message text;
- toxic - indicator of the toxicity of the comment (0 - no, 1 - yes);
We later added a third column:
- lemm_text - lemmatized column text.

There are no missing values or duplicates.

For lemmatization: only letters and spaces were left, punctuation marks and numbers were removed.

In the second function: the lemmatization - change word into its original form.


[To contents](#contents)

# 2.Train<a id='training'></a>

### Split data<a id='split'></a>

In [11]:
#features&target
features = toxic_comments['lemm_text']
target = toxic_comments['toxic']

#divide for two
features_train_valid, features_test, target_train_valid, target_test = train_test_split(
    features, target, test_size=0.20, random_state=12345)

#and three
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train_valid, target_train_valid, test_size=0.250, random_state=12345)

print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(95742,)
(31914,)
(31915,)


### Vectorize and transform feauters<a id='vect_transform_feauters'></a>

In [12]:
#TfidfVectorizer
vect = TfidfVectorizer(stop_words=nltk_stopwords.words('english'),lowercase=True)


new_features_train = vect.fit_transform(features_train)

new_features_valid = vect.transform(features_valid)

new_features_test = vect.transform(features_test)


print(new_features_train.shape)
print(new_features_valid.shape)
print(new_features_test.shape)

(95742, 110880)
(31914, 110880)
(31915, 110880)


### Get params<a id='params'></a>

In [13]:
#params depth for DecisionTreeClassifier
for depth in range(15, 21):
    model =  DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(new_features_train, target_train)
    predicted_valid = model.predict(new_features_valid)
    f1 = f1_score(target_valid, predicted_valid)
    print("max_depth =", depth, ": ", end='')
    print("f1:", f1) 

max_depth = 15 : f1: 0.6446788111217642
max_depth = 16 : f1: 0.6466882992937584
max_depth = 17 : f1: 0.6520090978013646
max_depth = 18 : f1: 0.6502463054187192
max_depth = 19 : f1: 0.6600188146754468
max_depth = 20 : f1: 0.6654177594604721


In [14]:
#params n_estimators for RandomForestClassifier
for est in range(15, 20):
    model =  RandomForestClassifier(random_state=12345, n_estimators = est)
    model.fit(new_features_train, target_train)
    predicted_valid = model.predict(new_features_valid)
    f1 = f1_score(target_valid, predicted_valid)
    print("max_est =", est, ": ", end='')
    print("f1:", f1)

max_est = 15 : f1: 0.69660014781966
max_est = 16 : f1: 0.6789234268385141
max_est = 17 : f1: 0.7014042867701404
max_est = 18 : f1: 0.6863816161235637
max_est = 19 : f1: 0.7023126734505087


In [15]:
#params for LightGBM min_data_in_leaf
for min_data in range(5, 10):
    model = LGBMClassifier(random_state=12345, min_data_in_leaf=min_data)
    model.fit(new_features_train,target_train)
    predicted_valid = model.predict(new_features_valid)
    f1 = f1_score(target_valid, predicted_valid)
    print("min_data_in_leaf =", min_data, ": ", end='') 
    print('f1:', f1)

min_data_in_leaf = 5 : f1: 0.7485051002462187
min_data_in_leaf = 6 : f1: 0.750394667602175
min_data_in_leaf = 7 : f1: 0.749472202674173
min_data_in_leaf = 8 : f1: 0.7467509659290482
min_data_in_leaf = 9 : f1: 0.748509294984216


In [16]:
#params for LightGBM max_depth
for depth in range(28, 35):
    model =  LGBMClassifier(random_state=12345,min_data_in_leaf=6, max_depth=depth)
    model.fit(new_features_train,target_train)
    predicted_valid = model.predict(new_features_valid)
    f1 = f1_score(target_valid, predicted_valid)
    print("max_depth =", depth, ": ", end='') 
    print('f1:', f1)

max_depth = 28 : f1: 0.7490333919156416
max_depth = 29 : f1: 0.7499125568380552
max_depth = 30 : f1: 0.750394667602175
max_depth = 31 : f1: 0.750394667602175
max_depth = 32 : f1: 0.750394667602175
max_depth = 33 : f1: 0.750394667602175
max_depth = 34 : f1: 0.750394667602175


In [17]:
#params for LightGBM num_leaves
for nl in range(29, 35):
    model =  LGBMClassifier(random_state=12345,min_data_in_leaf=6, max_depth=31, num_leaves=nl)
    model.fit(new_features_train,target_train)
    predicted_valid = model.predict(new_features_valid)
    f1 = f1_score(target_valid, predicted_valid)
    print("num_leaves =", nl, ": ", end='') 
    print('f1:', f1)

num_leaves = 29 : f1: 0.7466007416563659
num_leaves = 30 : f1: 0.7463946535349982
num_leaves = 31 : f1: 0.750394667602175
num_leaves = 32 : f1: 0.7491662278392137
num_leaves = 33 : f1: 0.7519664394336656
num_leaves = 34 : f1: 0.7526656178989686


### Train<a id='train'></a>

In [18]:
#train function
def classifier_model(models):
    model = models
    model.fit(new_features_train, target_train)
    predicted_valid = model.predict(new_features_valid)
    f1 = f1_score(target_valid, predicted_valid)
    print("f1:", f1)

In [19]:
#train LogisticRegression
classifier_model(LogisticRegression())



f1: 0.7338444687842279


In [20]:
#train RandomForestClassifier
classifier_model(RandomForestClassifier(random_state=12345, n_estimators = 19))

f1: 0.7023126734505087


In [21]:
#train DecisionTreeClassifier
classifier_model(DecisionTreeClassifier(random_state=12345, max_depth= 20))

f1: 0.6654177594604721


In [22]:
#train LGBMClassifier
classifier_model(LGBMClassifier(random_state=12345,max_depth=33, max_bin=6, min_data_in_leaf = 6,num_leaves=33))

f1: 0.7470049330514447


***Winner is  LGBMClassifier, check the test***

### Test<a id='test'></a>

In [23]:
#test LGBMClassifier
model = LGBMClassifier(random_state=12345,max_depth=33, max_bin=6, min_data_in_leaf = 6,num_leaves=33)
model.fit(new_features_train, target_train)
predicted_test = model.predict(new_features_test)
f1 = f1_score(target_test, predicted_test)
print("f1:", f1)

f1: 0.7482993197278912


### Conclusion <a id='conclusion_2'></a>

We have a standart data split, but something new in features.

We take features and vectorize them. And also, for feauters_train, we configure the TF-IDF, for valid and target this is not necessary.

So, for f1 the LGBMClassifier (0.75) won and LogisticRegression (0.73) was left behind.

For the test, we take only the LGBMClassifier and the result is the desired f1 = 0.75.

[To contents](#contents)

# 3. Final conclusion  <a id='conclusion_total'></a>

So we had a nice data with 159571 entries and 2 columns: text and toxicity label (0/1).

We've added another column with lemmatized and cleared text (in other words, we have returned the words to their original form). Then with the help of TfidfVectorizer and transform, we transformed the column for the model.

We used:
- Logistic regression: f1 = 0.73
- Random forest: f1 = 0.70
- Decision tree: f1 = 0.67
- LGBMClassifier: f1 = 0.75

As a result, winner is the LGBMClassifier and on test we see f1 = 0.75. 


[To contents](#contents)