# Project for the Wikishop online store

## Preparation

### Loading the necessary libraries

In [1]:
!pip install tqdm

from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score, roc_curve, roc_auc_score
from sklearn.dummy import DummyClassifier
#from sklearn.naive_bayes import GaussianNB, BernoulliNB

import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords as nltk_stopwords, wordnet as wn
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline

import seaborn as sns
from matplotlib import pyplot as plt

#!pip install pymystem3
#from pymystem3 import Mystem
import os

import warnings
warnings.simplefilter('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)



Please use `tqdm.notebook.*` instead of `tqdm._tqdm_notebook.*`
  from tqdm._tqdm_notebook import tqdm_notebook
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\exeve\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\exeve\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\exeve\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\exeve\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Loading the data

In [2]:
path1 = r'C:\Users\exeve\Downloads\toxic_comments.csv'
path2 = '/datasets/toxic_comments.csv'

try:
    if os.path.exists(path1):
        data = pd.read_csv(path1)
    elif os.path.exists(path2):
        data = pd.read_csv(path2)
except:
    print('Something went wrong')

### Analyzing the data

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [4]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


In [5]:
# Column 'Unnamed: 0' is an index duplicate. Let's delete it.
data = data.drop('Unnamed: 0', axis = 1)

In [6]:
# to lowercase
data['text'] = data['text'].str.lower()

In [7]:
data['text'].sample(20)

41499     civility \n\nplease refrain from comments such...
44226                   your ip edits have been discovered.
56653     ]\n\nhi bhaddani i totally totally aunderstand...
5931      " november 2014 (utc)\n\n and  i don't believe...
21428     please refrain from adding nonsense to wikiped...
16159     "\nabout the communist, not hard to believe th...
129569    guido\nhello blueboy i had a feeling this woul...
662       hebrew name of lydia \nappologies to til eulen...
113609    the article was written by the person in quest...
132331                    yes yes. why you deleted my work?
90717     "\n\nhelp\n\nhelp! a user deleted my in-progre...
9911      myself. you may leave a message at the talk pa...
25662     about the missing motivation section of the ge...
127839    "\n\n file:b0000dz6ke.03. ss500 sclzzzzzzz .jp...
139102    ":there are some major differences between the...
87029     "\ni am going to write a ""how to maintain thi...
119303    youth \n\ndoes anyone else agr

The texts contain a lot of characters, numbers, etc. It needs to be cleared.

### Creating the corpus

In [8]:
corpus = data['text'].values

In [9]:
import sys
size = sys.getsizeof(corpus)/1024/1024
f'Formatted data size: {size} MB'

'Formatted data size: 0.0001068115234375 MB'

### Lemmatization funtion

In [10]:
def get_wordnet_pos(word):
    
    tag = nltk.pos_tag([word])[0][1][0].upper()
    
    tag_dict = {"J": wn.ADJ,
                "N": wn.NOUN,
                "V": wn.VERB,
                "R": wn.ADV}
    
    return tag_dict.get(tag, wn.NOUN)

def lemmatize(text):
    m = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    lemmatized = [m.lemmatize(w, get_wordnet_pos(w)) for w in tokens]
    output = ' '.join(lemmatized)
    return output

### Cleaning function

In [11]:
def clear_text(text):
    text = re.sub(r'[^a-zA-Z ]', ' ', text) 
    return " ".join(text.split())

### Checking the functions for good functioning

In [12]:
print("Source text:", corpus[0])
print("Cleaned and lemmatized text:", lemmatize(clear_text(corpus[0])))

Source text: explanation
why the edits made under my username hardcore metallica fan were reverted? they weren't vandalisms, just closure on some gas after i voted at new york dolls fac. and please don't remove the template from the talk page since i'm retired now.89.205.38.27
Cleaned and lemmatized text: explanation why the edits make under my username hardcore metallica fan be revert they weren t vandalism just closure on some gas after i vote at new york doll fac and please don t remove the template from the talk page since i m retire now


### Corpus lemmatization

In [13]:
corpus_lemm = pd.Series(corpus).progress_apply(lambda x: lemmatize(clear_text(x)))

  0%|          | 0/159292 [00:00<?, ?it/s]

In [14]:
corpus_lemm.head()

0    explanation why the edits make under my userna...
1    d aww he match this background colour i m seem...
2    hey man i m really not try to edit war it s ju...
3    more i can t make any real suggestion on impro...
4    you sir be my hero any chance you remember wha...
dtype: object

### Getting Train and Test sets

In [15]:
features_train, features_test, target_train, target_test = train_test_split(corpus_lemm, data['toxic'],
                                                                            test_size = 0.5,
                                                                            random_state = 12345)

In [16]:
print(features_train.shape)
print(features_test.shape)
print(target_train.shape)
print(target_test.shape)

(79646,)
(79646,)
(79646,)
(79646,)


### Texts vectorization

In [17]:
vector = TfidfVectorizer(stop_words = 'english',
                         ngram_range = (1,2),
                         min_df=3,
                         max_df=0.9,
                         use_idf=1,
                        smooth_idf=1,
                         sublinear_tf=1)

In [18]:
vector.fit(features_train)

features_train_tfidf = vector.transform(features_train)
features_test_tfidf = vector.transform(features_test)


In [19]:
print(features_train_tfidf.shape)
print(features_test_tfidf.shape)

(79646, 131306)
(79646, 131306)


## Training

- We will create a function for hyperparameters tuning. 
- There is not so much data involved and the precision requested is only 0.75, therefore, taking into account the resources consumed by the BERT, ELMO, LSTM, etc. models, the LinearSVC or Logistic Regression models should more than comply, so they are the ones that we are going to use.
- We will choose the best of them.

### Hyperparameters tuning function

In [20]:
def grid(model, params, features, target, cv):

    
    grid_search = GridSearchCV(model,
                      param_grid=params,
                      cv = cv,
                      scoring = 'f1',
                      n_jobs = -1,
                      verbose = 10
                    )
    
    
    grid_search.fit(features, target)
    
    
    
    
    return (grid_search.best_params_, grid_search.best_score_,
            grid_search.cv_results_['mean_fit_time'][grid_search.best_index_],
            grid_search.cv_results_['mean_score_time'][grid_search.best_index_])

### Linear Support Vector Classification (LinearSVC)

In [21]:
LSVC = LinearSVC(random_state = 12345, class_weight = 'balanced')

LSVC_params ={'C': range(1,31,7)}

LSVC_results = grid(LSVC, LSVC_params, features_train_tfidf, target_train, 3)


LSVC_param = LSVC_results[0]
LSVC_score = LSVC_results[1]
LSVC_fit = LSVC_results[2]
LSVC_pred = LSVC_results[3]

print('Best parameters:', LSVC_param,
'\nBest score:', LSVC_score,
'\nFitting time:', LSVC_fit,
'\nScoring time:', LSVC_pred)

Fitting 3 folds for each of 5 candidates, totalling 15 fits
Best parameters: {'C': 1} 
Best score: 0.7594930849730966 
Fitting time: 4.880761543909709 
Scoring time: 0.05551640192667643


### Logistic Regression

In [22]:
log_reg = LogisticRegression(random_state = 12345,
                             class_weight = 'balanced')

log_params = {'C' : range(10,20,1)}

log_results = grid(log_reg,
                   log_params,
                   features_train_tfidf,
                   target_train,
                   cv = 3)

log_param = log_results[0]
log_score = log_results[1]
log_fit = log_results[2]
log_pred = log_results[3]

print('Best parameters:', log_param,
'\nBest score:', log_score,
'\nFitting time:', log_fit,
'\nScoring time:', log_pred)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best parameters: {'C': 10} 
Best score: 0.7657188790976778 
Fitting time: 11.827710390090942 
Scoring time: 0.04787158966064453


## Тesting

The accuracy of the models is almost the same, but we will choose LinearSVC due to its faster calculation speed.

### Testing the best model (LinearSCV)

In [23]:
LSVC = LinearSVC(random_state = 12345, class_weight = 'balanced', C = 1)
LSVC.fit(features_train_tfidf, target_train)
preds = LSVC.predict(features_test_tfidf)
score = f1_score(target_test, preds)

In [24]:
print('Testing LinearSCV F1-Score:', score)

Testing LinearSCV F1-Score: 0.7640793339862879


The accuracy on the test sample is also higher than 0.75. Thus, the model works.

## Conclusion

During the project the following tasks were solved:
- We have created a corpus of texts for analysis.
- We cleaned the corpus of signs, stop words, etc., and also lemmatized the corpus texts.
- We converted the corpus texts into vectors for training and prediction.
- We divided the data into training and test samples with a share of 50%.
- We analyzed LinearSVC and logistic regression models; both models had the same accuracy, but we chose the LinearSVC model because of its training speed.
- During testing, the accuracy turned out to be higher than the established minimum (F1 score >= 0.75) so the model is suitable and adequate and therefore recommended.