# Classifying Toxic Comments

After performing EDA and preprocessing, it's time to make a classification model. In this section of the project, the text data will be transformed into their respective numerical representation, by tf-idf, and a logistic regression model is built on all 6 classification tags.

Note: This machine learning model is originally by [Bojan Tunguz](https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams), which can be found in Kaggle. I've tweaked the hyper parameters from the classifier as well as the TfidfVectorizer to optimize ROC-AUC metric.

In [1]:
#load all necessary libraries
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

from scipy.sparse import hstack

np.random.seed(7)

In [2]:
df = pd.read_csv("processed/train_clean.csv", encoding = "latin1")

In [3]:
df = df.dropna(subset = ["comment_text"], axis = 0)

In [4]:
word_vectorizer = TfidfVectorizer(ngram_range = (1, 4),
                                  stop_words = "english",
                                  max_features = 20000,
                                  sublinear_tf = True,
                                  analyzer = "word"
                )

char_vectorizer = TfidfVectorizer(ngram_range = (2, 5),
                                stop_words = "english",
                                max_features = 20000,
                                sublinear_tf = True,
                                analyzer = "char"
               )

Tf-idf has been used two times, one for carrying information at a character and on a word level. Tf-idf is used here since the conventional count encoding, does not penalize frequently used words. The comments have been vectorized in a charcter level it is [recommended](https://developers.google.com/machine-learning/guides/text-classification/step-3) to use whenever there's a lot of wrongly spelled words in the corpus. The purpose of the word vectorizer is to find out the sequence of words which are most likely to come together. Lastly, the use of 20 000 features on both tfidf vectorizer is because, the same study made by Google, have shown that a lot of the datasets they've used have peaked their performance at 20 000.

The use of ngram range of 1 to 4 in the word vectorizer, came from a [paper](https://arxiv.org/pdf/1708.08123.pdf) made by Vedhera, Grossman, and Cormack, and in their study, whenever using Logistic Regression for Tf-idf, they got the highest scoring from this ngram range.

In [5]:
train_word = word_vectorizer.fit_transform(df["comment_text"])
train_char = char_vectorizer.fit_transform(df["comment_text"])

In [6]:
train = hstack([train_char, train_word])

The hstack command has been used to combine both the character and word level feature representations.

In [7]:
#initialization
classes = ["identity_hate", "insult", "obscene", "severe_toxic", "threat", "toxic"]

scores = []

#fit model
for class_category in classes:
    train_target = df[class_category]
    classifier = LogisticRegression(random_state = 6)

    cv_score = cross_val_score(classifier, train, train_target, cv = 3, scoring='roc_auc')
    scores.append(cv_score)
    print('CV score for class {} has mean of {:.4f} and std of {:.4f}'.format(class_category, np.mean(cv_score), np.std(cv_score)))

print("ROC_AUC mean: {}".format(np.mean(scores)))
print("ROC_AUC std: {}".format(np.std(scores)))

CV score for class identity_hate has mean of 0.9824 and std of 0.0010
CV score for class insult has mean of 0.9821 and std of 0.0006
CV score for class obscene has mean of 0.9896 and std of 0.0010
CV score for class severe_toxic has mean of 0.9878 and std of 0.0014
CV score for class threat has mean of 0.9894 and std of 0.0024
CV score for class toxic has mean of 0.9767 and std of 0.0010
ROC_AUC mean: 0.9846678156975965
ROC_AUC std: 0.004881091193700803


The average ROC-AUC for the Logistic Regression Model, has over 98.47%. Also all Logistic Regression models for the 6 classification tags have been consistent ranging from 97.7% up to 98.96%, with very little stand deviations. 

I've used Naive Bayes, as well as a bidirectional GRU but both of them have very little to add. By using Naive Bayes, the ROC was a lot lower compared to Logistic Regression, and using GRU's, there's very little increase in performance. Another reason for not choosing GRU over Logistic Regression is that it is a lot easier to maintain and the training time, when using Logistic Regression, is a lot faster comapred to GRU.