# A Toxic Comment Identifier Application - Doc2Vec Data Transformation
<a id='section_id2'></a>

In [7]:
from gensim.models import doc2vec
from nltk.tokenize import word_tokenize
from numpy import savetxt
import pickle
import numpy as np

### Public Methods
Below are public methods which can be called to generate vectors for a dataframe that is passed in.

In [8]:
"""
A method to get a list of doc2vec vectors based on the model and dataframe passed in
The dataframe should just consist of 1 column which contains the document or comment
to be vectorized
:param model : a doc2vec model that has already been initialized and defined
:param data : the dataframe which contains the document or comment to be vectorized
:return a list of vectors corresponding to the data passed in
"""
def get_doc2vec_vectors(model, data):
    # #using default values for now
    tokenized_comments = tokenize_comments(data)
    tagged_documents = get_tagged_documents(tokenized_comments)

    # build the vocabulary
    # input a list of documents
    model.build_vocab(x for x in tagged_documents)

    # Train the model
    model.train(tagged_documents, total_examples = model.corpus_count, epochs = model.epochs)

    #print("Inferring "+str(len(tokenized_comments)) +" comments into doc2vec vectors.")
    vectors = infer_vectors(model, tokenized_comments, "")
    return vectors

"""
A method that infers a list of vectors from a trained Doc2Vec model
: param model : a Doc2Vec model which is already trained with vocab built
: param input : a data frame to infer Doc2Vec vectors from
: param save_file_name [OPTIONAL] : If a string is provided,
the vectors will be saved using this file name.
"""
def infer_vectors(model, tokenized_comments, save_file_name):
    #print("Inferring "+str(len(tokenized_comments)) +" comments into doc2vec vectors.")
    vectors = []
    for comment in tokenized_comments:
        #count = count + 1
        #print("Vectorizing: "+str(count)+" comment.")
        vectors.append(model.infer_vector(comment))

    #print("Created "+str(len(vectors)) + " doc2vec vectors.")
    #save to file if a file name is present
    if save_file_name != "":
        print("Saving vectors to file: " + str(save_file_name))
        savetxt(save_file_name, vectors)

    return vectors

### Helper Methods

Below are helper methods used by the public methods above.

In [9]:
"""
A function to tokenize all data in a dataframe
:param data: a dataframe containing comments to tokenize
"""

def tokenize_comments(dataframe):
    data = []
    for row in dataframe:
        data.append(tokenize_each_comment(row))
    return data
"""
A function to tokenize a single comment
:param data: a single comment to tokenize
"""
def tokenize_each_comment(comment):
    temp = []
    for j in word_tokenize(comment):
        temp.append(j)
    return temp

"""
A function to generate a list of tagged documents to train a
Doc2Vec model
:param list_of_tokenized_comments: A list of tokenized comments
"""
def tagged_document(list_of_tokenized_comments):
  for x, ListOfWords in enumerate(list_of_tokenized_comments):
    yield doc2vec.TaggedDocument(ListOfWords, [x])

"""
A function to get tagged documents from
a list of tokenized comments
"""
def get_tagged_documents(list_of_tokenized_comments):
    return list(tagged_document(list_of_tokenized_comments))


### Hyperparameter Tuning of Doc2Vec Model

To tune the parameters that we can customize for a Doc2Vec model, we used a Custom Rocchio and the NearestCentroid classifier as a model to validate the parameters.  After doing a lot of research, it was decided to create a custom method to tune the Doc2Vec parameters. T  First, we created a dictionary of the parameters I want to tune.

The parameters we chose to tune was the following (Note, the descriptions below were taken from this source: https://medium.com/betacom/hyperparameters-tuning-tf-idf-and-doc2vec-models-73dd418b4d):
* dm: it defines the training algorithm. If dm=1, PV-DM is used. Otherwise, PV-DBOW is employed.
* vector_size: dimensionality of the feature vectors.
* window: the maximum distance between the current and predicted word within a sentence.
* hs: if 1, hierarchical softmax will be used for model training; if set to 0, and negative is non-zero, negative sampling will be used.



In [10]:
import itertools
dm = [0, 1]
vector_size = [500, 1000]
window = [2,5]
hs = [1]
paramsList = [{'dm': item[0],
               'vector_size': item[1],
               'window': item[2],
               'hs': item[3],
               'negative': 0
               } for item in
                 list(itertools.product(*[dm,
                                          vector_size,
                                          window,
                                          hs]))
              ]

#Note: commented out so it doesn't run when called from another notebook
#print("The list of parameters for tuning the Doc2Vec Model:"+str(paramsList))

### Evaluating Doc2Vec Model with Custom Rocchio Classifier
Below we evaluated the Doc2Vec model using a custom Rocchio classifier which uses cosine distance as the metric.  For evaluation, we went with a simple metric which prints the accuracy of the classifier based on how it categorized the test data.

The best accuracy we received was with the parameter settings below:

Evaluating {'dm': 0, 'vector_size': 1000, 'window': 5, 'hs': 1, 'negative': 0}
Rocchio classifier - Number of test instances classified correctly:1026 Percent Accuracy: 67.67810026385224

In [11]:
%run Toxic_App_Rocchio_Classifier.ipynb

file_object = open('clean_data1.p', 'rb')
clean_data = pickle.load(file_object)
train_ds = clean_data[0]
test_ds = clean_data[1]

train_comments_df = train_ds['comment_text']
test_comments_df = test_ds['comment_text']
train_labels_df = train_ds['toxicity_level']
test_labels_df = test_ds['toxicity_level']

def evaluateDoc2VecParams():
    # Tag docs
    train_tokenized_comments = tokenize_comments(train_comments_df)
    train_tagged_documents = get_tagged_documents(train_tokenized_comments)
    scoreList = []
    for param in paramsList:
      print("Evaluating "+str(param))
      try:
        d2v_model = doc2vec.Doc2Vec(train_tagged_documents,
                        dm=param['dm'],
                        vector_size=param['vector_size'],
                        window=param['window'],
                        min_count=1,
                        epochs=10,
                        hs=param['hs'],
                        seed=516)
        train_vectors = get_doc2vec_vectors(d2v_model,train_comments_df)

        tokenized_test_comments = tokenize_comments(test_comments_df)
        test_vectors = infer_vectors(d2v_model, tokenized_test_comments, "")

        train_labels = np.array(train_labels_df)

        #classify test data using prototype vectors
        test_labels = np.array(test_labels_df)

        # generate the prototype vectors
        prototype_vectors = rocchio_train(train_vectors, train_labels)

        #classify test data using prototype vectors
        rocchio_evaluate(test_vectors, test_labels, prototype_vectors)
      except Exception as error:
        print(f'Cannot evaluate model with parameters {param} because of error: {error}')
        continue
    return scoreList

#Note: commented out so it doesn't run when called from another notebook
#evaluateDoc2VecParams()


### Evaluating Doc2Vec Model with Nearest Centroid Classifier
Below we evaluated the Doc2Vec model using the Nearest Centroid classifier which uses euclidean distance as the metric.  Overall, for this model we achieve lower accuracy scores compared to the Custom Rocchio model above.

The best accuracy we received was for the settings below:

Evaluating {'dm': 0, 'vector_size': 1000, 'window': 2, 'hs': 1, 'negative': 0}
Nearest Centroid classifier - Number of test instances classified correctly:960 Percent Accuracy: 63.3245382585752

In [12]:
%run Toxic_App_Rocchio_Classifier.ipynb

file_object = open('clean_data1.p', 'rb')
clean_data = pickle.load(file_object)
train_ds = clean_data[0]
test_ds = clean_data[1]

train_comments_df = train_ds['comment_text']
test_comments_df = test_ds['comment_text']
train_labels_df = train_ds['toxicity_level']
test_labels_df = test_ds['toxicity_level']

def evaluateDoc2VecParams():
    # Tag docs
    train_tokenized_comments = tokenize_comments(train_comments_df)
    train_tagged_documents = get_tagged_documents(train_tokenized_comments)
    scoreList = []
    for param in paramsList:
      print("Evaluating "+str(param))
      try:
        d2v_model = doc2vec.Doc2Vec(train_tagged_documents,
                        dm=param['dm'],
                        vector_size=param['vector_size'],
                        window=param['window'],
                        min_count=1,
                        epochs=10,
                        hs=param['hs'],
                        seed=516)
        train_vectors = get_doc2vec_vectors(d2v_model,train_comments_df)

        tokenized_test_comments = tokenize_comments(test_comments_df)
        test_vectors = infer_vectors(d2v_model, tokenized_test_comments, "")

        train_labels = np.array(train_labels_df)

        #classify test data using prototype vectors
        test_labels = np.array(test_labels_df)

        rocchio_classifier_nearest_centroid(train_vectors, train_labels, test_vectors, test_labels)

      except Exception as error:
        print(f'Cannot evaluate model with parameters {param} because of error: {error}')
        continue
    return scoreList

#Note: commented out so it doesn't run when called from another notebook
#evaluateDoc2VecParams()

### Conclusion

The 2 parameters below received the highest accuracies from both evaluations above:

* Custom Rocchio: Evaluating {'dm': 0, 'vector_size': 1000, 'window': 5, 'hs': 1, 'negative': 0}
* Nearest Centroid: Evaluating {'dm': 0, 'vector_size': 1000, 'window': 2, 'hs': 1, 'negative': 0}

The only difference was with the window parameter. Considering this parameter controls the maximum distance between the current and predicted word of a sentence, we are inclined to go with the higher window.  Therefore, we will be using these parameters for the Doc2Vec model we use to transfor our data.

Final set of parameters to use: 'dm': 0, 'vector_size': 1000, 'window': 5, 'hs': 1, 'negative': 0