# Rocchio Classifier

This notebook contains all methods related to creating and tuning the model for the Rocchio Classifier.

A clean_data.p file was created to store all of the clean data from our training data set.  Below, the clean data is read and uploaded to pandas dataframes.

In [3]:
import pandas as pd
import numpy as np
import pickle
%run Doc2Vec.ipynb

In [6]:
file_object = open('clean_data1.p', 'rb')
clean_data = pickle.load(file_object)
train_ds = clean_data[0]
test_ds = clean_data[1]

print(train_ds.shape)
print(test_ds.shape)

(4398, 5)
(1516, 5)


In [7]:
from gensim.models import doc2vec
%run Doc2Vec.ipynb
# retrieve the comments only
train_comments_df = train_ds['comment_text']
test_comments_df = test_ds['comment_text']
d2v_model = doc2vec.Doc2Vec(vector_size = 1000, min_count = 5, epochs = 40)

train_vectors = get_doc2vec_vectors(d2v_model, train_comments_df)

 # #using default values for now
tokenized_test_comments = tokenize_comments(test_comments_df)
test_vectors = infer_vectors(d2v_model, tokenized_test_comments, "")

#store the test labels
test_labels = test_ds['toxicity_level']

### Methods for Custom Rocchio Classifier

Below are methods to use the Rocchio Classifier.

In [8]:
# helper function
def get_number_of_classes(classes):
    return len(np.unique(classes))

def compute_tf_idf_matrix(trainingData):
    # compute term frequencies to get an idea of their distributions across the corpus.
    termFreq = trainingData.sum(axis=1)
    # print("termFreq:" + str(termFreq))

    # computer the doc frequency, the number of docs in which the term appears divided by total number of docs
    # doc counts for each term
    docFreq = pd.DataFrame([(trainingData != 0).sum(1)]).T  # if TD!=0 add 1 and sum it up across
    # print("docFreq:" + str(docFreq))

    # Create a matrix with all entries = number of docs
    numDocs = trainingData.shape[1]
    newMatrix = np.ones(np.shape(trainingData), dtype=float) * numDocs
    np.set_printoptions(precision=2, suppress=True, linewidth=120)
    # print("newMatrix" + str(newMatrix))

    # Convert each entry into IDF values
    # IDF is the log of the inverse of document frequency
    # Note that IDF is only a function of the term, so all columns will be identical.
    idf = np.log2(np.divide(newMatrix, np.array(docFreq)))
    # print("idf" + str(idf))

    # Finally compute the TFxIDF values for each document-term entry
    tf_idf = trainingData * idf
    pd.set_option("display.precision", 2)
    # print("tf_idf" + str(tf_idf))
    return tf_idf

"""A Training Function for the Rocchio Classification Algorithm
:param data: an np array of the training data formatted as a doc-term frequency matrix,
a ttd*idf matrix or else a doc2vec vector matrix
:param labels: an np array of the corresponding labels for the data
:returns: the prototype vectors for each class (a.k.a. label)
"""
def rocchio_train(train, labels):
    # the prototype will be a dictionary with unique class labels as keys and on dimensional arrays representing
    # the prototype.
    prototype = {}
    #tf_idf_train = compute_tf_idf_matrix(train)
    num_labels = get_number_of_classes(labels)
    #print("The number of classes:" + str(num_labels))

    # add each class to the prototype dictionary
    # the class is the key
    # a one dimensional array equal to the length of a term vector array is initialized with all 0s
    for label in labels:
        term_vector_array = np.zeros(np.shape(len(train[0])), dtype=float)
        prototype[label] = term_vector_array

    length = len(train)
    for i in range(length):
        # first figure out which class this entry belongs to
        label = labels[i]

        # next sum up the entire document term vector
        prototype[label] = prototype.get(label) + np.array(train[i])

    return prototype

"""A Classifier Function for the Rocchio Classification Algorithm
:param data: a dictionary of prototype vectors where the class is a key and the value is the vector
for the corresponding class.
:param instance: the test instance to classify
:returns: the predicted class
"""
def rocchio_classifier(prototype, test_instance):
    m = -2
    predicted_class = -1  # intialize the predicted class to -1 by default
    for classLabel in prototype:
        # (compute similarity to prototype vector)
        # use cosine similarity
        # cosine distance is 1 - (dot product / L2 norm)
        term_vector = prototype[classLabel]
        prototype_norm = np.array([np.linalg.norm(term_vector) for i in range(len(term_vector))])
        test_instance_norm = np.linalg.norm(test_instance)

        sims = np.dot(term_vector, test_instance) / (prototype_norm * test_instance_norm)

        # get the maximum cosSim
        max_cos_sim = np.max(sims)

        if max_cos_sim > m:
            m = max_cos_sim
            predicted_class = classLabel

    return predicted_class

"""An Evaluation Function for the Rocchio Classification Algorithm
:param test_data: an np array of the test data formatted as a doc-term frequency matrix,
a ttd*idf matrix or else a doc2vec vector matrix
:param test_labels: an np array of the corresponding labels for the data
:param prototype: a dictionary of prototype vectors to use for classification
:returns: the accuracy of the model
"""
def rocchio_evaluate(test_data, test_labels, prototype):
    num_correct = 0

    # iterate through the test instances and call the rocchio classifier
    test_data_len = len(test_data)
    for i in range(test_data_len):
        predicted_label = rocchio_classifier(prototype, test_data[i])
        test_label = test_labels[i]
        #print("Predicted Label:" + str(predicted_label) + " " + "Test Label:" + str(test_labels[i]))
        if predicted_label == test_label:
            num_correct = num_correct + 1
    accuracy = (num_correct / test_data_len * 100.0)
    print("Rocchio classifier - Number of test instances classified correctly:" + str(
        num_correct) + " Percent Accuracy: " + str(accuracy))

### Running Custom Rochio Implementation using Doc2Vec vectors

In [9]:
# create np arrays for inputs
train_vectors = np.array(train_vectors)
train_labels = np.array(train_ds['toxicity_level'])

# generate the prototype vectors
prototype_vectors = rocchio_train(train_vectors, train_labels)

#classify test data using prototype vectors
test_vectors = np.array(test_vectors)
test_labels = np.array(test_labels)
rocchio_evaluate(test_vectors, test_labels, prototype_vectors)

Rocchio classifier - Number of test instances classified correctly:960 Percent Accuracy: 63.3245382585752


### Running Nearest Centroid from sklearn using Doc2Vec vectors

### Methods for Nearest Centroid

In [10]:
from sklearn.neighbors import NearestCentroid
"""An Classifier Function for the Nearest Centroid Algorithm
:param train: an np array of the train data formatted as a doc-term frequency matrix,
a ttd*idf matrix or else a doc2vec vector matrix
:param train: an np array of the corresponding labels for the train data
:param test_data: an np array of the test data formatted as a doc-term frequency matrix,
a ttd*idf matrix or else a doc2vec vector matrix
:param test_labels: an np array of the corresponding labels for the test data
:returns: the accuracy of the model
"""
def rocchio_classifier_nearest_centroid(train, train_labels, test_data, test_labels):
    clf = NearestCentroid()
    clf.fit(train, train_labels)
    predictions = clf.predict(test_data)

    numCorrect = 0
    test_data_len = len(test_data)
    for i in range(test_data_len):
        if predictions[i] == test_labels[i]:
            numCorrect = numCorrect + 1

    accuracy = (numCorrect / test_data_len * 100.0)
    print("Nearest Centroid classifier - Number of test instances classified correctly:" + str(
        numCorrect) + " Percent Accuracy: " + str(accuracy))


In [11]:
rocchio_classifier_nearest_centroid(train_vectors, train_labels, test_vectors, test_labels)

Nearest Centroid classifier - Number of test instances classified correctly:836 Percent Accuracy: 55.145118733509236


### Custom Doc2Vec Parameter Tuning

In [12]:
import itertools
dm = [0]
vector_size = [500, 1000]
window = [2,5]
hs = [1]
paramsList = [{'dm': item[0],
               'vector_size': item[1],
               'window': item[2],
               'hs': 1,
               'negative': 0
               } for item in
                 list(itertools.product(*[dm,
                                          vector_size,
                                          window,
                                          hs]))
              ]

print(paramsList)

[{'dm': 0, 'vector_size': 500, 'window': 2, 'hs': 1, 'negative': 0}, {'dm': 0, 'vector_size': 500, 'window': 5, 'hs': 1, 'negative': 0}, {'dm': 0, 'vector_size': 1000, 'window': 2, 'hs': 1, 'negative': 0}, {'dm': 0, 'vector_size': 1000, 'window': 5, 'hs': 1, 'negative': 0}]


In [14]:
from gensim.models import doc2vec
import pickle
import numpy as np
import pandas as pd

%run Doc2Vec.ipynb

file_object = open('clean_data1.p', 'rb')
clean_data = pickle.load(file_object)
train_ds = clean_data[0]
test_ds = clean_data[1]

#pre-process the corpus
train_comments_df = train_ds['comment_text']
test_comments_df = test_ds['comment_text']
train_labels_df = train_ds['toxicity_level']
test_labels_df = test_ds['toxicity_level']

def evaluateDoc2VecParams():
    # Tag docs
    train_tokenized_comments = tokenize_comments(train_ds['comment_text'])
    train_tagged_documents = get_tagged_documents(train_tokenized_comments)
    scoreList = []
    for param in paramsList:
      print("Evaluating "+str(param))
      try:
        d2v_model = doc2vec.Doc2Vec(train_tagged_documents,
                        dm=param['dm'],
                        vector_size=param['vector_size'],
                        window=param['window'],
                        min_count=1,
                        epochs=10,
                        hs=param['hs'],
                        seed=516)
        train_vectors = get_doc2vec_vectors(d2v_model,train_comments_df)

        tokenized_test_comments = tokenize_comments(test_comments_df)
        test_vectors = infer_vectors(d2v_model, tokenized_test_comments, "")

        train_labels = np.array(train_labels_df)

        #classify test data using prototype vectors
        test_labels = np.array(test_labels_df)

        rocchio_classifier_nearest_centroid(train_vectors, train_labels, test_vectors, test_labels)
        # generate the prototype vectors
        #prototype_vectors = rocchio_train(train_vectors, train_labels)

        #classify test data using prototype vectors
        #rocchio_evaluate(test_vectors, test_labels, prototype_vectors)
      except Exception as error:
        print(f'Cannot evaluate model with parameters {param} because of error: {error}')
        continue
    return scoreList

evaluateDoc2VecParams()


Evaluating {'dm': 0, 'vector_size': 500, 'window': 2, 'hs': 1, 'negative': 0}
Nearest Centroid classifier - Number of test instances classified correctly:957 Percent Accuracy: 63.12664907651715
Evaluating {'dm': 0, 'vector_size': 500, 'window': 5, 'hs': 1, 'negative': 0}
Nearest Centroid classifier - Number of test instances classified correctly:955 Percent Accuracy: 62.99472295514512
Evaluating {'dm': 0, 'vector_size': 1000, 'window': 2, 'hs': 1, 'negative': 0}
Nearest Centroid classifier - Number of test instances classified correctly:959 Percent Accuracy: 63.25857519788918
Evaluating {'dm': 0, 'vector_size': 1000, 'window': 5, 'hs': 1, 'negative': 0}
Nearest Centroid classifier - Number of test instances classified correctly:958 Percent Accuracy: 63.19261213720316


[]

### Using Pipeline to Tune Doc2Vec (these didn't work for me)

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import NearestCentroid
from gensim.models import doc2vec
from sklearn.model_selection import GridSearchCV

train_comments = np.array(train_ds['comment_text'])
train_labels = np.array(train_ds['toxicity_level'])

param_grid = {'doc2vec__vector_size': [1000, 5000, 10000],
              'doc2vec__min_count': [2, 5, 10],
              'doc2vec__epochs': [40, 400],
}

#pipe_rocchio = Pipeline([('doc2vec', doc2vec.Doc2Vec()), ('rocchio', NearestCentroid())])
Rocchio_params =[
    {
        'token_value': [doc2vec.Doc2Vec()],
        'doc2vec__vector_size': [1000, 5000, 10000],
        'doc2vec__min_count': [2, 5, 10],
        'reduce_dim': ['passthrough'],
        'clf__metric': ['euclidian', 'cosine']
    }
]

# Rocchio Pipeline
pipe_Rocchio = Pipeline(
    [
        ('token_value', 'passthrough'),
        ('reduce_dim', 'passthrough'),
        ('clf', NearestCentroid())
    ]
)

rocchio_grid = GridSearchCV(pipe_Rocchio,
                        param_grid=Rocchio_params,
                        scoring="accuracy",
                        verbose=3,
                        n_jobs=2)

fitted = rocchio_grid.fit(train_comments,train_labels)

# Best parameters
print("Best Parameters: {}\n".format(rocchio_grid.best_params_))
print("Best accuracy: {}\n".format(rocchio_grid.best_score_))
print("Finished.")






Fitting 5 folds for each of 18 candidates, totalling 90 fits


ValueError: Invalid parameter doc2vec for estimator Pipeline(steps=[('token_value',
                 <gensim.models.doc2vec.Doc2Vec object at 0x7fd2b95762b0>),
                ('reduce_dim', 'passthrough'), ('clf', NearestCentroid())]). Check the list of available parameters with `estimator.get_params().keys()`.

In [22]:
import sys
import os
from time import time
from operator import itemgetter
import pickle
import pandas as pd
import numpy as np
from argparse import ArgumentParser

from gensim.models.doc2vec import Doc2Vec
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

from sklearn.base import BaseEstimator

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
%run Doc2Vec.ipynb



train_comments = np.array(train_ds['comment_text'])
train_labels = np.array(train_ds['toxicity_level'])

class Doc2VecModel(BaseEstimator):

    def __init__(self, size=1, window=1, dm=1):
        self.d2v_model = None
        self.size = size
        self.window = window
        self.dm = dm

    def fit(self, raw_documents, y=None):
        # Initialize model
        doc2vec.Doc2Vec
        self.d2v_model = Doc2Vec(size=self.size, window=self.window, dm=self.dm, iter=5, alpha=0.025, min_alpha=0.001)
        # Tag docs
        tokenized_comments = Doc2Vec.tokenize_comments(train_ds['comment_text'])
        tagged_documents = Doc2Vec.get_tagged_documents(tokenized_comments)
        # Build vocabulary
        self.d2v_model.build_vocab(tagged_documents)
        # Train model
        self.d2v_model.train(tagged_documents, total_examples=len(tagged_documents), epochs=self.d2v_model.iter)
        return self

    def transform(self, raw_documents):
        X = []
        for index, row in raw_documents.iteritems():
            X.append(self.d2v_model.infer_vector(row))
        X = pd.DataFrame(X, index=raw_documents.index)
        return X

    def fit_transform(self, raw_documents, y=None):
        self.fit(raw_documents)
        return self.transform(raw_documents)


param_grid = {'doc2vec__window': [2, 3],
              'doc2vec__dm': [0,1],
              'doc2vec__size': [100,200]
}

pipe_log = Pipeline([('doc2vec', Doc2VecModel()), ('log', LogisticRegression())])

log_grid = GridSearchCV(pipe_log,
                        param_grid=param_grid,
                        scoring="accuracy",
                        verbose=3,
                        n_jobs=1)

fitted = log_grid.fit(train_ds['comment_text'], train_ds['toxicity_level'])

# Best parameters
print("Best Parameters: {}\n".format(log_grid.best_params_))
print("Best accuracy: {}\n".format(log_grid.best_score_))
print("Finished.")

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV 1/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=2;, score=nan total time=   0.0s
[CV 2/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=2;, score=nan total time=   0.0s
[CV 3/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=2;, score=nan total time=   0.0s
[CV 4/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=2;, score=nan total time=   0.0s
[CV 5/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=2;, score=nan total time=   0.0s
[CV 1/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=3;, score=nan total time=   0.0s
[CV 2/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=3;, score=nan total time=   0.0s
[CV 3/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=3;, score=nan total time=   0.0s
[CV 4/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=3;, score=nan total time=   0.0s
[CV 5/5] END doc2vec__dm=0, doc2vec__size=100, doc2vec__window=3;, score=nan

40 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/lisasaurus01/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/lisasaurus01/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/Users/lisasaurus01/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/Users/lisasaurus01/opt/anaconda3/lib/python3.9/site-packages/joblib/me

TypeError: __init__() got an unexpected keyword argument 'size'

### Custom Doc2Vec Parameter Tuning

In [1]:
import itertools
dm = [0]
vector_size = [500, 1000]
window = [2,5]
hs = [1]
paramsList = [{'dm': item[0],
               'vector_size': item[1],
               'window': item[2],
               'hs': 1,
               'negative': 0
               } for item in
                 list(itertools.product(*[dm,
                                          vector_size,
                                          window,
                                          hs]))
              ]

print(paramsList)

[{'dm': 0, 'vector_size': 500, 'window': 2, 'hs': 1, 'negative': 0}, {'dm': 0, 'vector_size': 500, 'window': 5, 'hs': 1, 'negative': 0}, {'dm': 0, 'vector_size': 1000, 'window': 2, 'hs': 1, 'negative': 0}, {'dm': 0, 'vector_size': 1000, 'window': 5, 'hs': 1, 'negative': 0}]
