# Doc2Vec Vectorization

### Import Modules

In [5]:
import pandas as pd
import numpy as np

import pickle

from sklearn import utils
from sklearn.metrics import classification_report

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from collections import OrderedDict
import multiprocessing

import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

### Import Clean DataFrame

In [4]:
with open('../Data/df_train.pkl', 'rb') as f:
    df_train = pickle.load(f)

### Import Stop Word List

In [7]:
with open('../Data/stop_word_list.pkl', 'rb') as f:
    stop_word_list = pickle.load(f)

### Doc2Vec

Many current NLP systems and techniques treat words as individual units - there is no notion of similarity between words, and they are represented as indices in a vocabulary.  These document representations lose the ordering of the words and they also ignore semantics of the words.  When it comes to evaluating texts, one of the most common fixed-length vectors is the bag-of-words.  

Different from the bag-of-words models, word embeddings are representations of words in an N-dimensional vector space so that semantically similar (e.g. “king” — “monarch”) or semantically related (e.g. “bird” — “fly”) words come closer.  Word2Vec is a technique for learning high-quality word vectors that embeds words in a lower-dimensional vector space using a shallow neural network.  The result of this unsupervised framework is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings.

Doc2Vec uses the same logic as word2vec, but applies this to a document level; it modifies the Word2Vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.  There are two methods of implementing Doc2Vec: Distributed Memory (DM) and Distributed Bag of Words (DBOW).  DM attempts to predict a word given its previous words and paragraph vector. DBOW predicts a random group of words in a paragraph given only its paragraph vector.

----

I use the python library gensim, since it has a much more readable implementation of Doc2Vec.  I start by splitting all of the reviews, tag each word in the review with its respective index number, and append each of these tagged reviews to the tagged_documents list.

In [14]:
tagged_documents = []
for indx, doc in enumerate(df_train["review"].values):
    tagged_documents.append(TaggedDocument([x for x in doc.split()], [indx]))

### Export Tagged Documents

In [15]:
with open('../Data/tagged_documents.pkl', 'wb+') as f:
    pickle.dump(tagged_documents, f)

After tagging the reviews, the data is ready for training and I can start modeling.  The dm parameter defines the training algorithm in Doc2Vec. dm=1 means ‘distributed memory’ (PV-DM) and dm=0 means ‘distributed bag of words’ (PV-DBOW).  The distributed memory model preserves the word order in a document whereas distributed bag of words uses the bag of words approach, which doesn’t preserve any word order.  I implemented the DM algorithm in two ways - one  which averages context vectors (dm_mean) and one which concatenates them (dm_concat). This was demonstrated in a [Doc2Vec tutorial](https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5).

I used an article entitled [An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation](https://arxiv.org/pdf/1607.05368.pdf) as guidance when deciding on the hyperparameters of the various Doc2Vec models as there does not appear to be a way to optimize parameters using gensim.

- I chose to limit my vector size to 300, since anything above that takes up a lot of memory and did not seem to improve my model performance.

- If you set the negative parameter to > 0, negative sampling will be used. The number for negative specifies how many “noise words” will be drawn.  The gensim documentation  reccomends a range of 5-20.  

- When hs is set to 0 and the negative parameter is > 0, negative sampling will be used. Due to the size of the dataset I am working with the neural network within Doc2Vec has a lot of weights, all of which would be updated by every one of the training samples.  Negative sampling addresses this by having each training sample only modify a small percentage of the weights.

- Min_count ignores all words with total frequency lower than the set number.

- Alpha indicates the initial learning rate of the model.

In [16]:
cores = multiprocessing.cpu_count()
vec_size = 300

model_dbow = Doc2Vec(dm=0, dbow_words=1, vector_size=vec_size, negative=5, hs=0, min_count=2, sample=0, 
             workers=cores)

model_dm_mean = Doc2Vec(dm=1, dm_mean=1, vector_size=vec_size, window=10, negative=5, hs=0, min_count=2, sample=0, 
                workers=cores, alpha=0.05, comment='alpha=0.05')

model_dm_concat = Doc2Vec(dm=1, dm_concat=1, vector_size=vec_size, window=5, negative=5, hs=0, min_count=2, sample=0, 
                  workers=cores)

Next, I will set up a for loop which will scan and initialize the vocabulary for each of these models. The vocabulary is a dictionary **(accessible via model.wv.vocab)** of all of the unique words extracted from the training corpus along with the count **(e.g., model.wv.vocab['penalty'].count for counts for the word penalty)**.

In [17]:
models = [(model_dbow, 'model_dbow'), (model_dm_mean, 'model_dm_mean'), (model_dm_concat, 'model_dm_concat')]

for model in models:
    model[0].build_vocab(tagged_documents)
    print("%s vocabulary scanned & state initialized" % model[0])
    
models_by_name = OrderedDict((str(model[1]), model[0]) for model in models)

Doc2Vec(dbow+w,d150,n5,w5,mc2,t4) vocabulary scanned & state initialized
Doc2Vec("alpha=0.05",dm/m,d150,n5,w10,mc2,t4) vocabulary scanned & state initialized
Doc2Vec(dm/c,d150,n5,w5,mc2,t4) vocabulary scanned & state initialized


When training the models I shuffle the corpus before each pass because the native corpus is organized in a stacked fashion where all the negative sentiment documents come first and then are followed by the positive sentiment documents.  A shuffle will break up these groupings and should lead to better results.

In [18]:
for model in models:
    for epoch in range(30):
        print('Epoch: {0}'.format(epoch), 'Model: %s' % (model[0]))
        model[0].train(utils.shuffle(tagged_documents), total_examples=len(tagged_documents), epochs=1)
        model[0].alpha -= 0.002
        model[0].min_alpha = model[0].alpha

Epoch: 0 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 1 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 2 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 3 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 4 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 5 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 6 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 7 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 8 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 9 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 10 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 11 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 12 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 13 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 14 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 15 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 16 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 17 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 18 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 19 Model: Doc2Vec(dbow+w,d150,n5,w

### Pickle Models

In [19]:
with open('../Models/model_dbow.pkl', 'wb+') as f:
    pickle.dump(model_dbow, f)

In [26]:
with open('../Models/model_dm_mean.pkl', 'wb+') as f:
    pickle.dump(model_dm_mean, f)

In [27]:
with open('../Models/model_dm_concat.pkl', 'wb+') as f:
    pickle.dump(model_dm_concat, f)

### Define X and y

In [20]:
X = {}

for model in models:
    X[model[1]] = np.zeros((df_train.shape[0], vec_size))
    for i in range(df_train.shape[0]):
        X[model[1]][i] = model[0].docvecs[i]  
        
#Keys are just the string

In [21]:
y = df_train['label'].values

### Export X and y 

In [24]:
with open('../Data/X_doc2vec.pkl', 'wb+') as f:
    pickle.dump(X, f)

In [25]:
with open('../Data/y_doc2vec.pkl', 'wb+') as f:
    pickle.dump(y, f)