<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

# Doc2Vec

Credit: https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5

In [1]:
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [2]:
data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

In [12]:
help(TaggedDocument)

Help on class TaggedDocument in module gensim.models.doc2vec:

class TaggedDocument(TaggedDocument)
 |  Represents a document along with a tag, input document format for :class:`~gensim.models.doc2vec.Doc2Vec`.
 |  
 |  A single document, made up of `words` (a list of unicode string tokens) and `tags` (a list of tokens).
 |  Tags may be one or more unicode string tokens, but typical practice (which will also be the most memory-efficient)
 |  is for the tags list to include a unique integer id as the only tag.
 |  
 |  Replaces "sentence as a list of words" from :class:`gensim.models.word2vec.Word2Vec`.
 |  
 |  Method resolution order:
 |      TaggedDocument
 |      TaggedDocument
 |      builtins.tuple
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __str__(self)
 |      Human readable representation of the object's state, used for debugging.
 |      
 |      Returns
 |      -------
 |      str
 |         Human readable representation of the object's state (words and t

In [4]:
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

'''
"Note: dm defines the training algorithm. If dm=1 means ‘distributed memory’ (PV-DM) and dm =0 means ‘distributed bag of words’ (PV-DBOW). Distributed Memory model preserves the word order in a document whereas Distributed Bag of words just uses the bag of words approach, which doesn’t preserve any word order."
'''

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    if epoch % 10 == 0: 
        print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")



iteration 0
iteration 10
iteration 20
iteration 30
iteration 40
iteration 50
iteration 60
iteration 70
iteration 80
iteration 90
Model Saved


In [6]:
from gensim.models.doc2vec import Doc2Vec

model= Doc2Vec.load("d2v.model")

#to find the vector of a document which is not in training data
test_data = word_tokenize("I love chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)


V1_infer [-0.00615856  0.02282158 -0.0098063  -0.00422864  0.03034436  0.00866225
  0.00908293 -0.02194043  0.01761457 -0.00817535 -0.00170775  0.00439633
  0.0137575  -0.02104217 -0.01473705  0.00050565  0.01018822 -0.02339925
  0.0040169  -0.01591701]


In [8]:
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)

[('0', 0.9914426803588867), ('2', 0.9892101883888245), ('3', 0.9888399243354797)]


  if np.issubdtype(vec.dtype, np.int):


In [9]:
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])

[-0.10517474 -0.10617201 -0.14613736 -0.24473928  0.2275966  -0.06443435
  0.3328848  -0.00885167  0.2031824   0.13060504 -0.4419818  -0.06489877
 -0.06304339 -0.08129008  0.04666416 -0.5370674  -0.45441043  0.00670689
 -0.28969312 -0.30347893]
