## Doc2Vec
In this notebook we demonstrate how to train a doc2vec model on a custom corpus.

In [None]:
import warnings
warnings.filterwarnings('ignore')
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
data = ["dog bites man",
        "man bites dog",
        "dog eats meat",
        "man eats food"]

tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(data)]


In [None]:
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food'], tags=['3'])]

In [None]:
#dbow
model_dbow = Doc2Vec(tagged_data,vector_size=20, min_count=1, epochs=2,dm=0)


In [None]:
print(model_dbow.infer_vector(['man','eats','food']))#feature vector of man eats food

[-1.6232926e-02  7.2174487e-03 -1.8149924e-02  1.9396201e-02
 -1.0752278e-02  2.1854458e-02 -1.0387233e-02  5.0634018e-04
 -1.0485573e-02 -2.3734096e-02 -2.1500036e-02  1.1494607e-02
 -7.5711228e-05 -9.6793557e-03 -1.1162300e-02  2.3743849e-02
  5.5664307e-03 -2.3691252e-02  1.7469667e-02 -8.0082798e-03]


In [None]:
model_dbow.wv.most_similar("man",topn=5)#top 5 most simlar words.

[('dog', 0.1856406182050705),
 ('meat', 0.12032049894332886),
 ('bites', 0.037392228841781616),
 ('food', -0.027777723968029022),
 ('eats', -0.29439008235931396)]

In [None]:
 model_dbow.wv.n_similarity(["dog"],["man"])

0.1856406

In [None]:
#dm
model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2,dm=1)

print("Inference Vector of man eats food\n ",model_dm.infer_vector(['man','eats','food']))

print("Most similar words to man in our corpus\n",model_dm.wv.most_similar("man",topn=5))
print("Similarity between man and dog: ",model_dm.wv.n_similarity(["dog"],["man"]))

Inference Vector of man eats food
  [-1.6232852e-02  7.2173858e-03 -1.8149856e-02  1.9396329e-02
 -1.0752306e-02  2.1854490e-02 -1.0387184e-02  5.0630077e-04
 -1.0485582e-02 -2.3733964e-02 -2.1500139e-02  1.1494617e-02
 -7.5761047e-05 -9.6794488e-03 -1.1162374e-02  2.3743976e-02
  5.5664619e-03 -2.3691194e-02  1.7469568e-02 -8.0082249e-03]
Most similar words to man in our corpus
 [('dog', 0.1856406182050705), ('meat', 0.12032049894332886), ('bites', 0.037392228841781616), ('food', -0.027777723968029022), ('eats', -0.29439008235931396)]
Similarity between man and dog:  0.1856406


What happens when we compare between words which are not in the vocabulary?

In [None]:
model_dm.wv.n_similarity(['covid'],['man'])

KeyError: ignored