<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Intro-on-Doc2Vec" data-toc-modified-id="Intro-on-Doc2Vec-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Intro on Doc2Vec</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Doc2Vec-by-averaging" data-toc-modified-id="Doc2Vec-by-averaging-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Doc2Vec by averaging</a></span></li><li><span><a href="#Training-Doc2Vec" data-toc-modified-id="Training-Doc2Vec-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Training Doc2Vec</a></span></li><li><span><a href="#References" data-toc-modified-id="References-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>References</a></span></li></ul></div>

# Introduction
<hr style="border:2px solid black"> </hr>

<div class="alert alert-block alert-warning">
<font color=black>

**What?** Doc2Vec embedding

</font>
</div>

# Intro on Doc2Vec
<hr style="border:2px solid black"> </hr>

<div class="alert alert-block alert-info">
<font color=black>

- In the Doc2vec embedding scheme, we learn a direct representation for the entire document (sentence/paragraph) rather than each word. 
- Just as we used word and character embeddings as features for performing text classification, we can also use Doc2vec as a feature representation mechanism. 
- Doc2vec allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences, paragraphs and documents), by considering the context of words  in the text into account. 

</font>
</div>

# Imports
<hr style="border:2px solid black"> </hr>

In [None]:
import spacy
import warnings
warnings.filterwarnings('ignore')
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint
import nltk
nltk.download('punkt')

In [None]:
# downloading en_core_web_sm, assuming spacy is already installed
!python -m spacy download en_core_web_sm

In [7]:
#here nlp object refers to the 'en_core_web_sm' language model instance.
nlp = spacy.load("en_core_web_sm") 

# Doc2Vec by averaging
<hr style="border:2px solid black"> </hr>

In [8]:
# Assume each sentence in documents corresponds to a separate document.
documents = ["Dog bites man.", "Man bites dog.",
             "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".", "") for doc in documents]
processed_docs
print("Document After Pre-Processing:", processed_docs)

Document After Pre-Processing: ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']


In [6]:
# Iterate over each document and initiate an nlp instance.
for doc in processed_docs:
    # creating a spacy "Doc" object which is a container for accessing linguistic annotations.
    doc_nlp = nlp(doc)

    print("-"*30)
    # this gives the average vector of each document
    print("Average Vector of '{}'\n".format(doc), doc_nlp.vector)
    for token in doc_nlp:
        print()
        # this gives the text of each word in the doc and their respective vectors.
        print(token.text, token.vector)

------------------------------
Average Vector of 'dog bites man'
 [ 1.73471165e+00  6.59988284e-01 -3.19469976e-03  1.16046118e-02
 -6.73563778e-01  3.18560489e-02 -3.09772134e-01  3.27946335e-01
 -3.91988426e-01  4.72732484e-02  5.01355052e-01  4.57249850e-01
  2.93650508e-01  6.01201467e-02 -5.18603086e-01  1.02208388e+00
 -4.08442408e-01  5.55521190e-01 -2.39790127e-01 -1.11812174e-01
 -8.04212391e-01  5.24548590e-01 -8.18855941e-01 -1.03225298e-01
 -5.79499066e-01 -4.91997510e-01  2.11589560e-01 -3.03037763e-01
 -2.11566295e-02  2.10528076e-01 -1.63071573e-01 -9.90999520e-01
 -3.07075799e-01 -4.42660958e-01  5.17583370e-01  3.94171029e-01
 -1.43115565e-01 -1.47346497e-01  1.00340843e-02  1.65966237e+00
 -7.00365484e-01 -2.15463758e-01 -7.12421238e-01  1.03067887e+00
 -4.61727791e-02 -1.03162311e-01  2.11046147e-03  7.27808416e-01
  8.60032082e-01  1.04048915e-01  6.03652179e-01  4.28187847e-02
 -4.26694632e-01  6.76044464e-01  8.63265574e-01 -6.79133654e-01
 -5.75602293e-01 -7.9803

# Training Doc2Vec
<hr style="border:2px solid black"> </hr>

In [10]:
data = ["dog bites man",
        "man bites dog",
        "dog eats meat",
        "man eats food"]

tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(data)]

In [11]:
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food'], tags=['3'])]

In [13]:
# dbow
model_dbow = Doc2Vec(tagged_data, vector_size=20, min_count=1, epochs=2, dm=0)

In [14]:
# feature vector of man eats food
print(model_dbow.infer_vector(['man','eats','food'])) 

[-0.0125304  -0.00402047 -0.02311893  0.00096551  0.01455172 -0.00070387
  0.00353274 -0.01841839 -0.00211448 -0.01939035  0.01752053  0.01105619
  0.00875513  0.01472122 -0.02045602 -0.00492593 -0.01968692  0.00734338
  0.00015174  0.01738879]


In [15]:
# top 5 most simlar words
model_dbow.wv.most_similar("man",topn=5) 

[('meat', 0.39641645550727844),
 ('bites', 0.05595850199460983),
 ('dog', 0.050179000943899155),
 ('food', -0.06502582132816315),
 ('eats', -0.2928891181945801)]

In [16]:
model_dbow.wv.n_similarity(["dog"],["man"])

0.050179023

In [17]:
# dm
model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2, dm=1)

print("Inference Vector of man eats food\n ",
      model_dm.infer_vector(['man', 'eats', 'food']))

print("Most similar words to man in our corpus\n",
      model_dm.wv.most_similar("man", topn=5))
print("Similarity between man and dog: ",
      model_dm.wv.n_similarity(["dog"], ["man"]))

Inference Vector of man eats food
  [-0.01253042 -0.00402056 -0.02311884  0.00096535  0.01455176 -0.00070395
  0.00353278 -0.01841841 -0.00211436 -0.01939049  0.01752049  0.01105618
  0.00875503  0.01472113 -0.02045604 -0.00492576 -0.01968694  0.00734341
  0.00015168  0.01738894]
Most similar words to man in our corpus
 [('meat', 0.39641645550727844), ('bites', 0.05595850199460983), ('dog', 0.050179000943899155), ('food', -0.06502582132816315), ('eats', -0.2928891181945801)]
Similarity between man and dog:  0.050179023


In [None]:
# What happens when we compare between words which are not in the vocabulary?
model_dm.wv.n_similarity(['covid'],['man'])

# References
<hr style="border:2px solid black"> </hr>

<div class="alert alert-warning">
<font color=black>

- https://github.com/practical-nlp/practical-nlp/blob/master/Ch3/07_DocVectors_using_averaging_Via_spacy.ipynb
- https://github.com/practical-nlp/practical-nlp/blob/master/Ch3/08_Training_Dov2Vec_using_Gensim.ipynb
</font>
</div>