<h1>Table of Contents<span class="tocSkip"></span></h1>


# Introduction
<hr style="border:2px solid black"> </hr>


**What?** Doc2Vec embedding



# Intro on Doc2Vec
<hr style="border:2px solid black"> </hr>


- In the Doc2vec embedding scheme, we learn a direct representation for the entire document (sentence/paragraph) rather than each word. 
- Just as we used word and character embeddings as features for performing text classification, we can also use Doc2vec as a feature representation mechanism. 
- Doc2vec allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences, paragraphs and documents), by considering the context of words  in the text into account. 



# Imports
<hr style="border:2px solid black"> </hr>

In [1]:
import spacy
import warnings
warnings.filterwarnings('ignore')
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint
import nltk
nltk.download('punkt')

Slow version of gensim.models.doc2vec is being used
Slow version of Fasttext is being used
[nltk_data] Downloading package punkt to /Users/gm_main/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
# downloading en_core_web_sm, assuming spacy is already installed
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 3.2 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.4.0
    Uninstalling en-core-web-sm-3.4.0:
      Successfully uninstalled en-core-web-sm-3.4.0
Successfully installed en-core-web-sm-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
#here nlp object refers to the 'en_core_web_sm' language model instance.
nlp = spacy.load("en_core_web_sm") 

# Doc2Vec by averaging
<hr style="border:2px solid black"> </hr>

In [4]:
# Assume each sentence in documents corresponds to a separate document.
documents = ["Dog bites man.", "Man bites dog.",
             "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".", "") for doc in documents]
processed_docs
print("Document After Pre-Processing:", processed_docs)

Document After Pre-Processing: ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']


In [5]:
# Iterate over each document and initiate an nlp instance.
for doc in processed_docs:
    # creating a spacy "Doc" object which is a container for accessing linguistic annotations.
    doc_nlp = nlp(doc)

    print("-"*30)
    # this gives the average vector of each document
    print("Average Vector of '{}'\n".format(doc), doc_nlp.vector)
    for token in doc_nlp:
        print()
        # this gives the text of each word in the doc and their respective vectors.
        print(token.text, token.vector)

------------------------------
Average Vector of 'dog bites man'
 [ 0.9667074  -0.7464984   0.36472026 -0.7160862   0.05632142  0.3010411
  0.03581474  0.1674742   0.6921275  -0.4247689  -0.40176523  0.28563294
  0.267582    0.04325875 -0.41354653 -0.71943134  0.06804088  0.55628216
  0.25607497 -0.00654431 -0.08122383 -0.25795496  0.03947112 -0.6283007
 -0.00591467  0.57891047  0.30958834 -0.62156993  0.36322308  0.1362602
 -0.09057757  0.09294707 -1.1436163  -0.6169702   0.22781153  0.3885541
 -0.34815928  0.9676342  -0.13708536  0.01889853 -0.01454741  0.18896584
  0.01767497 -0.99440145  0.09768162 -0.88023967  0.27920863  0.7918339
 -0.68609625 -0.6622398   0.12798305 -0.24238761 -0.5436991  -0.5641633
 -0.7091424   0.19476144  0.15599926 -1.2776715  -0.43213868 -0.0783754
 -0.9465809  -0.5943868   0.33934966 -0.3902187   0.5428256  -0.8500302
  0.02790977  0.31925896  0.21651608 -0.6812822   0.08517497  0.65442723
  0.07228329  0.26916853  0.36432457 -0.15105744  0.24893479  0.25

# Training Doc2Vec
<hr style="border:2px solid black"> </hr>

In [6]:
data = ["dog bites man",
        "man bites dog",
        "dog eats meat",
        "man eats food"]

tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[
                              str(i)]) for i, word in enumerate(data)]

In [7]:
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food'], tags=['3'])]

In [10]:
# dbow
#model_dbow = Doc2Vec(tagged_data, vector_size=20, min_count=1, epochs=2, dm=0)
model_dbow = Doc2Vec(tagged_data, min_count=1 , dm=0)

In [11]:
# feature vector of man eats food
print(model_dbow.infer_vector(['man','eats','food'])) 

[-1.5055673e-03 -6.5643340e-04 -1.5337862e-03  4.4471645e-03
 -1.1858434e-03  4.0370282e-03  4.9832887e-03 -4.6497909e-03
 -1.6673182e-03  1.6169664e-05  1.1205728e-03  9.0012058e-05
  4.1281856e-03  5.2525555e-03  5.4032938e-03  1.2685859e-03
  3.7547171e-03 -2.3299565e-03  5.2124885e-04  7.9722534e-04
 -4.7293515e-03 -3.9906022e-03  2.5666279e-03  2.0991445e-03
  1.3862473e-03 -5.2972109e-04  8.0923975e-04 -6.5183174e-04
 -2.0638751e-03  3.2214874e-03 -5.5095917e-03  4.0627550e-03
 -4.5546186e-03 -2.1431690e-04 -4.8075812e-03 -5.3252000e-03
 -1.8881884e-03 -1.5379010e-03 -3.3728560e-03 -3.6469745e-03
  5.1650684e-03  3.1540168e-03  3.3927790e-03 -5.6977815e-04
  5.7716673e-04  3.0867255e-03  2.0285179e-03  1.5525875e-03
 -3.0417431e-03 -2.0042835e-04  2.8212524e-03  3.5411720e-03
  3.2089865e-03 -4.0924284e-03  7.4126833e-04 -1.5213282e-03
  3.0659388e-03 -4.2737378e-03  3.8189387e-03 -1.5778744e-03
 -1.4482663e-04 -3.9257295e-03 -4.7147609e-04 -2.5935107e-04
  3.0449959e-03  2.70537

In [12]:
# top 5 most simlar words
model_dbow.wv.most_similar("man", topn=5)

[('meat', 0.14971570670604706),
 ('food', 0.0940275639295578),
 ('eats', 0.06420493870973587),
 ('dog', 0.04808744788169861),
 ('bites', -0.19711486995220184)]

In [13]:
model_dbow.wv.n_similarity(["dog"],["man"])

0.04808744582323465

In [15]:
# dm
#model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2, dm=1)
model_dm = Doc2Vec(tagged_data, min_count=1, dm=1)

print("Inference Vector of man eats food\n ",
      model_dm.infer_vector(['man', 'eats', 'food']))

print("Most similar words to man in our corpus\n",
      model_dm.wv.most_similar("man", topn=5))
print("Similarity between man and dog: ",
      model_dm.wv.n_similarity(["dog"], ["man"]))

Inference Vector of man eats food
  [-8.9369353e-04 -7.2097947e-04 -1.3077069e-03  4.3011000e-03
 -1.1055426e-03  4.5322534e-03  4.2835274e-03 -4.4452804e-03
 -1.6298419e-03 -7.8196434e-05  1.0140239e-03  1.7354690e-04
  3.4750437e-03  4.7489721e-03  4.9432544e-03  1.7002359e-03
  4.5833136e-03 -2.5901839e-03 -7.5883014e-05  1.0104821e-03
 -4.1430281e-03 -3.6695832e-03  2.5926216e-03  1.1491202e-03
  1.1015469e-03 -6.3824776e-04  7.4395444e-04  1.6525861e-05
 -2.1716817e-03  3.1743161e-03 -4.9198712e-03  4.3016486e-03
 -4.6971282e-03 -1.7302616e-04 -4.6032029e-03 -4.9301139e-03
 -1.8194761e-03 -8.1060466e-04 -3.5031340e-03 -3.8361144e-03
  4.8423083e-03  2.7547572e-03  3.8806605e-03 -9.9162245e-04
  1.1126343e-03  3.5640367e-03  2.0425448e-03  1.9070003e-03
 -2.7080572e-03 -3.8688310e-04  3.2736980e-03  3.0209492e-03
  2.4876369e-03 -3.5065734e-03  1.1925988e-03 -1.9386517e-03
  2.8000250e-03 -4.4494877e-03  3.9097210e-03 -1.8667324e-03
  7.1551220e-04 -4.4491235e-03 -1.9327666e-04 -3.

In [16]:
# What happens when we compare between words which are not in the vocabulary?
model_dm.wv.n_similarity(['covid'],['man'])

KeyError: "word 'covid' not in vocabulary"

# References
<hr style="border:2px solid black"> </hr>


- https://github.com/practical-nlp/practical-nlp/blob/master/Ch3/07_DocVectors_using_averaging_Via_spacy.ipynb
- https://github.com/practical-nlp/practical-nlp/blob/master/Ch3/08_Training_Dov2Vec_using_Gensim.ipynb
