## Doc2Vec
In this notebook we demonstrate how to train a doc2vec model on a custom corpus.

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

# !pip install gensim==3.6.0
# !pip install spacy==2.2.4
# !pip install nltk==3.2.5

# ===========================

In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch3/ch3-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch3-requirements.txt"

# ===========================

In [3]:
import warnings
warnings.filterwarnings('ignore')
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mccar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
data = ["dog bites man",
        "man bites dog",
        "dog eats meat",
        "man eats food"]

tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(data)]


In [5]:
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food'], tags=['3'])]

In [6]:
#dbow
model_dbow = Doc2Vec(tagged_data,vector_size=20, min_count=1, epochs=2,dm=0)


In [7]:
print(model_dbow.infer_vector(['man','eats','food']))#feature vector of man eats food

[ 0.01854673  0.02034257 -0.01173941  0.00879014  0.00221492  0.00485233
  0.01129949 -0.01174115  0.0134295  -0.00252968  0.0004261  -0.01701962
 -0.01295847  0.01288585  0.00518897 -0.00707532  0.00639923 -0.01267522
 -0.00673448 -0.0132132 ]


In [8]:
model_dbow.wv.most_similar("man",topn=5)#top 5 most simlar words.

[('meat', 0.39641639590263367),
 ('bites', 0.05595849081873894),
 ('dog', 0.05017898976802826),
 ('food', -0.06502573937177658),
 ('eats', -0.2928890585899353)]

In [9]:
model_dbow.wv.n_similarity(["dog"],["man"])

0.050178975

In [10]:
#dm
model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2,dm=1)

print("Inference Vector of man eats food\n ",model_dm.infer_vector(['man','eats','food']))

print("Most similar words to man in our corpus\n",model_dm.wv.most_similar("man",topn=5))
print("Similarity between man and dog: ",model_dm.wv.n_similarity(["dog"],["man"]))

Inference Vector of man eats food
  [ 0.01854671  0.02034247 -0.01173931  0.00878998  0.00221495  0.00485226
  0.01129953 -0.01174116  0.01342962 -0.00252981  0.00042606 -0.01701963
 -0.01295856  0.01288576  0.00518894 -0.00707516  0.00639921 -0.0126752
 -0.00673455 -0.01321305]
Most similar words to man in our corpus
 [('meat', 0.39641639590263367), ('bites', 0.05595849081873894), ('dog', 0.05017898976802826), ('food', -0.06502573937177658), ('eats', -0.2928890585899353)]
Similarity between man and dog:  0.050178975


What happens when we compare between words which are not in the vocabulary?

In [11]:
model_dm.wv.n_similarity(['covid'],['man'])

0.0