## Doc2Vec
In this notebook we demonstrate how to train a doc2vec model on a custom corpus.

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install gensim==3.6.0
!pip install spacy==2.2.4
!pip install nltk==3.2.5

# ===========================



In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch3/ch3-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch3-requirements.txt"

# ===========================

In [3]:
import warnings
warnings.filterwarnings('ignore')
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from pprint import pprint
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
data = ["dog bites man",
        "man bites dog",
        "dog eats meat",
        "man eats food"]

tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(data)]


In [5]:
tagged_data

[TaggedDocument(words=['dog', 'bites', 'man'], tags=['0']),
 TaggedDocument(words=['man', 'bites', 'dog'], tags=['1']),
 TaggedDocument(words=['dog', 'eats', 'meat'], tags=['2']),
 TaggedDocument(words=['man', 'eats', 'food'], tags=['3'])]

In [6]:
#dbow
model_dbow = Doc2Vec(tagged_data,vector_size=20, min_count=1, epochs=2,dm=0)


In [7]:
print(model_dbow.infer_vector(['man','eats','food']))#feature vector of man eats food

[-1.0145655e-02 -5.4906374e-03 -2.1160658e-02 -1.1651787e-02
  3.5484456e-03 -7.0642815e-03 -9.2761237e-03 -2.8323701e-03
  2.3504248e-02 -9.2086571e-05  2.2652479e-02 -8.9776609e-03
  1.1970512e-02 -1.1935705e-02  1.3459495e-02 -2.2505892e-02
  1.8962136e-02 -1.0934918e-02  1.7853366e-02 -1.4978001e-02]


In [8]:
model_dbow.wv.most_similar("man",topn=5)#top 5 most simlar words.

[('dog', 0.2630311846733093),
 ('eats', 0.23952406644821167),
 ('food', -0.11896046996116638),
 ('meat', -0.2617309093475342),
 ('bites', -0.306953489780426)]

In [9]:
 model_dbow.wv.n_similarity(["dog"],["man"])

0.26303118

In [10]:
#dm
model_dm = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2,dm=1)

print("Inference Vector of man eats food\n ",model_dm.infer_vector(['man','eats','food']))

print("Most similar words to man in our corpus\n",model_dm.wv.most_similar("man",topn=5))
print("Similarity between man and dog: ",model_dm.wv.n_similarity(["dog"],["man"]))

Inference Vector of man eats food
  [-1.01456400e-02 -5.49062993e-03 -2.11605523e-02 -1.16518466e-02
  3.54836439e-03 -7.06422143e-03 -9.27604642e-03 -2.83227302e-03
  2.35041156e-02 -9.20040839e-05  2.26525515e-02 -8.97767674e-03
  1.19706187e-02 -1.19358245e-02  1.34595484e-02 -2.25058738e-02
  1.89621784e-02 -1.09350523e-02  1.78532843e-02 -1.49779590e-02]
Most similar words to man in our corpus
 [('dog', 0.2630311846733093), ('eats', 0.23952406644821167), ('food', -0.11896046996116638), ('meat', -0.2617309093475342), ('bites', -0.306953489780426)]
Similarity between man and dog:  0.26303118


What happens when we compare between words which are not in the vocabulary?

In [11]:
model_dm.wv.n_similarity(['covid'],['man'])

KeyError: ignored