# embed-text-doc2vec

based on https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5

first, let's install some dependencies. a guide to doing this: https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/

In [1]:
# Install a conda package in the current Jupyter kernel
import sys
!conda install --yes --prefix {sys.prefix} gensim nltk

Fetching package metadata .........
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /Users/m/anaconda3/envs/parse-html:
#
gensim                    2.3.0               np113py36_0  
nltk                      3.2.4                    py36_0  


In [2]:
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

Let’s prepare data for training our doc2vec model

In [6]:
data_dir = '../../data/'

# our list of documents
data = []

In [10]:
import glob
txt_files = glob.glob(f"{data_dir}/*.txt")
print(len(txt_files))

141


In [51]:
# should an example of just the filename without the path
txt_files[0][11:]

'Trieu et al. - 2017 - News Classification from Social Media Using Twitte.txt'

In [11]:
for file in txt_files:
    with open(file, 'r', encoding="utf-8") as file:
        currentText = file.read()
        data.append(currentText)
        file.close()

print(len(data))

141


In [19]:
from random import randrange
random_index = randrange(len(data)-1)

# print the first 1000 characters of a random document from our corpus
print(data[random_index][0:1000])

Bubbleworld
A New Visual Information Retrieval Technique
Christopher Van Berendonck

Timothy Jacobs

Chris.Vanberendonck@defence.gov.au

Timothy.Jacobs@afit.edu

Air Force Institute of Technology
Wright-Patterson Air Force Base
Ohio 45433 USA

Abstract
Visualisation has significant advantages over traditional textual
lists for improving cognition in information retrieval. To realise
these advantages, we identify a set of cognitive principles and
usage patterns for information retrieval. We apply these
principles and patterns to the design of a prototype visual
information retrieval system, Bubbleworld. In Bubbleworld, we
apply a variety of visual techniques that successfully transform
the internal mental representations of the information retrieval
problem to an efficient external view and, through visual cues,
provide cognitive amplification at key stages of the information
retrieval process. We enhance the knowledge acquisition
process by providing query refinement and interaction
te

In [20]:
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), 
    tags=[str(i)]) for i, _d in enumerate(data)]

Here we have a list of four sentences as training data. Now I have tagged the data and its ready for training. Lets start training our model.

In [21]:
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
                alpha=alpha,
                min_alpha=0.00025,
                min_count=1,
                dm=1)

In [22]:
model.build_vocab(tagged_data)

In [23]:
for epoch in range(max_epochs):
    print ('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha
    
model.save("d2v.model")
print("Model d2v.model Saved")

iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

Note: dm defines the training algorithm. If dm=1 means ‘distributed memory’ (PV-DM) and dm =0 means ‘distributed bag of words’ (PV-DBOW). Distributed Memory model preserves the word order in a document whereas Distributed Bag of words just uses the bag of words approach, which doesn’t preserve any word order.

So we have saved the model and it’s ready for implementation. Lets play with it.

In [24]:
from gensim.models.doc2vec import Doc2Vec

model= Doc2Vec.load("d2v.model")

#to find the vector of a document which is not in the training data
test_data = word_tokenize("I love chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)




V1_infer [-0.08711238  0.30913809  0.22616503  0.11435614 -0.01954506  0.08124611
 -0.15061171  0.04631089  0.08107837 -0.00901058 -0.13561398  0.08290039
  0.14645907  0.01209032  0.37373963  0.28964242 -0.12185849  0.10028809
  0.38866341 -0.07192592]


In [25]:
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)

[('70', 0.9756021499633789), ('89', 0.6453909873962402), ('101', 0.6296983361244202), ('30', 0.6239702701568604), ('98', 0.6230168342590332), ('121', 0.6164370179176331), ('82', 0.6149293184280396), ('50', 0.6094334721565247), ('122', 0.6089662909507751), ('127', 0.605032205581665)]


In [26]:
# to find vector of doc in training data using tags
# or in other words printing the vector of the document 
# at index 1 in the training data
print(model.docvecs['1'])

[ 20.57692528 -54.21049118 -13.5384798  -22.32372665 -36.1111412
  25.23378944   2.2354517   22.41984367   4.04739571  -1.5552386
   3.65032578   0.99818057   5.76238346  -1.77558267 -46.97889328
   6.29866123 -38.61899567  10.46997738 -50.46998978 -69.23514557]


In [41]:
# how many dimensions does our doc2vec document space have?
dimensions = len(model.docvecs['1'])
print(dimensions)

20


Cool! This dimensionality is determined by the `vec_size` parameter we specified at training time.

In [45]:
# create column headers for csv file
headers = ['doc']
i = 0
while i < dimensions:
    headers.append(f"v{i}")
    i+=1
    
print(headers)

['doc', 'v0', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18', 'v19']


In [52]:
# retrieve vectors of all documents in training data
# write vectors to a csv file
import csv

with open('document-vectors.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"')
    writer.writerow(headers)
    
    index_count = len(data)-1
    i = 0
    while i <= index_count:
        doc_name = txt_files[i][11:]
        vec = list(model.docvecs[i])
        row = [doc_name] + vec
        writer.writerow(row)
        i += 1


In [55]:
# read vectors in from csv file
import csv

imported_vectors = []

with open('document-vectors.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        imported_vectors.append(row)
        
print(imported_vectors[0:2])

[['doc', 'v0', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18', 'v19']]


In [37]:
# project from 20D to 2D with t-SNE

In [31]:
# visualize t-SNE projection

In [38]:
# project from 20D to 2D with UMAP

In [33]:
# visualize UMAP projection