# embed-text-doc2vec

based on https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5

first, let's install some dependencies. a guide to doing this: https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/

In [1]:
# Install a conda package in the current Jupyter kernel
import sys
!conda install --yes --prefix {sys.prefix} gensim nltk

Fetching package metadata .........
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /Users/m/anaconda3/envs/parse-html:
#
gensim                    2.3.0               np113py36_0  
nltk                      3.2.4                    py36_0  


In [1]:
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

Let’s prepare data for training our doc2vec model

In [2]:
data_dir = '../../data/'

# our list of documents
data = []

In [3]:
import glob
txt_files = glob.glob(f"{data_dir}/*.txt")
print(len(txt_files))

141


In [4]:
# should an example of just the filename without the path
txt_files[0][11:]

'Trieu et al. - 2017 - News Classification from Social Media Using Twitte.txt'

In [5]:
for file in txt_files:
    with open(file, 'r', encoding="utf-8") as file:
        currentText = file.read()
        data.append(currentText)
        file.close()

print(len(data))

141


In [6]:
from random import randrange
random_index = randrange(len(data)-1)

# print the first 1000 characters of a random document from our corpus
print(data[random_index][0:1000])

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2016.2557324, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 11, NO. 12, 2015

1

Interactive Visualization
of Large Data Sets
Parke Godfrey ∗ , Jarek Gryz ∗ , Piotr Lasek ∗ †
{godfrey, jarek, plasek}@cse.yorku.ca
York University, Canada ∗
Rzeszów University, Poland †
Abstract—Visualization provides a powerful means for data analysis. But to be practical, visual analytics tools must support smooth
and flexible use of visualizations at a fast rate. This becomes increasingly onerous with the ever-increasing size of real-world datasets.
First, large databases make interaction more difficult once query response time exceeds several seconds. Second, any attempt to show
all data points will overload the visualization, resulting in ch

In [7]:
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), 
    tags=[str(i)]) for i, _d in enumerate(data)]

Here we have a list of four sentences as training data. Now I have tagged the data and its ready for training. Lets start training our model.

In [8]:
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
                alpha=alpha,
                min_alpha=0.00025,
                min_count=1,
                dm=1)

In [9]:
model.build_vocab(tagged_data)

In [None]:
for epoch in range(max_epochs):
    print ('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha
    
model.save("d2v.model")
print("Model d2v.model Saved")

iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7


Note: dm defines the training algorithm. If dm=1 means ‘distributed memory’ (PV-DM) and dm =0 means ‘distributed bag of words’ (PV-DBOW). Distributed Memory model preserves the word order in a document whereas Distributed Bag of words just uses the bag of words approach, which doesn’t preserve any word order.

So we have saved the model and it’s ready for implementation. Lets play with it.

In [None]:
from gensim.models.doc2vec import Doc2Vec

model= Doc2Vec.load("d2v.model")

#to find the vector of a document which is not in the training data
test_data = word_tokenize("I love chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)




In [None]:
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)

In [None]:
# to find vector of doc in training data using tags
# or in other words printing the vector of the document 
# at index 1 in the training data
print(model.docvecs['1'])

In [None]:
# how many dimensions does our doc2vec document space have?
dimensions = len(model.docvecs['1'])
print(dimensions)

Cool! This dimensionality is determined by the `vec_size` parameter we specified at training time.

In [None]:
# create column headers for csv file
headers = ['doc']
i = 0
while i < dimensions:
    headers.append(f"v{i}")
    i+=1
    
print(headers)

In [None]:
# retrieve vectors of all documents in training data
# write vectors to a csv file
import csv

with open('document-vectors.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"')
    writer.writerow(headers)
    
    index_count = len(data)-1
    i = 0
    while i <= index_count:
        doc_name = txt_files[i][11:]
        vec = list(model.docvecs[i])
        row = [doc_name] + vec
        writer.writerow(row)
        i += 1


In [56]:
# read vectors in from csv file
import csv

imported_vectors = []

with open('document-vectors.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        imported_vectors.append(row)
        
print(imported_vectors[0:2])

[['doc', 'v0', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18', 'v19'], ['Trieu et al. - 2017 - News Classification from Social Media Using Twitte.txt', '37.8572', '9.51042', '-12.9113', '-15.5441', '33.3247', '-21.667', '-26.5227', '5.00173', '-6.14793', '48.2965', '50.8205', '15.4903', '-5.50221', '-20.1339', '-38.5444', '5.88733', '6.06908', '9.33432', '-25.3571', '-12.2529']]


In [37]:
# project from 20D to 2D with t-SNE

In [31]:
# visualize t-SNE projection

In [38]:
# project from 20D to 2D with UMAP

In [33]:
# visualize UMAP projection