# Doc2Vec to TREC Disk 1-5 articles

In [1]:
import logging, sys
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create STDERR handler
handler = logging.StreamHandler(sys.stderr)
# ch.setLevel(logging.DEBUG)

# Create formatter and add it to the handler
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

# Set STDERR handler as the only handler 
logger.handlers = [handler]

## Basic Setup

Let's import Doc2Vec module.

In [2]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

## Preparing the corpus

First, download the data of all TREC disk 1-5 from [here](https://trec.nist.gov/data/docs_eng.html)
Second, convert the TREC format to TrecCorpus. TrecCorpus construct a corpus from a TREC disk.

For more details on TrecCorpus, you should access docString of TrecCorpus class.

In [3]:
from trec.treccorpus import TrecCorpus

trecdata_path = "F:/Corpus/TrecData/"
# initialize dictionary with {} will significantly reduce the wait time for 8h, 
# since it will be done anyway when training doc2vec model, thus there is no need to be done here
trec = TrecCorpus(trecdata_path, dictionary={}) 

gensim.corpora.textcorpus - INFO - Input stream provided but dictionary already initialized


Define **TaggedTrecDocument** class to convert TrecCorpus into suitable form for Doc2Vec.

In [4]:
class TaggedTrecDocument(object):
    def __init__(self, trec):
        self.trec = trec
        self.trec.metadata = True
    def __iter__(self):
        for content, (doc_id, title) in self.trec.get_texts():
            yield TaggedDocument(content, [doc_id])

In [None]:
documents = TaggedTrecDocument(trec)

In [3]:
from utils import read_content

class TaggedLemmaDocument(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for content, (doc_id, title) in read_content(self.filename):
            yield TaggedDocument(content, [doc_id])

In [4]:
lemmatized = "f:/Corpus/lemmatized_trec_all.dat"
documents = TaggedLemmaDocument(lemmatized)

## Preprocessing
To set the same vocabulary size with original paper. We first calculate the optimal **min_count** parameter.

## Training the Doc2Vec Model
To train Doc2Vec model by several method, DBOW and DM, we define the list of models.

In [5]:
cores = multiprocessing.cpu_count()

model = Doc2Vec(dm=0, dbow_words=1, vector_size=1000, window=8, min_count=20, epochs=10, workers=cores)

# model = Doc2Vec(dm=1, dm_mean=1, vector_size=1000, window=8, min_count=20, epochs=10, workers=cores)
# 
# models = [
#     # PV-DBOW
#     Doc2Vec(dm=0, dbow_words=1, vector_size=300, window=8, min_count=10, epochs=10, workers=cores),
#     # PV-DM w/average
#     # Doc2Vec(dm=1, dm_mean=1, vector_size=300, window=8, min_count=10, epochs=10, workers=cores),
# ]

In [6]:
%%time

model.build_vocab(documents)
print(str(model))
# models[1].reset_from(models[0])
# print(str(models[1]))

gensim.models.doc2vec - INFO - collecting all words and their counts
gensim.models.doc2vec - INFO - PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
gensim.models.doc2vec - INFO - PROGRESS: at example #10000, processed 2380229 words (1985982/s), 53110 word types, 10000 tags
gensim.models.doc2vec - INFO - PROGRESS: at example #20000, processed 4766904 words (2670516/s), 74249 word types, 20000 tags
gensim.models.doc2vec - INFO - PROGRESS: at example #30000, processed 7123010 words (2705687/s), 91948 word types, 30000 tags
gensim.models.doc2vec - INFO - PROGRESS: at example #40000, processed 9521068 words (2689660/s), 106236 word types, 40000 tags
gensim.models.doc2vec - INFO - PROGRESS: at example #50000, processed 11856370 words (2609621/s), 118372 word types, 50000 tags
gensim.models.doc2vec - INFO - PROGRESS: at example #60000, processed 14265842 words (2003478/s), 129561 word types, 60000 tags
gensim.models.doc2vec - INFO - PROGRESS: at example #70000, processe

Doc2Vec(dbow+w,d1000,n5,w8,mc20,s0.001,t12)
Wall time: 3min 43s


Now we’re ready to train Doc2Vec of the TREC disk 1-5.

In [7]:
%%time 
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

## Save trained model

gensim.models.base_any2vec - INFO - training model with 12 workers on 151574 vocabulary and 1000 features, using sg=1 hs=0 sample=0.001 negative=5 window=8
gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.03% examples, 75251 words/s, in_qsize 23, out_qsize 0
gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.08% examples, 117354 words/s, in_qsize 23, out_qsize 0
gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.12% examples, 113515 words/s, in_qsize 23, out_qsize 0
gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.17% examples, 123054 words/s, in_qsize 23, out_qsize 0
gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.20% examples, 126288 words/s, in_qsize 23, out_qsize 0
gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.24% examples, 127256 words/s, in_qsize 23, out_qsize 0
gensim.models.base_any2vec - INFO - EPOCH 1 - PROGRESS: at 0.27% examples, 126646 words/s, in_qsize 23, out_qsize 0
gensim.models.base_any2vec - INFO

Wall time: 7h 4min 2s


In [8]:
fname = "F:/Models/Lemma_DBOW_2_doc2vec_trec_d1000_n5_w8_mc20_t12_e10_dbow.model"
model.save(fname)

gensim.utils - INFO - saving Doc2Vec object under F:/Models/Lemma_DBOW_2_doc2vec_trec_d1000_n5_w8_mc20_t12_e10_dbow.model, separately None
gensim.utils - INFO - storing np array 'syn1neg' to F:/Models/Lemma_DBOW_2_doc2vec_trec_d1000_n5_w8_mc20_t12_e10_dbow.model.trainables.syn1neg.npy
gensim.utils - INFO - storing np array 'vectors' to F:/Models/Lemma_DBOW_2_doc2vec_trec_d1000_n5_w8_mc20_t12_e10_dbow.model.wv.vectors.npy
gensim.utils - INFO - storing np array 'vectors_docs' to F:/Models/Lemma_DBOW_2_doc2vec_trec_d1000_n5_w8_mc20_t12_e10_dbow.model.docvecs.vectors_docs.npy
gensim.utils - INFO - saved F:/Models/Lemma_DBOW_2_doc2vec_trec_d1000_n5_w8_mc20_t12_e10_dbow.model


## Similarity interface

After that, let's test both models! DBOW model show similar results with the original paper. First, calculating cosine similarity of "Machine learning" using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add .docvecs after model name to extract Document Vector from Doc2Vec Model.

In [None]:
for model in models:
    print(str(model))
    pprint(model.docvecs.most_similar(positive=["Machine learning"], topn=20))

DBOW model interpret the word 'Machine Learning' as a part of Computer Science field, and DM model as Data Science related field.

Second, calculating cosine simillarity of "Lady Gaga" using Paragraph Vector.

In [None]:
for model in models:
    print(str(model))
    pprint(model.docvecs.most_similar(positive=["Lady Gaga"], topn=10))

DBOW model reveal the similar singer in the U.S., and DM model understand that many of Lady Gaga's songs are similar with the word "Lady Gaga".

Third, calculating cosine simillarity of "Lady Gaga" - "American" + "Japanese" using Document vector and Word Vectors. "American" and "Japanese" are Word Vectors, not Paragraph Vectors. Word Vectors are already converted to lowercases by WikiCorpus.

In [None]:
for model in models:
    print(str(model))
    vec = [model.docvecs["Lady Gaga"] - model["american"] + model["japanese"]]
    pprint([m for m in model.docvecs.most_similar(vec, topn=11) if m[0] != "Lady Gaga"])

As a result, DBOW model demonstrate similar artists to Lady Gaga in Japan such as 'Perfume', who is the most famous idol in Japan. On the other hand, DM model results don't include Japanese artists in top 10 similar documents. It's almost the same with no vector calculated results.

These results demonstrate that the DBOW employed in the original paper is outstanding for calculating similarity between Document Vector and Word Vector.