# Doc2Vec to wikipedia articles

We conduct the similar experiment to **Document Embedding with Paragraph Vectors** (http://arxiv.org/abs/1507.07998).
In this paper, they showed only DBOW results to Wikipedia data. So we replicate this experiments using not only DBOW but also DM.

## Basic Setup

Let's import Doc2Vec module.

In [1]:
import multiprocessing
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

## Preparing the corpus

First, download the dump of all Wikipedia articles from [here](http://download.wikimedia.org/enwiki/) (you want the file enwiki-latest-pages-articles.xml.bz2, or enwiki-YYYYMMDD-pages-articles.xml.bz2 for date-specific dumps).

Second, convert the articles to WikiCorpus. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

For more details on WikiCorpus, you should access [Corpus from a Wikipedia dump](https://radimrehurek.com/gensim/corpora/wikicorpus.html).

In [2]:
#wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")
#wiki = WikiCorpus("enwiki-YYYYMMDD-pages-articles.xml.bz2")
#wiki.save("enwikicorpus")
wiki = WikiCorpus.load("enwikicorpus")

Define **TaggedWikiDocument** class to convert WikiCorpus into suitable form for Doc2Vec.

In [3]:
class TaggedWikiDocument(object):
    def __init__(self, wiki):
        self.wiki = wiki
        self.wiki.metadata = True
    def __iter__(self):
        for content, (page_id, title) in self.wiki.get_texts():
            yield TaggedDocument([c.decode("utf-8") for c in content], [title])

In [None]:
documents = TaggedWikiDocument(wiki)

## Training the Doc2Vec Model

At first, we define some types of models.

In [None]:
cores = multiprocessing.cpu_count()
vocab =  915715 # along with paper

models = [
    # PV-DBOW 
    Doc2Vec(dm=0, dbow_words=1, size=200, window=5, min_count=5, max_vocab_size=vocab, workers=cores),
    # PV-DM w/concatenation
    Doc2Vec(dm=1, dm_concat=1, size=200, window=5, min_count=5, max_vocab_size=vocab, workers=cores),
    # PV-DM w/average
    Doc2Vec(dm=1, dm_mean=1, size=200, window=5, min_count=5, max_vocab_size=vocab, workers=cores),
]
models[0].build_vocab(documents)
print(str(models[0]))
for model in models[1:]:
    model.reset_from(models[0])
    print(str(model))

Now we’re ready to train Doc2Vec of the English Wikipedia.

In [None]:
alpha, min_alpha, passes = 0.025, 0.001, 10
alpha_delta = (alpha - min_alpha) / passes

for model in models:
    print(str(model))
    model.alpha, model.min_alpha = alpha, min_alpha
    for epoch in range(passes):
        model.train(sentences)
        alpha -= alpha_delta
    model.save('{}'.format(str(model)))

## Similarity interface

Second, calculating cosine simillarity of **"Machine learning"** using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add **.docvecs** after model name to extract Document Vector from Doc2Vec Model.

In [None]:
for model in models:
    print(str(model), model.docvecs.most_similar(positive=["Machine learning"], topn=20))

Second, calculating cosine simillarity of **"Lady Gaga"** using Paragraph Vector.

In [None]:
for model in models:
    print(str(model), model.docvecs.most_similar(positive=["Lady Gaga"], topn=10))

Third, calculating cosine simillarity of **"Lady Gaga" - "American" + "Japanese"** using Document vector and Word Vectors. "American" and "Japanese" are Word Vectors, not Paragraph Vectors. Word Vectors are already converted to lowercases by WikiCorpus.

In [None]:
for model in models:
    print(str(model), model.docvecs.most_similar([model.docvecs["Lady Gaga"] - model["american"] + model["japanese"]], topn=10))

As a result, DBOW reveal 