# Doc2Vec to wikipedia articles

We conduct the similar experiment to **Document Embedding with Paragraph Vectors** (http://arxiv.org/abs/1507.07998).

## Basic Setup

Let's import Doc2Vec module.

In [1]:
import multiprocessing
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

## Preparing the corpus

First, download the dump of all Wikipedia articles from [here](http://download.wikimedia.org/enwiki/) (you want the file enwiki-latest-pages-articles.xml.bz2, or enwiki-YYYYMMDD-pages-articles.xml.bz2 for date-specific dumps).

Second, convert the articles to WikiCorpus. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

For more details on WikiCorpus, you should access [Corpus from a Wikipedia dump](https://radimrehurek.com/gensim/corpora/wikicorpus.html).

In [2]:
wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")
wiki = WikiCorpus("enwiki-YYYYMMDD-pages-articles.xml.bz2")
wiki.save("enwikicorpus")
#wiki = WikiCorpus.load("enwikicorpus")

Define **TaggedWikiDocument** class to convert WikiCorpus into suitable form for Doc2Vec.

In [3]:
class TaggedWikiDocument(object):
    def __init__(self, wiki):
        self.wiki = wiki
        self.wiki.metadata = True
    def __iter__(self):
        for content, (page_id, title) in self.wiki.get_texts():
            yield TaggedDocument([c.decode("utf-8") for c in content], [title])

## Training the Doc2Vec Model

Now we’re ready to compute Doc2Vec of the English Wikipedia.

In [4]:
documents = TaggedWikiDocument(wiki)
d2v = Doc2Vec(documents, size=500, window=8, min_count=5, workers=multiprocessing.cpu_count())
d2v.save("d2v")
#d2v = Doc2Vec.load("d2v")

## Similarity interface

First, calculating cosine simillarity of **"Lady Gaga"** using Document Vector. Word Vector and Document Vector are separately stored. We have to add **.docvecs** after model name to extract Document Vector from Doc2Vec Model.

In [31]:
d2v.docvecs.most_similar(positive=["Lady Gaga"], topn=30)

[('The Fame Monster', 0.4721682071685791),
 ('List of awards and nominations received by Lady Gaga', 0.4562109112739563),
 ('The Fame', 0.45485830307006836),
 ('Born This Way (song)', 0.45266470313072205),
 ('Beautiful, Dirty, Rich', 0.4488331377506256),
 ('Lisa Goes Gaga', 0.4479030966758728),
 ('Born This Way Foundation', 0.4456796646118164),
 ('Just Dance (song)', 0.4308058023452759),
 ('Bad Romance', 0.4264294505119324),
 ('Marry the Night', 0.42452168464660645),
 ('LoveGame', 0.42279836535453796),
 ('Alejandro (song)', 0.4224458932876587),
 ('The Monster Ball Tour', 0.4190131723880768),
 ('Poker Face (Lady Gaga song)', 0.4150453805923462),
 ('Aura (song)', 0.41388726234436035),
 ('Haus of Gaga', 0.41266217827796936),
 ('G.U.Y.', 0.412364661693573),
 ('Hair (Lady Gaga song)', 0.4060732424259186),
 ('Fame Kills: Starring Kanye West and Lady Gaga', 0.40446993708610535),
 ('Janet Jackson', 0.4010123014450073),
 ('Born This Way (album)', 0.4006291627883911),
 ('You and I (Lady Gaga son

Second, calculating cosine simillarity of **"Lady Gaga" - "American" + "Japanese"** using Document vector and Word Vectors. "American" and "Japanese" are Word Vectors, not Paragraph Vectors. Word Vectors are already converted to lowercases by WikiCorpus.

In [14]:
gagavec = d2v.docvecs["Lady Gaga"]
jpvec = d2v["japanese"]
amvec = d2v["american"]
[sim for sim in d2v.docvecs.most_similar([gagavec - amvec + jpvec],topn=31) if sim[0] != "Lady Gaga"]

[('The Fame Monster', 0.3138265907764435),
 ('Aura (song)', 0.3118044137954712),
 ('Lisa Goes Gaga', 0.3084213137626648),
 ('Venus (Lady Gaga song)', 0.3026971220970154),
 ('Born This Way (song)', 0.30016905069351196),
 ('Judas (Lady Gaga song)', 0.2984205186367035),
 ('Marry the Night', 0.2962271571159363),
 ('Beautiful, Dirty, Rich', 0.2950041890144348),
 ('G.U.Y.', 0.2893539071083069),
 ('List of awards and nominations received by Lady Gaga', 0.2887462377548218),
 ('Hime (rapper)', 0.2820761203765869),
 ('Feel (Kumi Koda song)', 0.281485915184021),
 ('Kokia (singer)', 0.2808469831943512),
 ('Lady Gaga discography', 0.2800357937812805),
 ('LoveGame', 0.2782258689403534),
 ('Nōdōteki Sanpunkan', 0.27804940938949585),
 ('J-pop', 0.2755916714668274),
 ('The Fame', 0.2755494713783264),
 ('Bad Romance', 0.27365168929100037),
 ('Born This Way (album)', 0.27221694588661194),
 ('Just Dance (song)', 0.27107709646224976),
 ('Alizée', 0.26866328716278076),
 ('Poker Face (Lady Gaga song)', 0.266