# Doc2Vec to wikipedia articles

We conduct the similar experiment to **Document Embedding with Paragraph Vectors** (http://arxiv.org/abs/1507.07998).
In this paper, they showed only DBOW results to Wikipedia data. So we replicate this experiments using not only DBOW but also DM.

## Basic Setup

Let's import Doc2Vec module.

In [1]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

## Preparing the corpus

First, download the dump of all Wikipedia articles from [here](http://download.wikimedia.org/enwiki/) (you want the file enwiki-latest-pages-articles.xml.bz2, or enwiki-YYYYMMDD-pages-articles.xml.bz2 for date-specific dumps).

Second, convert the articles to WikiCorpus. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

For more details on WikiCorpus, you should access [Corpus from a Wikipedia dump](https://radimrehurek.com/gensim/corpora/wikicorpus.html).

In [2]:
wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")
#wiki = WikiCorpus("enwiki-YYYYMMDD-pages-articles.xml.bz2")

Define **TaggedWikiDocument** class to convert WikiCorpus into suitable form for Doc2Vec.

In [3]:
class TaggedWikiDocument(object):
    def __init__(self, wiki):
        self.wiki = wiki
        self.wiki.metadata = True
    def __iter__(self):
        for content, (page_id, title) in self.wiki.get_texts():
            yield TaggedDocument([c.decode("utf-8") for c in content], [title])

In [4]:
documents = TaggedWikiDocument(wiki)

## Training the Doc2Vec Model

At first, we define some types of models.

In [5]:
cores = multiprocessing.cpu_count()
vocab =  915715 # along with paper

models = [
    # PV-DBOW 
    Doc2Vec(dm=0, dbow_words=1, size=200, window=5, min_count=5, max_vocab_size=vocab, iter=10, workers=cores),
    # PV-DM w/average
    Doc2Vec(dm=1, dm_mean=1, size=200, window=5, min_count=5, max_vocab_size=vocab, workers=cores),
]
models[0].build_vocab(documents)
print(str(models[0]))
models[1].reset_from(models[0])
print(str(models[1]))

Doc2Vec(dbow+w,d200,hs,w5,mc5,t8)
Doc2Vec(dm/m,d200,hs,w5,mc5,t8)


Now we’re ready to train Doc2Vec of the English Wikipedia.

In [6]:
for model in models:
    %time model.train(documents)

CPU times: user 3d 18h 14min 32s, sys: 29min 16s, total: 3d 18h 43min 48s
Wall time: 21h 48min 45s
CPU times: user 11h 59min 25s, sys: 16min 43s, total: 12h 16min 9s
Wall time: 7h 43min 6s


## Similarity interface

After that, let's test both models! **DBOW** model show the simillar results with the original paper.
First, calculating cosine simillarity of **"Machine learning"** using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add **.docvecs** after model name to extract Document Vector from Doc2Vec Model.

In [7]:
for model in models:
    print(str(model))
    pprint(model.docvecs.most_similar(positive=["Machine learning"], topn=20))

Doc2Vec(dbow+w,d200,hs,w5,mc5,t8)
[('Artificial neural network', 0.7255396246910095),
 ('Theoretical computer science', 0.7125768661499023),
 ('Data mining', 0.6895816326141357),
 ('Pattern recognition', 0.678891658782959),
 ('List of important publications in computer science', 0.6695302724838257),
 ('Outline of computer science', 0.667578935623169),
 ('Information visualization', 0.6667760014533997),
 ('Unsupervised learning', 0.6627277135848999),
 ('Bayesian network', 0.6622973680496216),
 ('Support vector machine', 0.6594343781471252),
 ('Algorithmic composition', 0.6593101024627686),
 ("Solomonoff's theory of inductive inference", 0.6554585695266724),
 ('Kriging', 0.6505937576293945),
 ('Model checking', 0.6501827239990234),
 ('Information theory', 0.6447420120239258),
 ('Computational learning theory', 0.6422973871231079),
 ('Generalization error', 0.6414266228675842),
 ('Complexity', 0.6391021609306335),
 ('Glossary of artificial intelligence', 0.6353012323379517),
 ('Theory of 

**DBOW** model interpret the word 'Machine Learning' as a part of Computer Science field, and **DM** model as Data Science related field.

Second, calculating cosine simillarity of **"Lady Gaga"** using Paragraph Vector.

In [8]:
for model in models:
    print(str(model))
    pprint(model.docvecs.most_similar(positive=["Lady Gaga"], topn=10))

Doc2Vec(dbow+w,d200,hs,w5,mc5,t8)
[('Katy Perry', 0.7122470140457153),
 ('Nicki Minaj', 0.6723202466964722),
 ('Christina Aguilera', 0.6566343903541565),
 ('Adam Lambert', 0.6328957676887512),
 ('Beyoncé', 0.6275202035903931),
 ('Rihanna', 0.6249787211418152),
 ('Miley Cyrus', 0.6206135153770447),
 ('Nicole Scherzinger', 0.6199545860290527),
 ('List of awards and nominations received by Lady Gaga', 0.6158047914505005),
 ('Ariana Grande', 0.6092617511749268)]
Doc2Vec(dm/m,d200,hs,w5,mc5,t8)
[('Katy Perry', 0.6092739105224609),
 ('List of awards and nominations received by Lady Gaga', 0.581851065158844),
 ('Born This Way (song)', 0.5763623118400574),
 ('ArtRave: The Artpop Ball', 0.5581628084182739),
 ('The Monster Ball Tour', 0.5547659993171692),
 ('Born This Way Ball', 0.5475640892982483),
 ('Applause (Lady Gaga song)', 0.5422967672348022),
 ('The Fame', 0.5405881404876709),
 ('Rihanna', 0.5365649461746216),
 ('Paparazzi (Lady Gaga song)', 0.5345339775085449)]


**DBOW** model reveal the similar singer in the U.S., and DM model understand that many of Lady Gaga's songs are similar with the word **"Lady Gaga"**.

Third, calculating cosine simillarity of **"Lady Gaga" - "American" + "Japanese"** using Document vector and Word Vectors. "American" and "Japanese" are Word Vectors, not Paragraph Vectors. Word Vectors are already converted to lowercases by WikiCorpus.

In [9]:
for model in models:
    print(str(model))
    vec = [model.docvecs["Lady Gaga"] - model["american"] + model["japanese"]]
    pprint([m for m in model.docvecs.most_similar(vec, topn=11) if m[0] != "Lady Gaga"])

Doc2Vec(dbow+w,d200,hs,w5,mc5,t8)
[('Nicki Minaj', 0.5384640097618103),
 ('AKB48', 0.5191181302070618),
 ('Katy Perry', 0.5179253816604614),
 ('Gwen Stefani', 0.48544663190841675),
 ('Ayumi Hamasaki', 0.4849669635295868),
 ('Mondai Girl', 0.4826279282569885),
 ('Koda Kumi', 0.4788415729999542),
 ('Momoiro Clover Z', 0.47876042127609253),
 ('Big Bang (South Korean band)', 0.47655385732650757),
 ('Kyary Pamyu Pamyu', 0.4751134514808655)]
Doc2Vec(dm/m,d200,hs,w5,mc5,t8)
[('Aura (song)', 0.4737752676010132),
 ('Katy Perry', 0.47200503945350647),
 ('Haus of Gaga', 0.448883056640625),
 ('Born This Way (album)', 0.44316208362579346),
 ('The Fame', 0.4316578209400177),
 ('Born This Way (song)', 0.4304528832435608),
 ('Artpop', 0.42924416065216064),
 ('Marry the Night', 0.4230160117149353),
 ('List of awards and nominations received by Lady Gaga', 0.420695424079895),
 ('Speechless (Lady Gaga song)', 0.4165343940258026)]


As a result, **DBOW** model demonstrate the similar artists with Lady Gaga in Japan such as 'AKB48', which is the Most famous Idol in Japan, 'Kyary Pamyu Pamyu' whose appearance is also characteristic.
On the other hand, **DM** model results don't include the Japanese aritsts in top 10 simillar documents. It's almost same with no vector calculated results.

This results demonstrate that **DBOW** employed in the original paper is outstanding for calculating the similarity between Document Vector and Word Vector.