# Training Doc2Vec on Wikipedia articles

This notebook replicates the **Document Embedding with Paragraph Vectors** paper, http://arxiv.org/abs/1507.07998.

In that paper, the authors only showed results from the DBOW ("distributed bag of words") mode, trained on the English Wikipedia. Here we replicate this experiment using not only DBOW, but also the DM mode of the "paragraph vector" algorithm aka Doc2Vec.

## Basic setup

Let's import the necessary modules and set up logging. The code below assumes Python 3.7+ and Gensim 4.0+.

In [1]:
import logging
import multiprocessing
from pprint import pprint

import smart_open
from gensim.corpora.wikicorpus import WikiCorpus, tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Preparing the corpus

First, download the dump of all Wikipedia articles from [here](http://download.wikimedia.org/enwiki/). You want the file named `enwiki-latest-pages-articles.xml.bz2`.

Second, convert that Wikipedia article dump from the arcane Wikimedia XML format into a plain text file. This will make the subsequent training faster and also allow easy inspection of the data = "input eyeballing".

We'll preprocess each article at the same time, normalizing its text to lowercase, splitting into tokens, etc. Below I use a regexp tokenizer that simply looks for alphabetic sequences as tokens. But feel free to adapt the text preprocessing to your own domain. High quality preprocessing is often critical for the final pipeline accuracy – garbage in, garbage out!

In [None]:
wiki = WikiCorpus(
    "enwiki-latest-pages-articles.xml.bz2",  # path to the file you downloaded above
    tokenizer_func=tokenize,  # simple regexp; plug in your own tokenizer here
    metadata=True,  # also return the article titles and ids when parsing
    dictionary={},  # don't start processing the data yet
)

In [2]:
with smart_open.open("wiki.txt.gz", "w", encoding='utf8') as fout:
    for article_no, (content, (page_id, title)) in enumerate(wiki.get_texts()):
        title = ' '.join(title.split())
        if article_no % 500000 == 0:
            logging.info("processing article #%i: %r (%i tokens)", article_no, title, len(content))
        fout.write(f"{title}\t{' '.join(content)}\n")  # title_of_article [TAB] words of the article

2022-03-17 21:15:32,118 : INFO : processing article #0: 'Anarchism' (6538 tokens)
2022-03-17 21:30:00,138 : INFO : processing article #500000: 'Spiritual Formation Bible' (54 tokens)
2022-03-17 21:40:22,219 : INFO : processing article #1000000: 'Adolf von Liebenberg' (52 tokens)
2022-03-17 21:49:43,825 : INFO : processing article #1500000: 'Small nucleolar RNA U6-53/MBII-28' (123 tokens)
2022-03-17 21:59:23,620 : INFO : processing article #2000000: 'Xie Fei' (50 tokens)
2022-03-17 22:09:17,460 : INFO : processing article #2500000: 'Rhein, Saskatchewan' (185 tokens)
2022-03-17 22:19:39,293 : INFO : processing article #3000000: 'Kunyinsky District' (969 tokens)
2022-03-17 22:30:41,221 : INFO : processing article #3500000: 'Lake Saint-Charles' (555 tokens)
2022-03-17 22:41:17,487 : INFO : processing article #4000000: 'Mahāyānasaṃgraha' (612 tokens)
2022-03-17 22:52:27,834 : INFO : processing article #4500000: 'Liriomyza trifolii' (1493 tokens)
2022-03-17 23:04:41,464 : INFO : processing a

The above takes about 2 hours on my 2021 M1 MacbookPro, and creates a new ~5.8 GB file named `wiki.txt.gz`. We're compressing the text into `.gz` (GZIP) right away to save on disk space, using the [smart_open](https://github.com/RaRe-Technologies/smart_open) library.

Next we'll set up a stream to load the preprocessed articles from `wiki.txt.gz` one by one, in the format expected by Doc2Vec, ready for training. We don't want to load everything into RAM at once, because that would blow up the memory. And it is not necessary – Gensim can handle streamed training data:

In [4]:
class TaggedWikiCorpus:
    def __init__(self, wiki_text_path):
        self.wiki_text_path = wiki_text_path
        
    def __iter__(self):
        for line in smart_open.open(self.wiki_text_path, encoding='utf8'):
            title, words = line.split('\t')
            yield TaggedDocument(words=words.split(), tags=[title])

documents = TaggedWikiCorpus('wiki.txt.gz')  # A streamed iterable; nothing in RAM yet.

In [7]:
# Load and print the first preprocessed Wikipedia document, as a sanity check = "input eyeballing".
first_doc = next(iter(documents))
print(first_doc.tags, ': ', ' '.join(first_doc.words[:50] + ['………'] + first_doc.words[-50:]))

['Anarchism'] :  anarchism is political philosophy and movement that is sceptical of authority and rejects all involuntary coercive forms of hierarchy anarchism calls for the abolition of the state which it holds to be unnecessary undesirable and harmful as historically left wing movement placed on the farthest left of the political spectrum ……… criticism of philosophical anarchism defence of philosophical anarchism stating that both kinds of anarchism philosophical and political anarchism are philosophical and political claims anarchistic popular fiction novel an argument for philosophical anarchism external links anarchy archives anarchy archives is an online research center on the history and theory of anarchism


The document seems legit so let's move on to finally training some Doc2vec models.

## Training Doc2Vec

The original paper had a vocabulary size of 915,715 word types, so we'll try to match it by setting `max_final_vocab=915715` in the Doc2vec constructor.

In [8]:
cores = multiprocessing.cpu_count()

models = [
    # PV-DBOW 
    Doc2Vec(dm=0, dbow_words=1, vector_size=200, window=8, epochs=10, workers=cores, max_final_vocab=915715),
    # PV-DM with average
    Doc2Vec(dm=1, dm_mean=1, vector_size=200, window=8, epochs=10, workers=cores, max_final_vocab=915715),
]

2022-03-17 23:12:37,360 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t10>', 'datetime': '2022-03-17T23:12:37.360576', 'gensim': '4.1.3.dev0', 'python': '3.8.9 (default, Oct 26 2021, 07:25:53) \n[Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-12.2.1-arm64-arm-64bit', 'event': 'created'}
2022-03-17 23:12:37,365 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t10>', 'datetime': '2022-03-17T23:12:37.365118', 'gensim': '4.1.3.dev0', 'python': '3.8.9 (default, Oct 26 2021, 07:25:53) \n[Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-12.2.1-arm64-arm-64bit', 'event': 'created'}


In [10]:
models[0].build_vocab(documents, progress_per=500000)
print(models[0])

# Save some time by copying the vocabulary structures from the first model.
# Both models are built on top of exactly the same data, so there's no need to repeat the vocab-building step.
models[1].reset_from(models[0])
print(models[1])

2022-03-17 23:14:38,521 : INFO : collecting all words and their counts
2022-03-17 23:14:38,529 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2022-03-17 23:16:33,505 : INFO : PROGRESS: at example #500000, processed 654950164 words (5696698/s), 3222179 word types, 500000 tags
2022-03-17 23:17:41,900 : INFO : PROGRESS: at example #1000000, processed 1018611068 words (5317131/s), 4480366 word types, 1000000 tags
2022-03-17 23:18:36,271 : INFO : PROGRESS: at example #1500000, processed 1305140647 words (5269927/s), 5420104 word types, 1500000 tags
2022-03-17 23:19:23,908 : INFO : PROGRESS: at example #2000000, processed 1550245240 words (5145361/s), 6188355 word types, 2000000 tags
2022-03-17 23:20:10,242 : INFO : PROGRESS: at example #2500000, processed 1790661139 words (5188872/s), 6941128 word types, 2500000 tags
2022-03-17 23:20:56,600 : INFO : PROGRESS: at example #3000000, processed 2028261627 words (5125392/s), 7664997 word types, 3000000 tags
2022-0

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t10>
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t10>


Now we’re ready to train Doc2Vec on the English Wikipedia. **Warning!** Training the DBOW model takes ~16 hours, and DM ~4 hours, on my 2021 laptop.

In [14]:
for model in models:
    model.train(documents, total_examples=model.corpus_count, epochs=model.epochs, report_delay=30*60)

2022-03-17 23:29:03,320 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 10 workers on 894446 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-03-17T23:29:03.320153', 'gensim': '4.1.3.dev0', 'python': '3.8.9 (default, Oct 26 2021, 07:25:53) \n[Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-12.2.1-arm64-arm-64bit', 'event': 'train'}
2022-03-17 23:29:04,361 : INFO : EPOCH 1 - PROGRESS: at 0.00% examples, 379389 words/s, in_qsize 19, out_qsize 0
2022-03-17 23:59:04,372 : INFO : EPOCH 1 - PROGRESS: at 17.95% examples, 429937 words/s, in_qsize 19, out_qsize 0
2022-03-18 00:29:04,379 : INFO : EPOCH 1 - PROGRESS: at 55.55% examples, 437068 words/s, in_qsize 20, out_qsize 0
2022-03-18 00:59:04,423 : INFO : EPOCH 1 - PROGRESS: at 98.13% examples, 439343 words/s, in_qsize 20, out_qsize 0
2022-03-18 01:00:11,996 : INFO : worker thread finished; awaiting finish of 9 more threads
2022-03-18 01:00:12,013 : INF

2022-03-18 08:30:54,071 : INFO : worker thread finished; awaiting finish of 9 more threads
2022-03-18 08:30:54,085 : INFO : worker thread finished; awaiting finish of 8 more threads
2022-03-18 08:30:54,094 : INFO : worker thread finished; awaiting finish of 7 more threads
2022-03-18 08:30:54,131 : INFO : worker thread finished; awaiting finish of 6 more threads
2022-03-18 08:30:54,132 : INFO : worker thread finished; awaiting finish of 5 more threads
2022-03-18 08:30:54,145 : INFO : worker thread finished; awaiting finish of 4 more threads
2022-03-18 08:30:54,164 : INFO : worker thread finished; awaiting finish of 3 more threads
2022-03-18 08:30:54,171 : INFO : worker thread finished; awaiting finish of 2 more threads
2022-03-18 08:30:54,183 : INFO : worker thread finished; awaiting finish of 1 more threads
2022-03-18 08:30:54,189 : INFO : worker thread finished; awaiting finish of 0 more threads
2022-03-18 08:30:54,189 : INFO : EPOCH - 6 : training on 2996051328 raw words (2402970085 

2022-03-18 15:12:45,325 : INFO : worker thread finished; awaiting finish of 8 more threads
2022-03-18 15:12:45,326 : INFO : worker thread finished; awaiting finish of 7 more threads
2022-03-18 15:12:45,327 : INFO : worker thread finished; awaiting finish of 6 more threads
2022-03-18 15:12:45,327 : INFO : worker thread finished; awaiting finish of 5 more threads
2022-03-18 15:12:45,332 : INFO : worker thread finished; awaiting finish of 4 more threads
2022-03-18 15:12:45,338 : INFO : worker thread finished; awaiting finish of 3 more threads
2022-03-18 15:12:45,346 : INFO : worker thread finished; awaiting finish of 2 more threads
2022-03-18 15:12:45,348 : INFO : worker thread finished; awaiting finish of 1 more threads
2022-03-18 15:12:45,354 : INFO : worker thread finished; awaiting finish of 0 more threads
2022-03-18 15:12:45,355 : INFO : EPOCH - 1 : training on 2996051328 raw words (2402951760 effective words) took 1521.7s, 1579074 effective words/s
2022-03-18 15:12:46,373 : INFO : E

2022-03-18 18:00:31,056 : INFO : worker thread finished; awaiting finish of 7 more threads
2022-03-18 18:00:31,057 : INFO : worker thread finished; awaiting finish of 6 more threads
2022-03-18 18:00:31,059 : INFO : worker thread finished; awaiting finish of 5 more threads
2022-03-18 18:00:31,059 : INFO : worker thread finished; awaiting finish of 4 more threads
2022-03-18 18:00:31,068 : INFO : worker thread finished; awaiting finish of 3 more threads
2022-03-18 18:00:31,072 : INFO : worker thread finished; awaiting finish of 2 more threads
2022-03-18 18:00:31,075 : INFO : worker thread finished; awaiting finish of 1 more threads
2022-03-18 18:00:31,076 : INFO : worker thread finished; awaiting finish of 0 more threads
2022-03-18 18:00:31,076 : INFO : EPOCH - 8 : training on 2996051328 raw words (2402970402 effective words) took 1404.5s, 1710899 effective words/s
2022-03-18 18:00:32,091 : INFO : EPOCH 9 - PROGRESS: at 0.01% examples, 2063533 words/s, in_qsize 0, out_qsize 0
2022-03-18 1

## Similarity interface

After that, let's test both models! The DBOW model shows similar results as the original paper.

First, calculate the most similar Wikipedia articles to the "Machine learning" article. The calculated word vectors and document vectors are stored separately, in `model.wv` and `model.dv` respectively:

In [15]:
for model in models:
    print(model)
    pprint(model.dv.most_similar(positive=["Machine learning"], topn=20))

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t10>
[('Supervised learning', 0.7626678943634033),
 ('Pattern recognition', 0.7443839907646179),
 ('Artificial neural network', 0.7443667650222778),
 ('Boosting (machine learning)', 0.7209591865539551),
 ('Deep learning', 0.7030681371688843),
 ('Linear classifier', 0.6918482184410095),
 ('Feature selection', 0.6885010600090027),
 ('Knowledge retrieval', 0.6797034740447998),
 ('Convolutional neural network', 0.6789148449897766),
 ('Outline of computer science', 0.6732515096664429),
 ('Training, validation, and test sets', 0.6729527711868286),
 ('Support-vector machine', 0.6719434857368469),
 ('Learning classifier system', 0.6716565489768982),
 ('Outline of machine learning', 0.6692107915878296),
 ('Bayesian network', 0.6654112935066223),
 ('Manifold regularization', 0.6635575294494629),
 ('Multi-task learning', 0.6624512672424316),
 ('Fuzzy logic', 0.6605969667434692),
 ('Computer mathematics', 0.6600310206413269),
 ('Recurrent neural network', 0.657

Both results seem similar, but note the DM model took 4x less time train (training 4x faster).

Second, let's calculate the most similar Wikipedia entries to "Lady Gaga" using Paragraph Vector:

In [16]:
for model in models:
    print(model)
    pprint(model.dv.most_similar(positive=["Lady Gaga"], topn=10))

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t10>
[('Ariana Grande', 0.755739688873291),
 ('Katy Perry', 0.7534462213516235),
 ('Miley Cyrus', 0.7091007828712463),
 ('Adele', 0.6958011984825134),
 ('Demi Lovato', 0.6867919564247131),
 ('Nicki Minaj', 0.6783465147018433),
 ('Taylor Swift', 0.6691418886184692),
 ('Adam Lambert', 0.6638894081115723),
 ('Rihanna', 0.6437391638755798),
 ('Kesha', 0.6433634161949158)]
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t10>
[('Born This Way (album)', 0.6649508476257324),
 ('Artpop', 0.6616811752319336),
 ('Lady Gaga videography', 0.6363328695297241),
 ('Katy Perry', 0.6322777271270752),
 ('Beautiful, Dirty, Rich', 0.6277879476547241),
 ('Lady Gaga discography', 0.60688316822052),
 ('Applause (Lady Gaga song)', 0.6062529683113098),
 ('List of Lady Gaga live performances', 0.5975069403648376),
 ('Born This Way (song)', 0.5948888659477234),
 ('Madonna', 0.5918263792991638)]


The DBOW model reveals similar singers in the U.S., while the DM model seems to pay more attention to the word "Gaga" itself.

Finally, let's do some wilder artihmetics that vectors embeddings are famous for. What are the entries most similar to "Lady Gaga" - "American" + "Japanese"?

Note that "American" and "Japanese" are word vectors, but they live in the same space as the document vectors so we can add / subtract them at will, for some interesting results. All word vectors were already lowercased by our tokenizer above, so we look for the lowercased version here:

In [17]:
for model in models:
    print(model)
    vec = [model.dv["Lady Gaga"] - model.wv["american"] + model.wv["japanese"]]
    pprint([m for m in model.dv.most_similar(vec, topn=11) if m[0] != "Lady Gaga"])

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t10>
[('Ayumi Hamasaki', 0.604461669921875),
 ('2NE1', 0.5942890644073486),
 ('Katy Perry', 0.5932046175003052),
 ('Ariana Grande', 0.5865142941474915),
 ("Can't Stop the Disco", 0.5778986215591431),
 ("Girls' Generation", 0.5741134285926819),
 ('We Are "Lonely Girl"', 0.5682086944580078),
 ('Perfume (Japanese band)', 0.568188488483429),
 ('H (Ayumi Hamasaki EP)', 0.5679325461387634),
 ('Kyary Pamyu Pamyu', 0.5665541887283325)]
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t10>
[('Kaela Kimura', 0.5528751015663147),
 ('Chisato Moritaka', 0.551906943321228),
 ('Suzuki Ami Around the World: Live House Tour 2005', 0.5428911447525024),
 ('Pink Lady (duo)', 0.5385505557060242),
 ('Artpop', 0.5361125469207764),
 ('Kaede (dancer)', 0.535369873046875),
 ('Miliyah Kato', 0.5336685180664062),
 ('Liyuu', 0.5325193405151367),
 ('Ai (singer)', 0.5272262692451477),
 ('Momoiro Clover Z', 0.525260329246521)]


As a result, the DBOW model surfaced artists similar to Lady Gaga in Japan, such as **Ayumi Hamasaki** whose Wiki bio says:

> Ayumi Hamasaki is a Japanese singer, songwriter, record producer, actress, model, spokesperson, and entrepreneur.

So that sounds like a success.

Similarly, the DM model thought **Kaela Kimura** is the closest hit:

> Kaela Kimura is a Japanese pop rock singer, lyricist, fashion model and television presenter.

Also pretty good.

These results demonstrate that both training modes employed in the original paper are outstanding for calculating similarity between document vectors, word vectors, or a combination of both. The DM mode has the added advantage of being 4x faster to train.

If you wanted to continue working with these trained models, you could save them to disk, to avoid having to re-train the models from scratch every time:

In [19]:
models[0].save('doc2vec_dbow.model')
models[1].save('doc2vec_dm.model')

2022-03-18 19:08:34,623 : INFO : Doc2Vec lifecycle event {'fname_or_handle': 'doc2vec_dbow.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-03-18T19:08:34.622990', 'gensim': '4.1.3.dev0', 'python': '3.8.9 (default, Oct 26 2021, 07:25:53) \n[Clang 13.0.0 (clang-1300.0.29.30)]', 'platform': 'macOS-12.2.1-arm64-arm-64bit', 'event': 'saving'}
2022-03-18 19:08:34,641 : INFO : storing np array 'vectors' to doc2vec_dbow.model.dv.vectors.npy
2022-03-18 19:08:40,244 : INFO : storing np array 'vectors' to doc2vec_dbow.model.wv.vectors.npy
2022-03-18 19:08:46,811 : INFO : storing np array 'syn1neg' to doc2vec_dbow.model.syn1neg.npy
2022-03-18 19:08:48,564 : INFO : not storing attribute cum_table
2022-03-18 19:08:56,097 : INFO : saved doc2vec_dbow.model
2022-03-18 19:08:56,098 : INFO : Doc2Vec lifecycle event {'fname_or_handle': 'doc2vec_dm.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-03-18T19:08:56.09876

To continue your doc2vec explorations, refer to the official API documentation in Gensim: https://radimrehurek.com/gensim/models/doc2vec.html