### Custom WikiCorpus
* [tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb)
* [Wiki Corpus processing](https://radimrehurek.com/gensim/corpora/wikicorpus.html)
* [doc](https://radimrehurek.com/gensim/models/doc2vec.html)
* [intro](https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)
* iterable/iterator [tutorial](https://www.programiz.com/python-programming/iterator)
* [3 epochs](https://stackoverflow.com/questions/46856838/how-many-epochs-should-word2vec-be-trained-what-is-a-recommended-training-datas)

**Test WikiCorpus**

In [None]:
import wiki

path_base = '/Users/harangju/Developer/data/wiki/partition/'
name_xml = 'enwiki-20190720-pages-articles-multistream1.xml-p10p30302.bz2'
name_index = 'enwiki-20190720-pages-articles-multistream-index1.txt-p10p30302.bz2'
path_xml = path_base + name_xml
path_index = path_base + name_index
corpus = wiki.Corpus(path_xml, path_index)
i = iter(corpus)
td = next(i)
td.words[:7]

### Train small model

In [None]:
# small model
from gensim.models.doc2vec import Doc2Vec

model = Doc2Vec(vector_size=50,
                min_count=10,
                epochs=1)

In [None]:
print('Building vocabulary...')
model.build_vocab(corpus)
print('\nTraining...')
%time model.train(corpus,\
                  total_examples=model.corpus_count,\
                  epochs=model.epochs)
print('')

In [None]:
model.infer_vector(['hello', 'world', 'how', 'are', 'you'])

In [None]:
model.save(path_base + name_xml[:-4] + '-d2v-model-small')

### Test model

In [None]:
# Pick a random document from the test corpus and infer a vector from the model
import random

doc_id = random.randint(0, len(corpus.names) - 1)
inferred_vector = model.infer_vector(corpus.doc_at(doc_id))
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

print('Test Document (\"{}\" at {}): «{}...»\n'.\
      format(corpus.names[doc_id], doc_id, ' '.join(corpus.doc_at(doc_id)[:10])))

for i in range(10):
    print('Similarity rank {}: \"{}\" at {} with {}'.\
          format(i, corpus.names[sims[i][0]], sims[i][0], sims[i][1]))

### Go full wikipedia

In [2]:
import wiki
from gensim.models.doc2vec import Doc2Vec

In [1]:
import wiki

path_base = '/Users/harangju/Developer/data/wiki/'
name_xml = 'enwiki-20190801-pages-articles-multistream.xml.bz2'
name_index = 'enwiki-20190801-pages-articles-multistream-index.txt.bz2'
path_xml = path_base + name_xml
path_index = path_base + name_index
%time corpus = wiki.Corpus(path_xml, path_index)

Dump: Loading index...
Dump: Loaded.
CPU times: user 1min 11s, sys: 2.45 s, total: 1min 13s
Wall time: 1min 14s


In [2]:
model = Doc2Vec(dm=1, # PV-DM distributed memory
                vector_size=300,
                min_count=10,
                workers=4,
                epochs=3)

In [None]:
print('Building vocabulary...')
model.build_vocab(corpus)
model.save(path_base + 'models/' + name_xml[:-4] + '-d2v-model-full')

In [4]:
model = Doc2Vec.load(path_base + 'models/' + name_xml[:-4] + '-d2v-model-full')

In [None]:
print('\nTraining...')
for i in range(3):
    print('Epoch: ' + str(i))
    %time model.train(corpus,\
                      total_examples=model.corpus_count,\
                      epochs=1)
    print('')
    model.save(path_base + name_xml[:-4] + '-d2v-model-full')


Training...
Epoch: 0
Corpus index: 306606/19567244