### Custom WikiCorpus
* [tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb)
* [Wiki Corpus processing](https://radimrehurek.com/gensim/corpora/wikicorpus.html)
* [doc](https://radimrehurek.com/gensim/models/doc2vec.html)
* [intro](https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)
* iterable/iterator [tutorial](https://www.programiz.com/python-programming/iterator)
* [3 epochs](https://stackoverflow.com/questions/46856838/how-many-epochs-should-word2vec-be-trained-what-is-a-recommended-training-datas)

**Test WikiCorpus**

In [13]:
from wiki.corpus import WikiCorpus

path_base = '/Users/harangju/Developer/data/wiki/partition/'
name_xml = 'enwiki-20190720-pages-articles-multistream1.xml-p10p30302.bz2'
name_index = 'enwiki-20190720-pages-articles-multistream-index1.txt-p10p30302.bz2'
path_xml = path_base + name_xml
path_index = path_base + name_index
corpus = WikiCorpus(path_xml, path_index)
# corpus.names = corpus.names[:1000]
i = iter(corpus)
td = next(i)
td.words[:7]

WikiDump: Loading index...
WikiDump: Loaded.
Corpus index: 0/19820

['the', 'tardis', 'time', 'and', 'relative', 'dimension', 'in']

### Train small model

In [14]:
# small model
from gensim.models.doc2vec import Doc2Vec

model = Doc2Vec(vector_size=50,
                min_count=10,
                epochs=1)

In [15]:
print('Building vocabulary...')
model.build_vocab(corpus)
print('\nTraining...')
%time model.train(corpus,\
                  total_examples=model.corpus_count,\
                  epochs=model.epochs)
print('')

Building vocabulary...
Corpus index: 19819/19820
Training...
Corpus index: 19819/19820CPU times: user 11min 56s, sys: 7.06 s, total: 12min 3s
Wall time: 10min 59s



In [16]:
model.infer_vector(['hello', 'world', 'how', 'are', 'you'])

array([ 0.02355263, -0.01539653,  0.01153189, -0.00314219,  0.03802452,
       -0.00558852,  0.02279632,  0.00790861, -0.01366444,  0.01767311,
        0.00768275,  0.00399759,  0.0083572 , -0.01757905,  0.00783303,
       -0.03318266,  0.02648101,  0.00241776,  0.00269744,  0.00681946,
       -0.0240631 , -0.03282417, -0.03265889, -0.01961087,  0.03083852,
       -0.00756616,  0.02801166,  0.01585421, -0.02515791, -0.0201935 ,
        0.01929022,  0.00165877,  0.02372515, -0.00631725, -0.02663536,
        0.00820262, -0.00113813,  0.01115231, -0.00416908, -0.01063516,
       -0.00582522, -0.00242083, -0.02801578, -0.02043372,  0.01470257,
       -0.00280134, -0.02949255,  0.01079922, -0.00094088, -0.01385028],
      dtype=float32)

In [17]:
model.save(path_base + name_xml[:-4] + '-d2v-model-small')

### Test model

In [43]:
# Pick a random document from the test corpus and infer a vector from the model
import random

doc_id = random.randint(0, len(corpus.names) - 1)
inferred_vector = model.infer_vector(corpus.doc_at(doc_id))
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

print('Test Document (\"{}\" at {}): «{}...»\n'.\
      format(corpus.names[doc_id], doc_id, ' '.join(corpus.doc_at(doc_id)[:10])))

for i in range(10):
    print('Similarity rank \"{}\"": {} at {} with {}'.\
          format(i, corpus.names[sims[i][0]], sims[i][0], sims[i][1]))

Test Document ("Minos" at 6401): «px thumb gustave doré illustration of king minos for dante...»

Similarity rank 0: Hades at 11420 with 0.8840001821517944
Similarity rank 1: Minos at 6401 with 0.883558988571167
Similarity rank 2: Cerberus at 15734 with 0.8781974911689758
Similarity rank 3: Hephaestus at 10646 with 0.8533622026443481
Similarity rank 4: Hera at 11431 with 0.8531811833381653
Similarity rank 5: Freyr at 12862 with 0.8461430668830872
Similarity rank 6: Heracles at 11034 with 0.8367626070976257
Similarity rank 7: Fenrir at 12910 with 0.8352528810501099
Similarity rank 8: Norns at 5796 with 0.835107147693634
Similarity rank 9: Freyja at 12861 with 0.8343520164489746


### Go full wikipedia

In [None]:
from wikicorpus import WikiCorpus

path_base = '/Users/harangju/Developer/data/wiki/partition/'
name_xml = 'enwiki-20190720-pages-articles-multistream1.xml-p10p30302.bz2'
name_index = 'enwiki-20190720-pages-articles-multistream-index1.txt-p10p30302.bz2'
path_xml = path_base + name_xml
path_index = path_base + name_index
corpus = WikiCorpus(path_xml, path_index)

In [None]:
# big model
model = Doc2Vec(dm=1, # PV-DM distributed memory
                vector_size=300,
                min_count=10,
                workers=4,
                epochs=1)

In [None]:
print('Building vocabulary...')
model.build_vocab(corpus)
print('\nTraining...')
%time model.train(corpus,\
                  total_examples=model.corpus_count,\
                  epochs=model.epochs)
print('')

In [None]:
model.save(path_base + name_xml[:-4] + '-d2v-model-full')