### Custom WikiCorpus
* [tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb)
* [Wiki Corpus processing](https://radimrehurek.com/gensim/corpora/wikicorpus.html)
* [doc](https://radimrehurek.com/gensim/models/doc2vec.html)
* [intro](https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)
* iterable/iterator [tutorial](https://www.programiz.com/python-programming/iterator)
* [3 epochs](https://stackoverflow.com/questions/46856838/how-many-epochs-should-word2vec-be-trained-what-is-a-recommended-training-datas)

In [138]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.utils import simple_preprocess
import sys
%run wikidump.py

class WikiCorpus:
    def __init__(self, path_xml, path_index):
        self.dump = WikiDump(path_xml, path_index)
        self.names = list(self.dump.idx.keys())
    
    def __iter__(self):
        self.i = 0
        return self
    
    def __next__(self):
        if self.i < len(self.names):
            sys.stdout.write("\rCorpus index: " + str(self.i) + 
                             '/' + str(len(self.names)))
            sys.stdout.flush()
            self.i += 1
            return TaggedDocument(self.doc_at(self.i), [self.i])
        else:
            raise StopIteration
    
    def doc_at(self, index):
        doc = self.dump.load_page(self.names[index])
        return simple_preprocess(doc.strip_code())

**Test WikiCorpus**

In [139]:
path_base = '/Users/harangju/Developer/data/wiki/partition/'
name_xml = 'enwiki-20190720-pages-articles-multistream1.xml-p10p30302.bz2'
name_index = 'enwiki-20190720-pages-articles-multistream-index1.txt-p10p30302.bz2'
path_xml = path_base + name_xml
path_index = path_base + name_index
corpus = WikiCorpus(path_xml, path_index)
i = iter(corpus)
td = next(i)
td.words[:7]

WikiDump: Loading index...
WikiDump: Loaded.
Corpus index: 0/19820

['the', 'tardis', 'time', 'and', 'relative', 'dimension', 'in']

### Train model

In [None]:
# small model
model = Doc2Vec(vector_size=50,
                min_count=10,
                epochs=1)

In [None]:
# big model
model = Doc2Vec(dm=1, # PV-DM distributed memory
                vector_size=300,
                min_count=10,
                workers=4,
                epochs=1)

In [128]:
print('Building vocabulary...')
model.build_vocab(corpus)
print('\nTraining...')
%time model.train(corpus,\
                  total_examples=model.corpus_count,\
                  epochs=model.epochs)
print('')


Training...
Index: 19819/19820CPU times: user 16min 3s, sys: 14.5 s, total: 16min 18s
Wall time: 20min 47s



In [129]:
model.infer_vector(['hello', 'world', 'how', 'are', 'you'])

array([ 0.00217859, -0.00407312, -0.0094516 ,  0.03701768,  0.00157459,
        0.01731165,  0.01991037,  0.02336471,  0.02469954, -0.03268813,
       -0.02404369,  0.0266294 ,  0.00915382, -0.01682209,  0.02136993,
        0.02303282, -0.00344342,  0.03854709, -0.0397099 ,  0.03860508,
       -0.00178146,  0.01096711, -0.01168419, -0.03181314,  0.01543378,
       -0.02217452, -0.02155361,  0.00148972, -0.01155852, -0.01389984,
       -0.02114898,  0.01661966,  0.02680798, -0.02321431, -0.01646995,
       -0.02106231, -0.06027954, -0.01270194, -0.04149376,  0.00120926,
       -0.03135706,  0.03331354, -0.02723764, -0.00970843,  0.00156369,
       -0.00028389, -0.02758574, -0.00632157, -0.03685561,  0.04988539],
      dtype=float32)

In [134]:
model.save(path_base + name_xml[:-4] + '-d2v-model')

### Test model

In [160]:
# Pick a random document from the test corpus and infer a vector from the model
import random

doc_id = random.randint(0, len(corpus.names) - 1)
inferred_vector = model.infer_vector(corpus.doc_at(doc_id))
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({} at {}): «{}»\n'.\
      format(corpus.names[doc_id], doc_id, ' '.join(corpus.doc_at(doc_id))))

index = 1
print('Second most similar ({} at {} with {}): «{}»'.\
     format(corpus.names[sims[index][0]], sims[index][0], sims[index][1],\
            ' '.join(corpus.doc_at(sims[index][0])) ))
# print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
# for label, index in [('2nd MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
#     print(u'%s %s %s: «%s»\n' %\
#           (label, corpus.names[sims[index][0]], sims[index], \
#            ' '.join(corpus.doc_at(sims[index][0])) ))

Test Document (Lisp (programming language) at 8213): «lisp historically lisp is family of computer programming languages with long history and distinctive fully parenthesized prefix notation originally specified in lisp is the second oldest high level programming language in widespread use today only fortran is older by one year lisp has changed since its early days and many dialects have existed over its history today the best known general purpose lisp dialects are clojure common lisp and scheme lisp was originally created as practical mathematical notation for computer programs influenced by the notation of alonzo church lambda calculus it quickly became the favored programming language for artificial intelligence ai research as one of the earliest programming languages lisp pioneered many ideas in computer science including tree data structures automatic storage management dynamic typing conditionals higher order functions recursion the self hosting compiler and the read eval print

Second most similar (Scheme (programming language) at 1425 with 0.9268268346786499): «scheme is programming language that supports multiple paradigms including functional and imperative programming it is one of the three main dialects of lisp alongside common lisp and clojure unlike common lisp scheme follows minimalist design philosophy specifying small standard core with powerful tools for language extension scheme was created during the at the mit ai lab and released by its developers guy steele and gerald jay sussman via series of memos now known as the lambda papers it was the first dialect of lisp to choose lexical scope and the first to require implementations to perform tail call optimization giving stronger support for functional programming and associated techniques such as recursive algorithms it was also one of the first programming languages to support first class continuations it had significant influence on the effort that led to the development of common lisp common lis