### Doc2Vec

* [tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb)
* [Wiki Corpus processing](https://radimrehurek.com/gensim/corpora/wikicorpus.html)
* [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)

### Custom WikiCorpus

iterable/iterator [tutorial](https://www.programiz.com/python-programming/iterator)

In [5]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.utils import simple_preprocess
%run wikidump.py

class WikiCorpus:
    def __init__(self, path_xml, path_index):
        self.dump = WikiDump(path_xml, path_index)
        self.names = list(self.dump.idx.keys())
    
    def __iter__(self):
        self.i = 0
        return self
    
    def __next__(self):
        if self.i < len(self.names):
            page = self.dump.load_page(self.names[self.i])
            page = simple_preprocess(page.strip_code())
            doc = TaggedDocument(page, [self.i])
            self.i += 1
            return doc
        else:
            raise StopIteration

**Test WikiCorpus**

In [None]:
path_base = '/Users/harangju/Developer/data/wiki/'
name_xml = 'enwiki-20190801-pages-articles-multistream.txt.bz2'
name_index = 'enwiki-20190801-pages-articles-multistream.xml.bz2'
path_xml = path_base + name_xml
path_index = path_base + name_index
i = iter(WikiCorpus(path_xml, path_index))

WikiDump: Loading index...


In [None]:
i.names[:20]

In [None]:
# next(i)

### Train model

In [None]:
model = Doc2Vec(dm=1, # PV-DM distributed memory
                vector_size=300, # @knee
                min_count=10, # find refs
                workers=4,
                epochs=30)

In [None]:
corpus = WikiCorpus(path_xml, path_index)

In [None]:
%time model.build_vocab(documents=corpus,
                        progress_per=1000,
                        keep_raw_vocab=False)

In [None]:
%time model.train(documents=corpus,
                  total_examples=len(corpus.names),
#                   total_words=something,
                  epochs=model.epochs,
                  word_count=0,
                  report_delay=10)

In [None]:
model.infer_vector(['hello', 'world', 'how', 'are', 'you'])

In [None]:
model.save(path_base + name_xml[:-4] + '-d2v-model')

### Test model

In [None]:
model([])

In [None]:
model.similarity()