# Training Doc2Vec on scientific articles

This notebook replicates the **Document Embedding with Paragraph Vectors** paper, http://arxiv.org/abs/1507.07998.

In that paper, the authors only showed results from the DBOW ("distributed bag of words") mode, trained on the article dataset. Here we replicate this experiment using not only DBOW, but also the DM ("distributed memory") mode of the Paragraph Vector algorithm aka Doc2Vec.

## Basic setup

Let's import the necessary modules and set up logging. The code below assumes Python 3.7+ and Gensim 4.0+.

In [1]:
import logging
import multiprocessing
from pprint import pprint

import smart_open
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from collections import Counter
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


## Preparing the corpus

In [3]:
COLUMNS_TO_DROP = ['year', 'n_citation', 'references', 'authors','venue', 'lang', 'page_start', 'page_end', 'volume',
       'issue', 'issn', 'isbn', 'doi', 'pdf', 'url']
RANDOM_STATE = 47
NUM_PARTS = 8

def get_text_data(file_path):
    
    data = pd.read_json(file_path, dtype={'title': 'string', 'abstract': 'string'})
    data.drop(COLUMNS_TO_DROP, axis=1, inplace=True)
    data['abstract'].replace('', np.nan, inplace=True)
    data['title'].replace('', np.nan, inplace=True)
    data['fos'].replace([], np.nan, inplace=True)
    data = data.dropna(subset=['keywords', 'abstract', 'title', 'fos'])
    data['text'] = data[['title', 'abstract']].apply(lambda row: '  '.join(row.astype(str)), axis=1).astype('string')
    data.drop(['title', 'abstract'], axis=1, inplace=True)
    return data

In [4]:
articles = pd.concat(get_text_data(f'data/part_{i+1}.json') for i in range(NUM_PARTS))
articles.reset_index(drop=True, inplace=True)
articles.to_json('articles.json')
articles.to_csv('articles.csv')
articles = pd.read_json('articles.json')

articles.info()

2022-10-27 13:34:13,515 : INFO : Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-10-27 13:34:13,516 : INFO : NumExpr defaulting to 8 threads.


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1824173 entries, 0 to 1824172
Data columns (total 4 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   _id       object
 1   keywords  object
 2   fos       object
 3   text      object
dtypes: object(4)
memory usage: 69.6+ MB


In [5]:
test_articles = get_text_data('data/part_10.json')
test_articles.reset_index(drop=True, inplace=True)
test_articles.to_json('articles.json')
test_articles = pd.read_json('articles.json')

test_articles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 258469 entries, 0 to 258468
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   _id       258469 non-null  object
 1   keywords  258469 non-null  object
 2   fos       258469 non-null  object
 3   text      258469 non-null  object
dtypes: object(4)
memory usage: 9.9+ MB


In [6]:
test_articles.head()

Unnamed: 0,_id,keywords,fos,text
0,558332610cf2485614700b00,"[adaptive control, control system synthesis, h...","[Adaptability, Simulation, Terrain, Systems de...",An adaptive mobility system for the disabled ...
1,558331150cf2320d1b9975fe,[integrated services digital networks],"[Signal processing, Intersymbol interference, ...",Modeling and Analysis of Error Probability Per...
2,558332640cf2320d1b99763a,"[computational complexity, computational geome...","[Motion planning, Discrete mathematics, Polyhe...",Interference detection between non-convex poly...
3,558331180cf2485614700ac3,"[mobile robots, position control, climbing rob...","[Robot control, Computer vision, Simulation, B...",A climbing robot with continuous motion A wal...
4,558331250cf2485614700ac4,"[electric sensing devices, feedback, mobile ro...","[Teleoperation, Robot control, Simulation, Sup...",Stabilization of a mobile robot climbing stair...


## Normalize data

In [7]:
%%time

def normalize(sentence):
    from nltk.stem.porter import PorterStemmer
    porter = PorterStemmer()
    if isinstance(sentence, list):
        return [word.lower() for word in sentence]
        #return [porter.stem(word) for word in sentence]
    return ' '.join(porter.stem(word) for word in sentence.split())

articles = articles.parallel_applymap(normalize)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=456044), Label(value='0 / 456044')…

CPU times: user 59.4 s, sys: 24.4 s, total: 1min 23s
Wall time: 7min 5s


## Topic selection

In [8]:
tag_freq = Counter()
for index, doc in articles.iterrows():
    tag_freq.update(Counter(doc['fos']))
    #tag_freq.update(Counter(doc['keywords']))

tag_freq.most_common(50)

[('computer science', 1127637),
 ('mathematics', 413745),
 ('artificial intelligence', 341173),
 ('algorithm', 165940),
 ('computer network', 120221),
 ('computer vision', 117480),
 ('discrete mathematics', 114166),
 ('engineering', 110202),
 ('distributed computing', 104010),
 ('mathematical optimization', 99134),
 ('combinatorics', 96472),
 ('theoretical computer science', 93219),
 ('pattern recognition', 86478),
 ('data mining', 85147),
 ('control theory', 75038),
 ('world wide web', 72621),
 ('machine learning', 67116),
 ('programming language', 62127),
 ('information retrieval', 55339),
 ('multimedia', 54431),
 ('software', 53105),
 ('computer security', 52954),
 ('knowledge management', 50674),
 ('software engineering', 49672),
 ('human–computer interaction', 48856),
 ('parallel computing', 48698),
 ('electronic engineering', 47390),
 ('the internet', 46992),
 ('real-time computing', 44455),
 ('natural language processing', 44397),
 ('artificial neural network', 42081),
 ('embedd

In [9]:
tag_freq_kw = Counter()
for index, doc in articles.iterrows():
    tag_freq_kw.update(Counter(doc['keywords']))

tag_freq_kw.most_common(50)

[('data mining', 44934),
 ('computer science', 33771),
 ('internet', 33656),
 ('real time', 30102),
 ('satisfiability', 26700),
 ('feature extraction', 25331),
 ('computational complexity', 24608),
 ('neural network', 22856),
 ('indexing terms', 22481),
 ('protocols', 21203),
 ('computational modeling', 20720),
 ('algorithm design and analysis', 20581),
 ('algorithms', 20108),
 ('information retrieval', 19584),
 ('mathematical model', 19269),
 ('quality of service', 19064),
 ('optimization', 18836),
 ('computer architecture', 17998),
 ('indexation', 17877),
 ('genetic algorithm', 17351),
 ('software engineering', 17316),
 ('testing', 17162),
 ('wireless sensor networks', 17078),
 ('machine learning', 17042),
 ('hardware', 16453),
 ('scheduling', 16304),
 ('image segmentation', 16014),
 ('robustness', 15982),
 ('bandwidth', 15756),
 ('application software', 15547),
 ('security', 14338),
 ('computer vision', 14031),
 ('image processing', 13973),
 ('real time systems', 13919),
 ('informat

In [10]:
len(tag_freq)
tag

130119

In [20]:
most_freq_tags = {tag:tag_freq[tag] for tag in tag_freq if tag_freq[tag] > 99}
len(most_freq_tags)

14869

In [23]:
class TaggedCorpus:
    def __init__(self, dataframe, min_freq=99):
        self.df = dataframe
        self.min_freq = min_freq
        
    def __iter__(self):
        for index, row in self.df.iterrows():
            kws = {kw for kw in row['fos'] if tag_freq[kw] > self.min_freq}
            kws.discard('computer science')
            if len(kws) < 2:
                continue
            yield TaggedDocument(words=row['text'].split(), tags=list(kws))

In [31]:
documents399 = TaggedCorpus(articles, min_freq=399)
documents99 = TaggedCorpus(articles, min_freq=99)
documents1499 = TaggedCorpus(articles, min_freq=1499)
#documents = [TaggedDocument(row['text'], [row['_id']]) for index, row in articles.iterrows()]

In [25]:
# Load and print the first preprocessed document, as a sanity check = "input eyeballing".
first_doc = next(iter(documents99))
print(first_doc.tags, ': ', first_doc.words)

['hydrology', 'irrigation', 'environmental science', 'canopy', 'moisture', 'agronomy', 'water content', 'soil water'] :  ['the', 'relationship', 'between', 'canopi', 'paramet', 'and', 'spectrum', 'of', 'winter', 'wheat', 'under', 'differ', 'irrig', 'in', 'hebei', 'province.', 'drought', 'is', 'the', 'first', 'place', 'in', 'all', 'the', 'natur', 'disast', 'in', 'the', 'world.', 'It', 'is', 'especi', 'seriou', 'in', 'north', 'china', 'plain.', 'In', 'thi', 'paper,', 'differ', 'soil', 'water', 'content', 'control', 'level', 'at', 'winter', 'wheat', 'growth', 'stage', 'are', 'perform', 'on', 'gucheng', 'ecological-meteorolog', 'integr', 'observ', 'experi', 'station', 'of', 'cams,', 'china.', 'some', 'canopi', 'parameters,', 'includ', 'growth', 'conditions,', 'dri', 'weight,', 'physiolog', 'paramet', 'and', 'hyperspectr', 'reflectance,', 'are', 'measur', 'from', 'erect', 'stage', 'to', 'milk', 'stage', 'for', 'winter', 'wheat', 'in', '2009.', 'the', 'relationship', 'between', 'canopi', 'pa

In [26]:
# Load and print the first preprocessed document, as a sanity check = "input eyeballing".
first_doc = next(iter(documents1499))
print(first_doc.tags, ': ', first_doc.words)

['algorithm', 'shortest path problem', 'statistics', 'sequential logic', 'monte carlo method'] :  ['time', 'yield', 'estim', 'use', 'statist', 'static', 'time', 'analysi', 'As', 'process', 'variat', 'becom', 'a', 'signific', 'problem', 'in', 'deep', 'sub-micron', 'technology,', 'a', 'shift', 'from', 'determinist', 'static', 'time', 'analysi', 'to', 'statist', 'static', 'time', 'analysi', 'for', 'high-perform', 'circuit', 'design', 'could', 'reduc', 'the', 'excess', 'conservat', 'that', 'is', 'built', 'into', 'current', 'time', 'design', 'methods.', 'We', 'address', 'the', 'time', 'yield', 'problem', 'for', 'sequenti', 'circuit', 'and', 'propos', 'a', 'statist', 'approach', 'to', 'handl', 'it.', 'We', 'consid', 'the', 'spatial', 'and', 'path', 'reconverg', 'correl', 'between', 'path', 'delays,', 'set-up', 'time', 'and', 'hold', 'time', 'constraints,', 'and', 'clock', 'skew', 'due', 'to', 'process', 'variations.', 'We', 'propos', 'a', 'method', 'to', 'get', 'the', 'time', 'yield', 'base'

The document seems legit so let's move on to finally training some Doc2vec models.

## Training Doc2Vec

The original paper had a vocabulary size of 915,715 word types, so we'll try to match it by setting `max_final_vocab` to 1,000,000 in the Doc2vec constructor.

Other critical parameters were left unspecified in the paper, so we'll go with a window size of eight (a prediction window of 8 tokens to either side). It looks like the authors tried vector dimensionality of 100, 300, 1,000 & 10,000 in the paper (with 10k dims performing the best), but I'll only train with 200 dimensions here, to keep the RAM in check on my laptop.

Feel free to tinker with these values yourself if you like:

In [27]:
workers = multiprocessing.cpu_count() - 2  # leave one core for the OS & other stuff


# PV-DBOW: paragraph vector in distributed bag of words mode
model_dbow99 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)

# PV-DBOW: paragraph vector in distributed bag of words mode
model_dbow399 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)

# PV-DBOW: paragraph vector in distributed bag of words mode
model_dbow1499 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)


2022-10-27 14:08:57,006 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>', 'datetime': '2022-10-27T14:08:56.973874', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'created'}
2022-10-27 14:08:57,007 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>', 'datetime': '2022-10-27T14:08:57.007875', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'created'}
2022-10-27 14:08:57,008 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>', 'datetime': '2022-10-27T14:08:57.008944', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'created'}


In [28]:
workers

30

In [32]:
model_dbow99.build_vocab(documents99, progress_per=100000, )
print(model_dbow99)
print('---------------------------------------')

model_dbow399.build_vocab(documents399, progress_per=100000)
print(model_dbow399)
print('---------------------------------------')

model_dbow1499.build_vocab(documents1499, progress_per=100000)
print(model_dbow1499)

# Save some time by copying the vocabulary structures from the DBOW model to the DM model.
# Both models are built on top of exactly the same data, so there's no need to repeat the vocab-building step.
#model_dm.reset_from(model_dbow)
#print(model_dm)

2022-10-27 14:15:00,203 : INFO : collecting all words and their counts
2022-10-27 14:15:00,205 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2022-10-27 14:15:12,819 : INFO : PROGRESS: at example #100000, processed 13844549 words (1097663 words/s), 381145 word types, 5708 tags
2022-10-27 14:15:25,343 : INFO : PROGRESS: at example #200000, processed 27386108 words (1081320 words/s), 627224 word types, 5708 tags
2022-10-27 14:15:37,949 : INFO : PROGRESS: at example #300000, processed 41843358 words (1146934 words/s), 840040 word types, 5708 tags
2022-10-27 14:15:50,623 : INFO : PROGRESS: at example #400000, processed 56246304 words (1136493 words/s), 1030567 word types, 5708 tags
2022-10-27 14:16:03,363 : INFO : PROGRESS: at example #500000, processed 70637102 words (1129619 words/s), 1209311 word types, 5708 tags
2022-10-27 14:16:16,090 : INFO : PROGRESS: at example #600000, processed 85030601 words (1131102 words/s), 1377326 word types, 5708 tags


Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
---------------------------------------


2022-10-27 14:19:13,736 : INFO : PROGRESS: at example #100000, processed 13843077 words (1105020 words/s), 379074 word types, 1727 tags
2022-10-27 14:19:26,326 : INFO : PROGRESS: at example #200000, processed 27463509 words (1081920 words/s), 624307 word types, 1727 tags
2022-10-27 14:19:39,135 : INFO : PROGRESS: at example #300000, processed 41922887 words (1128905 words/s), 835516 word types, 1727 tags
2022-10-27 14:19:51,890 : INFO : PROGRESS: at example #400000, processed 56307686 words (1127874 words/s), 1024516 word types, 1727 tags
2022-10-27 14:20:04,634 : INFO : PROGRESS: at example #500000, processed 70683572 words (1128112 words/s), 1201732 word types, 1727 tags
2022-10-27 14:20:17,543 : INFO : PROGRESS: at example #600000, processed 85086720 words (1115877 words/s), 1368075 word types, 1727 tags
2022-10-27 14:20:30,546 : INFO : PROGRESS: at example #700000, processed 99588058 words (1115303 words/s), 1528559 word types, 1727 tags
2022-10-27 14:20:43,307 : INFO : PROGRESS: a

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>


In [33]:
# Train DBOW doc2vec 399
# Report progress every 10 min.
model_dbow399.train(documents399, total_examples=model_dbow399.corpus_count, epochs=model_dbow399.epochs, report_delay=10*60)

2022-10-27 14:23:00,569 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 30 workers on 403737 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-10-27T14:23:00.568981', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'train'}
2022-10-27 14:23:02,540 : INFO : EPOCH 0 - PROGRESS: at 0.00% examples, 4157 words/s, in_qsize 60, out_qsize 0
2022-10-27 14:33:02,541 : INFO : EPOCH 0 - PROGRESS: at 45.20% examples, 158004 words/s, in_qsize 59, out_qsize 0
2022-10-27 14:43:02,572 : INFO : EPOCH 0 - PROGRESS: at 89.91% examples, 157977 words/s, in_qsize 59, out_qsize 0
2022-10-27 14:45:19,124 : INFO : EPOCH 0: training on 258838695 raw words (211614412 effective words) took 1338.5s, 158096 effective words/s
2022-10-27 14:45:21,046 : INFO : EPOCH 1 - PROGRESS: at 0.00% examples, 4217 words/s, in_qsize 59, out_qsiz

In [34]:
# Train DBOW doc2vec 99
# Report progress every 10 min.
model_dbow99.train(documents99, total_examples=model_dbow99.corpus_count, epochs=model_dbow99.epochs, report_delay=10*60)

2022-10-27 18:50:03,613 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 30 workers on 405281 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-10-27T18:50:03.612981', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'train'}
2022-10-27 18:50:05,539 : INFO : EPOCH 0 - PROGRESS: at 0.00% examples, 4262 words/s, in_qsize 59, out_qsize 0
2022-10-27 19:00:05,584 : INFO : EPOCH 0 - PROGRESS: at 42.82% examples, 151284 words/s, in_qsize 59, out_qsize 0
2022-10-27 19:10:05,660 : INFO : EPOCH 0 - PROGRESS: at 85.27% examples, 151350 words/s, in_qsize 59, out_qsize 0
2022-10-27 19:13:36,152 : INFO : EPOCH 0: training on 259498691 raw words (213950041 effective words) took 1412.5s, 151466 effective words/s
2022-10-27 19:13:38,132 : INFO : EPOCH 1 - PROGRESS: at 0.00% examples, 4191 words/s, in_qsize 60, out_qsiz

In [35]:
# Train DBOW doc2vec 1499
# Report progress every 10 min.
model_dbow1499.train(documents1499, total_examples=model_dbow1499.corpus_count, epochs=model_dbow1499.epochs, report_delay=10*60)

2022-10-27 23:32:11,464 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 30 workers on 397201 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-10-27T23:32:11.464899', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'train'}
2022-10-27 23:32:13,237 : INFO : EPOCH 0 - PROGRESS: at 0.00% examples, 4482 words/s, in_qsize 60, out_qsize 0
2022-10-27 23:42:13,270 : INFO : EPOCH 0 - PROGRESS: at 49.85% examples, 169377 words/s, in_qsize 59, out_qsize 0
2022-10-27 23:52:13,304 : INFO : EPOCH 0 - PROGRESS: at 98.94% examples, 169435 words/s, in_qsize 59, out_qsize 0
2022-10-27 23:52:25,964 : INFO : EPOCH 0: training on 255365971 raw words (205897723 effective words) took 1214.5s, 169535 effective words/s
2022-10-27 23:52:27,542 : INFO : EPOCH 1 - PROGRESS: at 0.00% examples, 5048 words/s, in_qsize 59, out_qsiz

In [36]:
documents2999 = TaggedCorpus(articles, min_freq=2999)
# PV-DBOW: paragraph vector in distributed bag of words mode
model_dbow2999 = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=12, workers=workers, max_final_vocab=1000000,
)
model_dbow2999.build_vocab(documents2999, progress_per=100000)
print(model_dbow2999)

2022-10-28 03:35:16,887 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>', 'datetime': '2022-10-28T03:35:16.887666', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'created'}
2022-10-28 03:35:16,888 : INFO : collecting all words and their counts
2022-10-28 03:35:16,889 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2022-10-28 03:35:29,665 : INFO : PROGRESS: at example #100000, processed 13832088 words (1082698 words/s), 373187 word types, 789 tags
2022-10-28 03:35:42,461 : INFO : PROGRESS: at example #200000, processed 27545553 words (1071778 words/s), 614343 word types, 789 tags
2022-10-28 03:35:55,412 : INFO : PROGRESS: at example #300000, processed 42001821 words (1116280 words/s), 822074 word types, 789 tags
2022-10-28 03:36:08,248 : INFO : PROGRESS: at example #400000, processed 56376101 words (1119

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>


In [37]:
# Train DBOW doc2vec 2999
# Report progress every 10 min.
model_dbow2999.train(documents2999, total_examples=model_dbow2999.corpus_count, epochs=model_dbow2999.epochs, report_delay=10*60)

2022-10-28 03:39:11,107 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 30 workers on 384266 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-10-28T03:39:11.107801', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'train'}
2022-10-28 03:39:12,883 : INFO : EPOCH 0 - PROGRESS: at 0.00% examples, 4400 words/s, in_qsize 59, out_qsize 0
2022-10-28 03:49:13,060 : INFO : EPOCH 0 - PROGRESS: at 54.01% examples, 176917 words/s, in_qsize 60, out_qsize 0
2022-10-28 03:57:53,274 : INFO : EPOCH 0: training on 248396459 raw words (198492524 effective words) took 1122.2s, 176885 effective words/s
2022-10-28 03:57:54,902 : INFO : EPOCH 1 - PROGRESS: at 0.00% examples, 4892 words/s, in_qsize 60, out_qsize 0
2022-10-28 04:07:54,906 : INFO : EPOCH 1 - PROGRESS: at 53.97% examples, 176870 words/s, in_qsize 59, out_qsiz

## Finding similar documents

In [141]:
models = [model_dbow99, model_dbow399, model_dbow1499, model_dbow2999]

In [39]:
for model in models:
    print(model)
    pprint(model.dv.most_similar(positive=["reinforcement learning"], topn=15))
    

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
[('q-learning', 0.9321299195289612),
 ('temporal difference learning', 0.8832590579986572),
 ('reinforcement learning algorithm', 0.8621841669082642),
 ('error-driven learning', 0.8326339721679688),
 ('learning classifier system', 0.8043680191040039),
 ('action selection', 0.773815929889679),
 ('reinforcement', 0.7714371085166931),
 ('learning agent', 0.7500354051589966),
 ('markov decision process', 0.728828489780426),
 ('multiagent learning', 0.7118377089500427),
 ('robot learning', 0.6870455741882324),
 ('partially observable markov decision process', 0.6772484183311462),
 ('sequential decision', 0.6734935641288757),
 ('bellman equation', 0.6446964144706726),
 ('instance-based learning', 0.6378455758094788)]
Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t30>
[('q-learning', 0.9268630743026733),
 ('temporal difference learning', 0.8876643776893616),
 ('error-driven learning', 0.8351960778236389),
 ('learning classifier system', 0.7971616387367249),
 (

## Model Saving



In [53]:
model_dbow399.save(f'doc2vec_dbow399.model')
model_dbow99.save(f'doc2vec_dbow99.model')
model_dbow1499.save(f'doc2vec_dbow1499.model')
model_dbow2999.save(f'doc2vec_dbow2999.model')

2022-10-28 13:59:44,698 : INFO : Doc2Vec lifecycle event {'fname_or_handle': 'doc2vec_dbow399.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-10-28T13:59:44.698903', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'saving'}
2022-10-28 13:59:44,700 : INFO : storing np array 'vectors' to doc2vec_dbow399.model.wv.vectors.npy
2022-10-28 13:59:44,879 : INFO : storing np array 'syn1neg' to doc2vec_dbow399.model.syn1neg.npy
2022-10-28 13:59:45,049 : INFO : not storing attribute cum_table
2022-10-28 13:59:45,258 : INFO : saved doc2vec_dbow399.model
2022-10-28 13:59:45,259 : INFO : Doc2Vec lifecycle event {'fname_or_handle': 'doc2vec_dbow99.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-10-28T13:59:45.259126', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'pla

In [137]:
model_dbow99 = Doc2Vec.load(f'doc2vec_dbow99.model')
model_dbow399 = Doc2Vec.load(f'doc2vec_dbow399.model')
model_dbow1499 = Doc2Vec.load(f'doc2vec_dbow1499.model')
model_dbow2999 = Doc2Vec.load(f'doc2vec_dbow2999.model')


2022-10-28 16:07:06,678 : INFO : loading Doc2Vec object from doc2vec_dbow99.model
2022-10-28 16:07:06,831 : INFO : loading dv recursively from doc2vec_dbow99.model.dv.* with mmap=None
2022-10-28 16:07:06,832 : INFO : loading wv recursively from doc2vec_dbow99.model.wv.* with mmap=None
2022-10-28 16:07:06,833 : INFO : loading vectors from doc2vec_dbow99.model.wv.vectors.npy with mmap=None
2022-10-28 16:07:06,906 : INFO : loading syn1neg from doc2vec_dbow99.model.syn1neg.npy with mmap=None
2022-10-28 16:07:06,978 : INFO : setting ignored attribute cum_table to None
2022-10-28 16:07:10,138 : INFO : Doc2Vec lifecycle event {'fname': 'doc2vec_dbow99.model', 'datetime': '2022-10-28T16:07:10.138734', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.11.0-34-generic-x86_64-with-glibc2.10', 'event': 'loaded'}
2022-10-28 16:07:10,139 : INFO : loading Doc2Vec object from doc2vec_dbow399.model
2022-10-28 16:07:10,270 : INFO : loading dv recur

In [222]:
from sklearn.base import BaseEstimator

class Model(BaseEstimator):
    
    def __init__(self, models, treshholds=None, repeat=5):
        self.models = models
        self.repeat = repeat
        if treshholds is None:
            self.treshholds = [3] * len(models)
        else:
            self.treshholds  = treshholds
        
    def fit(self, X, y):
        pass
                                      
    def predict(self, text):
        infer = set()
        for _ in range(self.repeat):
            for idx, model in enumerate(self.models):
                doc_vector = model.infer_vector(text.split())
                tegs = {teg for teg, _ in
                    model.dv.most_similar([doc_vector], topn=self.treshholds[idx])}
                infer.update(tegs)
        return sorted(infer)


In [None]:
estimator = Model(models=models)

In [None]:
import dill
with open("doc2vec_model_v2.pkl", "wb") as fp:
    dill.dump(estimator, fp)

# Examples

In [231]:
import random

indexes = [random.choice(range(1,10000)) for _ in range(20)]
for idx in indexes:
    text = test_articles.text[idx]
    print(text + '\n')
    print('Predict tags:\n')
    print('\n'.join(estimator.predict(text)) + '\n')
    print('Original fos:\n')
    print('\n'.join(topic.lower() for topic in test_articles.fos[idx]))
    print('\n----------------------------------------------------------------\n')

600 bps voice digitizer  This paper presents an analysis/synthesis method whereby speech may be transmitted at 600 bps, a data rate which is less than 1 percent of the PCM transmission rate for original speech sounds. This R&D effort was motivated by the pressing need for very-low-data rate (VLDR) voice digitizers to meet some of the current military voice communication requirements. The use of a VLDR voice digitizer makes it possible to transmit speech signals over adverse channels which support data rates of only a few hundred bps, or to transmit speech signals over more favorable channels with redundancies for error protection and other useful applications. The 600 bps synthesized speech loses some of its original speech quality, but the intelligibility is sufficiently high to permit the use of the system in certain specialized military applications. One of the most attractive features of the 600 bps voice digitizer is that it is a simple extension of the 2400 bps linear predictive 

applied mathematics
combinatorics
covariance
finite set
inverse
linear form
linear map
matrix (mathematics)
orthonormal basis
rational function
state-transition matrix
unit circle

Original fos:

autoregressive model
matrix (mathematics)
algorithm
stochastic process
white noise
transfer function
estimation theory
state space
mathematics
autocorrelation

----------------------------------------------------------------

Video synthesis of arbitrary views for approximately planar scenes  In this paper, we propose a method to synthesize arbitrary views of a planar scene, given a monocular video sequence. The method is based on the availability of knowledge of the angle between the original and synthesized views. Such a method has many impor- tant applications, one of them being gait recognition. Gait recog- nition algorithms rely on the availability of an approximate side- view of the person. From a realistic viewpoint, such an assumption is impractical in surveillance applications and it 

adaptive filter
digital filter
filter (signal processing)
filter design
interpolation
kernel adaptive filter
linear filter
recursive filter
smoothing

Original fos:

raised-cosine filter
digital filter
root-raised-cosine filter
mathematical analysis
control theory
sinc filter
low-pass filter
kernel adaptive filter
adaptive filter
mathematics
filter design

----------------------------------------------------------------

Efficient Detection Technique for Multiple Packet Collisions in OFDM Systems  Whenever there is a collision involving several transmitted packets, the traditional approach for MAC (Medium Access Control) protocols is to discard all packets involved and request their retransmission, with a consequent loss in the overall throughput. Based on conventional Multiple Input Multiple Output (MIMO) techniques, in this paper we propose a multipacket detector for OFDM schemes (Orthogonal Frequency Division Multiplexing) that allows an efficient packet separation in the presence o

To continue your doc2vec explorations, refer to the official API documentation in Gensim: https://radimrehurek.com/gensim/models/doc2vec.html