# Creating a Doc2Vec vector representation.

Import the modules.

In [1]:
import re
import nltk
import pandas as pd
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maninaya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maninaya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Read the airline tweets sentiment dataset, which contains comments (text) related to airlines and their corresponding sentiment. The dataset can be obtained from https://d1p17r2m4rzlbo.cloudfront.net/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv

In [3]:
data = pd.read_csv('https://www.dropbox.com/s/8yq0edd4q908xqw/airline_sentiment.csv?dl=1')

A sample of the dataset looks as follows

In [4]:
data.head()

Unnamed: 0,airline_sentiment,text
0,1,@VirginAmerica plus you've added commercials t...
1,0,@VirginAmerica it's really aggressive to blast...
2,0,@VirginAmerica and it's a really big bad thing...
3,0,@VirginAmerica seriously would pay $30 a fligh...
4,1,"@VirginAmerica yes, nearly every time I fly VX..."


To build a document vector I followed these steps
* Preprocess the input sentences to remove punctuation
* Lowercasing for all words, 
* Remove the stop words (words that occur very frequently and do not add context to sentence, for example, and and the)
* Tag each sentence with its sentence ID.We are assigning an ID for each sentence.

In [5]:
stop = set(stopwords.words('english'))
def preprocess(text):
    text=text.lower()
    text=re.sub('[^0-9a-zA-Z]+',' ',text)
    words = text.split()
    words2 = [word for word in words if word not in stop]
    words3=' '.join(words2)
    return(words3)
data['text'] = data['text'].apply(preprocess)

After Preprocessing the Dataset looks as follows

In [6]:
data.head()

Unnamed: 0,airline_sentiment,text
0,1,virginamerica plus added commercials experienc...
1,0,virginamerica really aggressive blast obnoxiou...
2,0,virginamerica really big bad thing
3,0,virginamerica seriously would pay 30 flight se...
4,1,virginamerica yes nearly every time fly vx ear...


Create a dictionary of tagged documents where the document ID is generated along with the text (tweet)

In [7]:
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data['text'])]

Initialize a model with parameters, as follows.
* In the preceding code snippet, size represents the vector size of the document, 
* alpha represents the learning rate, 
* min_count represents the minimum frequency for a word to be considered, 
* and dm = 1 represents the PV-DM

In [8]:
max_epochs = 100
vec_size = 300
alpha = 0.025
model = Doc2Vec(size=vec_size, alpha=alpha, min_alpha=0.00025, min_count=30, dm =1)

  "C extension not loaded, training will be slow. "


Build a vocabulary

In [9]:
model.build_vocab(tagged_data)

2019-06-24 16:37:12,030 : INFO : collecting all words and their counts
2019-06-24 16:37:12,031 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-06-24 16:37:12,121 : INFO : PROGRESS: at example #10000, processed 108546 words (1236706/s), 11586 word types, 10000 tags
2019-06-24 16:37:12,135 : INFO : collected 12533 word types and 11541 unique tags from a corpus of 11541 examples and 125959 words
2019-06-24 16:37:12,138 : INFO : Loading a fresh vocabulary
2019-06-24 16:37:12,148 : INFO : effective_min_count=30 retains 669 unique words (5% of original 12533, drops 11864)
2019-06-24 16:37:12,150 : INFO : effective_min_count=30 leaves 90277 word corpus (71% of original 125959, drops 35682)
2019-06-24 16:37:12,157 : INFO : deleting the raw counts dictionary of 12533 items
2019-06-24 16:37:12,158 : INFO : sample=0.001 downsamples 78 most-common words
2019-06-24 16:37:12,159 : INFO : downsampling leaves estimated 68196 word corpus (75.5% of prior 90277)
2019-

Train the model for a high number of epochs on the tagged data

In [10]:
model.train(tagged_data,epochs=100,total_examples=model.corpus_count)

2019-06-24 16:37:12,515 : INFO : training model with 3 workers on 669 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2019-06-24 16:37:21,645 : INFO : EPOCH 1 - PROGRESS: at 7.96% examples, 575 words/s, in_qsize 6, out_qsize 0
2019-06-24 16:37:30,996 : INFO : EPOCH 1 - PROGRESS: at 31.28% examples, 1160 words/s, in_qsize 6, out_qsize 0
2019-06-24 16:37:40,389 : INFO : EPOCH 1 - PROGRESS: at 56.57% examples, 1335 words/s, in_qsize 6, out_qsize 0
2019-06-24 16:37:50,221 : INFO : EPOCH 1 - PROGRESS: at 79.61% examples, 1426 words/s, in_qsize 3, out_qsize 0
2019-06-24 16:37:50,493 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 16:37:50,637 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 16:37:52,128 : INFO : EPOCH 1 - PROGRESS: at 100.00% examples, 1720 words/s, in_qsize 0, out_qsize 1
2019-06-24 16:37:52,129 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 16:37:52,130 :

2019-06-24 16:42:07,117 : INFO : EPOCH 8 - PROGRESS: at 31.28% examples, 1182 words/s, in_qsize 6, out_qsize 0
2019-06-24 16:42:16,380 : INFO : EPOCH 8 - PROGRESS: at 56.57% examples, 1359 words/s, in_qsize 6, out_qsize 0
2019-06-24 16:42:17,480 : INFO : EPOCH 8 - PROGRESS: at 64.33% examples, 1498 words/s, in_qsize 5, out_qsize 0
2019-06-24 16:42:26,188 : INFO : EPOCH 8 - PROGRESS: at 79.61% examples, 1445 words/s, in_qsize 3, out_qsize 0
2019-06-24 16:42:27,164 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 16:42:27,733 : INFO : EPOCH 8 - PROGRESS: at 95.47% examples, 1673 words/s, in_qsize 1, out_qsize 1
2019-06-24 16:42:27,735 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 16:42:28,705 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 16:42:28,707 : INFO : EPOCH - 8 : training on 125959 raw words (68189 effective words) took 39.7s, 1716 effective words/s
2019-06-24 16:42:37,658 : INFO : EPOCH 

2019-06-24 16:46:19,672 : INFO : EPOCH 14 - PROGRESS: at 100.00% examples, 1729 words/s, in_qsize 0, out_qsize 1
2019-06-24 16:46:19,674 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 16:46:19,675 : INFO : EPOCH - 14 : training on 125959 raw words (68270 effective words) took 39.5s, 1729 effective words/s
2019-06-24 16:46:28,810 : INFO : EPOCH 15 - PROGRESS: at 7.96% examples, 578 words/s, in_qsize 5, out_qsize 0
2019-06-24 16:46:38,024 : INFO : EPOCH 15 - PROGRESS: at 31.28% examples, 1167 words/s, in_qsize 6, out_qsize 0
2019-06-24 16:46:46,478 : INFO : EPOCH 15 - PROGRESS: at 56.57% examples, 1390 words/s, in_qsize 5, out_qsize 0
2019-06-24 16:46:48,129 : INFO : EPOCH 15 - PROGRESS: at 71.81% examples, 1696 words/s, in_qsize 4, out_qsize 0
2019-06-24 16:46:55,318 : INFO : EPOCH 15 - PROGRESS: at 79.61% examples, 1510 words/s, in_qsize 3, out_qsize 0
2019-06-24 16:46:56,215 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 16

2019-06-24 16:50:37,125 : INFO : EPOCH 21 - PROGRESS: at 71.81% examples, 1628 words/s, in_qsize 4, out_qsize 0
2019-06-24 16:50:44,557 : INFO : EPOCH 21 - PROGRESS: at 79.61% examples, 1452 words/s, in_qsize 3, out_qsize 0
2019-06-24 16:50:45,755 : INFO : EPOCH 21 - PROGRESS: at 87.72% examples, 1551 words/s, in_qsize 2, out_qsize 1
2019-06-24 16:50:45,758 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 16:50:46,759 : INFO : EPOCH 21 - PROGRESS: at 95.47% examples, 1653 words/s, in_qsize 1, out_qsize 1
2019-06-24 16:50:46,761 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 16:50:47,524 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 16:50:47,526 : INFO : EPOCH - 21 : training on 125959 raw words (68298 effective words) took 40.1s, 1705 effective words/s
2019-06-24 16:50:56,744 : INFO : EPOCH 22 - PROGRESS: at 7.96% examples, 572 words/s, in_qsize 5, out_qsize 0
2019-06-24 16:51:06,001 : INFO : EP

2019-06-24 16:55:12,972 : INFO : EPOCH 28 - PROGRESS: at 79.61% examples, 1542 words/s, in_qsize 3, out_qsize 0
2019-06-24 16:55:13,514 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 16:55:13,935 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 16:55:15,285 : INFO : EPOCH 28 - PROGRESS: at 100.00% examples, 1833 words/s, in_qsize 0, out_qsize 1
2019-06-24 16:55:15,287 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 16:55:15,290 : INFO : EPOCH - 28 : training on 125959 raw words (68115 effective words) took 37.2s, 1833 effective words/s
2019-06-24 16:55:24,344 : INFO : EPOCH 29 - PROGRESS: at 7.96% examples, 584 words/s, in_qsize 6, out_qsize 0
2019-06-24 16:55:33,040 : INFO : EPOCH 29 - PROGRESS: at 31.28% examples, 1208 words/s, in_qsize 6, out_qsize 0
2019-06-24 16:55:41,784 : INFO : EPOCH 29 - PROGRESS: at 56.57% examples, 1406 words/s, in_qsize 6, out_qsize 0
2019-06-24 16:55:43,037 : INFO : E

2019-06-24 16:59:43,185 : INFO : EPOCH 35 - PROGRESS: at 79.92% examples, 1575 words/s, in_qsize 3, out_qsize 0
2019-06-24 16:59:43,543 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 16:59:43,926 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 16:59:45,126 : INFO : EPOCH 35 - PROGRESS: at 100.00% examples, 1892 words/s, in_qsize 0, out_qsize 1
2019-06-24 16:59:45,128 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 16:59:45,129 : INFO : EPOCH - 35 : training on 125959 raw words (68206 effective words) took 36.0s, 1892 effective words/s
2019-06-24 16:59:54,204 : INFO : EPOCH 36 - PROGRESS: at 7.96% examples, 580 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:00:03,143 : INFO : EPOCH 36 - PROGRESS: at 31.28% examples, 1194 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:00:11,918 : INFO : EPOCH 36 - PROGRESS: at 56.57% examples, 1394 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:00:12,972 : INFO : E

2019-06-24 17:04:21,442 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 17:04:22,245 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 17:04:22,246 : INFO : EPOCH - 42 : training on 125959 raw words (68231 effective words) took 40.0s, 1708 effective words/s
2019-06-24 17:04:31,769 : INFO : EPOCH 43 - PROGRESS: at 7.96% examples, 556 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:04:40,513 : INFO : EPOCH 43 - PROGRESS: at 31.28% examples, 1173 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:04:48,058 : INFO : EPOCH 43 - PROGRESS: at 56.57% examples, 1440 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:04:49,315 : INFO : EPOCH 43 - PROGRESS: at 71.81% examples, 1779 words/s, in_qsize 4, out_qsize 0
2019-06-24 17:04:55,134 : INFO : EPOCH 43 - PROGRESS: at 79.61% examples, 1632 words/s, in_qsize 3, out_qsize 0
2019-06-24 17:04:56,660 : INFO : EPOCH 43 - PROGRESS: at 87.72% examples, 1720 words/s, in_qsize 2, out_qsize 1
2019-06-24 17:

2019-06-24 17:09:30,771 : INFO : EPOCH 50 - PROGRESS: at 31.28% examples, 1136 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:09:40,106 : INFO : EPOCH 50 - PROGRESS: at 56.57% examples, 1321 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:09:41,118 : INFO : EPOCH 50 - PROGRESS: at 64.33% examples, 1463 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:09:49,815 : INFO : EPOCH 50 - PROGRESS: at 79.61% examples, 1420 words/s, in_qsize 3, out_qsize 0
2019-06-24 17:09:50,489 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 17:09:51,117 : INFO : EPOCH 50 - PROGRESS: at 95.47% examples, 1654 words/s, in_qsize 1, out_qsize 1
2019-06-24 17:09:51,121 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 17:09:52,276 : INFO : EPOCH 50 - PROGRESS: at 100.00% examples, 1689 words/s, in_qsize 0, out_qsize 1
2019-06-24 17:09:52,279 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 17:09:52,280 : INFO : EPOCH - 50 : training

2019-06-24 17:13:59,135 : INFO : EPOCH 57 - PROGRESS: at 31.28% examples, 1170 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:14:08,058 : INFO : EPOCH 57 - PROGRESS: at 56.57% examples, 1369 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:14:09,282 : INFO : EPOCH 57 - PROGRESS: at 71.81% examples, 1699 words/s, in_qsize 4, out_qsize 0
2019-06-24 17:14:17,548 : INFO : EPOCH 57 - PROGRESS: at 79.61% examples, 1468 words/s, in_qsize 3, out_qsize 0
2019-06-24 17:14:17,663 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 17:14:18,496 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 17:14:19,777 : INFO : EPOCH 57 - PROGRESS: at 100.00% examples, 1753 words/s, in_qsize 0, out_qsize 1
2019-06-24 17:14:19,777 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 17:14:19,778 : INFO : EPOCH - 57 : training on 125959 raw words (68211 effective words) took 38.9s, 1753 effective words/s
2019-06-24 17:14:28,470 : INFO :

2019-06-24 17:18:48,592 : INFO : EPOCH 64 - PROGRESS: at 87.72% examples, 1571 words/s, in_qsize 2, out_qsize 1
2019-06-24 17:18:48,593 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 17:18:48,660 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 17:18:49,907 : INFO : EPOCH 64 - PROGRESS: at 100.00% examples, 1744 words/s, in_qsize 0, out_qsize 1
2019-06-24 17:18:49,909 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 17:18:49,909 : INFO : EPOCH - 64 : training on 125959 raw words (68250 effective words) took 39.1s, 1744 effective words/s
2019-06-24 17:18:58,907 : INFO : EPOCH 65 - PROGRESS: at 7.80% examples, 602 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:19:07,962 : INFO : EPOCH 65 - PROGRESS: at 31.28% examples, 1190 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:19:09,047 : INFO : EPOCH 65 - PROGRESS: at 48.07% examples, 1675 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:19:17,086 : INFO : E

2019-06-24 17:22:16,982 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 17:22:17,114 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 17:22:17,115 : INFO : EPOCH - 70 : training on 125959 raw words (68248 effective words) took 30.0s, 2277 effective words/s
2019-06-24 17:22:23,174 : INFO : EPOCH 71 - PROGRESS: at 7.80% examples, 890 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:22:24,619 : INFO : EPOCH 71 - PROGRESS: at 23.65% examples, 2145 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:22:29,555 : INFO : EPOCH 71 - PROGRESS: at 31.28% examples, 1713 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:22:30,595 : INFO : EPOCH 71 - PROGRESS: at 39.40% examples, 1980 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:22:32,205 : INFO : EPOCH 71 - PROGRESS: at 48.07% examples, 2119 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:22:35,410 : INFO : EPOCH 71 - PROGRESS: at 56.57% examples, 2035 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:

2019-06-24 17:25:10,074 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 17:25:10,623 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 17:25:10,624 : INFO : EPOCH - 76 : training on 125959 raw words (68236 effective words) took 27.5s, 2483 effective words/s
2019-06-24 17:25:16,476 : INFO : EPOCH 77 - PROGRESS: at 7.96% examples, 901 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:25:18,422 : INFO : EPOCH 77 - PROGRESS: at 23.65% examples, 2070 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:25:22,583 : INFO : EPOCH 77 - PROGRESS: at 31.28% examples, 1788 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:25:25,770 : INFO : EPOCH 77 - PROGRESS: at 48.07% examples, 2119 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:25:28,717 : INFO : EPOCH 77 - PROGRESS: at 56.57% examples, 2064 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:25:29,783 : INFO : EPOCH 77 - PROGRESS: at 64.33% examples, 2234 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:

2019-06-24 17:27:57,414 : INFO : EPOCH - 82 : training on 125959 raw words (68225 effective words) took 28.2s, 2423 effective words/s
2019-06-24 17:28:03,314 : INFO : EPOCH 83 - PROGRESS: at 7.96% examples, 890 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:28:05,045 : INFO : EPOCH 83 - PROGRESS: at 23.65% examples, 2119 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:28:09,202 : INFO : EPOCH 83 - PROGRESS: at 31.28% examples, 1816 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:28:12,543 : INFO : EPOCH 83 - PROGRESS: at 48.07% examples, 2116 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:28:14,892 : INFO : EPOCH 83 - PROGRESS: at 56.57% examples, 2131 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:28:16,195 : INFO : EPOCH 83 - PROGRESS: at 64.33% examples, 2274 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:28:20,103 : INFO : EPOCH 83 - PROGRESS: at 71.81% examples, 2127 words/s, in_qsize 4, out_qsize 0
2019-06-24 17:28:21,333 : INFO : EPOCH 83 - PROGRESS: at 79.61% examples, 2252 words

2019-06-24 17:30:57,951 : INFO : EPOCH 89 - PROGRESS: at 31.77% examples, 1766 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:30:59,989 : INFO : EPOCH 89 - PROGRESS: at 48.07% examples, 2251 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:31:03,674 : INFO : EPOCH 89 - PROGRESS: at 56.57% examples, 2078 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:31:07,369 : INFO : EPOCH 89 - PROGRESS: at 71.81% examples, 2233 words/s, in_qsize 4, out_qsize 0
2019-06-24 17:31:10,134 : INFO : EPOCH 89 - PROGRESS: at 79.61% examples, 2206 words/s, in_qsize 3, out_qsize 0
2019-06-24 17:31:10,628 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 17:31:12,725 : INFO : EPOCH 89 - PROGRESS: at 92.25% examples, 2321 words/s, in_qsize 1, out_qsize 1
2019-06-24 17:31:12,726 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 17:31:12,741 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 17:31:12,742 : INFO : EPOCH - 89 : training 

2019-06-24 17:33:48,177 : INFO : EPOCH 95 - PROGRESS: at 31.28% examples, 1708 words/s, in_qsize 6, out_qsize 0
2019-06-24 17:33:51,085 : INFO : EPOCH 95 - PROGRESS: at 48.07% examples, 2074 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:33:54,439 : INFO : EPOCH 95 - PROGRESS: at 56.57% examples, 1982 words/s, in_qsize 5, out_qsize 0
2019-06-24 17:33:59,226 : INFO : EPOCH 95 - PROGRESS: at 71.81% examples, 2045 words/s, in_qsize 4, out_qsize 0
2019-06-24 17:34:01,552 : INFO : EPOCH 95 - PROGRESS: at 79.61% examples, 2076 words/s, in_qsize 3, out_qsize 0
2019-06-24 17:34:02,236 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-24 17:34:04,872 : INFO : EPOCH 95 - PROGRESS: at 92.25% examples, 2142 words/s, in_qsize 1, out_qsize 1
2019-06-24 17:34:04,873 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-24 17:34:04,971 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-24 17:34:04,972 : INFO : EPOCH - 95 : training 

2019-06-24 17:38:36,248 : INFO : EPOCH - 100 : training on 125959 raw words (68367 effective words) took 80.7s, 847 effective words/s
2019-06-24 17:38:36,250 : INFO : training on a 12595900 raw words (6818518 effective words) took 3684.2s, 1851 effective words/s


The training process would generate vectors for words as well as for the document/paragraph ID.

Word vectors can be fetched similarly as follow

In [11]:
model['wife']

array([ 6.83363795e-01,  2.18556076e-01, -9.93281156e-02,  5.00977933e-01,
       -6.10099316e-01, -1.19297791e+00, -6.19753838e-01,  7.92771459e-01,
       -1.05113737e-01,  7.42656350e-01,  2.21066445e-01, -2.21875653e-01,
       -8.93157125e-02,  4.61845219e-01,  6.18790984e-01, -4.35925182e-03,
       -3.24997529e-02, -2.29817450e-01,  1.14406109e+00,  5.12469172e-01,
       -2.49639094e-01,  3.57245088e-01,  7.74364650e-01, -6.96844637e-01,
       -5.38131773e-01,  3.54839973e-02,  5.67080975e-01,  4.48563427e-01,
        1.83258921e-01, -4.79078926e-02,  2.34056741e-01,  5.83579063e-01,
       -2.33898222e-01,  3.37909490e-01, -5.54619543e-02,  5.78056216e-01,
       -8.57927442e-01,  3.00665982e-02, -3.65196437e-01, -8.23443174e-01,
       -7.23384367e-03,  7.39861310e-01, -6.21722460e-01,  9.41931084e-03,
        4.19006437e-01,  3.06935668e-01, -9.89769697e-01,  3.16349924e-01,
        1.40945539e-01,  4.35875922e-01, -7.24571645e-01,  3.10387105e-01,
       -6.24643207e-01,  

Document vectors can be fetched as follows

In [12]:
model.docvecs[0]

array([-0.03598474, -0.00157027, -0.50850284,  0.11235338,  0.11084609,
       -0.283903  ,  0.05399168, -0.05833881, -0.00111604,  0.02569002,
       -0.25047883,  0.28985623, -0.22419295, -0.19107272, -0.08686312,
       -0.37746522,  0.08554612, -0.10853971, -0.10020198,  0.32795116,
       -0.19635145,  0.03694262,  0.02288098, -0.26752353, -0.06135087,
        0.03989259,  0.09079652,  0.15367931, -0.21049754, -0.09268438,
        0.17893073, -0.22629526,  0.07564218, -0.17506745,  0.1999165 ,
        0.03930481,  0.09206168,  0.17865096,  0.03661555, -0.16433404,
        0.16147546,  0.02905   ,  0.23946117, -0.1059329 , -0.11226641,
       -0.05089262, -0.27166086, -0.10594206, -0.32562187,  0.13954674,
        0.3673068 ,  0.1481816 ,  0.08556935,  0.03873833, -0.1373017 ,
        0.0636354 ,  0.05293239,  0.1463187 ,  0.12918085, -0.10092092,
       -0.13858603, -0.05592032, -0.41131228, -0.10259422, -0.0445575 ,
       -0.23184916, -0.2157029 ,  0.19692037,  0.3673854 , -0.24

Extract the most similar document to a given document ID

In [13]:
similar_doc = model.docvecs.most_similar('457')
print(similar_doc)

2019-06-24 17:38:36,732 : INFO : precomputing L2-norms of doc weight vectors


[('827', 0.9801170229911804), ('5076', 0.9537193179130554), ('2394', 0.9295538663864136), ('3900', 0.8774361610412598), ('5079', 0.8678953647613525), ('4903', 0.8643197417259216), ('6652', 0.8594534397125244), ('2773', 0.7951229810714722), ('1766', 0.7837889194488525), ('691', 0.7489890456199646)]


In the preceding code snippet, we are extracting the document ID that is most similar to the document ID number 457, which is 827.

Let's look into the text of documents 457 and 827

In [14]:
data['text'][457]

'united thanks sent'

In [15]:
data['text'][827]

'united sent thanks'