In [2]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import logging
import os
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s",\
                   level=logging.INFO)

# Small In Memory Example
***
We will manually define some sentences, and use the `TaggedDocument` data structure to have a list of `TaggedDocument` objects. Each sentence is a document in this case, and its tag is its index in the list. 

In [3]:
documents = ["I love machine learning. Its awesome.",
       "I love coding in python",
       "I love building chatbots",
       "they chat amagingly well"]
# a list of TaggedDocument objects (sentences)
tagged_data = [TaggedDocument(words=word_tokenize(sentence.lower()), tags=[str(i)]) for i, sentence in enumerate(documents)]

Now we have a list of four sentences of training data.

In [5]:
tagged_data

[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']),
 TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']),
 TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']),
 TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])]

`dm` defines the training algorithm. `dm=1` means distributed memory and `dm=0` means distributed bag of words. The distributed memory model preserves word order in a document, whereas bag of words does not

In [12]:
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
               alpha=alpha,
               min_alpha=0.00025,
               min_count=1,
               dm=1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    logging.info("Epoch: {}".format(epoch))
    
    model.train(tagged_data, 
                total_examples=model.corpus_count, 
                epochs=model.iter)
    model.alpha -= 0.0002
    model.min_alpha = model.alpha

  from ipykernel import kernelapp as app




Now we can save the model:

In [13]:
model.save("data/doc2vec.model")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Now we can load and play with it:

In [15]:
model = Doc2Vec.load("data/doc2vec.model")

test_sentence = "I love chatbots"
test_data = word_tokenize(test_sentence.lower())
vec = model.infer_vector(test_data)
print("Vector inferred from sentence:\n{}->\n{}".format(test_sentence,vec))

Vector inferred from sentence:
I love chatbots->
[ 0.02188428 -0.00057101 -0.01809273  0.04565398  0.0376552  -0.01517666
 -0.03909002 -0.0026375  -0.0059761   0.01059998 -0.01804322 -0.01692442
  0.05742316  0.01855473 -0.00890092 -0.07206926 -0.04325558  0.04320221
 -0.03700936 -0.00157744]


Find most similar doc using tags:

In [17]:
similar_doc = model.docvecs.most_similar("1",topn=1)
print(similar_doc)

[('0', 0.9935578107833862)]


Print the vector for document at index 1 in training data:

In [18]:
print(model.docvecs['1'])

[ 0.19112408 -0.05284713 -0.00806603  0.5364271   0.3120089  -0.08068791
 -0.15005417 -0.12068697 -0.15206137  0.2529414  -0.12449547 -0.24706307
  0.32597885  0.13146715  0.07459036 -0.37053815 -0.49766836  0.25876194
 -0.32637128 -0.13382992]


# Example Loading Multiple Data Files in a Directory
***
Now we will pass a directory path containing text files.

You could do it like this:

```python
data = []
for file in os.listdir(DIR_PATH):
    with open(os.path.join(DIR_PATH,file),'r') as file:
        text = file.read()
    cleaned_tokens = preprocess(text)
    data.append(cleaned_tokens)
```

But this is the same thing as just reading all of the text in your entire corpus into memory all at once. 

Instead, you should encapsulate a generator inside your class that you pass to your Doc2Vec Model.

The input to a Doc2Vec Model is an iterator of `LabeledSentence` objects. However, the `LabeledSentence` class is now deprecated. Use `TaggedDocument` instead ([documentation](https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument)). 

A `TaggedDocument` is a single document, made up of words (a list of unicode string token) and tags (list of tokens). Best practice is to make the tags list a list of unique integer ids.

So we need to create a lazy iterator to consume the documents in a certain directory.

The input to Doc2Vec will be our iterator of `TaggedDocument` objects.

In [27]:
class Document_Directory_Iterator(object):
    """
    Given a directory, lazily loop over files and read each one as a document 
    """
    def __init__(self, DIR_NAME, EXT="txt", preprocess=None):
        """
        Args:
            DIR_NAME (str): name of directory
            EXT (str): extension of document files
            preprocess (function): preprocessing to perform
        """
        self.DIRS = [DIR_NAME]
        self.EXT = EXT
        self.doc_list = [os.path.join(DIR_NAME,file) for file in os.listdir(DIR_NAME) \
                         if file.endswith(self.EXT)]
        self.preprocess = preprocess
        # keep track of how many documents have been read in
        self.docs_read = 0
    def add_dir(self, DIR_NAME):
        """
        Add a directory name to list of directories. Add all of the new file paths to our 
        documents list in that case
        
        Args:
            DIR_NAME (str): name of directory to add to directories
        """
        self.DIRS.append(DIR_NAME)
        self.doc_list = self.doc_list + [os.path.join(DIR_NAME,file) for file in os.listdir(DIR_NAME) \
                                         if file.endswith(self.EXT)]
    def get_dirs(self):
        return self.DIRS
    def __iter__(self):
        """
        Loop over files in self.doc_list lazily, preprocess, tokenize and yield TaggedDocuments
        """
        for file_path in self.doc_list:
            self.docs_read += 1
            with open(file_path,"r") as file:
                document_text = file.read()
            # preprocess and tokenize the text here
            if self.preprocess is not None:
                document_text = self.preprocess(document_text)
                yield TaggedDocument(words=word_tokenize(document_text), 
                                 tags=[str(self.docs_read)])
            else:
                yield TaggedDocument(words=word_tokenize(document_text.lower()), 
                                 tags=[str(self.docs_read)])

And we can test our trained model, given another file:

In [None]:
print(model.most_similar("path-to-test-doc"))

# IMDB Sentiment Dataset Example
***
We will recreate the results of [Le and Mikolov 2014](https://arxiv.org/pdf/1405.4053.pdf).

## Load the Data
***

In [30]:
import re
def normalize_text(text):
    norm_text = text.lower()
    # replace line breaks with spaces
    norm_text = norm_text.replace("<br />", " ")
    # add spaces to both sides of every punctuation
    norm_text = re.sub(r"([\.\",\(\)!\?;:])"," \\1 ", norm_text)
    return norm_text

## Train Model
***
We load all of the directories into our lazy iterator, and then run the `build_vocab` over our iterator, as well as our `train` over the iterator.

In [31]:
%%time 
import multiprocessing
cores = multiprocessing.cpu_count()
base_dir = "data/imdb_movie_reviews/"
model = Doc2Vec(vector_size=100, window=10, min_count=5, workers=cores, alpha=0.025, min_alpha=0.025)

document_iterator = Document_Directory_Iterator(os.path.join(base_dir, "train","unsup"),EXT="txt",
                                               preprocess=normalize_text)

for data_set in ["train","test"]:
    for data_label in ["neg","pos"]:
        document_iterator.add_dir(os.path.join(base_dir, data_set, data_label))

print("Directories loaded: {}".format(document_iterator.get_dirs()))
%time model.build_vocab(document_iterator)

2019-08-20 16:34:19,032 : INFO : collecting all words and their counts
2019-08-20 16:34:19,034 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags


Directories loaded: ['data/imdb_movie_reviews/train/unsup', 'data/imdb_movie_reviews/train/neg', 'data/imdb_movie_reviews/train/pos', 'data/imdb_movie_reviews/test/neg', 'data/imdb_movie_reviews/test/pos']


2019-08-20 16:34:42,016 : INFO : PROGRESS: at example #10000, processed 2680835 words (116651/s), 64350 word types, 10000 tags
2019-08-20 16:35:04,993 : INFO : PROGRESS: at example #20000, processed 5381296 words (117535/s), 91560 word types, 20000 tags
2019-08-20 16:35:27,747 : INFO : PROGRESS: at example #30000, processed 8054600 words (117486/s), 112361 word types, 30000 tags
2019-08-20 16:35:51,093 : INFO : PROGRESS: at example #40000, processed 10793941 words (117341/s), 130895 word types, 40000 tags
2019-08-20 16:36:14,479 : INFO : PROGRESS: at example #50000, processed 13511422 words (116202/s), 147337 word types, 50000 tags
2019-08-20 16:36:37,657 : INFO : PROGRESS: at example #60000, processed 16176594 words (114991/s), 162993 word types, 60000 tags
2019-08-20 16:37:00,435 : INFO : PROGRESS: at example #70000, processed 18889135 words (119088/s), 178081 word types, 70000 tags
2019-08-20 16:37:23,198 : INFO : PROGRESS: at example #80000, processed 21577961 words (118124/s), 191

CPU times: user 3min 49s, sys: 1.18 s, total: 3min 50s
Wall time: 3min 51s
CPU times: user 3min 49s, sys: 1.25 s, total: 3min 51s
Wall time: 3min 51s


In [32]:
%time model.train(document_iterator, epochs=3,total_examples=model.corpus_count)

2019-08-20 16:38:10,079 : INFO : training model with 8 workers on 59273 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2019-08-20 16:38:11,698 : INFO : EPOCH 1 - PROGRESS: at 0.04% examples, 4242 words/s, in_qsize 10, out_qsize 0
2019-08-20 16:38:12,837 : INFO : EPOCH 1 - PROGRESS: at 0.31% examples, 19802 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:38:13,853 : INFO : EPOCH 1 - PROGRESS: at 0.72% examples, 34472 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:38:14,860 : INFO : EPOCH 1 - PROGRESS: at 1.13% examples, 43005 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:38:15,888 : INFO : EPOCH 1 - PROGRESS: at 1.54% examples, 48460 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:38:16,895 : INFO : EPOCH 1 - PROGRESS: at 1.94% examples, 52344 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:38:17,916 : INFO : EPOCH 1 - PROGRESS: at 2.34% examples, 55212 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:38:18,994 : INFO : EPOCH 1 - PROGRESS: at 2.80% exam

2019-08-20 16:39:27,329 : INFO : EPOCH 1 - PROGRESS: at 30.00% examples, 73072 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:39:28,347 : INFO : EPOCH 1 - PROGRESS: at 30.40% examples, 73089 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:39:29,358 : INFO : EPOCH 1 - PROGRESS: at 30.82% examples, 73112 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:39:30,361 : INFO : EPOCH 1 - PROGRESS: at 31.23% examples, 73152 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:39:31,433 : INFO : EPOCH 1 - PROGRESS: at 31.66% examples, 73196 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:39:32,524 : INFO : EPOCH 1 - PROGRESS: at 32.07% examples, 73224 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:39:33,637 : INFO : EPOCH 1 - PROGRESS: at 32.52% examples, 73244 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:39:34,747 : INFO : EPOCH 1 - PROGRESS: at 32.93% examples, 73252 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:39:35,905 : INFO : EPOCH 1 - PROGRESS: at 33.42% examples, 73294 words/s, in_qsize

2019-08-20 16:40:43,900 : INFO : EPOCH 1 - PROGRESS: at 60.07% examples, 73806 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:40:44,924 : INFO : EPOCH 1 - PROGRESS: at 60.45% examples, 73808 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:40:45,945 : INFO : EPOCH 1 - PROGRESS: at 60.85% examples, 73809 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:40:46,970 : INFO : EPOCH 1 - PROGRESS: at 61.26% examples, 73812 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:40:48,008 : INFO : EPOCH 1 - PROGRESS: at 61.66% examples, 73809 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:40:49,052 : INFO : EPOCH 1 - PROGRESS: at 62.14% examples, 73886 words/s, in_qsize 13, out_qsize 1
2019-08-20 16:40:50,074 : INFO : EPOCH 1 - PROGRESS: at 62.52% examples, 73847 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:40:51,133 : INFO : EPOCH 1 - PROGRESS: at 62.90% examples, 73831 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:40:52,185 : INFO : EPOCH 1 - PROGRESS: at 63.33% examples, 73859 words/s, in_qsize

2019-08-20 16:42:01,201 : INFO : EPOCH 1 - PROGRESS: at 90.46% examples, 73793 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:42:02,223 : INFO : EPOCH 1 - PROGRESS: at 90.86% examples, 73789 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:42:03,375 : INFO : EPOCH 1 - PROGRESS: at 91.33% examples, 73804 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:42:04,461 : INFO : EPOCH 1 - PROGRESS: at 91.79% examples, 73813 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:42:05,572 : INFO : EPOCH 1 - PROGRESS: at 92.24% examples, 73812 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:42:06,573 : INFO : EPOCH 1 - PROGRESS: at 92.66% examples, 73820 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:42:07,638 : INFO : EPOCH 1 - PROGRESS: at 93.11% examples, 73835 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:42:08,669 : INFO : EPOCH 1 - PROGRESS: at 93.52% examples, 73831 words/s, in_qsize 15, out_qsize 1
2019-08-20 16:42:09,883 : INFO : EPOCH 1 - PROGRESS: at 93.98% examples, 73827 words/s, in_qsize

2019-08-20 16:43:12,811 : INFO : EPOCH 2 - PROGRESS: at 18.63% examples, 70701 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:43:13,821 : INFO : EPOCH 2 - PROGRESS: at 19.04% examples, 70793 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:43:14,844 : INFO : EPOCH 2 - PROGRESS: at 19.43% examples, 70853 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:43:15,858 : INFO : EPOCH 2 - PROGRESS: at 19.83% examples, 70915 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:43:16,900 : INFO : EPOCH 2 - PROGRESS: at 20.24% examples, 70952 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:43:17,901 : INFO : EPOCH 2 - PROGRESS: at 20.65% examples, 71033 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:43:18,911 : INFO : EPOCH 2 - PROGRESS: at 21.05% examples, 71096 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:43:20,017 : INFO : EPOCH 2 - PROGRESS: at 21.47% examples, 71154 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:43:21,059 : INFO : EPOCH 2 - PROGRESS: at 21.88% examples, 71177 words/s, in_qsize

2019-08-20 16:44:29,962 : INFO : EPOCH 2 - PROGRESS: at 48.66% examples, 72631 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:44:30,972 : INFO : EPOCH 2 - PROGRESS: at 49.09% examples, 72695 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:44:32,044 : INFO : EPOCH 2 - PROGRESS: at 49.50% examples, 72678 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:44:33,103 : INFO : EPOCH 2 - PROGRESS: at 49.89% examples, 72665 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:44:34,112 : INFO : EPOCH 2 - PROGRESS: at 50.28% examples, 72680 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:44:35,148 : INFO : EPOCH 2 - PROGRESS: at 50.68% examples, 72684 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:44:36,182 : INFO : EPOCH 2 - PROGRESS: at 51.07% examples, 72687 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:44:37,215 : INFO : EPOCH 2 - PROGRESS: at 51.49% examples, 72691 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:44:38,347 : INFO : EPOCH 2 - PROGRESS: at 51.93% examples, 72692 words/s, in_qsize

2019-08-20 16:45:46,985 : INFO : EPOCH 2 - PROGRESS: at 78.96% examples, 73247 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:45:48,038 : INFO : EPOCH 2 - PROGRESS: at 79.38% examples, 73240 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:45:49,082 : INFO : EPOCH 2 - PROGRESS: at 79.81% examples, 73236 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:45:50,117 : INFO : EPOCH 2 - PROGRESS: at 80.20% examples, 73231 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:45:51,249 : INFO : EPOCH 2 - PROGRESS: at 80.66% examples, 73231 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:45:52,396 : INFO : EPOCH 2 - PROGRESS: at 81.13% examples, 73226 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:45:53,517 : INFO : EPOCH 2 - PROGRESS: at 81.59% examples, 73230 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:45:54,661 : INFO : EPOCH 2 - PROGRESS: at 82.05% examples, 73222 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:45:55,678 : INFO : EPOCH 2 - PROGRESS: at 82.44% examples, 73228 words/s, in_qsize

2019-08-20 16:46:56,308 : INFO : EPOCH 3 - PROGRESS: at 6.75% examples, 66186 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:46:57,393 : INFO : EPOCH 3 - PROGRESS: at 7.19% examples, 66708 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:46:58,503 : INFO : EPOCH 3 - PROGRESS: at 7.62% examples, 67121 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:46:59,512 : INFO : EPOCH 3 - PROGRESS: at 8.05% examples, 67474 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:47:00,612 : INFO : EPOCH 3 - PROGRESS: at 8.46% examples, 67831 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:47:01,667 : INFO : EPOCH 3 - PROGRESS: at 8.87% examples, 67969 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:47:02,757 : INFO : EPOCH 3 - PROGRESS: at 9.31% examples, 68319 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:47:03,853 : INFO : EPOCH 3 - PROGRESS: at 9.75% examples, 68592 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:47:04,929 : INFO : EPOCH 3 - PROGRESS: at 10.16% examples, 68924 words/s, in_qsize 15, out

2019-08-20 16:48:13,584 : INFO : EPOCH 3 - PROGRESS: at 37.03% examples, 72599 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:48:14,673 : INFO : EPOCH 3 - PROGRESS: at 37.47% examples, 72640 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:48:15,700 : INFO : EPOCH 3 - PROGRESS: at 37.87% examples, 72650 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:48:16,891 : INFO : EPOCH 3 - PROGRESS: at 38.33% examples, 72676 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:48:17,917 : INFO : EPOCH 3 - PROGRESS: at 38.74% examples, 72687 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:48:19,176 : INFO : EPOCH 3 - PROGRESS: at 39.23% examples, 72728 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:48:20,186 : INFO : EPOCH 3 - PROGRESS: at 39.64% examples, 72743 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:48:21,312 : INFO : EPOCH 3 - PROGRESS: at 40.06% examples, 72739 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:48:22,409 : INFO : EPOCH 3 - PROGRESS: at 40.51% examples, 72764 words/s, in_qsize

2019-08-20 16:49:31,024 : INFO : EPOCH 3 - PROGRESS: at 67.32% examples, 73193 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:49:32,109 : INFO : EPOCH 3 - PROGRESS: at 67.76% examples, 73210 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:49:33,308 : INFO : EPOCH 3 - PROGRESS: at 68.22% examples, 73215 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:49:34,333 : INFO : EPOCH 3 - PROGRESS: at 68.62% examples, 73220 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:49:35,409 : INFO : EPOCH 3 - PROGRESS: at 69.01% examples, 73234 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:49:36,433 : INFO : EPOCH 3 - PROGRESS: at 69.45% examples, 73273 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:49:37,481 : INFO : EPOCH 3 - PROGRESS: at 69.89% examples, 73304 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:49:38,557 : INFO : EPOCH 3 - PROGRESS: at 70.27% examples, 73284 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:49:39,569 : INFO : EPOCH 3 - PROGRESS: at 70.68% examples, 73293 words/s, in_qsize

2019-08-20 16:50:48,085 : INFO : EPOCH 3 - PROGRESS: at 98.00% examples, 73476 words/s, in_qsize 16, out_qsize 0
2019-08-20 16:50:49,167 : INFO : EPOCH 3 - PROGRESS: at 98.44% examples, 73487 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:50:50,184 : INFO : EPOCH 3 - PROGRESS: at 98.88% examples, 73491 words/s, in_qsize 15, out_qsize 0
2019-08-20 16:50:50,961 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-08-20 16:50:50,964 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-08-20 16:50:50,974 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-08-20 16:50:50,977 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-08-20 16:50:50,978 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-08-20 16:50:50,982 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-08-20 16:50:50,988 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-08-20 16:50:50,989 

CPU times: user 15min 20s, sys: 21.8 s, total: 15min 42s
Wall time: 12min 40s


Now lets look and see if we can understand some similarities in between reviews

In [37]:
with open("data/imdb_movie_reviews/train/pos/9901_8.txt","r") as fileHandle:
    text = normalize_text(fileHandle.read())
    
print(text)

this movie looked like a classic in the cheesy 80s slasher genre ,  which is my favorite genre of them all ,  so when i saw it was free on demand ,  i had to watch it !  it stars caroline munro ,  from both dr .  phibes films  ( she was his wife that died !  )  ,  dracula a . d .  1972 ,  the golden voyage of sinbad ,  captain kronos - vampire hunter ,  the spy who loved me ,  maniac ,  and faceless .   brought to you by the people behind don't open til christmas and pieces ,  heres my thoughts on this .  .  .   it opens on april fools day ,  where a bunch of kids play an elaborate prank on the school nerd- promising him sex in the shower ,  and giving him public humiliation and a face in the toilet  ( all while hes naked !  )  .  the coach puts a stop to it ,  but both parties swear revenge .  but ,  the cool kids end up burning the nerd alive .   cut to the future ,  and its th high school reunion ,  or so they think  ( bwahahha ?  )  .  the only ones with invitations were the gang w

Here is the inferred document vector for this review:

In [38]:
tokens = word_tokenize(text.lower())
new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector],topn=10)
print(sims)

[('45121', 0.39297741651535034), ('85756', 0.3914135992527008), ('44679', 0.3826449513435364), ('49843', 0.37741243839263916), ('836', 0.3691520392894745), ('55268', 0.3668164014816284), ('52728', 0.36188164353370667), ('64782', 0.3617124855518341), ('89269', 0.3613439202308655), ('65784', 0.35799914598464966)]


In [39]:
most_similar_doc_path = document_iterator.doc_list[67349]
with open(most_similar_doc_path, 'r') as file:
    text = normalize_text(file.read())
    print(text)

i have just recently been through a stage where i wanted to see why it is that horror films of the 90's can't hold a candle to 70's and 80's horror films .  i have been very public in this forum about the vileness of films like the haunting and urban legend and such .  i feel that they  ( and others like them )  don't know what true horror is .  and it bothered me to the point where it made me go to my local video store and rent some of the classic horror films .  i already own all the friday's so i rented the texas chainsaw massacre ,  the original nightmare on elm street ,  jaws ,  the exorcist ,  angel heart ,  the exorcist and halloween .  now the other films are classics in their own right but it is here that i want to tell you about halloween .  because what halloween does is perhaps something no other film in the history of horror film can do ,  and that is it uses subtle techniques ,  techniques that don't rely on blood and gore ,  and it uses these to scare the living daylight