# Adaptation of {Doc2Vec Tutorial on the Lee Dataset}

In [1]:
import gensim
import os
import collections
import smart_open
import random
from __future__ import division
from __future__ import print_function



## What is it?

Doc2Vec is an NLP tool for representing documents as a vector and is a generalizing of the Word2Vec method. This tutorial will serve as an introduction to Doc2Vec and present ways to train and assess a Doc2Vec model.

## Resources

* [Word2Vec Paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
* [Doc2Vec Paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
* [Dr. Michael D. Lee's Website](http://faculty.sites.uci.edu/mdlee)
* [Lee Corpus](http://faculty.sites.uci.edu/mdlee/similarity-data/)
* [IMDB Doc2Vec Tutorial](doc2vec-IMDB.ipynb)

## Getting Started

To get going, we'll need to have a set of documents to train our doc2vec model. In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a <b>corpus</b>. 

For this tutorial, we'll be training our model using the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included in gensim. This corpus contains 314 documents selected from the Australian Broadcasting
Corporation’s news mail service, which provides text e-mails of headline stories and covers a number of broad topics.

And we'll test our model by eye using the much shorter [Lee Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) which contains 50 documents.

In [2]:
# Set file names for train and test data
#test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = 'rev_kkma_data_all_1col_train.dat'
lee_test_file = 'rev_kkma_data_all_1col_test.dat'

## Define a Function to Read and Preprocess Text

Below, we define a function to open the train/test file (with latin encoding), read the file line-by-line, pre-process each line using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Note that, for a given file (aka corpus), each continuous line constitutes a single document and the length of each line (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.

In [20]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="utf-8") as f:
        for i, line in enumerate(f):
            line_elems = line.split('\t')
            if line_elems[0][0:2] != 'EE':
                #print(line_elems[0][0:2])
                continue
            if tokens_only:
                yield gensim.utils.simple_preprocess(line_elems[1])
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line_elems[1]), [line_elems[0],line_elems[0]])

In [21]:
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file))

Let's take a look at the training corpus

In [23]:
#print lee_train_file
#print lee_test_file
print (train_corpus[:1])
#for x in train_corpus[0].words:
 #   print x

[TaggedDocument(words=[u'mobile', u'web', u'app', u'process', u'oriented', u'gateway', u'server', u'service', u'platform', u'encryption', u'\uc2a4\ub9c8\ud2b8', u'\ub2e8\ub9d0\uae30', u'\uae30\ubc18', u'\uc5c5\ubb34', u'\uc5c5\ubb34\uc2dc\uc2a4\ud15c', u'\uc2dc\uc2a4\ud15c', u'\uc5f0\ub3d9', u'\ubaa8\ubc14\uc77c', u'\ud50c\ub9ac\ucf00\uc774\uc158', u'\ucf00\uc774', u'\uc11c\ube44\uc2a4', u'\ucee8\ubc84\uc804\uc2a4', u'\ud50c\ub7ab\ud3fc', u'\uae30\uc220', u'\uac1c\ubc1c', u'\ud504\ub85c\uc138\uc2a4', u'\ud504\ub85c\uc138\uc2a4\uc911\uc2ec\ud615', u'\uc911\uc2ec', u'\uac8c\uc774\ud2b8\uc6e8\uc774', u'\uac8c\uc774\ud2b8\uc6e8\uc774\uc11c\ubc84', u'\uc11c\ubc84', u'\uc11c\ube44\uc2a4\ud50c\ub7ab\ud3fc', u'\uc554\ud638\ud654', u'\uae30\uc5c5', u'\uc5f0\uacc4', u'\ubbf8\ub4e4', u'\ubbf8\ub4e4\uc6e8\uc5b4', u'\uc6e8\uc5b4', u'\ud504\ub808\uc784', u'\ud504\ub808\uc784\uc6cc\ud06c', u'\uc6cc\ud06c', u'\uc801\uc6a9', u'\uacf5\uc778', u'\uc778\uc99d', u'\uc870\ud569', u'\uc870\ud569\ud615', u'\u

And the testing corpus looks like this:

In [24]:
#print(test_corpus[:1])

Notice that the testing corpus is just a list of lists and does not contain any tags.

## Training the Model

### Instantiate a Doc2Vec Object 

Now, we'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 55 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time. Small datasets with short documents, like this one, can benefit from more training passes.

In [25]:
model = gensim.models.doc2vec.Doc2Vec(size=15, min_count=2, iter=20)
model.build_vocab(train_corpus)

### Build a Vocabulary

In [30]:

def trial(the_size,the_iter):
    global model
    print("Trial with:",the_size,the_iter)
    model = gensim.models.doc2vec.Doc2Vec(size=the_size, min_count=2, iter=the_iter)
    model.build_vocab(train_corpus)
    model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
    print('Train:',end='')
    tester(train_corpus)
    print('Test:',end='')
    tester(test_corpus)
for i in range(5,105,5):
    trial(the_size=10,the_iter=i)

Trial with: 10 5
Train:0.652304009575
Test:0.541237113402
Trial with: 10 10
Train:0.694195092759
Test:0.618556701031
Trial with: 10 15
Train:0.731897067624
Test:0.649484536082
Trial with: 10 20
Train:0.748055056852
Test:0.639175257732
Trial with: 10 25
Train:0.749850388989
Test:0.701030927835
Trial with: 10 30
Train:0.742070616397
Test:0.685567010309
Trial with: 10 35
Train:0.755834829443
Test:0.716494845361
Trial with: 10 40
Train:0.760622381807
Test:0.685567010309
Trial with: 10 45
Train:0.742070616397
Test:0.675257731959
Trial with: 10 50
Train:0.767205266308
Test:0.680412371134
Trial with: 10 55
Train:0.76959904249
Test:0.701030927835
Trial with: 10 60
Train:0.77618192699
Test:0.716494845361
Trial with: 10 65
Train:0.754637941352
Test:0.726804123711
Trial with: 10 70
Train:0.766008378217
Test:0.731958762887
Trial with: 10 75
Train:0.771992818671
Test:0.737113402062
Trial with: 10 80
Train:0.77079593058
Test:0.762886597938
Trial with: 10 85
Train:0.777977259126
Test:0.721649484536
T

Essentially, the vocabulary is a dictionary (accessible via `model.wv.vocab`) of all of the unique words extracted from the training corpus along with the count (e.g., `model.wv.vocab['penalty'].count` for counts for the word `penalty`).

### Time to Train

If the BLAS library is being used, this should take no more than 3 seconds.
If the BLAS library is not being used, this should take no more than 2 minutes, so use BLAS if you value your time.

In [31]:
model.save('trained_with size=10&iter=100,EEonly')
#%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)

### Inferring a Vector

One important thing to note is that you can now infer a vector for any piece of text without having to re-train the model by passing a list of words to the `model.infer_vector` function. This vector can then be compared with other vectors via cosine similarity.

In [None]:
model.infer_vector([u'데이터', u'알고리즘', u'소프트웨어'])
inferred_vector = model.infer_vector(train_corpus[0].words)
model.docvecs.most_similar([inferred_vector], topn=5)

## Assessing Model

To assess our new model, we'll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity. Basically, we're pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model. The expectation is that we've likely overfit our model (i.e., all of the ranks will be less than 2) and so we should be able to find similar documents very easily. Additionally, we'll keep track of the second ranks for a comparison of less similar documents. 

In [27]:
def tester(corpus):
    total_success = 0
    total_data = 0
    #WARNING : remove divide 4
    for doc_id in range(len(corpus)//4):
        inferred_vector = model.infer_vector(corpus[doc_id].words)
        sims = model.docvecs.most_similar([inferred_vector], topn=5)
        score = corpus[doc_id].tags[0] in [doc_tag for doc_tag, sim in sims]
        if score:
            total_success += 1
        total_data += 1

    print(total_success/total_data) 

In [28]:
tester(train_corpus)
tester(test_corpus)

0.798922800718
0.711340206186


Let's count how each document ranks with respect to the training corpus 

In [None]:
model.save('trained_with size=10&iter=195')

Basically, greater than 95% of the inferred documents are found to be most similar to itself and about 5% of the time it is mistakenly most similar to another document. the checking of an inferred-vector against a training-vector is a sort of 'sanity check' as to whether the model is behaving in a usefully consistent manner, though not a real 'accuracy' value.

This is great and not entirely surprising. We can take a look at an example:

In [None]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Notice above that the most similar document is has a similarity score of ~80% (or higher). However, the similarity score for the second ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself

In [None]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(train_corpus))

# Compare and print the most/median/least similar documents from the train corpus
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

## Testing the Model

Using the same approach above, we'll infer the vector for a randomly chosen test document, and compare the document to our model by eye.

In [None]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus))
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

### Wrapping Up

That's it! Doc2Vec is a great way to explore relationships between documents.