# Doc2Vec Tutorial on the Lee Dataset

Doc2Vec is an NLP tool for representing documents as a vector and is a generalizing of the Word2Vec method. This tutorial will serve as an introduction to Doc2Vec and present ways to train and assess a Doc2Vec model.

In [1]:
import gensim
import os
import collections
import smart_open
import random



## Getting Started

 In theory, a document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. In NLP parlance a collection or set of documents is often referred to as a corpus.

In [2]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'

## Define a Function to Read and Preprocess Text

In [3]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [4]:
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

In [9]:
train_corpus[:2]

[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', '

In [10]:
print(test_corpus[:2])

[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to'


Notice that the testing corpus is just a list of lists and does not contain any tags.

## Training the Model

### Instantiate a Doc2Vec Object

Now, we'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 10 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time.

In [11]:
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=10)

### Build a Vocabulary

In [13]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a dictionary (accessible via model.vocab) of all of the unique words extracted from the training corpus along with the count (e.g., model.vocab['penalty'].count for counts for the word penalty).

### Time to Train

In [14]:
%time model.train(train_corpus)

Wall time: 444 ms


426778

### Inferring a Vector

One important thing to note is that you can now infer a vector for any piece of text without having to re-train the model by passing a list of words to the model.infer_vector function. This vector can then be compared with other vectors via cosine similarity.

In [15]:
model.infer_vector(['only', 'you', 'can', 'prevent', 'forrest', 'fires'])

array([-0.00398235, -0.00461269, -0.02679544, -0.00935559, -0.00363114,
       -0.01800475,  0.00599421, -0.04219684,  0.01010247, -0.01585881,
        0.03546004, -0.00756014,  0.02242384, -0.00484945, -0.00255826,
       -0.01466642,  0.0159989 , -0.00532978, -0.0121013 , -0.00150979,
        0.02956434,  0.00016201,  0.00784938, -0.01737342, -0.01330245,
       -0.01583373, -0.01426438,  0.01153005,  0.02079429,  0.01086976,
        0.00197235, -0.01115874, -0.00574612, -0.01148573,  0.02358464,
       -0.03167198,  0.02231628,  0.02417704,  0.03692144, -0.0199048 ,
       -0.00967707,  0.02325767, -0.04682001,  0.00731927,  0.02529767,
        0.03053166,  0.02304165,  0.00541707, -0.00276967, -0.03043545], dtype=float32)

## Assessing Model

To assess our new model, we'll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity. Basically, we're pretending as if the training corpus is some new unseen data and then seeing how they compare with the trained model. The expectation is that we've likely overfit our model (i.e., all of the ranks will be less than 2) and so we should be able to find similar documents very easily. Additionally, we'll keep track of the second ranks for a comparison of less similar documents.

In [16]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    
    second_ranks.append(sims[1])


Let's count how each document ranks with respect to the training corpus

In [17]:
collections.Counter(ranks)  #96% accuracy

Counter({0: 30,
         1: 25,
         2: 12,
         3: 15,
         4: 16,
         5: 8,
         6: 8,
         7: 11,
         8: 14,
         9: 6,
         10: 12,
         11: 11,
         12: 7,
         13: 5,
         14: 4,
         15: 4,
         16: 2,
         17: 4,
         18: 6,
         19: 2,
         20: 3,
         21: 4,
         22: 6,
         23: 4,
         24: 1,
         25: 5,
         26: 3,
         27: 3,
         28: 5,
         29: 3,
         30: 3,
         31: 2,
         33: 2,
         34: 1,
         35: 1,
         36: 2,
         37: 2,
         39: 1,
         40: 5,
         41: 1,
         42: 3,
         43: 1,
         45: 1,
         46: 3,
         47: 3,
         48: 2,
         49: 1,
         50: 1,
         53: 2,
         54: 1,
         57: 3,
         58: 2,
         59: 1,
         61: 1,
         62: 1,
         63: 1,
         65: 1,
         66: 1,
         68: 1,
         77: 1,
         80: 1,
         82: 1,
         

In [18]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not v

Notice above that the most similar document is has a similarity score of ~80% (or higher). However, the similarity score for the second ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself

In [19]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(train_corpus))

# Compare and print the most/median/least similar documents from the train corpus
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (231): «the us space shuttle endeavour has blasted off from the kennedy space centre en route to the international space station iss with replacement crew endeavour launch was delayed three times most recently by bad weather over the space centre yesterday the national aeronautics and space administration nasa had earlier pushed back the launch twice due to problems with the docking of russian cargo ship at the station rectified on monday by spacewalk completed by two russian cosmonauts the shuttle is taking to the station its fourth long term crew russian commander yuri onufrienko and americans carl walz and dan bursch and is due to return to earth on december with the current crew members who have been on the station since august the shuttle is also carrying the italian raffaello module laden with tonnes of equipment food supplies and materials for scientific experiments it was the first us space shuttle launch since september when hijacked airliners left around people

## Testing the Model

In [20]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus))
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Test Document (12): «drug squad detectives have asked the police ombudsman to investigate the taskforce that is examining allegations of widespread corruption within the squad this coincides with the creation of special unit within the taskforce to track the spending of at least serving and former squad members the corruption taskforce codenamed ceja will check tax records and financial statements in bid to establish if any of the suspects have accrued unexplained wealth over the past seven years but drug squad detectives have countered with their own set of allegations complaining to the ombudsman that the internal investigation is flawed biased and over zealous»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (143, 0.9983001351356506): «kashmiri militant groups denied involvement in thursday attack on the indian parliament accusing indian intelligence instead we want to make it clear that kashmiris have no connection with this attack said the muttahida

## Wrapping Up
That's it! Doc2Vec is a great way to explore relationships between documents.

---

# Try our Amazon Reviews!

In [24]:
A_train = test_data_dir + os.sep + 'Atrain.csv'
A_test = test_data_dir + os.sep + 'Atest.csv'

In [25]:
Atrain_corpus = list(read_corpus(A_train))
Atest_corpus = list(read_corpus(A_test, tokens_only=True))

In [27]:
Atrain_corpus[:2]

[TaggedDocument(words=['ve', 'owned', 'some', 'type', 'of', 'dustbuster', 'handvac', 'for', 'the', 'last', 'twenty', 'years', 'and', 'don', 'think', 've', 'ever', 'ever', 'had', 'one', 'that', 'didn', 'hate', 'too', 'little', 'suction', 'power', 'not', 'enough', 'charge', 'time', 'the', 'suction', 'nozzle', 'is', 'awkward', 'to', 'use', 'and', 'the', 'batteries', 'are', 'always', 'dead', 'after', 'couple', 'minutes', 'can', 'keep', 'them', 'plugged', 'into', 'the', 'wall', 'all', 'the', 'time', 'but', 'that', 'kills', 'the', 'battery', 'and', 'wastes', 'electricity', 'have', 'found', 'it', 'much', 'more', 'effective', 'to', 'just', 'use', 'the', 'suction', 'hose', 'on', 'my', 'upright', 'even', 'though', 'that', 'means', 'lugging', 'my', 'vacuum', 'around', 'so', 'decided', 'to', 'see', 'what', 'is', 'new', 'in', 'handvac', 'technology', 'and', 'if', 'they', 'have', 'improved', 'at', 'all', 'over', 'the', 'last', 'ten', 'years', 'fortunately', 'can', 'say', 'that', 'they', 'have', 'the

In [29]:
print(Atest_corpus[:2])

[['very', 'good', 'suction', 'and', 'the', 'battery', 'doesn', 'fade', 'while', 'being', 'used', 'very', 'impressed', 'and', 'will', 'only', 'use', 'this', 'type', 'of', 'dust', 'buster', 'going', 'forward'], ['plan', 'on', 'buying', 'this', 'for', 'both', 'of', 'my', 'children', 'it', 'is', 'great', 'for', 'vacuuming', 'out', 'my', 'car', 'use', 'this', 'every', 'day', 'and', 'it', 'has', 'good', 'strong', 'charge']]


In [30]:
Amodel = gensim.models.doc2vec.Doc2Vec(size=100, min_count=5, iter=100)

In [31]:
Amodel.build_vocab(Atrain_corpus)

In [32]:
%time Amodel.train(Atrain_corpus)

Wall time: 26.1 s


20566587

In [33]:
Aranks = []
Asecond_ranks = []
for doc_id in range(len(Atrain_corpus)):
    inferred_vector = Amodel.infer_vector(Atrain_corpus[doc_id].words)
    sims = Amodel.docvecs.most_similar([inferred_vector], topn=len(Amodel.docvecs))
    Arank = [docid for docid, sim in sims].index(doc_id)
    Aranks.append(Arank)
    
    Asecond_ranks.append(sims[1])

In [34]:
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(Atrain_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % Amodel)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(Atrain_corpus[sims[index][0]].words)))

Document (4499): «it is strong vacuum cleaner it is easy to use it gets charged fast and you use it quite long very affordable convenient and handy»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d100,n5,w5,mc5,s0.001,t3):

MOST (4499, 0.7602863311767578): «it is strong vacuum cleaner it is easy to use it gets charged fast and you use it quite long very affordable convenient and handy»

MEDIAN (1981, 0.3023831844329834): «fills up quickly and gets kind of clogged if you are cleaning up cat hair but works quite well most of the time seems like the brush is going to wear out quickly but time will tell plenty of power good unit like the upright storage base this is my first one so have nothing to compare it to but sure like having it»

LEAST (131, -0.015795066952705383): «this here black decker bust duster really saved my hash was at home by myself over weekend while my wife was away at work conference while she was gone thought try to make some chocolate from scratch with bunch of organ

In [36]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(Atrain_corpus))

# Compare and print the most/median/least similar documents from the train corpus
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(Atrain_corpus[doc_id].words)))
sim_id = Asecond_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(Atrain_corpus[sim_id[0]].words)))

Train Document (1665): «was really hopeful about this vacuum it works well but its slightly disappointing in that bought it specifically for husky hair on my couch and it doesn do whole lot about that even with full battery and clean filter the cavity for debris is also bit small and find myself emptying it regularly the most useful thing have been doing with it lately is vacuuming up my sweeping piles oh well»

Similar Document (3515, 0.6362065076828003): «very pleased with purchase quick delivery»



In [38]:
# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(Atest_corpus))
inferred_vector = Amodel.infer_vector(Atest_corpus[doc_id])
sims = Amodel.docvecs.most_similar([inferred_vector], topn=len(Amodel.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(Atest_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % Amodel)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(Atrain_corpus[sims[index][0]].words)))

Test Document (3659): «very little suction»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d100,n5,w5,mc5,s0.001,t3):

MOST (3547, 0.7797701954841614): «very strong suction works amazingly»

MEDIAN (1541, 0.29992538690567017): «the design is compact the charging station is handy but all in all it does not have all that muchsuction power it will pick up little dust motes in the back of drawer but didn do much forcleaning the crevices in my car»

LEAST (1228, 0.06226041540503502): «after trying other brands and being disappointed now just automatically update my when the suction on the old one goes use my vacuum so many times day should have holster for it this version is down from star rating because of the charging base it is hard to find place for since it has to store upright and it awkward to get properly seated on the base why did they change from such simple recharging system on the last series was so happy to be done with the old bulky wall mount now it bulky shelf mount still t

訓練樣本太少效果很差呀XDDDDDDDDDDDD