# Word2Vec (gensim)

Based on [Dive Into NLTK, Part X: Play with Word2Vec Models based on NLTK Corpus by TextMiner](http://textminingonline.com/dive-into-nltk-part-x-play-with-word2vec-models-based-on-nltk-corpus)

## 1. Exploring the `gutenburg` corpus

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works. Most of the items in its collection are full texts of public domain books.

In [4]:
from nltk.corpus import gutenberg
gutenberg.readme().replace('\n', ' ')

'Project Gutenberg Selections http://gutenberg.net/  This corpus contains etexts from from Project Gutenberg, by the following authors:  * Jane Austen (3) * William Blake (2) * Thornton W. Burgess * Sarah Cone Bryant * Lewis Carroll * G. K. Chesterton (3) * Maria Edgeworth * King James Bible * Herman Melville * John Milton * William Shakespeare (3) * Walt Whitman  The beginning of the body of each book could not be identified automatically, so the semi-generic header of each file has been removed, and included below. Some source files ended with a line "End of The Project Gutenberg Etext...", and this has been deleted.  Information about Project Gutenberg (one page)  We produce about two million dollars for each hour we work.  The fifty hours is one conservative estimate for how long it we take to get any etext selected, entered, proofread, edited, copyright searched and analyzed, the copyright letters written, etc.  This projected audience is one hundred million readers.  If our value

In [6]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [9]:
bible_kjv_sents = gutenberg.sents('bible-kjv.txt')
len(bible_kjv_sents)

30103

## 2. Implementing Word2Vec

In [24]:
from string import punctuation

discard_punctuation_and_lowercased_sents = [[word.lower() for word in sent if word not in punctuation and word.isalpha()] 
                                            for sent in bible_kjv_sents]
discard_punctuation_and_lowercased_sents[3]

['in',
 'the',
 'beginning',
 'god',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth']

In [62]:
from gensim.models import Word2Vec

bible_kjv_word2vec_model = word2vec.Word2Vec(discard_punctuation_and_lowercased_sents, min_count=5, size=200)
bible_kjv_word2vec_model.save('bible_word2vec_gensim')
# model = Word2Vec.load(fname) # To load a model
word_vectors = bible_kjv_word2vec_model.wv
del bible_kjv_word2vec_model # When we finish training the model, we can only delete it and keep the word vectors.
word_vectors.save_word2vec_format('bible_word2vec_org', 'bible_word2vec_vocabulary')
len(word_vectors.vocab)

5279

In [52]:
word_vectors.most_similar(['god']) # Most similar as in closest in the word graph. Word2vec is essentially about proportions of word occurrences in relations holding in general over large corpora of text. Consider word analogy ‘man is to woman as king is to X’ which was famously demonstrated in word2vec. The algorithm is able to come up with an answer queen, almost magically by simple vector differences. The main idea, called distributional hypothesis, is that similar words appear in similar contexts of words around them.

[('truth', 0.7849754095077515),
 ('hosts', 0.7561447620391846),
 ('lord', 0.7557143568992615),
 ('christ', 0.739787220954895),
 ('spirit', 0.7170284986495972),
 ('salvation', 0.7115912437438965),
 ('faith', 0.707336962223053),
 ('glory', 0.6949458122253418),
 ('hope', 0.6832618117332458),
 ('mercy', 0.6805936098098755)]

In [53]:
word_vectors.most_similar(['heaven'], topn=3)

[('earth', 0.7416267991065979),
 ('heavens', 0.7133727073669434),
 ('sea', 0.6633787155151367)]

In [54]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('daughter', 0.6277151703834534)]

In [55]:
# The `_cosmul` variant uses a slightly-different comparison when using multiple positive/negative examples (such as when asking about analogies). One paper has shown it does better:
word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.9797199964523315)]

In [56]:
word_vectors.similarity('lord', 'god')

0.7557145774555314

In [57]:
word_vectors.doesnt_match("lord god salvation food spirit".split())

'food'

In [59]:
# Probability of a text under the model
# bible_kjv_word2vec_model.score(["In the beginning".split()]) # Doesn't work for 2 reasons: 1. I deleted the model. 2. It has only been implemented for models made with certain arguments as of the time of writing.

AttributeError: 'Word2VecKeyedVectors' object has no attribute 'score'