<a href="https://colab.research.google.com/github/mkane968/Text-Mining-Experiments/blob/main/NLTK/Tutorial%2010%3A%20Gensim%20Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 10: Gensim Word2Vec

Based on [Dive Into NLTK, Part X: Play with Word2Vec Models based on NLTK Corpus by TextMiner](https://textminingonline.com/dive-into-nltk-part-x-play-with-word2vec-models-based-on-nltk-corpus)

The word2vec algorithm uses a **neural network model** to learn **word associations** from a large corpus of text. Once trained, such a model can detect **synonymous words** or **suggest additional words** for a partial sentence. 

As the name implies, word2vec represents each distinct word with a **particular list of numbers** called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the **level of semantic similarity** between the words represented by those vectors. 

[Learn more...](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/)

### ***Step 1: Exploring the gutenburg corpus***
Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works. Most of the items in its collection are full texts of public domain books.

In [2]:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
gutenberg.readme().replace('\n', ' ')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


'Project Gutenberg Selections http://gutenberg.net/  This corpus contains etexts from from Project Gutenberg, by the following authors:  * Jane Austen (3) * William Blake (2) * Thornton W. Burgess * Sarah Cone Bryant * Lewis Carroll * G. K. Chesterton (3) * Maria Edgeworth * King James Bible * Herman Melville * John Milton * William Shakespeare (3) * Walt Whitman  The beginning of the body of each book could not be identified automatically, so the semi-generic header of each file has been removed, and included below. Some source files ended with a line "End of The Project Gutenberg Etext...", and this has been deleted.  Information about Project Gutenberg (one page)  We produce about two million dollars for each hour we work.  The fifty hours is one conservative estimate for how long it we take to get any etext selected, entered, proofread, edited, copyright searched and analyzed, the copyright letters written, etc.  This projected audience is one hundred million readers.  If our value

Explore file ids in Project Gutenberg - list of available texts.

In [3]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Get all sentences in the King James Bible text and print length.

In [5]:
nltk.download('punkt')
bible_kjv_sents = gutenberg.sents('bible-kjv.txt')
len(bible_kjv_sents)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


30103

### ***Step 2: Implementing Word2Vec***

Clean text of punctuation and lowercase all words in sentences, print example sentence to check cleaning.

In [6]:
from string import punctuation

discard_punctuation_and_lowercased_sents = [[word.lower() for word in sent if word not in punctuation and word.isalpha()] 
                                            for sent in bible_kjv_sents]
discard_punctuation_and_lowercased_sents[3]

['in',
 'the',
 'beginning',
 'god',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth']

Import word2vec and create model based on KJV text; get and save word vectors.

In [8]:
from gensim.models import word2vec

bible_kjv_word2vec_model = word2vec.Word2Vec(discard_punctuation_and_lowercased_sents, min_count=5, size=200)
bible_kjv_word2vec_model.save('bible_word2vec_gensim')
# model = Word2Vec.load(fname) # To load a model
word_vectors = bible_kjv_word2vec_model.wv
del bible_kjv_word2vec_model # When we finish training the model, we can only delete it and keep the word vectors.
word_vectors.save_word2vec_format('bible_word2vec_org', 'bible_word2vec_vocabulary')
len(word_vectors.vocab)

5279

Get most similar word vectors to "god." "Most similar" means closest in the word graph. Word2vec is essentially about proportions of word occurrences in relations holding in general over large corpora of text. 

Consider the word analogy ‘man is to woman as king is to X’ which was famously demonstrated in word2vec. The algorithm is able to come up with an answer, *queen*, almost magically by simple vector differences. The main idea, called distributional hypothesis, is that similar words appear in similar contexts of words around them.


In [9]:
word_vectors.most_similar(['god']) 

[('truth', 0.7798963785171509),
 ('salvation', 0.7758899927139282),
 ('lord', 0.7544775009155273),
 ('hosts', 0.7544087171554565),
 ('faith', 0.7490318417549133),
 ('spirit', 0.7441205978393555),
 ('christ', 0.7392501831054688),
 ('fear', 0.712788462638855),
 ('glory', 0.7098792791366577),
 ('grace', 0.7070826292037964)]

Get the most similar words to another word, this time displaying only the top 3 most similar.

In [10]:
word_vectors.most_similar(['heaven'], topn=3)

[('earth', 0.7397711277008057),
 ('heavens', 0.7071738243103027),
 ('darkness', 0.6421352624893188)]

Try out analogy above, getting vector difference between "king" and unknown as based on that between two givens (woman and man)

In [11]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.6202816367149353)]

The `_cosmul` variant uses a slightly-different comparison when using multiple positive/negative examples (such as when asking about analogies). One paper has shown it does better:


In [12]:
word_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.970683217048645)]

Get the similarity between two words

In [13]:
word_vectors.similarity('lord', 'god')

0.7544775

Get a word that does not "match" given words (significantly different vector)

In [14]:
word_vectors.doesnt_match("lord god salvation food spirit".split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'food'