### Using third-party implementations of word2vec

The third-party implementations of word2vec are readily available.

The gensim library provides an implementation of word2vec. Because Keras does not provide any support for
word2vec, and integrating the gensim implementation into Keras code is very common practice.

**To install gensim:** 
        -> pip install --upgrade gensim 
        -> conda install -c conda-forge gensim
        
**Dataset:** http://mattmahoney.net/dc/text8.zip 

The text8 corpus is a file containing about 17 million words derived from Wikipedia text.
Wikipedia text was cleaned to remove markup, punctuation, and non-ASCII text, and the first 100 million characters of this cleaned text became the text8 corpus

In [1]:
from gensim.models import word2vec
import logging
import os
from pathlib import Path

Read words from text8 and split up the words into senetences of 50 words each. The gensim library provides a built-in text8 handler that does something similar. Since we want
to illustrate how to generate a model with any (preferably large) corpus that may or may not fit into
memory, we will show you how to generate these sentences using a Python generator.

In this case,
we do ingest the entire file into memory, but when traversing through directories of files, generators
allows us to load parts of the data into memory at a time, process them, and yield them to the caller

In [9]:
class Text8Sentences(object):
    def __init__(self, fname, maxlen):
        self.fname = fname
        self.maxlen = maxlen

    def __iter__(self):
        with open(os.path.join(DATA_DIR, "text8"), "r") as ftext:
            text = ftext.read().split(" ")
            words = []
            for word in text:
                if len(words) >= self.maxlen:
                    yield words
                    words = []
                words.append(word)
            yield words

In [None]:
#Python logging to report on progress
logging.basicConfig(
    format='%(asctime)s : %(levelname)s : %(message)s',
    level=logging.INFO)

DATA_DIR = "data/"
MODEL_NAME = "word2vec_text8"
model_file = Path(MODEL_NAME)
sentences = Text8Sentences(os.path.join(DATA_DIR, "text8"), 50)

The next lines trains the model with the sentences from the dataset.
We have chosen the size of the **embedding vectors to be 300**, and we only 
consider **words that appear a minimum of 30 times** in the corpus.

The default **window size is 5** which means that it consider the $w_{(i-5)}$ to $w_{(i+5)}$.

By default, the word2vec model created is CBOW, but you
can change that by setting sg=1 (skip-gram) in the parameters:

In [22]:
if model_file.is_file():
    model = word2vec.Word2Vec.load(MODEL_NAME)
else:
    model = word2vec.Word2Vec(sentences, size=300, min_count=30)

2019-05-01 02:03:34,498 : INFO : collecting all words and their counts
2019-05-01 02:03:36,184 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-05-01 02:03:36,354 : INFO : PROGRESS: at sentence #10000, processed 500000 words, keeping 33464 word types
2019-05-01 02:03:36,533 : INFO : PROGRESS: at sentence #20000, processed 1000000 words, keeping 52755 word types
2019-05-01 02:03:36,706 : INFO : PROGRESS: at sentence #30000, processed 1500000 words, keeping 65589 word types
2019-05-01 02:03:36,881 : INFO : PROGRESS: at sentence #40000, processed 2000000 words, keeping 78383 word types
2019-05-01 02:03:37,065 : INFO : PROGRESS: at sentence #50000, processed 2500000 words, keeping 88008 word types
2019-05-01 02:03:37,237 : INFO : PROGRESS: at sentence #60000, processed 3000000 words, keeping 96645 word types
2019-05-01 02:03:37,413 : INFO : PROGRESS: at sentence #70000, processed 3500000 words, keeping 104309 word types
2019-05-01 02:03:37,588 : INFO : PROGRE

2019-05-01 02:04:11,431 : INFO : EPOCH 2 - PROGRESS: at 22.11% examples, 444121 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:04:12,434 : INFO : EPOCH 2 - PROGRESS: at 27.93% examples, 481645 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:04:13,449 : INFO : EPOCH 2 - PROGRESS: at 33.64% examples, 506788 words/s, in_qsize 6, out_qsize 0
2019-05-01 02:04:14,458 : INFO : EPOCH 2 - PROGRESS: at 38.93% examples, 520693 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:04:15,459 : INFO : EPOCH 2 - PROGRESS: at 44.69% examples, 537726 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:04:16,470 : INFO : EPOCH 2 - PROGRESS: at 50.46% examples, 551156 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:04:17,475 : INFO : EPOCH 2 - PROGRESS: at 56.22% examples, 562664 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:04:18,481 : INFO : EPOCH 2 - PROGRESS: at 62.27% examples, 574712 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:04:19,491 : INFO : EPOCH 2 - PROGRESS: at 67.45% examples, 577412 words/s, in_qsiz

2019-05-01 02:05:17,543 : INFO : EPOCH 5 - PROGRESS: at 65.22% examples, 600269 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:05:18,545 : INFO : EPOCH 5 - PROGRESS: at 71.27% examples, 609287 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:05:19,551 : INFO : EPOCH 5 - PROGRESS: at 77.51% examples, 616967 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:05:20,555 : INFO : EPOCH 5 - PROGRESS: at 83.39% examples, 621809 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:05:21,582 : INFO : EPOCH 5 - PROGRESS: at 89.33% examples, 625922 words/s, in_qsize 6, out_qsize 1
2019-05-01 02:05:22,595 : INFO : EPOCH 5 - PROGRESS: at 94.85% examples, 627154 words/s, in_qsize 5, out_qsize 0
2019-05-01 02:05:23,683 : INFO : EPOCH 5 - PROGRESS: at 99.50% examples, 620284 words/s, in_qsize 6, out_qsize 0
2019-05-01 02:05:23,748 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-01 02:05:23,755 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-01 02:05:23,758 : I

We can also find words that are most similar to a certain word:

In [21]:
print("""model.most_similar("woman")""")
print(model.wv.most_similar("woman"))

"""
('child', 0.7407524585723877), 
('girl', 0.720534086227417), (
'man', 0.6518779993057251), 
('lover', 0.6452478766441345), 
('prostitute', 0.6388713121414185), 
('baby', 0.6201745867729187), 
('herself', 0.6199727654457092), 
('mother', 0.6185761094093323), 
('person', 0.6185750365257263), 
('lady', 0.6109557747840881)]
"""


model.most_similar("woman")
[('child', 0.7407524585723877), ('girl', 0.720534086227417), ('man', 0.6518779993057251), ('lover', 0.6452478766441345), ('prostitute', 0.6388713121414185), ('baby', 0.6201745867729187), ('herself', 0.6199727654457092), ('mother', 0.6185761094093323), ('person', 0.6185750365257263), ('lady', 0.6109557747840881)]


"\n('child', 0.7407524585723877), \n('girl', 0.720534086227417), (\n'man', 0.6518779993057251), \n('lover', 0.6452478766441345), \n('prostitute', 0.6388713121414185), \n('baby', 0.6201745867729187), \n('herself', 0.6199727654457092), \n('mother', 0.6185761094093323), \n('person', 0.6185750365257263), \n('lady', 0.6109557747840881)]\n"

We can provide hints for finding word similarity. For example, the following command returns the top
10 words that are like woman and king but unlike man:

In [18]:
print("""model.most_similar(positive=["woman", "king"],
      negative=["man"], topn=10)""")
print(model.wv.most_similar(positive=['woman', 'king'],
                         negative=['man'],topn=10))
"""
[('queen', 0.5695605874061584), 
('isabella', 0.553027331829071), 
('empress', 0.5448185205459595), 
('princess', 0.5404852628707886), 
('daughter', 0.5338696837425232), 
('throne', 0.5229605436325073), 
('elizabeth', 0.5115128755569458), 
('pharaoh', 0.5077822804450989), (
'prince', 0.5037112236022949), 
('son', 0.5034430623054504)]
"""

model.most_similar(positive=["woman", "king"],
      negative=["man"], topn=10)
[('queen', 0.5695605874061584), ('isabella', 0.553027331829071), ('empress', 0.5448185205459595), ('princess', 0.5404852628707886), ('daughter', 0.5338696837425232), ('throne', 0.5229605436325073), ('elizabeth', 0.5115128755569458), ('pharaoh', 0.5077822804450989), ('prince', 0.5037112236022949), ('son', 0.5034430623054504)]


"\n[('queen', 0.5695605874061584), \n('isabella', 0.553027331829071), \n('empress', 0.5448185205459595), \n('princess', 0.5404852628707886), \n('daughter', 0.5338696837425232), \n('throne', 0.5229605436325073), \n('elizabeth', 0.5115128755569458), \n('pharaoh', 0.5077822804450989), (\n'prince', 0.5037112236022949), \n('son', 0.5034430623054504)]\n"

We can also find similarities between individual words. To give a feel of how the positions of the
words in the embedding space correlates with their semantic meanings

In [14]:
model.wv.similarity("girl", "woman")

0.72053397

In [15]:
model.wv.similarity("girl", "man")

0.59525347

In [16]:
model.wv.similarity("girl", "car")

0.3213757

In [17]:
model.wv.similarity("bus", "car")

0.4743592