## Exploring two word2vec models (word2vec vs gensim)

In [1]:
import word2vec
import numpy

import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import nltk

### 1. Python interface to Google word2vec

Ref: https://github.com/danielfrg/word2vec

In [2]:
def read_dataset(path):
    """Read a dataset, where the first column contains a real-valued score,
    followed by a tab and a string of words.
    """
    dataset = []
    with open(path, "r") as f:
        for line in f:
            line_parts = line.strip().split("\t")
            dataset.append((float(line_parts[0]), line_parts[1].lower()))
    return dataset

In [3]:
path = '/Users/lifa08/Local_documents/Machine_Learning/Miniproject_test/train.txt'
sentences_train = read_dataset(path)

In [4]:
f = open('sentence_forword2vector.txt', 'w')
for label, sentence in sentences_train:
    f.write(sentence)
f.close()

word2phrase groups up similar words "Los Angeles" to "Los_Angeles".

**Note:** word2phrase will create a phrases text file which can be used as a better input for word2vec. However, we can also use the original text file as input for word2vec, thus this step can be skipped.

In [5]:
word2vec.word2phrase('sentence_forword2vector.txt', 'sentence_phrases.txt', verbose=True)

Starting training using file sentence_forword2vector.txt
Words processed: 100K     Vocab size: 76K  
Vocab size (unigrams + bigrams): 51508
Words in train file: 133711
Words written: 100K

Train the model using the word2phrase output.

word2vec generates a bin file containing the word vectors in a binary format.

In [6]:
word2vec.word2vec('sentence_phrases.txt', 'sentence_phrases_bin.bin', size=100, verbose=True)

Starting training using file sentence_phrases.txt
Vocab size: 3062
Words in train file: 110763


word2clusters cluster the trained vectors.

The output file contains the cluster for every word in the vocabulary.

In [7]:
word2vec.word2clusters('sentence_phrases_bin.bin', 'sentence_phrases_clusters.txt', 100, verbose=True)

Starting training using file sentence_phrases_bin.bin
Vocab size: 9
Words in train file: 7295


In [8]:
model = word2vec.load('sentence_phrases_bin.bin')
model.vocab

  ret = sqrt(sqnorm)
  return (1.0 / LA.norm(vec, ord=2)) * vec


array(['</s>', ',', 'the', ..., 'ethnic', 'nonsense', 'earth'],
      dtype='<U78')

In [9]:
model.vectors.shape

(3062, 100)

In [10]:
model.vectors

array([[  6.58986646e-13,   7.27593387e-13,  -6.30599360e-13, ...,
          2.50888205e-13,   5.03685635e-13,   3.18791556e-14],
       [             nan,              nan,              nan, ...,
                     nan,              nan,              nan],
       [ -1.75078571e-01,  -3.41962308e-01,   1.35473743e-01, ...,
         -1.93995610e-02,  -2.25419521e-01,  -2.85199702e-01],
       ..., 
       [             nan,              nan,              nan, ...,
                     nan,              nan,              nan],
       [  8.89440253e-03,  -5.86260185e-02,   1.92473996e-02, ...,
          3.45590040e-02,  -2.70427559e-02,  -6.95693418e-02],
       [             nan,              nan,              nan, ...,
                     nan,              nan,              nan]])

Retreive the vector of individual words.

In [11]:
model['nonsense'].shape

(100,)

In [12]:
model[','][:10]

array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])

Do simple queries to retreive words similar to "the" based on cosine similarity.

This returned a tuple with 2 items:

    indexes: numpy array with the indexes of the similar words in the vocabulary
    metrics: numpy array with cosine similarity to each word

In [13]:
indexes, metrics = model.cosine(',')
indexes, metrics

(array([1016, 1025, 1024, 1023, 1022, 1021, 1020, 1019, 1018, 1017]),
 array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]))

In [14]:
model.vocab[indexes]

array(['faith', 'writer', 'class', 'ultimate', 'cute', 'fears', 'fit',
       '.nothing', 'steven', 'several'],
      dtype='<U78')

In [15]:
model.generate_response(indexes, metrics)

rec.array([('faith',  nan), ('writer',  nan), ('class',  nan),
           ('ultimate',  nan), ('cute',  nan), ('fears',  nan),
           ('fit',  nan), ('.nothing',  nan), ('steven',  nan),
           ('several',  nan)], 
          dtype=[('word', '<U78'), ('metric', '<f8')])

In [16]:
model.generate_response(indexes, metrics).tolist()

[('faith', nan),
 ('writer', nan),
 ('class', nan),
 ('ultimate', nan),
 ('cute', nan),
 ('fears', nan),
 ('fit', nan),
 ('.nothing', nan),
 ('steven', nan),
 ('several', nan)]

In [17]:
indexes, metrics = model.cosine('good')
model.generate_response(indexes, metrics).tolist()

[('blood', nan),
 ("._'a", nan),
 ('let', nan),
 ('flicks', nan),
 ('fare', nan),
 ('.is', nan),
 ('open', nan),
 ('aspects', nan),
 ('sequence', nan),
 ('low-key', nan)]

### 2. gensim word2vec model

Refs:

https://radimrehurek.com/gensim/models/word2vec.html

https://rare-technologies.com/word2vec-tutorial/

In [18]:
sentences_train = read_dataset(path)

In [19]:
print(sentences_train[:2])

[(0.69444, "the rock is destined to be the 21st century 's new `` conan '' and that he 's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal ."), (0.83333, "the gorgeously elaborate continuation of `` the lord of the rings '' trilogy is so huge that a column of words can not adequately describe co-writer\\/director peter jackson 's expanded vision of j.r.r. tolkien 's middle-earth .")]


In [20]:
sentences = []
i = 0
for label, sentence in sentences_train:
    sentences.append(sentence)

print(sentences[:2])

["the rock is destined to be the 21st century 's new `` conan '' and that he 's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .", "the gorgeously elaborate continuation of `` the lord of the rings '' trilogy is so huge that a column of words can not adequately describe co-writer\\/director peter jackson 's expanded vision of j.r.r. tolkien 's middle-earth ."]


In [21]:
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

In [22]:
# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
print(tokenized_sentences[:2])

[['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', '``', 'conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean-claud', 'van', 'damme', 'or', 'steven', 'segal', '.'], ['the', 'gorgeously', 'elaborate', 'continuation', 'of', '``', 'the', 'lord', 'of', 'the', 'rings', "''", 'trilogy', 'is', 'so', 'huge', 'that', 'a', 'column', 'of', 'words', 'can', 'not', 'adequately', 'describe', 'co-writer\\/director', 'peter', 'jackson', "'s", 'expanded', 'vision', 'of', 'j.r.r', '.', 'tolkien', "'s", 'middle-earth', '.']]


**Note**: 

`gensim.models.Word2Vec(sentences, iter)` will run **two passes** over the sentences 
iterator (or, in general iter+1 passes; default iter=5). 

First pass: collects words and their frequencies to build **an internal dictionary tree**

The second and subsequent passes: train the neural model.

In [23]:
embedding_size = 10
model = gensim.models.Word2Vec(tokenized_sentences[0:5], min_count=1, size=embedding_size, window=5, workers=4)

2018-07-07 11:57:05,040 : INFO : collecting all words and their counts
2018-07-07 11:57:05,043 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-07-07 11:57:05,046 : INFO : collected 100 word types from a corpus of 140 raw words and 5 sentences
2018-07-07 11:57:05,050 : INFO : Loading a fresh vocabulary
2018-07-07 11:57:05,054 : INFO : min_count=1 retains 100 unique words (100% of original 100, drops 0)
2018-07-07 11:57:05,056 : INFO : min_count=1 leaves 140 word corpus (100% of original 140, drops 0)
2018-07-07 11:57:05,059 : INFO : deleting the raw counts dictionary of 100 items
2018-07-07 11:57:05,065 : INFO : sample=0.001 downsamples 100 most-common words
2018-07-07 11:57:05,070 : INFO : downsampling leaves estimated 55 word corpus (40.0% of prior 140)
2018-07-07 11:57:05,073 : INFO : estimated required memory for 100 words and 10 dimensions: 58000 bytes
2018-07-07 11:57:05,076 : INFO : resetting layer weights
2018-07-07 11:57:05,082 : INFO : training 

Access the word vector for an individual word.

In [24]:
model.wv['but']

array([-0.04527435, -0.01737068,  0.04638617, -0.04020905, -0.04004258,
       -0.03544782, -0.00919336,  0.03392975,  0.04286923, -0.02652049], dtype=float32)

In [25]:
model.save('mygword2vmodel')

2018-07-07 11:57:05,154 : INFO : saving Word2Vec object under mygword2vmodel, separately None
2018-07-07 11:57:05,167 : INFO : not storing attribute syn0norm
2018-07-07 11:57:05,174 : INFO : not storing attribute cum_table
2018-07-07 11:57:05,193 : INFO : saved mygword2vmodel


In [26]:
del model

In [27]:
model = gensim.models.Word2Vec.load('mygword2vmodel')

2018-07-07 11:57:05,264 : INFO : loading Word2Vec object from mygword2vmodel
2018-07-07 11:57:05,290 : INFO : loading wv recursively from mygword2vmodel.wv.* with mmap=None
2018-07-07 11:57:05,300 : INFO : setting ignored attribute syn0norm to None
2018-07-07 11:57:05,306 : INFO : setting ignored attribute cum_table to None
2018-07-07 11:57:05,311 : INFO : loaded mygword2vmodel


In [28]:
model.wv['but']

array([-0.04527435, -0.01737068,  0.04638617, -0.04020905, -0.04004258,
       -0.03544782, -0.00919336,  0.03392975,  0.04286923, -0.02652049], dtype=float32)

`model.wv` is a dictionary that contains `model.wv.index2word` and `model.wv.syn0`.

`model.wv.syn0` contains the word embeddings and is thus of shape (num_words, embedding_size)

In [29]:
# create a dictionary that maps words to their corresponding embedding vectors
w2v = dict(zip(model.wv.index2word, model.wv.syn0))
print(w2v['but'])
print(model.wv.syn0.shape)
print(model.wv.syn0[:3])

[-0.04527435 -0.01737068  0.04638617 -0.04020905 -0.04004258 -0.03544782
 -0.00919336  0.03392975  0.04286923 -0.02652049]
(100, 10)
[[-0.0092307   0.01956238  0.01146896  0.04565843 -0.00327957 -0.03220305
  -0.04059184 -0.04343898 -0.00902533  0.03680716]
 [-0.0008527   0.04095223 -0.03789742 -0.04003941 -0.01954055 -0.03402293
   0.01164207 -0.01927962 -0.01874233  0.04072549]
 [ 0.00829409  0.00988619 -0.03435677  0.02545083 -0.04681684  0.03951706
  -0.00840738  0.03969266  0.03700726 -0.01092675]]


In [30]:
# http://adventuresinmachinelearning.com/gensim-word2vec-tutorial/
embedding_matrix = numpy.zeros((len(model.wv.vocab), embedding_size))
for i in range(len(model.wv.vocab)):
    embedding_vector = model.wv[model.wv.index2word[i]]
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

print(embedding_matrix[:3])

print(numpy.allclose(embedding_matrix, model.wv.syn0))

[[-0.0092307   0.01956238  0.01146896  0.04565843 -0.00327957 -0.03220305
  -0.04059184 -0.04343898 -0.00902533  0.03680716]
 [-0.0008527   0.04095223 -0.03789742 -0.04003941 -0.01954055 -0.03402293
   0.01164207 -0.01927962 -0.01874233  0.04072549]
 [ 0.00829409  0.00988619 -0.03435677  0.02545083 -0.04681684  0.03951706
  -0.00840738  0.03969266  0.03700726 -0.01092675]]
True


Transfer sentence words to their corresponding embeddings.

In [31]:
sentence_matrix = numpy.zeros((len(tokenized_sentences[0]), embedding_size))
for i, word in enumerate(tokenized_sentences[0]):
    sentence_matrix[i] = model.wv[word]
    # sentence_matrix[i] = w2v[word]
print(sentence_matrix[:2])

sentence_wv = model.wv[tokenized_sentences[0]]
print(sentence_wv[:2])

print(numpy.allclose(sentence_matrix, sentence_wv))

[[-0.0092307   0.01956238  0.01146896  0.04565843 -0.00327957 -0.03220305
  -0.04059184 -0.04343898 -0.00902533  0.03680716]
 [ 0.02223633 -0.04034107 -0.00503555  0.04257284  0.01365479  0.04847161
  -0.01993341 -0.03781372 -0.04177392  0.02818117]]
[[-0.0092307   0.01956238  0.01146896  0.04565843 -0.00327957 -0.03220305
  -0.04059184 -0.04343898 -0.00902533  0.03680716]
 [ 0.02223633 -0.04034107 -0.00503555  0.04257284  0.01365479  0.04847161
  -0.01993341 -0.03781372 -0.04177392  0.02818117]]
True


In [32]:
print(len(model.wv.vocab))

idx_sentence = numpy.zeros(len(tokenized_sentences[0]))
for i, word in enumerate(tokenized_sentences[0]):
    idx_sentence[i] = model.wv.vocab[word].index
print(idx_sentence)

100
[  0.  13.   6.  14.   7.  15.   0.  16.  17.   4.  18.   8.  19.   9.  20.
  10.  21.   4.  22.   7.  23.   3.  24.  25.  26.  27.  28.  29.   5.  30.
  31.  32.  33.  34.  35.   2.]


In [33]:
embedding_matrix[35]

array([-0.01843417, -0.00773418,  0.02616785, -0.04447961,  0.02029118,
       -0.01890256,  0.03755583, -0.00378606, -0.00943715,  0.04591386])

In [34]:
print(len(tokenized_sentences))

7161


In [35]:
def sentences_to_idxs(tokenized_sentences):
    idx_sentences = []
    for tokenized_sentence in tokenized_sentences:
        idx_one_sentence = numpy.zeros(len(tokenized_sentence))
        idx = 0
        for idx, word in enumerate(tokenized_sentence):
            idx_one_sentence[idx] = model.wv.vocab[word].index
        idx_sentences.append(idx_one_sentence) 
    return idx_sentences

In [36]:
idx_sentences = sentences_to_idxs(tokenized_sentences[0:5])
print(idx_sentences)

[array([  0.,  13.,   6.,  14.,   7.,  15.,   0.,  16.,  17.,   4.,  18.,
         8.,  19.,   9.,  20.,  10.,  21.,   4.,  22.,   7.,  23.,   3.,
        24.,  25.,  26.,  27.,  28.,  29.,   5.,  30.,  31.,  32.,  33.,
        34.,  35.,   2.]), array([  0.,  36.,  37.,  38.,   1.,   8.,   0.,  39.,   1.,   0.,  40.,
         9.,  41.,   6.,  42.,  43.,  10.,   3.,  44.,   1.,  45.,  46.,
        47.,  48.,  49.,  50.,  51.,  52.,   4.,  53.,  54.,   1.,  55.,
         2.,  56.,   4.,  57.,   2.]), array([ 58.,  59.,  60.,  61.,   3.,  62.,   1.,  63.,  11.,   3.,  12.,
        64.,  65.,   5.,   3.,  12.,  66.,  67.,  68.,   7.,   0.,  69.,
        11.,  70.,   0.,  71.,  72.,  73.,  74.,   0.,  75.,   5.,  76.,
         5.,  77.,   1.,   0.,  78.,   2.]), array([ 79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.,   1.,
        89.,  90.,  91.,  92.,  93.,   1.,  94.,   2.]), array([ 95.,   0.,  96.,   6.,  97.,  98.,  99.,   2.])]


word2vec model does not need to tokenize sentence before training while gensim model expects a sequence of sentences which are composed of a list of words. Therefore, gensim model is often used together with natural language processing tools such as nltk. From the above explorations, we can see that gensim model is more stable.