# 10 Word2Vec Implemented on Keras
keras is a quite high-level deep learning library. In this notebook, we are going to implement two word2vec models: CBoW and Skip-gram. The utilized corpus is IMDB movie review dataset. http://ai.stanford.edu/~amaas/data/sentiment/


## Agenda

1. How to load pre-trained word vectors
2. Reading in the IMDB Sentiment Dataset and Iterating over files in Python
3. Build Skip-gram Model
4. Build CBoW Model
5. Memory-friendly Data Generation Methods

## Part 1: Load pre-trained word vectors

- You can find the word2vec project here: https://code.google.com/archive/p/word2vec/
- Download the word embeddings from the section **Pre-trained word and phrase vectors**. It is named `GoogleNews-vectors-negative300.bin.gz (3.4G)`
- Use gensim that you can easily load these wordvectors and utilize their functions

In [1]:
from gensim.models import KeyedVectors
# Load pretrained model (since intermediate data is not included, the model cannot be refined with additional data)
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

dog = model['dog']
print(dog.shape)
print(dog[:10])

# Some predefined functions that show content related information for given words
print(model.most_similar(positive=['woman', 'king'], negative=['man']))

print(model.doesnt_match("breakfast cereal dinner lunch".split()))

print(model.similarity('woman', 'man'))

(300,)
[ 0.05126953 -0.02233887 -0.17285156  0.16113281 -0.08447266  0.05737305
  0.05859375 -0.08251953 -0.01538086 -0.06347656]
[('queen', 0.7118192911148071), ('monarch', 0.6189674139022827), ('princess', 0.5902431607246399), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321243286133), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235945582389832), ('queens', 0.5181134343147278), ('sultan', 0.5098593235015869), ('monarchy', 0.5087411999702454)]
cereal
0.76640123


In [2]:
# clear the memory
del model

## Part 2: Read in the IMDB Sentiment Dataset

- You can access the imdb data folder in BT5153_data folder.
- Each movie review is a text file and they are under two different folders: pos and neg.
- We need to iterate over these files and load them one by one.

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
def load_imdb_dataset(imdb_path):
    # imdb_path is the base path 
    train_texts = []
    train_labels = []
    # contain two sub-folders named pos and neg
    for cat in ['pos', 'neg']:
        dset_path = os.path.join(imdb_path, cat)
        # loop in each folder and get the file name for each txt.
        for fname in sorted(os.listdir(dset_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(dset_path, fname), encoding='utf-8') as f:
                    train_texts.append(f.read()) # load the data into memory
                label = 0 if cat == 'neg' else 1
                train_labels.append(label)
    imdbdf = pd.DataFrame(
             {'text': train_texts,
              'label': train_labels}
             )
    # shuffle the whole dataset
    imdbdf = imdbdf.sample(frac=1).reset_index(drop=True)
    # Return the dataset in dataframe format
    return imdbdf

In [4]:
df_corpus = load_imdb_dataset('./imdb/imdb')
print ('Train samples shape :', df_corpus.shape[0])

Train samples shape : 25000


In [5]:
# 1 denotes positive and 0 is negative
print(df_corpus.head())

                                                text  label
0  No likeable characters (the lead is a combinat...      0
1  Man, I really wanted to like these shows. I am...      0
2  His choice of films, the basic 'conceit' of th...      1
3  The story of the untouchable who acted like a ...      1
4  It really impresses me that it got made. The d...      0


#### Raw Text Cleaning

In [6]:
from bs4 import BeautifulSoup 
import re

def clean_txt(raw_txt):
    # Function to clean raw text
    # 1. Remove HTML
    raw_txt = BeautifulSoup(raw_txt, "html.parser").get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", raw_txt) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                                             
    # 
    #
    # 4. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( words )) 

In [7]:
df_corpus['text'] = df_corpus.text.apply(clean_txt)
corpus = df_corpus.text.tolist()

In [8]:
# check the corpus type, which is a list of string
print(type(corpus))
print(type(corpus[1]))

<class 'list'>
<class 'str'>


#### Text tokenization from Keras

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

In [10]:
from keras.preprocessing.text import Tokenizer

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [11]:
tokenizer = Tokenizer()
# learn the vocab
tokenizer.fit_on_texts(corpus)

In [12]:
print(type(corpus[1]))
print(corpus[1])

<class 'str'>
man i really wanted to like these shows i am starving for some good television and i applaud tnt for providing these opportunites but sadly i am in the minority i guess when it comes to the cinematic stephen king as brilliant as king s writing is the irony is that it simply doesn t translate well to the screen big or small with few exceptions very few the king experience cannot be filmed with the same impact that the stories have when read many people would disagree with this but i m sure that in their heart of hearts they have to admit that the best filmed king story is but a pale memory of the one they read the reason is simple the average king story takes place in the mind scape of the characters in the story he gives us glimpses of their inner thoughts their emotions and their sometimes fractured or unreal points of view in short king takes the reader places where you can t put a panavision camera as an audience watching the filmed king we re left with less than half 

This `fit_on_texts` function is trying to build the vocab

In [13]:
# from string to a sequence of intergers
# each word will be convereted to its vocab index
seq_corpus = tokenizer.texts_to_sequences(corpus)
print(seq_corpus[1])

[124, 9, 64, 463, 5, 38, 131, 285, 9, 238, 9697, 15, 48, 49, 678, 2, 9, 6051, 17938, 15, 3707, 131, 46354, 18, 1012, 9, 238, 8, 1, 5785, 9, 476, 53, 7, 261, 5, 1, 1337, 1630, 601, 14, 518, 14, 601, 12, 478, 6, 1, 3117, 6, 11, 7, 327, 151, 20, 7067, 71, 5, 1, 260, 191, 41, 384, 16, 169, 5563, 54, 169, 1, 601, 572, 553, 28, 796, 16, 1, 170, 1455, 11, 1, 525, 27, 53, 328, 108, 77, 60, 3407, 16, 10, 18, 9, 140, 248, 11, 8, 66, 468, 4, 3339, 32, 27, 5, 955, 11, 1, 116, 796, 601, 62, 6, 18, 3, 6279, 1722, 4, 1, 29, 32, 328, 1, 282, 6, 593, 1, 832, 601, 62, 301, 268, 8, 1, 326, 24100, 4, 1, 102, 8, 1, 62, 24, 402, 177, 7160, 4, 66, 2359, 2289, 66, 1416, 2, 66, 507, 15097, 41, 4865, 743, 4, 633, 8, 342, 601, 301, 1, 5050, 1339, 117, 21, 50, 20, 271, 3, 21016, 362, 14, 34, 299, 147, 1, 796, 601, 68, 149, 312, 16, 324, 72, 316, 1, 1602, 72, 1, 5050, 46, 4535, 5, 7, 12, 23, 97, 226, 3, 3174, 5, 2267, 11, 29, 452, 3, 103, 8, 3, 601, 62, 32, 328, 3100, 29, 6, 1737, 5, 4914, 14041, 4, 11, 170, 103, 

- In the following, we are going to use a toy corpus instead of the IMDB corpus for a quick demo.

In [14]:
# let us check the texts_to_sequences function
toy_corpus = ['king is a strong man', 
              'queen is a wise woman', 
              'boy is a young man',
              'girl is a young woman',
              'prince is a young king',
              'princess is a young queen',
               'man is strong', 
               'woman is pretty',
               'prince is a boy will be king',
               'princess is a girl will be queen']
tokenizer = Tokenizer()
tokenizer.fit_on_texts(toy_corpus)
toy_seq_corpus = tokenizer.texts_to_sequences(toy_corpus)
print(toy_seq_corpus[0])
print(tokenizer.word_index['king'])
print(tokenizer.word_index['is'])
print(tokenizer.word_index['a'])
print(tokenizer.word_index['strong'])

[4, 1, 2, 8, 5]
4
1
2
8


In [15]:
print(tokenizer.index_word[1])

is


In [16]:
print(tokenizer.index_word[0])

KeyError: 0

- The KeyError means that the Tokenizer reserves 0 as an OOV words.
- In practive, the first ebmedding in word embedding martix is for unkown words or chars.

## Part 3: Build Skip-gram Model

- Here, we only use toy corpus for demo purpose.
- Target: predict the nearby words based on the center word.
<img src="word2vec-skip-gram.png" alt="cbow"
	title="cbow pic" width="250" height="150" />

In [17]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Reshape
from keras.utils import to_categorical
from keras.preprocessing import sequence
import keras.backend as K

- **For skip-gram,  training data generation**:

the input x is the center word index, the output x is one hot vector of the neary word index.
For example, the toy corpus only contain two sentences.
```
I like apple 
I like reading books
```
1. The first step: build a vocab. which can be regarded as a mapping from words to interget index.

Here, OOV-> 0, I -> 1, like -> 2, apple -> 3, reading -> 4, books -> 5.

2. Then, we scan the corpus and creat the pair of center word and nearby word. Here, we set the window size is `one`.
We have the following pair of input x and target y.

<pre>
words pair              numerical input       numerical output

(I, like)                       1               [0,0,1,0,0,0]

(like, I)                       2               [0,1,0,0,0,0]

(like, apple)                   2               [0,0,0,1,0,0]

(apple, like)                   4               [0,0,1,0,0,0]

(I, like)

(like, I)

(like, reading)              

(reading, like)

(reading, books)

(books, reading)                 5              [0,0,0,0,1,0]
</pre>

In [18]:
def generate_data(corpus, window_size, V):
    """
    corpus is the collection of lists of words index
    window_size is the context size that defines 'nearby' words
    V is the vocab Size
    """
    labels = []
    in_words   = [] 
    maxlen = window_size*2
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            s = index - window_size
            e = index + window_size + 1
            for i in range(s, e):
                if 0<= i < L and i != index:
                    in_words.append([word])
                    labels.append(to_categorical(words[i], V))
    return (in_words, labels)   

In [19]:
# plus one is for OOV words
V = len(tokenizer.word_index) + 1
dim = 5
window_size = 4
ith = 0
input_x, target_y =  generate_data(toy_seq_corpus, window_size, V)
input_x           = np.array(input_x,dtype=np.int32)
target_y          = np.array(target_y,dtype=np.int32)

In [20]:
print('check the first pair of input and output')
print(input_x[0])
print(target_y[0])#the onehot vector
print('check the third pair of input and output')
print(input_x[2])
print(target_y[2])

check the first pair of input and output
[4]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
check the third pair of input and output
[4]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]


In [21]:
print(toy_seq_corpus[0])

[4, 1, 2, 8, 5]


- **Model Config**

It consists of two layers:

1. The first layer is embeddings layer, which perform the lookup operation. Given the word index as the input, the layer output will return the corresponding vector

2. The second layer is softmax layer.

- **Embeddings Layer**:

Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

This layer can only be used as the first layer in a model.

1. input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
2. output_dim: int >= 0. Dimension of the dense embedding.
3. embeddings_initializer: Initializer for the embeddings matrix (see initializers).
4. input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect  Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).

### Model Building

In [22]:
skipgram = Sequential()
skipgram.add(Embedding(input_dim=V, output_dim=dim, init='glorot_uniform', input_length=1))
skipgram.add(Reshape((dim, )))
skipgram.add(Dense(input_dim=dim, output_dim=V, activation='softmax'))

  
  after removing the cwd from sys.path.


In [23]:
print(skipgram.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 5)              85        
_________________________________________________________________
reshape_1 (Reshape)          (None, 5)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 17)                102       
Total params: 187
Trainable params: 187
Non-trainable params: 0
_________________________________________________________________
None


In [24]:
skipgram.compile(loss='categorical_crossentropy', optimizer="adadelta")

In [25]:
skipgram.fit(input_x, target_y, batch_size=8, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x11f36160>

- **How to save the learned word vectors**

In [26]:
f = open('vectors.txt' ,'w')
f.write('{} {}\n'.format(V-1, dim))

5

In [27]:
vectors = skipgram.get_weights()[0]
for word, i in tokenizer.word_index.items():
    str_vec = ' '.join(map(str, list(vectors[i, :])))
    f.write('{} {}\n'.format(word, str_vec))
f.close()

- **the saved format for word vectors** the first number is the vocabulary size, 5 is dimension
<img src="saved_format.jpg" alt="cbow"
	title="saved format" width="550" height="450" />

- **we can use gensim**

Gensim is a production-ready open-source library for NLP problems.

https://radimrehurek.com/gensim/index.html


In [29]:
from gensim.models import KeyedVectors
w2v = KeyedVectors.load_word2vec_format('./vectors.txt', binary=False)



In [30]:
w2v.most_similar(positive=['man'])

[('prince', 0.9203758239746094),
 ('woman', 0.8602206707000732),
 ('princess', 0.8566994667053223),
 ('queen', 0.7101820707321167),
 ('be', 0.6660196781158447),
 ('will', 0.5517114996910095),
 ('king', 0.5441617369651794),
 ('a', 0.38888657093048096),
 ('girl', 0.3134895861148834),
 ('is', 0.2508010268211365)]

## Part 4: Build CBoW Model

- CBoW's target is the prediction of center word.
<img src="word2vec-cbow.png" alt="cbow"
	title="cbow pic" width="250" height="150" />

- **For cbow,  training data generation**:

the input x is the list of  context word index, the output x is one hot vector of the center word.
For example, the toy corpus only contain two sentences.
```
I like apple 
I like reading books
```
1. The first step: build a vocab. which can be regarded as a mapping from words to interget index. 
Here OOV->0, I -> 1, like -> 2, apple -> 3, reading -> 4 books -> 5.

2. Then, we scan the corpus and creat the pair of list of nearby word and center word. Here, we set the window size is `one`.
We have the following pair of input x and target y.

<pre>
words pair                     numerical input       numerical output

([like], I)                        [2]                 [0,1,0,0,0,0]

([I, apple], like)                 [1,3]               [0,0,1,0,0,0]

([like], apple)                    [2]                 [0,0,0,1,0,0]

([like], I)                        [2]                 [0,1,0,0,0,0]
 
([I, reading], like)               [1,4]               [0,1,0,0,0,0]

([like, books], reading)           [2,5]               [0,0,0,0,1,0]

([reading], books)                 [4]                 [0,0,0,0,0,1]
</pre>

3. At last, sometimes, we can not get the input context with enough length. For example, the first pair's numerical input only has one word index insetad of two. What we can do here is padding the short input so that all input data have the same length. 

- **Prepare the training and labels**

In [31]:
from keras.preprocessing import sequence
import keras.backend as K
def generate_data(corpus, window_size, V):
    """
    corpus is the list of sequence of words index
    window_size is used to define  
    V is the vocab Size
    """
    context_words   = []
    center_words    = []
    maxlen = window_size*2
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            contexts = []
            labels   = []            
            s = index - window_size
            e = index + window_size + 1
            contexts.append([words[i] for i in range(s, e) if 0 <= i < L and i != index])
            labels.append(word)           
            x = sequence.pad_sequences(contexts, maxlen=maxlen)
            y = to_categorical(labels, V)
            context_words.append(x)
            center_words.append(y)
    return context_words, center_words

In [32]:
ith = 0
input_x, target_y = generate_data(toy_seq_corpus, window_size, V)

In [33]:
input_x = np.array(input_x)
print(input_x.shape)
input_x = np.squeeze(input_x)  # sequeeze the second dimesion as on
print(input_x.shape)
target_y = np.array(target_y)
target_y = np.squeeze(target_y)

(50, 1, 8)
(50, 8)


In [34]:
print(input_x[0])
print(target_y[0])

[0 0 0 0 1 2 8 5]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [35]:
print(toy_seq_corpus[0])

[4, 1, 2, 8, 5]


In [36]:
cbow = Sequential()

- **Lambda Layer**:

Wraps arbitrary expression as a Layer object.
    1. function: The function to be evaluated. Takes input tensor as first argument. usually based on backend
    2. output_shape: Expected output shape from function. 

In [37]:
from keras.layers import Lambda
cbow.add(Embedding(input_dim=V, output_dim=dim, input_length=window_size*2))
# sum all embeddings 
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim,)))
## add softmax layer
cbow.add(Dense(V, activation='softmax'))

In [38]:
print(cbow.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 8, 5)              85        
_________________________________________________________________
lambda_1 (Lambda)            (None, 5)                 0         
_________________________________________________________________
dense_2 (Dense)              (None, 17)                102       
Total params: 187
Trainable params: 187
Non-trainable params: 0
_________________________________________________________________
None


In [39]:
cbow.compile(loss='categorical_crossentropy', optimizer='adam')
# Train the model, iterating on the data in batches of 512 samples
cbow.fit(input_x, target_y, batch_size=8, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x11550898>

## Part 5: Memory-friendly Data Generation

- Here, we modify the data generation function of skip-gram
- `yield`: it will return generators. And generators do not store all the values in memory. It will return value during each iteration.

sample code
```
generator = (x * x for x in range(3))
for i in generator:
    print(i)
```

https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do


In [40]:
def generate_data_live(corpus, window_size, V):
    """
    corpus is the list of sequence of words index
    window_size is used to define  
    V is the vocab Size
    """
    maxlen = window_size*2
    for words in corpus:
        labels   = []
        in_words = [] 
        L = len(words)
        for index, word in enumerate(words):
            s = index - window_size
            e = index + window_size + 1
            for i in range(s, e):
                if 0<= i < L and i != index:
                    in_words.append([word])
                    labels.append(words[i])
        x = np.array(in_words,dtype=np.int32)
        y = to_categorical(labels, V)
        yield (x, y) 


In [41]:
## here you should define skipgram from scratch

for ite in range(50):
    loss = 0.
    for x, y in generate_data_live(toy_seq_corpus, window_size, V):
        #updated parameters based on data samples provided without regard to any fixed batch size
        loss += skipgram.train_on_batch(x, y)
    print(ite, loss)

0 25.925078868865967
1 25.895569801330566
2 25.86624312400818
3 25.836897134780884
4 25.807478189468384
5 25.777966499328613
6 25.74836039543152
7 25.718678951263428
8 25.68894910812378
9 25.659205436706543
10 25.62948775291443
11 25.599831104278564
12 25.570274353027344
13 25.540847301483154
14 25.51158118247986
15 25.482503175735474
16 25.453638553619385
17 25.425008058547974
18 25.396633625030518
19 25.36853051185608
20 25.340715646743774
21 25.31320285797119
22 25.28600311279297
23 25.259127378463745
24 25.232582330703735
25 25.206376314163208
26 25.180516004562378
27 25.155004024505615
28 25.12984275817871
29 25.105035066604614
30 25.08058214187622
31 25.056483268737793
32 25.0327365398407
33 25.00934100151062
34 24.98629379272461
35 24.963590621948242
36 24.94122886657715
37 24.919200897216797
38 24.897504091262817
39 24.87613296508789
40 24.855081796646118
41 24.83434009552002
42 24.81390690803528
43 24.793773651123047
44 24.773934364318848
45 24.75438094139099
46 24.73510837554