## Using wrappers for Gensim models for working with Keras

This tutorial is about using gensim models as a part of your Keras models.

The wrappers available (as of now) are :
* Word2Vec (uses the function ```get_embedding_layer``` defined in  ```gensim.models.keyedvectors```)

### Word2Vec

To use Word2Vec, we import the corresponding module

In [2]:
from gensim.models import word2vec

Next we create a dummy set of sentences to train the Word2Vec model associated with the wrapper.

In [3]:
sentences = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]

Then, we call the wrapper and pass appropriate parameters.

In [4]:
model = word2vec.Word2Vec(sentences, size=100, min_count=1, hs=1)



We can use methods and atributes associated with the Word2Vec model on the model returned by the wrapper.

In [5]:
sims = model.most_similar('graph', topn=10)   #words most similar to 'graph'
print sims

[('human', 0.21846070885658264), ('eps', 0.14406150579452515), ('system', 0.12887781858444214), ('time', 0.12749384343624115), ('computer', 0.10715052485466003), ('minors', 0.08211945742368698), ('user', 0.031229231506586075), ('interface', 0.016254138201475143), ('trees', 0.005966879427433014), ('survey', -0.10215148329734802)]


As with Word2Vec models, the results obtained after training on small input can be unexpected. 

#### Integration with Keras : Cosine Similarity Task

As an example of using the wrapper with Keras, we try to use the wrapper for word similarity task where we compute the cosine distance as a measure of similarity between the two words.

In [6]:
import numpy as np
from keras.engine import Input
from keras.models import Model
from keras.layers import merge

We would use the layer returned by the function `get_embedding_layer` in the Keras model.

In [7]:
model_wv = model.wv
embedding_layer = model_wv.get_embedding_layer()

Next, we construct the Keras model. 

In [8]:
input_a = Input(shape=(1,), dtype='int32', name='input_a')
input_b = Input(shape=(1,), dtype='int32', name='input_b')
embedding_a = embedding_layer(input_a)
embedding_b = embedding_layer(input_b)
similarity = merge([embedding_a, embedding_b], mode='cos', dot_axes=2)

keras_model = Model(input=[input_a, input_b], output=similarity)
keras_model.compile(optimizer='sgd', loss='mse')

Now, we input the two words which we wish to compare and retrieve the value predicted by the model as the similarity score of the two words. 

In [9]:
word_a = 'graph'
word_b = 'trees'
output = keras_model.predict([np.asarray([model.wv.vocab[word_a].index]), np.asarray([model.wv.vocab[word_b].index])])    #prob of occuring together
print output

[[[[ 0.00596689]]]]


#### Integration with Keras : 20NewsGroups Task

To see how this wrapper could be used while dealing with a real supervised task, we consider the [20NewsGroups](qwone.com/~jason/20Newsgroups/) task. Here, we take a smaller version of this data by taking a subset of the documents to be classified. First, we import the necessary modules.

In [10]:
import os
import sys
import keras
import numpy as np
from gensim.models import word2vec
from keras.models import Model
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Input, Dense, Flatten
from keras.layers import Conv1D, MaxPooling1D

# datapath = './datasets/'
datapath = '../../gensim/test/test_data/'

As the first step of the task, we iterate over the folder in which our text samples are stored, and format them into a list of samples. Also, we prepare at the same time a list of class indices matching the samples.

In [11]:
TEXT_DATA_DIR = datapath + '20_newsgroup_keras/'

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n')  # skip header
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)

Then, we format our text samples and labels into tensors that can be fed into a neural network. To do this, we rely on Keras utilities `keras.preprocessing.text.Tokenizer` and `keras.preprocessing.sequence.pad_sequences`.

In [12]:
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100

# Vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels))

x_train = data
y_train = labels

As the next step, we prepare the embedding layer for which we use the wrapper as follows.

In [13]:
Keras_w2v = word2vec.Word2Vec((word2vec.LineSentence(datapath+'20_newsgroup_keras_w2v_data.txt')) ,min_count=1)
Keras_w2v_wv = Keras_w2v.wv
embedding_layer = Keras_w2v_wv.get_embedding_layer()



Finally, we create a small 1D convnet to solve our classification problem.

In [14]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])

# model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=1)
model.fit(x_train, y_train, batch_size=1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fda30551e90>

As can be seen from the results above, the accuracy obtained is not that high. This is because of the small size of training data used and we could expect to obtain better accuracy for training data of larger size.