# Keras sentiment analysis with Elmo Embeddings

One of the recent trends in Natural Language Processing is transfer learning. Transfer learning allows NLP models to learn more from fewer examples. In this notebook, we experiment with so-called [ELMo Embeddings](https://allennlp.org/elmo), a new approach to word embeddings that relies on a large unlabelled text corpus to understand word meaning in context. ELMo Embeddings are available from [Tensorflow Hub](https://alpha.tfhub.dev/google/elmo/2).

## Preparation

Let's first install and import all the required libraries.

In [1]:
#!pip install tensorflow_hub

In [2]:
import tensorflow as tf
import tensorflow_hub as hub



In [6]:
from tensorflow.keras.models import load_model, Model
from tensorflow.python.keras import backend as K

sess = tf.compat.v1.Session()
K.set_session(sess)


#tf.compat.v1.disable_eager_execution()

#hello = tf.constant('Hello, TensorFlow!')
#sess = tf.compat.v1.Session()
#print(sess.run(hello))

#sess = tf.Session()
#backend.set_session(sess)

I0000 00:00:1745291915.125088 1773960 pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
I0000 00:00:1745291915.125149 1773960 pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [25]:
elmo_model = tf.compat.v1.Module()
#elmo_model = hub.Module("https://alpha.tfhub.dev/google/elmo/2", trainable=True)
#sess.run(tf.initialize_all_variables())
#sess.run(tf.tables_initializer())
sess.run("https://alpha.tfhub.dev/google/elmo/2")

ValueError: Argument `fetch` = https://alpha.tfhub.dev/google/elmo/2 cannot be interpreted as a Tensor. (The name 'https://alpha.tfhub.dev/google/elmo/2' looks a like a Tensor name, but is not a valid one. Tensor names must be of the form "<op_name>:<output_index>".)

INFO:tensorflow:Initialize variable module/bilm/char_embed:0 from checkpoint b'/tmp/tfhub_modules/147211ae67f6cee45196ab9725932f31e8151552/variables/variables' with bilm/char_embed


In [22]:
?sess.run

[0;31mSignature:[0m [0msess[0m[0;34m.[0m[0mrun[0m[0;34m([0m[0mfetches[0m[0;34m,[0m [0mfeed_dict[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0moptions[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mrun_metadata[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Runs operations and evaluates tensors in `fetches`.

This method runs one "step" of TensorFlow computation, by
running the necessary graph fragment to execute every `Operation`
and evaluate every `Tensor` in `fetches`, substituting the values in
`feed_dict` for the corresponding input values.

The `fetches` argument may be a single graph element, or an arbitrarily
nested list, tuple, namedtuple, dict, or OrderedDict containing graph
elements at its leaves.  A graph element can be one of the following types:

* A `tf.Operation`.
  The corresponding fetched value will be `None`.
* A `tf.Tensor`.
  The corresponding fetched value will be a numpy ndarray containing the
  value o

## ELMo Embeddings

A quick example will illustrate how ELMo Embeddings work. When we pass to our model a list of sentences (either as strings or as lists of tokens), we get back a list of 1024-dimensional embeddings for every sentence. These are the ELMo embeddings of the tokens in the sentence. 

In [None]:
embeddings = elmo_model(
    ["the cat is on the mat", "dogs are in the fog"],
    signature="default",
    as_dict=True)["elmo"]

## Sentiment analysis

In this experiment, we're going to build a simple neural network for sentiment analysis. As our training and test data, we use the IMDB movie reviews that come pre-packaged with Keras. We shuffle the reviews and pad all texts to a maximum length of 500.

In [None]:
import random
import numpy as np
from keras.datasets import imdb
from keras.preprocessing import sequence

VOCABULARY_SIZE = 50000
INDEX_FROM = 3
START_INDEX = 1
OOV_INDEX = 2
EMBEDDING_DIM = 300
SEQ_LENGTH = 500

(X_train, y_train), (X_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=VOCABULARY_SIZE,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=START_INDEX,
                                                      oov_char=OOV_INDEX,
                                                      index_from=INDEX_FROM)

train = list(zip(X_train, y_train))
random.shuffle(train)
X_train, y_train = zip(*train)
y_train = np.array(y_train)

X_train = sequence.pad_sequences(X_train, maxlen=SEQ_LENGTH)
X_test = sequence.pad_sequences(X_test, maxlen=SEQ_LENGTH)

## Simple embeddings

For our baseline, we're going to work with standard word embeddings. These map every token to a 300-dimensional embedding, irrespective of the context in which the token occurs. We'll use the English word embeddings from Facebook Research's [MUSE project](https://github.com/facebookresearch/MUSE).

In [None]:
!wget https://s3.amazonaws.com/arrival/embeddings/wiki.multi.en.vec -O /tmp/wiki.multi.en.vec

We load these embeddings and put the ones we need in an embedding matrix, where their row indices correspond to the token indices that Keras has assigned to the tokens in the IMDB corpus.

In [None]:
import numpy as np

def load_vectors(embedding_file_path):
    print("Loading vectors from", embedding_file_path)
    embeddings = []
    word2id = {}
    with open(embedding_file_path, 'r', encoding='utf-8') as f:
        next(f)
        for i, line in enumerate(f):
            word, emb = line.rstrip().split(' ', 1)
            emb = np.fromstring(emb, sep=' ')
            assert word not in word2id, 'word found twice'
            embeddings.append(emb)
            word2id[word] = len(word2id)

    embeddings = np.vstack(embeddings)
    return embeddings, word2id

embeddings_en, embedding_word2id_en = load_vectors("/tmp/wiki.multi.en.vec")

In [None]:
def create_embedding_matrix(target_word2id, embedding_word2id, embeddings, num_rows, num_columns):
    embedding_matrix = np.zeros((num_rows, num_columns))
    for word, i in target_word2id.items():
        if i >= num_rows:
            continue
        if word in embedding_word2id: 
            embedding_matrix[i] = embeddings[embedding_word2id[word]]
    return embedding_matrix

word2id_en = imdb.get_word_index()
word2id_en = {k:(v+INDEX_FROM) for k,v in word2id_en.items()}
word2id_en["<PAD>"] = 0
word2id_en["<START>"] = START_INDEX
word2id_en["<UNK>"] = OOV_INDEX

embedding_matrix_en = create_embedding_matrix(word2id_en, embedding_word2id_en, 
                                              embeddings_en, VOCABULARY_SIZE+INDEX_FROM-1, EMBEDDING_DIM)

## Models

We build two models for text classification, which are identical apart from the first layer. Our basic model has a simple embedding layer, were tokens are mapped to their static embeddings. Our ELMo model has a more complex first layer, where the static embedding for each token is concatenated to the ELMo embedding for that token in the relevant context. This results in a 1,324-dimensional embedding for each token in context. In both models, this embedding layer is followed by a simple convolution with kernel size 3, a maximum pooling operation, a dense layer, and finally a final layer that predicts the sentiment of each text as a number between 0 and 1.  

In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Lambda, Input
from keras.layers import Flatten, Concatenate
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.models import Model
from keras.optimizers import Adam

ELMO_EMBEDDING_DIM = 1024

def ElmoEmbedding(x):
    y = elmo_model(tf.squeeze(x), signature="default", as_dict=True)["elmo"]
    return y

def create_basic_model():
    sequence = Input(shape=(500,))
    embedding = Embedding(VOCABULARY_SIZE+INDEX_FROM-1, EMBEDDING_DIM, input_length=SEQ_LENGTH, 
                              weights=[embedding_matrix_en], trainable=False)(sequence)
        
    conv = Conv1D(filters=64, kernel_size=3, padding='same', activation='relu')(embedding)
    pool = MaxPooling1D(pool_size=SEQ_LENGTH)(conv)
    flat = Flatten()(pool)
    dense = Dense(250, activation='relu')(flat)
    prediction = Dense(1, activation='sigmoid')(dense)

    model = Model(inputs=sequence, outputs=prediction)
    optimizer = Adam(lr=0.0001, decay=1e-3)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    #print(model.summary())
    
    return model
    
    
def create_elmo_model(): 
    token_sequence = Input(shape=(1,), dtype="string", name="elmo_input")
    index_sequence = Input(shape=(SEQ_LENGTH,), name="standard_input")

    embedding1 = Lambda(ElmoEmbedding, output_shape=(SEQ_LENGTH, ELMO_EMBEDDING_DIM,))(token_sequence)
    embedding2 = Embedding(VOCABULARY_SIZE+INDEX_FROM-1, EMBEDDING_DIM, input_length=SEQ_LENGTH, 
                          weights=[embedding_matrix_en], trainable=False)(index_sequence)
    embedding = Concatenate()([embedding1, embedding2])
        
    conv = Conv1D(filters=64, kernel_size=3, padding='same', activation='relu')(embedding)
    pool = MaxPooling1D(pool_size=SEQ_LENGTH)(conv)
    flat = Flatten()(pool)
    dense = Dense(250, activation='relu')(flat)
    prediction = Dense(1, activation='sigmoid')(dense)

    model = Model(inputs=[index_sequence, token_sequence], outputs=prediction)
    optimizer = Adam(lr=0.00001, decay=1e-3)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    #print(model.summary())
    
    return model

## Training

We train both models for a maximum of 100 epochs, but we stop earlier when the validation loss hasn't improved for two epochs. We save and evaluate the model with the lowest validation loss. Although we didn't make a big effort to tune the learning rate, we did find that the ELMo model benefits from having a much smaller initial learning rate than the basic model. We use the same decay rate for both models. 

In [None]:
import math

def train_basic_model(model, X_train, y_train, X_val, y_val, X_test, y_test): 
    batch_size = 16
    earlystop = EarlyStopping(monitor='val_loss', patience=2) 
    checkpoint = ModelCheckpoint('basic_model.hdf5', save_best_only=True, monitor='val_loss', mode='min')

    model.fit(X_train, y_train, validation_data=(X_val, y_val), 
              epochs=100, batch_size=batch_size, callbacks=[earlystop, checkpoint])
    model.load_weights(filepath='basic_model.hdf5')
    scores = model.evaluate(X_test, y_test, batch_size=batch_size)
    print("Accuracy: %.2f%%" % (scores[1]*100))
    return scores[1]*100


def train_elmo_model(model, X_train, E_train, y_train, X_val, E_val, y_val, X_test, E_test, y_test): 
    batch_size = 16
    earlystop = EarlyStopping(monitor='val_loss', patience=2)        
    checkpoint = ModelCheckpoint('elmo_model.hdf5', save_best_only=True, monitor='val_loss', mode='min')

    model.fit([X_train, E_train], y_train, validation_data=([X_val, E_val], y_val), 
              epochs=100, batch_size=batch_size, callbacks=[earlystop, checkpoint])
    model.load_weights(filepath='elmo_model.hdf5')
    scores = model.evaluate([X_test, E_test], y_test, batch_size=batch_size)
    print("Accuracy: %.2f%%" % (scores[1]*100))
    return scores[1]*100

Because the ELMo model is quite slow, we chose to work with relatively small datasets. We train on just 200 training examples, validate the model on another 200 examples after each epoch, and test its final performance on 500 test examples. We repeat this process 10 times, and choose the training and validation examples randomly from the larger IMDB training set on each iteration. The ELMo model is trained, validated and tested on exactly the same examples as the basic model.  

In [None]:
elmo_accuracies = []
basic_accuracies = []

test_size = 500
validation_size = 200
training_size = 200
id2word_en = {v:k for k,v in word2id_en.items()}

for i in range(10):     
    
    train = list(zip(X_train, y_train))
    test = list(zip(X_test, y_test))
    random.shuffle(train)
    X_train, y_train = zip(*train)
    X_test, y_test = zip(*test)
    
    X_train = np.array(X_train)
    y_train = np.array(y_train)
    
    X_test = np.array(X_test)
    y_test = np.array(y_test)
            
    train_texts = [" ".join([id2word_en[idx] for idx in seq]) for seq in X_train[:training_size]]
    test_texts = [" ".join([id2word_en[idx] for idx in seq]) for seq in X_test[:test_size]]
    val_texts = [" ".join([id2word_en[idx] for idx in seq]) for seq in X_test[test_size:test_size+validation_size]]

    E_train = np.array(train_texts)
    E_test = np.array(test_texts)
    E_val = np.array(val_texts)
        
    model_baseline = create_basic_model()
    basic_acc = train_basic_model(model_baseline, X_train[:training_size], y_train[:training_size], 
                                  X_test[test_size:test_size+validation_size],
                                  y_test[test_size:test_size+validation_size], 
                                  X_test[:test_size], y_test[:test_size])
    basic_accuracies.append(basic_acc)

    model_elmo = create_elmo_model()
    elmo_acc = train_elmo_model(model_elmo, X_train[:training_size], E_train, y_train[:training_size],
                                X_test[test_size:test_size+validation_size], 
                                E_val, y_test[test_size:test_size+validation_size], 
                                X_test[:test_size], E_test, y_test[:test_size])
    elmo_accuracies.append(elmo_acc)
        
    print(basic_accuracies)
    print(elmo_accuracies)
    
print(np.mean(basic_accuracies))
print(np.mean(elmo_accuracies))


Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Accuracy: 77.60%
Train on 200 samples, validate on 200 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Accuracy: 76.80%
[73.2, 73.6, 52.400000000000006, 47.599999999999994, 63.6, 52.6, 77.60000000000001]
[78.0, 75.4, 75.4, 77.8, 75.6, 72.6, 76.8]
Train on 200 samples, validate on 200 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Accuracy: 70.40%
Train on 200 samples, validate on 200 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Accuracy: 75.20%
[73.2, 73.6, 52.400000000000006, 47.599999999999994, 63.6, 52.6, 77.60000000000001, 70.39999999999999]
[78.0, 75.4, 75.4, 77.8, 75.6, 72.6, 76.8, 75.2]
Train on 200 samples, validate on 2

## Results

When we train a basic sentiment analysis model on just 200 training examples, the results are hit and miss: the accuracies on unseen texts range from just 48% to 78%. When we replace the simple embedding layer by an ELMo embedding layer, however, the model performs about 10% better on average. Its accuracies were also much more consistent, between 73% and 78%. 

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

accuracies = pd.DataFrame({'basic' : basic_accuracies, 'elmo': elmo_accuracies})
plt.rcParams['figure.figsize'] = (10,6)
accuracies.boxplot()

In [None]:
# Pair plot
plt.scatter(np.zeros(len(basic_accuracies)), basic_accuracies)
plt.scatter(np.ones(len(elmo_accuracies)), elmo_accuracies)

for i in range(len(basic_accuracies)):
    plt.plot( [0,1], [basic_accuracies[i], elmo_accuracies[i]], c='k')

plt.xticks([0,1], ['basic', 'elmo'])

plt.show()

# Explore

In [11]:
!pip install simple_elmo

Collecting simple_elmo
  Downloading simple_elmo-0.9.2-py3-none-any.whl.metadata (10 kB)
Downloading simple_elmo-0.9.2-py3-none-any.whl (46 kB)
Installing collected packages: simple_elmo
Successfully installed simple_elmo-0.9.2


In [12]:
from simple_elmo import ElmoModel
model = ElmoModel()

In [10]:
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

AttributeError: module 'tensorflow_hub' has no attribute 'Module'

In [17]:
elmo_model = tf.compat.v1.Module()

In [15]:
?tf.compat.v1.Module

[0;31mInit signature:[0m [0mtf[0m[0;34m.[0m[0mcompat[0m[0;34m.[0m[0mv1[0m[0;34m.[0m[0mModule[0m[0;34m([0m[0mname[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Base neural network module class.

A module is a named container for `tf.Variable`s, other `tf.Module`s and
functions which apply to user input. For example a dense layer in a neural
network might be implemented as a `tf.Module`:

>>> class Dense(tf.Module):
...   def __init__(self, input_dim, output_size, name=None):
...     super().__init__(name=name)
...     self.w = tf.Variable(
...       tf.random.normal([input_dim, output_size]), name='w')
...     self.b = tf.Variable(tf.zeros([output_size]), name='b')
...   def __call__(self, x):
...     y = tf.matmul(x, self.w) + self.b
...     return tf.nn.relu(y)

You can use the Dense layer as you would expect:

>>> d = Dense(input_dim=3, output_size=2)
>>> d(tf.ones([1, 3]))
<tf.Tensor: shape=(1, 2), dtype=float32, n