In [None]:
!pip3 install -qU gluonnlp mxnet awscli botocore boto3 nltk sacremoses --upgrade

In [None]:
import io
import random
import numpy as np
import mxnet as mx
from mxnet import gluon
import gluonnlp as nlp

In [None]:
elmo_intro = """
Extensive experiments demonstrate that ELMo representations work extremely well in practice.
We first show that they can be easily added to existing models for six diverse and challenging language understanding problems, including textual entailment, question answering and sentiment analysis.
The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions.
For tasks where direct comparisons are possible, ELMo outperforms CoVe (McCann et al., 2017), which computes contextualized representations using a neural machine translation encoder.
Finally, an analysis of both ELMo and CoVe reveals that deep representations outperform those derived from just the top layer of an LSTM.
Our trained models and code are publicly available, and we expect that ELMo will provide similar gains for many other NLP problems.
"""

elmo_intro_file = 'elmo_intro.txt'
with io.open(elmo_intro_file, 'w', encoding='utf8') as f:
    f.write(elmo_intro)

dataset = nlp.data.TextLineDataset(elmo_intro_file, 'utf8')
print(len(dataset))
print(dataset[2]) # print an example sentence from the input data

In [None]:
tokenizer = nlp.data.SacreMosesTokenizer()
dataset = dataset.transform(tokenizer)
dataset = dataset.transform(lambda x: ['<bos>'] + x + ['<eos>'])
print(dataset[2]) # print the same tokenized sentence as above

Now, let's transform each *word* into a series of character tokens. 

0-255 values come from UTF-8, and some tokens have a special meaning:
  * bos_id (256) – The index of beginning of the sentence character
  * eos_id (257) – The index of end of the sentence character
  * bow_id (258) – The index of beginning of the word character
  * eow_id (259) – The index of end of the word character
  * pad_id (260) – The index of padding character is 260

In [None]:
vocab = nlp.vocab.ELMoCharVocab()
dataset = dataset.transform(lambda x: (vocab[x], len(x)), lazy=False)

Here's the same sentence : an array of arrays (33, corresponding to the number of tokens). 
Each sub-array encodes a single word.

In [None]:
print(dataset[2])

In [None]:
batch_size = 4
dataset_batchify_fn = nlp.data.batchify.Tuple(nlp.data.batchify.Pad(),
                                              nlp.data.batchify.Stack())
data_loader = gluon.data.DataLoader(dataset,
                                    batch_size=batch_size,
                                    batchify_fn=dataset_batchify_fn)

In [None]:
elmo_bilm, _ = nlp.model.get_model('elmo_2x1024_128_2048cnn_1xhighway',
                                   dataset_name='gbw',
                                   pretrained=True,
                                   ctx=mx.cpu())
print(elmo_bilm)

In [None]:
def get_features(data, valid_lengths):
    length = data.shape[1]
    hidden_state = elmo_bilm.begin_state(mx.nd.zeros, batch_size=batch_size)
    mask = mx.nd.arange(length).expand_dims(0).broadcast_axes(axis=(0,), size=(batch_size,))
    mask = mask < valid_lengths.expand_dims(1).astype('float32')
    output, hidden_state = elmo_bilm(data, hidden_state, mask)
    return output

batch = next(iter(data_loader))
features = get_features(*batch)
print([x.shape for x in features])

We get three outputs: one for the character-level CNN, and one for each of the two LSTMs. 

Each output stores:
  * the batch size,
  * the number of tokens for the longest sentence (other embeddings are padded to that length),
  * the embeddings for each sentence
  
Let's print the LSTM outputs for the sentence above.

In [None]:
print(features[1][2])
print(features[2][2])