# Steps to train ELMo model

Source from:
* https://github.com/allenai/bilm-tf
* https://appliedmachinelearning.blog/2019/11/30/training-elmo-from-scratch-on-custom-data-set-for-generating-embeddings-tensorflow/

In [21]:
import tensorflow as tf
import os
import numpy as np
import pandas as pd
from collections import Counter
from pathlib import Path
from bilm import Batcher, BidirectionalLanguageModel, weight_layers
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [2]:
tf.__version__

'1.15.0'

## Tokenize load the unaligned data

We have used the provided tokenizer to add space between tokens, so we will load the tokenized data

In [3]:
# Load data
DIRECTORY_URL = "data/train/"
FILE_NAMES = ["unalignedtok.en", "unalignedtok.fr"]

unaligned_en = []
unaligned_fr = []

with open(os.path.join(DIRECTORY_URL, FILE_NAMES[0]), 'r', encoding="UTF-8") as en_file:
    for line in en_file.readlines():
        line = line.rstrip().split('\n')
        unaligned_en.append(line[0])
    en_file.close()
    
with open(os.path.join(DIRECTORY_URL, FILE_NAMES[1]), 'r', encoding="UTF-8") as fr_file:
    for line in fr_file.readlines():
        line = line.rstrip().split('\n')
        unaligned_fr.append(line[0])
    fr_file.close()

In [4]:
len(unaligned_en), len(unaligned_fr)

(474000, 105395)

In [5]:
unaligned_en[0]

"for the second phase of the trials we just had different sizes , small , medium , large and extra - large . it 's true ."

In [6]:
unaligned_fr[0]

'nous n’ aurions pas pu dégager d’ accord sur un calendrier de conclusion de la cig sans l’ engagement politique de mes collègues du conseil européen .'

## Breakdown the unaligned dataset
As the training requires multiple files with one text sentence per line, we will create 79K training files by writing 6 sentences per file. After running the below python snippet, we get 79K files in train directory.

In [7]:
def create_trainset(train_folder, dataset):
    # create train folder
    if not os.path.exists(train_folder):
        os.makedirs(train_folder)
    
    # breakdow the dataset
    for i in range(0,dataset.shape[0],6):
        text = "\n".join(dataset[i:i+6].tolist())
        fp = open(train_folder+str(i)+".txt","w", encoding='UTF-8')
        fp.write(text)
        fp.close()

In [20]:
# Run the script on our 2 datasets
create_trainset("swb_en\\train\\", np.array(unaligned_en))

In [21]:
# Run the script on our 2 datasets
create_trainset("swb_fr\\train\\", np.array(unaligned_fr))

## Generate the vocabulary

In [8]:
def get_vocabulary(dataset, train_folder):
    texts = " ".join(dataset.tolist())
    words = texts.split(" ")
    print("Number of tokens in Training data = ",len(words))
    dictionary = Counter(words)
    print("Size of Vocab",len(dictionary))
    sorted_vocab = ["<S>","</S>","<UNK>"]
    sorted_vocab.extend([pair[0] for pair in dictionary.most_common()])

    text = "\n".join(sorted_vocab)
    fp = open(train_folder+"\\vocab.txt","w", encoding='UTF-8')
    fp.write(text)
    fp.close()
    return text

In [9]:
en_vocab = get_vocabulary(np.array(unaligned_en), "swb_en")

Number of tokens in Training data =  9814295
Size of Vocab 60023


In [10]:
fr_vocab = get_vocabulary(np.array(unaligned_fr), "swb_fr")

Number of tokens in Training data =  2389468
Size of Vocab 47190


## Train the biLM model

The training was done on the cluster, here are the followed steps:
* Install bilm package from their github depository: https://github.com/allenai/bilm-tf
* Put the train data, the generated vocabulary and the hyperparameters (options.json) under the same folder (ex: swb)
* Configure the hyperparameters (e.g. projection dim) in options.json and put them under checkpoint folder
* Run the following script: python bin/train_elmo.py --train_prefix='swb/train/*' --vocab_file 'swb/vocab.txt' --save_dir 'swb/checkpoint'

## Generate weights

After training, we have to convert the tensoflow checkpoints to hdf5 weights, by running this script:
* python bin/dump_weights.py --save_dir 'swb/checkpoint' --outfile 'swb/swb_weights.hdf5'

## Generate ELMo embeddings

Once the weights generated, we are ready to generate ELMo embeddings for a given sequence, by calling the following method:

Befor calling this method, we have to make final adjustement:
* Keep the dumped weights file in newly created model folder.
* Create an options.json file for the newly trained model in same folder.
* It is important to always set n_characters to 262 after training.
* Keep vocab.txt in model directory.

In [18]:
def get_elmo_emb(input_seq, max_length, options_file, weight_file, vocab_file):
    tf.compat.v1.disable_eager_execution()
    tf.reset_default_graph()
    
    print('getting the context of the sequence :', input_seq)
    # Create a Batcher to map text to character ids.
    batcher = Batcher(vocab_file, 50)
 
    # Input placeholders to the biLM.
    context_character_ids = tf.compat.v1.placeholder('int32', shape=(None, None, 50))
 
    # Build the biLM graph.
    bilm = BidirectionalLanguageModel(options_file, weight_file)
 
    # Get ops to compute the LM embeddings.
    context_embeddings_op = bilm(context_character_ids)
     
    # Get an op to compute ELMo (weighted average of the internal biLM layers)
    elmo_context_input = weight_layers('input', context_embeddings_op, l2_coef=0.0)
    
    # Now we can compute embeddings.
    #print("get elmo: input_seq=",input_seq)
    tokenized_context = [input_seq.split()] # for sentence in input_seq]
    #print(tokenized_context)
    
    with tf.Session() as sess:
        # It is necessary to initialize variables once before running inference.
        sess.run(tf.global_variables_initializer())
 
        # Create batches of data.
        context_ids = batcher.batch_sentences(tokenized_context)
        #print("Shape of context ids = ", context_ids.shape)
 
        # Compute ELMo representations (here for the input only, for simplicity).
        elmo_context_input_ = sess.run(
            elmo_context_input['weighted_op'],
            feed_dict={context_character_ids: context_ids}
        )
    # Pad the output to max sequence length of the model
    elmo_emb = pad_sequences(elmo_context_input_, maxlen=max_length, padding='post', dtype='float32')
    #print("Shape of generated embeddings = ",elmo_emb.shape)
    return elmo_emb

In [19]:
# Define the parameters:
input_seq = unaligned_en[0]
max_length = len(unaligned_en[0])
options_file = "ELMo/swb_en/options_eval.json"
weight_file = "ELMo/swb_en/swb_weights_en.hdf5"
vocab_file = "ELMo/swb_en/vocab.txt"

In [22]:
get_elmo_emb(input_seq, max_length, options_file, weight_file, vocab_file)

getting the context of the sequence : for the second phase of the trials we just had different sizes , small , medium , large and extra - large . it 's true .
USING SKIP CONNECTIONS


array([[[-0.42958167,  0.6997375 , -1.1938176 , ...,  1.2128806 ,
         -0.5734399 , -0.07925927],
        [ 0.18870763,  0.75108993,  0.35382527, ...,  0.55428684,
         -1.0110341 , -0.6422628 ],
        [ 1.8181396 ,  0.79422575, -0.23697373, ...,  0.43018457,
         -2.3241005 , -2.5005286 ],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]]], dtype=float32)