# Embeddings from Language Models (ELMo)

# Acknowledgment

This Google Colab has been created by **`Manuel Escola`** using the library created by **`Dani El-Ayyass`** and **`Andrey Kutuzov`**. You can find all the documentation at https://github.com/ltgoslo/simple_elmo.

<br />
<img style="width: 30%; padding-left: 11rem;" width="30%" src="https://drive.google.com/uc?id=1j7mUForsRLryX7CCQOOiYVB4W-c0mmnc" />
<!-- <img style="width: 30%; padding-left: 11rem;" src="img/contextual-elmo-icon.gif" /> -->

_Note: This notebook has been designed to be run in Google Collab. If run it locally or in other platforms, please make sure the libraries are correctly installed in your machine and the datasets are loaded correctly into the notebook._

# Installing Simple ELMo library

In [1]:
%%capture
!pip install --upgrade simple_elmo

# Importing pre-trained model

In [2]:
from simple_elmo import ElmoModel

model = ElmoModel()

In [3]:
%%capture
# https://drive.google.com/file/d/1ILVz0nq5gJ3ZxbvHyU1osUZ4vuBdAgQ7/view?usp=sharing
! gdown --id 1ILVz0nq5gJ3ZxbvHyU1osUZ4vuBdAgQ7

In [4]:
PATH_TO_ELMO = '/content/ELMo.zip'

In [None]:
model.load(PATH_TO_ELMO)

  lstm_cell = tf.compat.v1.nn.rnn_cell.LSTMCell(


# Creating useful functions

First, we created some useful functions to tokenise sentences at word level and find the position of a certain word (the one we want to compare) within each sentence. For better tokenisation, you can use the `SpaCy` library.

In [None]:
def sentence_to_words(sentence):
    '''
    This function splits sentences into a list with words.
    If there is some comma or dot, the comma or dot is deleted
    from the sentence.
    '''
    # Remove commas and dots
    sentence = sentence.replace(',', '').replace('.', '')

    # Split the sentence into words and return the list
    return sentence.split()

In [None]:
# Testing the function

sentence1 = "When I arrived, she already had left"
sentence2 = "I write with my left hand"

sentence_1 = sentence_to_words(sentence1)
sentence_2 = sentence_to_words(sentence2)

print(sentence_1)
print(sentence_2)

In [None]:
def find_word_indices(sentences, word):
    '''
    This function returns the index (on each sentence)
    of the word to compare

    Input: a list with the sentences (already tokenized)
    Output: index of the word within each sentence
    '''
    indices = []
    for sentence in sentences:
        try:
            # Find the index of the word in the sentence
            word_index = sentence.index(word)
            indices.append(word_index)
        except ValueError:
            # If the word is not found, you can append None or -1, or handle as needed
            indices.append(None)
    return indices

In [None]:
# Testing the function

word_to_find = "left"

SENTENCES = [sentence_1,
             sentence_2]

find_word_indices(SENTENCES, word_to_find)

# Data preprocessing

We use the first function to transform the sentences into a list of words (i.e., we tokenise the sentences).

In [None]:
sentence1 = "When I arrived, she already had left"
sentence2 = "I write with my left hand"

sentence_1 = sentence_to_words(sentence1)
sentence_2 = sentence_to_words(sentence2)
print(sentence_1)
print(sentence_2)
print('\n')

In [None]:
# We save the two lists of words (tokenized sentence) into a new list

SENTENCES = [sentence_1,
             sentence_2]

num_sentences = len(SENTENCES)
num_words = max(len(sentence_1), len(sentence_2))

print(f'The number of sentences is {num_sentences}')
print(f'The number of words in longer sentence is {num_words}')
print(f'The length of the embedding for each word is always 1024 in this ELMo version')
print('\n')
print(f'The shape of the array returned by the model should be: ({num_sentences}, {num_words}, 1024)')

# Implementing model

We use the model to transform each word in the sentences into a vector that contains 1024 values (the number 1024 is given by default by the model, but some libraries allow to change this number to any different one).

In [None]:
result = model.get_elmo_vectors(SENTENCES)
print(f'The shape is: {result.shape}, as expected')

# Finding the embeddings of the word to compare (second function)

We find the index of the word whose embedding (vector) we want to compare in both sentences. <br>
* **Note**: Remember that the word is represented by a different vector within each sentence because the vector is created considering the context.

In [None]:
word_to_find = "left"
indices = find_word_indices(SENTENCES, word_to_find)
print(f'The index of the word "{word_to_find}" in sentence 1 is {indices[0]}')
print(f'The index of the word "{word_to_find}" in sentence 2 is {indices[1]}')

# Compute the cosine similarity

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# sentence 1
embed_1 = result[0][indices[0]] # First element of the list
# sentence 2
embed_2 = result[1][indices[1]] # Second element of the list

# Reshape the embeddings to 2D arrays of shape [1, 1024]
embed_1_reshaped = np.reshape(embed_1, (1, -1))
embed_2_reshaped = np.reshape(embed_2, (1, -1))

# Calculate cosine similarity
similarity = cosine_similarity(embed_1_reshaped, embed_2_reshaped)[0][0]
print("The cosine similarity for word '{}' is: {:.4f}".format(word_to_find, similarity))