<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/embeddings/embeddings/ELMo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ELMo Embeddings

ELMo is a deep contextualized word representation that models:
- complex characteristics of word use (e.g., syntax and semantics)
-  how these uses vary across linguistic contexts (i.e., to model polysemy).

These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.



![elmo arch](https://drive.google.com/uc?id=19zI1rtdTW4INZ4PNAcyYeWlvl7UMcysN)

## Resources

- [Allennlp](https://allennlp.org/elmo)
- [Analytics Vidhya Post on ELMo](https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-features-from-text/?utm_source=blog&utm_medium=pretrained-word-embeddings-nlp)

- [ELMo talk](https://www.youtube.com/watch?v=9JfGxKkmBc0)

- [NLP with Elmo and Flair](https://www.youtube.com/watch?v=ZEhWpZGlJvE)

- [ELMo Paper](https://arxiv.org/pdf/1802.05365.pdf)

- [Img ref](http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html)

## Difference between ELMo and Word2Vec/GloVe Embeddings

### ELMo
---
**Advantages:**

- Embeddings are context dependent
    - For instance, for the same example above “He went to the prison cell with his cell phone to extract blood cell samples from inmates”, both Elmo and BERT would generate different vectors for the three vectors for cell. The first cell (prison cell case) , for instance would be closer to words like incarceration, crime etc. whereas the second “cell” (phone case) would be closer to words like iphone, android, galaxy etc..

**Disadvantages:**

- NO Arithmetic properties
- NO Cosine similarities

### Word2Vec / GloVe
---
**Advantages:**

- Arithmetic Properties (king - man + women = queen)
- Cosine Similarities (dog ~= husky)

**Disadvantages:**

- Embeddings are **not** context dependent
    - That is the one numeric representation of a word (which we call embedding/vector) regardless of where the words occurs in a sentence and regardless of the different meanings they may have. For instance, after we train word2vec/Glove on a corpus (unsupervised training - no labels needed) we get as output one vector representation for, say the word “cell”. So even if we had a sentence like “He went to the prison cell with his cell phone to extract blood cell samples from inmates”, where the word cell has different meanings based on the sentence context, these models just collapse them all into one vector for “cell” in their output.


---

[source](https://www.quora.com/What-are-the-main-differences-between-the-word-embeddings-of-ELMo-BERT-Word2vec-and-GloVe)



# AllenNLP Library

[AllenNLP]((https://allennlp.org/)) is:

- #### Deeplearning for NLP

  >  AllenNLP makes it easy to design and evaluate new deep learning models for nearly any NLP problem, along with the infrastructure to easily run them in the cloud or on your laptop.

- #### State of the art models

  >  AllenNLP includes reference implementations of high quality models for both core NLP problems (e.g. semantic role labeling) and NLP applications (e.g. textual entailment).



## Initial Setup

In [0]:
!pip install allennlp

## Using Allennlp for ELMo Embeddings

In [0]:
import torch
from allennlp.commands.elmo import ElmoEmbedder

In [0]:
# small model
options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x1024_128_2048cnn_1xhighway/elmo_2x1024_128_2048cnn_1xhighway_weights.hdf5"

# other models option, weight files are available here: https://allennlp.org/elmo

elmo = ElmoEmbedder(options_file=options_file, weight_file=weight_file)

In [0]:
sentence = "I ate an apple for breakfast"
words = sentence.split()

In [0]:
word_embeds = elmo.embed_sentence(words)

In [45]:
print(word_embeds.shape)

(3, 6, 256)


As we can in the architecture diagram, embeddings are created by combing the representations from `input` layer, `first layer` and `final layer`. But ELMo word embeddings can be constructed by combining ELMo layers in different ways. The available combination strategies are:

**`all`**: Use the concatenation of the three ELMo layers.

**`top`**: Use the top ELMo layer.

**`average`**: Use the average of the three ELMo layers.


As we can see from above the `word_embeds` is of shape `(3, 6, 256)` indicating `(layers, seq_length, emb_size)`

Let's see how can do the three operations mentioned above with the `word_embeds`



#### **`all`**: Use the concatenation of the three ELMo layers.

In [59]:
# convertion to torch tensors
word_tensors = torch.tensor(word_embeds)
word_tensors.shape

torch.Size([3, 6, 256])

In [61]:
all_mode = lambda x: torch.cat(x, 0)

all_embs = []
for token, token_idx in zip(words, range(len(words))):
    # get the embeddings from each layer
    elmo_embedding_layers = [
        torch.FloatTensor(word_tensors[0, token_idx, :]),
        torch.FloatTensor(word_tensors[1, token_idx, :]),
        torch.FloatTensor(word_tensors[2, token_idx, :])
    ]
    # concatenate all the layer embeddings
    word_embedding = all_mode(elmo_embedding_layers)
    all_embs.append(word_embedding)

len(all_embs)

6

In [62]:
# upon concatenating all the three layers the shape of embedding would be: 256 + 256 + 256 = 768
all_embs[0].shape

torch.Size([768])

#### **`top`**: Use the top ELMo layer

In [64]:
top_mode = lambda x: x[-1]

top_embs = []
for token, token_idx in zip(words, range(len(words))):
    # get the embeddings from each layer
    elmo_embedding_layers = [
        torch.FloatTensor(word_tensors[0, token_idx, :]),
        torch.FloatTensor(word_tensors[1, token_idx, :]),
        torch.FloatTensor(word_tensors[2, token_idx, :])
    ]
    # get only the top embedding from all the layers
    word_embedding = top_mode(elmo_embedding_layers)
    top_embs.append(word_embedding)

len(top_embs)

6

In [65]:
# upon getting the embedding from top layer the shape of embedding would be: 256
top_embs[0].shape

torch.Size([256])

#### **`average`**: Use the average of the three ELMo layers

In [67]:
average_mode = lambda x: torch.mean(torch.stack(x), 0)

avg_embs = []
for token, token_idx in zip(words, range(len(words))):
    # get the embeddings from each layer
    elmo_embedding_layers = [
        torch.FloatTensor(word_tensors[0, token_idx, :]),
        torch.FloatTensor(word_tensors[1, token_idx, :]),
        torch.FloatTensor(word_tensors[2, token_idx, :])
    ]
    # get average of all the embeddings from all the layers
    word_embedding = average_mode(elmo_embedding_layers)
    avg_embs.append(word_embedding)

len(avg_embs)

6

In [69]:
# upon averaging the embedding from all layers the shape of embedding would be: (256 + 256 + 256) / 3 = 256
avg_embs[0].shape

torch.Size([256])

# Flair Library

[Flair](https://github.com/flairNLP/flair) is:


> A powerful NLP library. Flair allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS), sense disambiguation and classification.

> Multilingual. Thanks to the Flair community, we support a rapidly growing number of languages. We also now include 'one model, many languages' taggers, i.e. single models that predict PoS or NER tags for input text in various languages.

> A text embedding library. Flair has simple interfaces that allow you to use and combine different word and document embeddings, including our proposed Flair embeddings, BERT embeddings and ELMo embeddings.

> A PyTorch NLP framework. Our framework builds directly on PyTorch, making it easy to train your own models and experiment with new approaches using Flair embeddings and classes.


## Initial Setup

In [1]:
!pip install flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/03/29/81e3c9a829ec50857c23d82560941625f6b42ce76ee7c56ea9529e959d18/flair-0.4.5-py3-none-any.whl (136kB)
[K     |██▍                             | 10kB 17.9MB/s eta 0:00:01[K     |████▉                           | 20kB 2.2MB/s eta 0:00:01[K     |███████▏                        | 30kB 2.9MB/s eta 0:00:01[K     |█████████▋                      | 40kB 3.1MB/s eta 0:00:01[K     |████████████                    | 51kB 2.6MB/s eta 0:00:01[K     |██████████████▍                 | 61kB 2.8MB/s eta 0:00:01[K     |████████████████▉               | 71kB 3.1MB/s eta 0:00:01[K     |███████████████████▏            | 81kB 3.4MB/s eta 0:00:01[K     |█████████████████████▋          | 92kB 3.6MB/s eta 0:00:01[K     |████████████████████████        | 102kB 3.5MB/s eta 0:00:01[K     |██████████████████████████▍     | 112kB 3.5MB/s eta 0:00:01[K     |████████████████████████████▊   | 122kB 3.5MB/s eta 0:00:0

In [5]:
!pip install allennlp

Collecting allennlp
[?25l  Downloading https://files.pythonhosted.org/packages/bb/bb/041115d8bad1447080e5d1e30097c95e4b66e36074277afce8620a61cee3/allennlp-0.9.0-py3-none-any.whl (7.6MB)
[K     |████████████████████████████████| 7.6MB 3.1MB/s 
[?25hCollecting gevent>=1.3.6
[?25l  Downloading https://files.pythonhosted.org/packages/9e/bd/04c4036f46f0272c804fce2c8308e06f8fb5db3b5c3adf97f8765bfa502c/gevent-20.5.0-cp36-cp36m-manylinux2010_x86_64.whl (5.2MB)
[K     |████████████████████████████████| 5.2MB 39.7MB/s 
Collecting responses>=0.7
  Downloading https://files.pythonhosted.org/packages/01/0c/e4da4191474e27bc41bedab2bf249b27d9261db749f59769d7e7ca8feead/responses-0.10.14-py2.py3-none-any.whl
Collecting tensorboardX>=1.2
[?25l  Downloading https://files.pythonhosted.org/packages/35/f1/5843425495765c8c2dd0784a851a93ef204d314fc87bcc2bbb9f662a3ad1/tensorboardX-2.0-py2.py3-none-any.whl (195kB)
[K     |████████████████████████████████| 204kB 35.6MB/s 
Collecting numpydoc>=0.8.0
  Down

## Using Flair for ELMo Embeddings

Resources:

- [Flair tutorial on Word Embeddings](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md)

- [List of available embeddings in Flair](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md)

In [0]:
from flair.data import Sentence
from flair.embeddings import ELMoEmbeddings

In [0]:
# by default the mode is all
embedding = ELMoEmbeddings(model="small")

In [0]:
sentence = Sentence('I ate an apple for breakfast')

In [74]:
embedding.embed(sentence)

[Sentence: "I ate an apple for breakfast" - 6 Tokens]

In [75]:
for token in sentence:
    print(f"{token} \t emb: {token.embedding.shape}")

Token: 1 I 	 emb: torch.Size([768])
Token: 2 ate 	 emb: torch.Size([768])
Token: 3 an 	 emb: torch.Size([768])
Token: 4 apple 	 emb: torch.Size([768])
Token: 5 for 	 emb: torch.Size([768])
Token: 6 breakfast 	 emb: torch.Size([768])


#### **`all`**: Use the concatenation of the three ELMo layers.

In [76]:
# upon concatenating all the three layers the shape of embedding would be: 256 + 256 + 256 = 768
for token in sentence:
    print(f"{token} \t emb: {token.embedding.shape}")

Token: 1 I 	 emb: torch.Size([768])
Token: 2 ate 	 emb: torch.Size([768])
Token: 3 an 	 emb: torch.Size([768])
Token: 4 apple 	 emb: torch.Size([768])
Token: 5 for 	 emb: torch.Size([768])
Token: 6 breakfast 	 emb: torch.Size([768])


#### Notes:

- **`top`** and **`average`** modes are not supported in flair==0.4.5.

- There has been PR raised for this [here](https://github.com/flairNLP/flair/pull/1547)

- In future release the support will be there by providing an extra argument `embedding_type`
    - [ELMo blog by Flair](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/ELMO_EMBEDDINGS.md)
    - [Source code](https://github.com/flairNLP/flair/blob/master/flair/embeddings/token.py#L1543)
