# ELMo
ELMo learns contextualized word vectors by running the text through a deep recurrent network.  
ELMo is actually an algorithm for unsupervised learning and does not make any use of the labels we have for our text classification task. The authors do show that contextualized word vectors obtained using ELMo increase text classification performance in a large array of tasks. Let's see if we see a significant gain in our case!  

There are good examples of using ELMo in both [the AllenNLP github repo](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) and [this AnalyticsVidhya post](https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-features-from-text/?utm_source=blog&utm_medium=top-pretrained-models-nlp-article). In this guide we'll use the python package [Flair](https://github.com/zalandoresearch/flair) to get ELMo embeddings.

## Contextual word embeddings with ELMo in Flair
In Flair, you init a `Sentence` object given the tokens seperated by spaces.  
Sentence has a few useful attributes and methods

In [1]:
!pip install allennlp
!pip install flair





In [1]:
from flair.embeddings import Sentence
sentence = Sentence('The grass is green .')

We also init a class of the desired embedding method. The `embed` method of this class gets a Sentence and adds to its tokens the relevant embedding.

In [2]:
from flair.embeddings import ELMoEmbeddings

# init embedding
elmo_embedding = ELMoEmbeddings()

elmo_embedding.embed(sentence)
for token in sentence:
    print(token)
    print(token.embedding.shape)
    print(token.embedding)

Token: 1 The
torch.Size([3072])
tensor([-0.3288,  0.2022, -0.5940,  ..., -1.2773,  0.3049,  0.2150])
Token: 2 grass
torch.Size([3072])
tensor([ 0.2539, -0.2363,  0.5263,  ..., -0.7001,  0.8798,  1.4191])
Token: 3 is
torch.Size([3072])
tensor([ 0.1915,  0.2300, -0.2894,  ..., -0.3626,  1.9066,  1.4520])
Token: 4 green
torch.Size([3072])
tensor([ 0.1779,  0.1309, -0.1041,  ..., -0.1006,  1.6152,  0.3299])
Token: 5 .
torch.Size([3072])
tensor([-0.8872, -0.2004, -1.0601,  ..., -0.0106, -0.0833,  0.0669])


**Try it yourself:** Now, compare the embeddings obtained using ELMo for the same word in different contexts. Are they equal or different?

### Word sense disambiguation using ELMo
**Try it yourself:** Let's also try to see how ELMo handles word sense disambiguation. Below are 6 sentences with 2 different meanings of the word `bank`. Try to see if ELMo vectors indeed separate the two meanings.

In [3]:
sentences = [
    "I was walking along the river bank",
    "I saw a toad near the east bank of the river",
    "We had a nice picnic by the bank",
    "I need to deposit money from the bank",
    "The bank branch is closed",
    "He started working at the bank"
]

### Classify documents using the average of contextual word vectors
**Try it yourself:** *Optional:* In previous sections we've built a classifier using the average of non-contextual word vectors. Now, try to use contextual word embeddings on our dataset. Use the average of these vectors and apply a classifier on it to obtain the predictions. Is the performance better than for non-contextual word vectors?

### Sentence embedding using ELMo
We've used Flair to get embeddings for each word in the sentence. However, for text classification of the entire document, we need a way to integrate all these vectors into a single document embedding. There are several methods for that, and those interested would find this article useful - https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d

The most basic element is averaging the word embedding into a single document embedding. In FLAIR, we do this using a DocumentPooling.

In [4]:
from flair.embeddings import DocumentPoolEmbeddings
document_embeddings = DocumentPoolEmbeddings([elmo_embedding], pooling='mean')

document_embeddings.embed(sentence)
document_embeddings

DocumentPoolEmbeddings(
  fine_tune_mode=linear, pooling=mean
  (embeddings): StackedEmbeddings(
    (list_embedding_0): ELMoEmbeddings(model=elmo-original)
  )
  (embedding_flex): Linear(in_features=3072, out_features=3072, bias=False)
)

In [5]:
# now check out the embedded sentence.
print(sentence.get_embedding().shape)
print(sentence.get_embedding())

torch.Size([3072])
tensor([-0.1185,  0.0253, -0.3043,  ..., -0.4902,  0.9246,  0.6966],
       grad_fn=<CatBackward>)


Alternatively, we can use an RNN that runs over the word embeddings. We will use the last hidden state as the document embedding. In this case it is very helpful to train the model using the true labels of our task, so that the RNN is optimized for our own data and task:

In [6]:
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings
glove_embedding = WordEmbeddings('glove')
document_embeddings = DocumentRNNEmbeddings([glove_embedding], rnn_type='LSTM')

In [7]:
# create an example sentence
sentence = Sentence('The grass is green . And the sky is blue .')

# embed the sentence with our document embedding
document_embeddings.embed(sentence)

# now check out the embedded sentence.
print(sentence.get_embedding())

tensor([ 0.1027, -0.1833,  0.1588, -0.0189, -0.0200, -0.0450,  0.0138,  0.0329,
        -0.1762,  0.0325,  0.2245, -0.0553, -0.0456, -0.0457,  0.3575,  0.0486,
         0.3647, -0.0574, -0.1374, -0.0072,  0.1516, -0.0705, -0.1483, -0.0227,
         0.0541, -0.0501, -0.0200, -0.3354,  0.1682,  0.0555, -0.2173, -0.0564,
        -0.0606,  0.0550,  0.1950,  0.0486, -0.0994,  0.1542,  0.0731, -0.0298,
         0.0631,  0.1804,  0.0638, -0.0166, -0.0665, -0.1210, -0.2998, -0.1263,
        -0.2099, -0.1629,  0.2091, -0.1292,  0.1359, -0.0130,  0.1932,  0.0120,
        -0.0294,  0.3837,  0.1122, -0.0678, -0.0818,  0.1508, -0.0017, -0.0788,
         0.0623,  0.0702,  0.2679, -0.0483, -0.0035, -0.1998,  0.0654, -0.2028,
        -0.1047,  0.1110,  0.1318, -0.2363,  0.0176, -0.0227,  0.0753,  0.0144,
        -0.0770, -0.0076, -0.0836, -0.1628,  0.0625,  0.0755, -0.1855,  0.0374,
         0.0665,  0.0162,  0.0702, -0.0649, -0.0161, -0.0772,  0.3199, -0.1396,
         0.1110,  0.0626, -0.0382,  0.19

Note that while `DocumentPoolEmbeddings` are immediately meaningful, `DocumentRNNEmbeddings` need to be tuned on the downstream task. This happens automatically in Flair if you train a new model with these embeddings. You can find an example of training a text classification model [here](/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md#training-a-text-classification-model). Once the model is trained, you can access the tuned DocumentRNNEmbeddings object directly from the classifier object and use it to embed sentences.

`DocumentRNNEmbeddings` have a number of hyper-parameters that can be tuned to improve learning:

```
:param hidden_size: the number of hidden states in the rnn.
:param rnn_layers: the number of layers for the rnn.
:param reproject_words: boolean value, indicating whether to reproject the token embeddings in a separate linear
layer before putting them into the rnn or not.
:param reproject_words_dimension: output dimension of reprojecting token embeddings. If None the same output
dimension as before will be taken.
:param bidirectional: boolean value, indicating whether to use a bidirectional rnn or not.
:param dropout: the dropout value to be used.
:param word_dropout: the word dropout value to be used, if 0.0 word dropout is not used.
:param locked_dropout: the locked dropout value to be used, if 0.0 locked dropout is not used.
:param rnn_type: one of 'RNN' or 'LSTM'
```

### Loading dataset
The simplest way to load our data in Flair is using a CSV file. You can learn about other method in [the documentation](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md).

To create a `Corpus` for a text classification task, you need to have three files (train, dev, and test) in the 
above format located in one folder. This data folder structure could, for example, look like this for the IMDB task:
```text
/data/train.csv
/data/val.csv
/data/test.txt
```
Now create a `CSVClassificationCorpus` by pointing to this folder (`/data`). 
Thereby, each line in a file is converted to a `Sentence` object annotated with the labels.

Attention: A text in a line can in fact have multiple sentences. Thus, a `Sentence` object is actually a `Document` and can actually consist of multiple sentences.

In [12]:
from flair.data import Corpus
from flair.datasets import CSVClassificationCorpus

# this is the folder in which train, test and dev files reside
data_folder = 'data/'

# column format indicating which columns hold the text and label(s). This is 1-based and not 0-based
column_name_map = {5: "text", 4: "label"}

# load corpus containing training, test and dev data
corpus: Corpus = CSVClassificationCorpus(data_folder,
                                         column_name_map,
                                         skip_header=True,
                                      test_file='test.csv',
                                      dev_file='val.csv',
                                      train_file='train.csv')
    
label_dict = corpus.make_label_dictionary()

2019-10-25 06:39:43,110 Reading data from data
2019-10-25 06:39:43,111 Train: data\train.csv
2019-10-25 06:39:43,112 Dev: data\val.csv
2019-10-25 06:39:43,112 Test: data\test.csv
2019-10-25 06:39:43,125 Computing label dictionary. Progress:


100%|██████████████████████████████████████████████████████████████████████████████| 764/764 [00:00<00:00, 1152.60it/s]


2019-10-25 06:39:58,081 [b'talk.politics.guns', b'rec.sport.baseball', b'rec.sport.hockey', b'talk.politics.mideast']


In [13]:
corpus.train[0]

Sentence: "check again . you may find that the arrest warrant was issued after the first firefight . --" - 18 Tokens

### Training our own model
We're going to use an RNN to run through the contextual word embeddings we got from ELMo. We will use the hidden state at the end of the document as an embedding for the entire document. We will train the RNN on our labeled dataset, so that the final hidden state carries the most relevant information for our custom classification task.  

For more information on training your own model using Flair, see [this tutorial](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md).

In [14]:
#TODO: Change code below to fit our own dataset

In [15]:
from flair.data import Corpus
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

# 3. make a list of word embeddings
word_embeddings = [WordEmbeddings('glove')]

# 4. initialize document embedding by passing list of word embeddings
# Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
document_embeddings: DocumentRNNEmbeddings = DocumentRNNEmbeddings(word_embeddings,
                                                                     hidden_size=512,
                                                                     reproject_words=True,
                                                                     reproject_words_dimension=256,
                                                                     )

# 5. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

In [None]:
# 6. initialize the text classifier trainer
trainer = ModelTrainer(classifier, corpus)

# 7. start the training
trainer.train('data/',
              learning_rate=0.1,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=5,
              max_epochs=150)

2019-10-25 06:40:32,949 ----------------------------------------------------------------------------------------------------
2019-10-25 06:40:32,950 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('glove')
    )
    (word_reprojection_map): Linear(in_features=100, out_features=256, bias=True)
    (rnn): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=512, out_features=4, bias=True)
  (loss_function): CrossEntropyLoss()
)"
2019-10-25 06:40:32,951 ----------------------------------------------------------------------------------------------------
2019-10-25 06:40:32,952 Corpus: "Corpus: 764 train + 164 dev + 164 test sentences"
2019-10-25 06:40:32,953 ----------------------------------------------------------------------------------------------------
2019-10-25 06:40:32,954 Parameters:
2019-10-25 06:40:32,955  - learning_rate: "0.1"
2019-10-

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


2019-10-25 06:41:18,046 ----------------------------------------------------------------------------------------------------
2019-10-25 06:41:29,540 epoch 2 - iter 0/24 - loss 1.38134038 - samples/sec: 89.03
2019-10-25 06:41:31,434 epoch 2 - iter 2/24 - loss 1.27515229 - samples/sec: 34.80
2019-10-25 06:41:33,158 epoch 2 - iter 4/24 - loss 1.24434588 - samples/sec: 38.15
2019-10-25 06:41:35,326 epoch 2 - iter 6/24 - loss 1.28255429 - samples/sec: 30.41
2019-10-25 06:41:37,470 epoch 2 - iter 8/24 - loss 1.29598440 - samples/sec: 30.82
2019-10-25 06:41:39,173 epoch 2 - iter 10/24 - loss 1.31206984 - samples/sec: 39.19
2019-10-25 06:41:40,820 epoch 2 - iter 12/24 - loss 1.30308888 - samples/sec: 40.35
2019-10-25 06:41:43,195 epoch 2 - iter 14/24 - loss 1.30053051 - samples/sec: 33.11
2019-10-25 06:41:45,092 epoch 2 - iter 16/24 - loss 1.29916115 - samples/sec: 34.79
2019-10-25 06:41:47,098 epoch 2 - iter 18/24 - loss 1.29539271 - samples/sec: 32.96
2019-10-25 06:41:48,824 epoch 2 - iter 2

2019-10-25 06:46:05,578 epoch 7 - iter 22/24 - loss 1.19991632 - samples/sec: 33.96
2019-10-25 06:46:06,903 ----------------------------------------------------------------------------------------------------
2019-10-25 06:46:06,905 EPOCH 7 done: loss 1.1955 - lr 0.1000
2019-10-25 06:46:18,289 DEV : loss 1.2214267253875732 - score 0.3537
2019-10-25 06:46:18,397 BAD EPOCHS (no improvement): 4
2019-10-25 06:46:18,400 ----------------------------------------------------------------------------------------------------
2019-10-25 06:46:32,221 epoch 8 - iter 0/24 - loss 1.11569440 - samples/sec: 86.55
2019-10-25 06:46:33,998 epoch 8 - iter 2/24 - loss 1.07375638 - samples/sec: 37.03
2019-10-25 06:46:35,886 epoch 8 - iter 4/24 - loss 1.16339598 - samples/sec: 34.99
2019-10-25 06:46:37,751 epoch 8 - iter 6/24 - loss 1.19701326 - samples/sec: 35.62
2019-10-25 06:46:39,480 epoch 8 - iter 8/24 - loss 1.17006695 - samples/sec: 38.20
2019-10-25 06:46:41,424 epoch 8 - iter 10/24 - loss 1.15266242 - 

2019-10-25 06:50:41,806 epoch 13 - iter 12/24 - loss 1.12940287 - samples/sec: 39.31
2019-10-25 06:50:43,675 epoch 13 - iter 14/24 - loss 1.11932804 - samples/sec: 35.45
2019-10-25 06:50:45,730 epoch 13 - iter 16/24 - loss 1.09389845 - samples/sec: 32.01
2019-10-25 06:50:47,497 epoch 13 - iter 18/24 - loss 1.11355928 - samples/sec: 37.72
2019-10-25 06:50:49,364 epoch 13 - iter 20/24 - loss 1.10975074 - samples/sec: 35.31
2019-10-25 06:50:50,897 epoch 13 - iter 22/24 - loss 1.10141043 - samples/sec: 43.20
2019-10-25 06:50:52,133 ----------------------------------------------------------------------------------------------------
2019-10-25 06:50:52,135 EPOCH 13 done: loss 1.0914 - lr 0.1000
2019-10-25 06:51:04,001 DEV : loss 1.09210205078125 - score 0.4634
2019-10-25 06:51:04,148 BAD EPOCHS (no improvement): 2
2019-10-25 06:51:04,151 ----------------------------------------------------------------------------------------------------
2019-10-25 06:51:15,745 epoch 14 - iter 0/24 - loss 0.9

2019-10-25 06:55:21,052 epoch 19 - iter 0/24 - loss 0.86436689 - samples/sec: 71.56
2019-10-25 06:55:22,854 epoch 19 - iter 2/24 - loss 0.82941147 - samples/sec: 37.04
2019-10-25 06:55:25,303 epoch 19 - iter 4/24 - loss 0.87983798 - samples/sec: 26.72
2019-10-25 06:55:27,475 epoch 19 - iter 6/24 - loss 0.90989699 - samples/sec: 30.51
2019-10-25 06:55:29,220 epoch 19 - iter 8/24 - loss 0.86983003 - samples/sec: 38.01
2019-10-25 06:55:30,989 epoch 19 - iter 10/24 - loss 0.85046833 - samples/sec: 37.23
2019-10-25 06:55:32,795 epoch 19 - iter 12/24 - loss 0.86228786 - samples/sec: 36.53
2019-10-25 06:55:34,756 epoch 19 - iter 14/24 - loss 0.86790332 - samples/sec: 34.55
2019-10-25 06:55:36,746 epoch 19 - iter 16/24 - loss 0.88605019 - samples/sec: 33.19
2019-10-25 06:55:38,837 epoch 19 - iter 18/24 - loss 0.89564519 - samples/sec: 31.57
2019-10-25 06:55:40,976 epoch 19 - iter 20/24 - loss 0.88883583 - samples/sec: 30.94
2019-10-25 06:55:43,470 epoch 19 - iter 22/24 - loss 0.87687095 - samp

2019-10-25 07:00:02,764 ----------------------------------------------------------------------------------------------------
2019-10-25 07:00:02,765 EPOCH 24 done: loss 0.8609 - lr 0.0500
2019-10-25 07:00:16,926 DEV : loss 1.1886277198791504 - score 0.4573
2019-10-25 07:00:17,052 BAD EPOCHS (no improvement): 4
2019-10-25 07:00:17,054 ----------------------------------------------------------------------------------------------------
2019-10-25 07:00:30,296 epoch 25 - iter 0/24 - loss 0.94394559 - samples/sec: 71.82
2019-10-25 07:00:32,452 epoch 25 - iter 2/24 - loss 0.85504725 - samples/sec: 30.78
2019-10-25 07:00:34,511 epoch 25 - iter 4/24 - loss 0.86950293 - samples/sec: 31.92
2019-10-25 07:00:36,385 epoch 25 - iter 6/24 - loss 0.83312696 - samples/sec: 35.53
2019-10-25 07:00:38,298 epoch 25 - iter 8/24 - loss 0.82939079 - samples/sec: 34.59
2019-10-25 07:00:40,451 epoch 25 - iter 10/24 - loss 0.78727212 - samples/sec: 30.52
2019-10-25 07:00:42,359 epoch 25 - iter 12/24 - loss 0.803

2019-10-25 07:04:58,626 epoch 30 - iter 12/24 - loss 0.88741197 - samples/sec: 28.94
2019-10-25 07:05:00,785 epoch 30 - iter 14/24 - loss 0.85185355 - samples/sec: 30.79
2019-10-25 07:05:02,956 epoch 30 - iter 16/24 - loss 0.84680405 - samples/sec: 30.14
2019-10-25 07:05:04,825 epoch 30 - iter 18/24 - loss 0.84991322 - samples/sec: 35.63
2019-10-25 07:05:07,172 epoch 30 - iter 20/24 - loss 0.83271367 - samples/sec: 27.96
2019-10-25 07:05:09,398 epoch 30 - iter 22/24 - loss 0.82647971 - samples/sec: 29.42
2019-10-25 07:05:10,547 ----------------------------------------------------------------------------------------------------
2019-10-25 07:05:10,548 EPOCH 30 done: loss 0.8452 - lr 0.0500
2019-10-25 07:05:23,707 DEV : loss 1.1903170347213745 - score 0.4573
2019-10-25 07:05:23,876 BAD EPOCHS (no improvement): 2
2019-10-25 07:05:23,879 ----------------------------------------------------------------------------------------------------
2019-10-25 07:05:37,868 epoch 31 - iter 0/24 - loss 1

2019-10-25 07:09:46,153 epoch 36 - iter 0/24 - loss 0.70948285 - samples/sec: 64.61
2019-10-25 07:09:48,258 epoch 36 - iter 2/24 - loss 0.73492930 - samples/sec: 31.68
2019-10-25 07:09:50,244 epoch 36 - iter 4/24 - loss 0.75701809 - samples/sec: 32.97
2019-10-25 07:09:52,093 epoch 36 - iter 6/24 - loss 0.71667857 - samples/sec: 35.81
2019-10-25 07:09:54,275 epoch 36 - iter 8/24 - loss 0.79711223 - samples/sec: 30.37
2019-10-25 07:09:56,328 epoch 36 - iter 10/24 - loss 0.79278141 - samples/sec: 31.96
2019-10-25 07:09:58,286 epoch 36 - iter 12/24 - loss 0.80666047 - samples/sec: 33.83
2019-10-25 07:10:00,111 epoch 36 - iter 14/24 - loss 0.79713227 - samples/sec: 36.43
2019-10-25 07:10:02,115 epoch 36 - iter 16/24 - loss 0.78026264 - samples/sec: 32.84
2019-10-25 07:10:04,118 epoch 36 - iter 18/24 - loss 0.77524734 - samples/sec: 32.87
2019-10-25 07:10:06,001 epoch 36 - iter 20/24 - loss 0.77600465 - samples/sec: 35.60
2019-10-25 07:10:07,753 epoch 36 - iter 22/24 - loss 0.78334551 - samp

2019-10-25 07:14:29,425 ----------------------------------------------------------------------------------------------------
2019-10-25 07:14:29,426 EPOCH 41 done: loss 0.7346 - lr 0.0500
2019-10-25 07:14:43,032 DEV : loss 1.0040267705917358 - score 0.5488
2019-10-25 07:14:43,160 BAD EPOCHS (no improvement): 1
2019-10-25 07:14:43,162 ----------------------------------------------------------------------------------------------------
2019-10-25 07:14:56,854 epoch 42 - iter 0/24 - loss 0.89064580 - samples/sec: 80.42
2019-10-25 07:14:58,664 epoch 42 - iter 2/24 - loss 0.86089611 - samples/sec: 36.57
2019-10-25 07:15:00,564 epoch 42 - iter 4/24 - loss 0.76156485 - samples/sec: 34.64
2019-10-25 07:15:02,579 epoch 42 - iter 6/24 - loss 0.76644603 - samples/sec: 32.74
2019-10-25 07:15:04,389 epoch 42 - iter 8/24 - loss 0.74369063 - samples/sec: 36.92
2019-10-25 07:15:06,559 epoch 42 - iter 10/24 - loss 0.76964226 - samples/sec: 30.20
2019-10-25 07:15:08,254 epoch 42 - iter 12/24 - loss 0.764

2019-10-25 07:18:58,276 epoch 47 - iter 12/24 - loss 0.70568141 - samples/sec: 39.51
2019-10-25 07:19:00,107 epoch 47 - iter 14/24 - loss 0.70317826 - samples/sec: 35.98
2019-10-25 07:19:01,809 epoch 47 - iter 16/24 - loss 0.69112670 - samples/sec: 39.02
2019-10-25 07:19:03,648 epoch 47 - iter 18/24 - loss 0.67279014 - samples/sec: 35.93
2019-10-25 07:19:05,795 epoch 47 - iter 20/24 - loss 0.66075538 - samples/sec: 30.62
2019-10-25 07:19:07,575 epoch 47 - iter 22/24 - loss 0.66707240 - samples/sec: 37.56
2019-10-25 07:19:08,736 ----------------------------------------------------------------------------------------------------
2019-10-25 07:19:08,737 EPOCH 47 done: loss 0.6761 - lr 0.0500
2019-10-25 07:19:20,288 DEV : loss 1.1154601573944092 - score 0.5427
2019-10-25 07:19:20,408 BAD EPOCHS (no improvement): 4
2019-10-25 07:19:20,410 ----------------------------------------------------------------------------------------------------
2019-10-25 07:19:32,086 epoch 48 - iter 0/24 - loss 1

2019-10-25 07:23:24,102 epoch 53 - iter 0/24 - loss 0.43452036 - samples/sec: 56.74
2019-10-25 07:23:26,304 epoch 53 - iter 2/24 - loss 0.51420826 - samples/sec: 29.85
2019-10-25 07:23:28,522 epoch 53 - iter 4/24 - loss 0.57091818 - samples/sec: 29.56
2019-10-25 07:23:30,633 epoch 53 - iter 6/24 - loss 0.54612073 - samples/sec: 31.56
2019-10-25 07:23:32,519 epoch 53 - iter 8/24 - loss 0.56167242 - samples/sec: 34.77
2019-10-25 07:23:34,302 epoch 53 - iter 10/24 - loss 0.58181300 - samples/sec: 36.72
2019-10-25 07:23:36,490 epoch 53 - iter 12/24 - loss 0.58255421 - samples/sec: 30.68
2019-10-25 07:23:38,437 epoch 53 - iter 14/24 - loss 0.58018672 - samples/sec: 33.69
2019-10-25 07:23:40,076 epoch 53 - iter 16/24 - loss 0.57383733 - samples/sec: 40.01
2019-10-25 07:23:41,808 epoch 53 - iter 18/24 - loss 0.56938091 - samples/sec: 38.56
2019-10-25 07:23:43,618 epoch 53 - iter 20/24 - loss 0.56099905 - samples/sec: 36.16
2019-10-25 07:23:45,563 epoch 53 - iter 22/24 - loss 0.55643257 - samp

2019-10-25 07:27:39,578 ----------------------------------------------------------------------------------------------------
2019-10-25 07:27:39,580 EPOCH 58 done: loss 0.5559 - lr 0.0250
2019-10-25 07:27:51,675 DEV : loss 0.8297656178474426 - score 0.622
2019-10-25 07:27:51,787 BAD EPOCHS (no improvement): 5
2019-10-25 07:27:51,788 ----------------------------------------------------------------------------------------------------
2019-10-25 07:28:03,734 epoch 59 - iter 0/24 - loss 0.43595353 - samples/sec: 74.36
2019-10-25 07:28:05,582 epoch 59 - iter 2/24 - loss 0.44809424 - samples/sec: 35.94
2019-10-25 07:28:07,382 epoch 59 - iter 4/24 - loss 0.44702618 - samples/sec: 36.57
2019-10-25 07:28:09,465 epoch 59 - iter 6/24 - loss 0.43417579 - samples/sec: 31.71
2019-10-25 07:28:11,277 epoch 59 - iter 8/24 - loss 0.46022122 - samples/sec: 36.59
2019-10-25 07:28:13,043 epoch 59 - iter 10/24 - loss 0.46588650 - samples/sec: 37.17
2019-10-25 07:28:15,104 epoch 59 - iter 12/24 - loss 0.4849

2019-10-25 07:32:06,694 epoch 64 - iter 12/24 - loss 0.50810600 - samples/sec: 33.00
2019-10-25 07:32:08,442 epoch 64 - iter 14/24 - loss 0.50678663 - samples/sec: 37.95
2019-10-25 07:32:10,323 epoch 64 - iter 16/24 - loss 0.48886657 - samples/sec: 35.03
2019-10-25 07:32:12,138 epoch 64 - iter 18/24 - loss 0.48734082 - samples/sec: 36.40


In [None]:
# 8. plot weight traces (optional)
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_weights('data/weights.txt')

## Contextual Word Vectors with BERT and Stacking Embeddings
We will later use BERT, a state-of-the-art transformer model that was trained on a very large corpus and can be fine-tuned for our own custom task. The Flair package can also be used to derive contextual word embeddings using BERT and its successors. We will use a different package for BERT, to provide you with sample code for using it, and for adaptation of the weights of the BERT model itself for our own task.

Below, you can try to use BERT, Roberta, XLNet or other models provided in Flair for contextual word embeddings.  
Flair also provides a simple way to stack vectors from different methods.

In [4]:
from flair.embeddings import FlairEmbeddings, BertEmbeddings

# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')

# init BERT
bert_embedding = BertEmbeddings('bert-base-uncased')

2019-10-17 08:38:42,789 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4/lm-multi-forward-v0.1.pt not found in cache, downloading to C:\Users\Omri\AppData\Local\Temp\tmppppabpwb


100%|██████████████████████████████████████████████████████████████████| 73034300/73034300 [00:08<00:00, 8530759.25B/s]


2019-10-17 08:38:51,700 copying C:\Users\Omri\AppData\Local\Temp\tmppppabpwb to cache at C:\Users\Omri\.flair\embeddings\lm-multi-forward-v0.1.pt
2019-10-17 08:38:51,779 removing temp file C:\Users\Omri\AppData\Local\Temp\tmppppabpwb
2019-10-17 08:38:52,276 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4/lm-multi-backward-v0.1.pt not found in cache, downloading to C:\Users\Omri\AppData\Local\Temp\tmptvnijhdp


100%|██████████████████████████████████████████████████████████████████| 73034304/73034304 [00:12<00:00, 5879942.68B/s]


2019-10-17 08:39:05,094 copying C:\Users\Omri\AppData\Local\Temp\tmptvnijhdp to cache at C:\Users\Omri\.flair\embeddings\lm-multi-backward-v0.1.pt
2019-10-17 08:39:05,172 removing temp file C:\Users\Omri\AppData\Local\Temp\tmptvnijhdp


100%|██████████████████████████████████████████████████████████████████████| 995526/995526 [00:00<00:00, 1143581.85B/s]
100%|█████████████████████████████████████████████████████████████████████████████| 521/521 [00:00<00:00, 262364.32B/s]
100%|███████████████████████████████████████████████████████████████| 714314041/714314041 [01:08<00:00, 10488502.38B/s]


In [8]:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings, Sentence
from flair.embeddings import StackedEmbeddings

# now create the StackedEmbedding object that combines all embeddings
stacked_embeddings = StackedEmbeddings(
    embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])

In [10]:
sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Token: 1 The
tensor([-1.4812e-07,  4.5007e-08,  6.0273e-07,  ...,  3.8287e-01,
         4.7210e-01,  2.9850e-01])
Token: 2 grass
tensor([ 1.6254e-04,  1.8764e-07, -7.9038e-09,  ...,  8.5283e-01,
        -5.0726e-02,  3.4476e-01])
Token: 3 is
tensor([-2.4521e-04,  3.4869e-07,  5.5841e-06,  ..., -1.8283e-01,
         7.1532e-01,  5.0841e-03])
Token: 4 green
tensor([8.3005e-05, 4.7261e-08, 5.7315e-07,  ..., 1.0157e+00, 7.5358e-01,
        1.1230e-01])
Token: 5 .
tensor([-8.3244e-07,  1.6451e-07, -1.7201e-08,  ..., -6.0930e-01,
         9.0591e-01,  1.7857e-01])


**Try it yourself:** Train a classifier using stacked embeddings of different models. Do you see an increase in performance?