In [None]:
# all_slow

# Tutorial - Easy Embeddings
> Using EasyWord, Stacked, and Document Embeddings in the AdaptNLP framework

## Finding Available Models with Hubs

We can search for available models to utilize with Embeddings with the `HFModelHub` and `FlairModelHub`. We'll see an example below:

In [None]:
from adaptnlp import EasyWordEmbeddings, EasyStackedEmbeddings, EasyDocumentEmbeddings
from adaptnlp.embeddings import DetailLevel
from adaptnlp.model_hub import HFModelHub, FlairModelHub

In [None]:
hub = HFModelHub()
models = hub.search_model_by_name('gpt2'); models

[Model Name: distilgpt2, Tasks: [text-generation],
 Model Name: gpt2-large, Tasks: [text-generation],
 Model Name: gpt2-medium, Tasks: [text-generation],
 Model Name: gpt2-xl, Tasks: [text-generation],
 Model Name: gpt2, Tasks: [text-generation]]

For this tutorial we'll use the `gpt2` base model:

In [None]:
model = models[-1]; model

Model Name: gpt2, Tasks: [text-generation]

## Producing Embeddings using `EasyWordEmbeddings`

First we'll use some basic example text:

In [None]:
example_text = "This is Albert.  My last name is Einstein.  I like physics and atoms."

And then instantiate our embeddings tagger:

In [None]:
embeddings = EasyWordEmbeddings()

Now let's run our `gpt2` model we grabbed earlier to generate some `EmbeddingResult` objects:

In [None]:
res = embeddings.embed_text(example_text, model_name_or_path=model)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.4.attn.masked_bias', 'h.8.attn.masked_bias', 'h.3.attn.masked_bias', 'h.6.attn.masked_bias', 'h.2.attn.masked_bias', 'h.0.attn.masked_bias', 'h.9.attn.masked_bias', 'h.7.attn.masked_bias', 'h.5.attn.masked_bias', 'h.11.attn.masked_bias', 'h.1.attn.masked_bias', 'h.10.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


These result objects contain all the data you might need, including the original text, the embeddings at both the token and sentence level, as well as a convience `to_dict` method, where we can specify the level of detail we want:

In [None]:
res

[OrderedDict([('inputs',
               'This is Albert.  My last name is Einstein.  I like physics and atoms.'),
              ('sentence_embeddings', tensor([])),
              ('token_embeddings',
               tensor([[ 1.2373, -0.8330,  0.5278,  ..., -1.0304, -0.7106,  0.7515],
                       [ 0.2954,  0.1950,  1.1355,  ...,  1.2045,  1.9508,  0.8828],
                       [-3.9810, -0.5063, -2.2954,  ..., -1.1560, -0.2203,  1.6024],
                       ...,
                       [ 4.3351,  1.2776, -1.1377,  ...,  0.1885, -0.0354,  0.4981],
                       [ 0.2538,  3.0154, -1.8020,  ..., -0.3545,  0.2821,  2.3718],
                       [ 1.0933,  0.3256,  1.6138,  ..., -1.0720, -1.3188,  0.3358]],
                      device='cuda:0'))])]

To grab our sentence or token embeddings, simply look it up by its key:

> Note: Only `StackedEmbeddings` will have sentence embeddings

In [None]:
res[0]['token_embeddings']

tensor([[ 1.2373, -0.8330,  0.5278,  ..., -1.0304, -0.7106,  0.7515],
        [ 0.2954,  0.1950,  1.1355,  ...,  1.2045,  1.9508,  0.8828],
        [-3.9810, -0.5063, -2.2954,  ..., -1.1560, -0.2203,  1.6024],
        ...,
        [ 4.3351,  1.2776, -1.1377,  ...,  0.1885, -0.0354,  0.4981],
        [ 0.2538,  3.0154, -1.8020,  ..., -0.3545,  0.2821,  2.3718],
        [ 1.0933,  0.3256,  1.6138,  ..., -1.0720, -1.3188,  0.3358]],
       device='cuda:0')

Using different models is extremely easy to do. Let's try using BERT embeddings with the `bert-base-cased` model instead.

Rather than passing in a `HFModelResult` or `FlairModelResult`, we can also just pass in the raw string name of the model as well:

In [None]:
res = embeddings.embed_text(example_text, model_name_or_path='bert-base-cased')

Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Just like in the last example, we can look at the embeddings in the same way:

In [None]:
res[0]['token_embeddings']

tensor([[-0.6795, -0.2041,  1.0153,  ...,  0.2279,  0.0967,  0.3887],
        [-0.1609, -0.2013,  0.9313,  ...,  0.0570,  0.6361,  0.6626],
        [-0.0846, -0.2399,  0.2524,  ..., -0.2886, -0.2588,  0.3147],
        ...,
        [ 0.2307,  0.0850, -0.3529,  ..., -0.1745,  0.5396, -0.1455],
        [-0.3223,  0.3806, -0.7739,  ..., -0.1101, -0.3259,  0.0197],
        [ 0.2760, -0.0849, -0.0120,  ..., -0.1703, -0.1642,  0.2458]],
       device='cuda:0')

We can also convert our output to an easy to use dictionary, which can have a bit more information. First let's not filter our results by passing in `detail_level = None`:

In [None]:
res = embeddings.embed_text(example_text, 
                            model_name_or_path='bert-base-cased',
                           detail_level=None)

Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
res

[EmbeddingResult: {
 	Inputs: This is Albert.  My last name is Einstein.  I like physics and atoms.
 	Token Embeddings Shape: torch.Size([16, 768])
 	Sentence Embeddings Shape: torch.Size([0])
 }]

We can see see that result is now an `EmbeddingResult`, which has all the information we key'd with as available attributes:

In [None]:
res[0].inputs

'This is Albert.  My last name is Einstein.  I like physics and atoms.'

If we want to filter the object ourselves and convert it to a dictionary, we can use the `to_dict()` function:

In [None]:
res[0].to_dict()

OrderedDict([('inputs',
              'This is Albert.  My last name is Einstein.  I like physics and atoms.'),
             ('sentence_embeddings', tensor([])),
             ('token_embeddings',
              tensor([[-0.6795, -0.2041,  1.0153,  ...,  0.2279,  0.0967,  0.3887],
                      [-0.1609, -0.2013,  0.9313,  ...,  0.0570,  0.6361,  0.6626],
                      [-0.0846, -0.2399,  0.2524,  ..., -0.2886, -0.2588,  0.3147],
                      ...,
                      [ 0.2307,  0.0850, -0.3529,  ..., -0.1745,  0.5396, -0.1455],
                      [-0.3223,  0.3806, -0.7739,  ..., -0.1101, -0.3259,  0.0197],
                      [ 0.2760, -0.0849, -0.0120,  ..., -0.1703, -0.1642,  0.2458]],
                     device='cuda:0'))])

You can specify the level of detail wanted by passing in "low", "medium", or "high" to the `to_dict` method, or use the convience `DetailLevel` class:

In [None]:
res_dict = res[0].to_dict(DetailLevel.Medium)

In [None]:
res_dict

OrderedDict([('inputs',
              'This is Albert.  My last name is Einstein.  I like physics and atoms.'),
             ('sentence_embeddings', tensor([])),
             ('token_embeddings',
              tensor([[-0.6795, -0.2041,  1.0153,  ...,  0.2279,  0.0967,  0.3887],
                      [-0.1609, -0.2013,  0.9313,  ...,  0.0570,  0.6361,  0.6626],
                      [-0.0846, -0.2399,  0.2524,  ..., -0.2886, -0.2588,  0.3147],
                      ...,
                      [ 0.2307,  0.0850, -0.3529,  ..., -0.1745,  0.5396, -0.1455],
                      [-0.3223,  0.3806, -0.7739,  ..., -0.1101, -0.3259,  0.0197],
                      [ 0.2760, -0.0849, -0.0120,  ..., -0.1703, -0.1642,  0.2458]],
                     device='cuda:0')),
             ('This',
              {'embeddings': tensor([-6.7946e-01, -2.0409e-01,  1.0153e+00,  3.6316e-01, -1.4399e+00,
                        9.0887e-02, -2.4070e-01,  6.3402e-01,  4.7817e-01, -9.0081e-01,
                    

Each level returns more data from the outputs:
- Available at all levels:
  - `original_sentence`: The original sentence
  - `tokenized_sentence`: The tokenized sentence
  - `sentence_embeddings`: Embeddings from the actual sentence (if available)
  - `token_embeddings`: Concatenated embeddings from all the tokens passed
- `DetailLevel.Low` (or 'low'):
  - Returns information available at all levels
- `DetailLevel.Medium` (or 'medium'):
  - Everything from `DetailLevel.Low`
  - For each token a dictionary of the embeddings and word index is added
- `DetailLevel.High` (or 'high'):
  - Everything from `DetailLevel.Medium`
  - This will also include the original Flair `Sentence` result from the model

Let's look at a final example with roBERTa embeddings:

In [None]:
res = embeddings.embed_text(example_text, model_name_or_path="roberta-base")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=480.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=331070498.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


And our generated embeddings:

In [None]:
#hide_input
print(f'Original text: {res[0]["inputs"]}')
print(f'Model: roberta-base')
print(f'Embedding: {res[0]["token_embeddings"]}')

Original text: This is Albert.  My last name is Einstein.  I like physics and atoms.
Model: roberta-base
Embedding: tensor([[ 0.0752,  0.6170,  0.4389,  ...,  0.2334, -0.2718,  0.0739],
        [ 0.1961,  0.6164,  0.1019,  ..., -0.3410, -0.2461, -0.0403],
        [ 0.1772,  0.0369, -0.0483,  ...,  0.3179,  0.1806,  0.1607],
        ...,
        [ 0.1566,  0.3472, -0.0110,  ...,  0.0318,  0.3524, -0.3428],
        [ 0.0555,  0.2878,  0.1732,  ..., -0.1009,  0.2014, -0.4145],
        [ 0.1859, -0.9742, -0.0155,  ..., -0.0734, -0.0804, -0.1212]],
       device='cuda:0')


## Producing Stacked Embeddings with `EasyStackedEmbeddings`

`EasyStackedEmbeddings` allows you to use a variable number of language models to produce our embeddings shown above. For our example we'll combine the `bert-base-cased` and `distilbert-base-cased` models.

First we'll instantiate our `EasyStackedEmbeddings`:

In [None]:
embeddings = EasyStackedEmbeddings("bert-base-cased", "distilbert-base-cased")

May need a couple moments to instantiate...


Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing DistilBertModel: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


And then generate our stacked word embeddings through our `embed_text` function:

In [None]:
res = embeddings.embed_text(example_text)

We can see our results below:

In [None]:
#hide_input
print(f'Original text: {res[0]["inputs"]}')
print(f'Model: roberta-base')
print(f'Embedding: {res[0]["token_embeddings"]}')

Original text: This is Albert.  My last name is Einstein.  I like physics and atoms.
Model: roberta-base
Embedding: tensor([[-0.6795, -0.2041,  1.0153,  ...,  0.2426, -0.2324,  0.3107],
        [-0.1609, -0.2013,  0.9313,  ..., -0.0443,  0.6380,  0.7524],
        [-0.0846, -0.2399,  0.2524,  ...,  0.0154, -0.5154, -0.1708],
        ...,
        [ 0.2307,  0.0850, -0.3529,  ..., -0.6223,  0.1720, -0.2028],
        [-0.3223,  0.3806, -0.7739,  ...,  0.2957, -0.2913,  0.2791],
        [ 0.2760, -0.0849, -0.0120,  ..., -0.2799, -0.2166, -0.1328]],
       device='cuda:0')


## Document Embeddings with `EasyDocumentEmbeddings`

Similar to the `EasyStackedEmbeddings`, `EasyDocumentEmbeddings` allows you to pool the embeddings from multiple models together with `embed_pool` and `embed_rnn`.

We'll use our `bert-base-cased` and `distilbert-base-cased` models again:

In [None]:
embeddings = EasyDocumentEmbeddings("bert-base-cased", "distilbert-base-cased")

May need a couple moments to instantiate...


Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing DistilBertModel: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTrainin

Pooled embedding loaded
RNN embeddings loaded


This time we will use the `embed_pool` method to generate `DocumentPoolEmbeddings`. These do an average over all the word embeddings in a sentence:

In [None]:
res = embeddings.embed_pool(example_text)

As a result rather than having embeddings by token, we have embeddings *by document*

In [None]:
res

[OrderedDict([('inputs',
               'This is Albert.  My last name is Einstein.  I like physics and atoms.'),
              ('sentence_embeddings',
               tensor([-0.2397,  0.2154,  0.1053,  ...,  0.0500,  0.0791,  0.2998],
                      device='cuda:0', grad_fn=<CatBackward>)),
              ('token_embeddings',
               tensor([[-0.6795, -0.2041,  1.0153,  ...,  0.2426, -0.2324,  0.3107],
                       [-0.1609, -0.2013,  0.9313,  ..., -0.0443,  0.6380,  0.7524],
                       [-0.0846, -0.2399,  0.2524,  ...,  0.0154, -0.5154, -0.1708],
                       ...,
                       [ 0.2307,  0.0850, -0.3529,  ..., -0.6223,  0.1720, -0.2028],
                       [-0.3223,  0.3806, -0.7739,  ...,  0.2957, -0.2913,  0.2791],
                       [ 0.2760, -0.0849, -0.0120,  ..., -0.2799, -0.2166, -0.1328]],
                      device='cuda:0'))])]

In [None]:
#hide_input
print(f'Original text: {res[0]["inputs"]}')
print(f'Model: roberta-base')
print(f'Embedding: {res[0]["token_embeddings"]}')

Original text: This is Albert.  My last name is Einstein.  I like physics and atoms.
Model: roberta-base
Embedding: tensor([[-0.6795, -0.2041,  1.0153,  ...,  0.2426, -0.2324,  0.3107],
        [-0.1609, -0.2013,  0.9313,  ..., -0.0443,  0.6380,  0.7524],
        [-0.0846, -0.2399,  0.2524,  ...,  0.0154, -0.5154, -0.1708],
        ...,
        [ 0.2307,  0.0850, -0.3529,  ..., -0.6223,  0.1720, -0.2028],
        [-0.3223,  0.3806, -0.7739,  ...,  0.2957, -0.2913,  0.2791],
        [ 0.2760, -0.0849, -0.0120,  ..., -0.2799, -0.2166, -0.1328]],
       device='cuda:0')


We can also generate `DocumentRNNEmbeddings` as well. Document RNN Embeddings run an RNN over all the words in the sentence and use the final state of the RNN as the embedding.

First we'll call `embed_rnn`:

In [None]:
sentences = embeddings.embed_rnn(example_text)

And then look at our generated embeddings:

In [None]:
#hide_input
print(f'Original text: {res[0]["inputs"]}')
print(f'Model: roberta-base')
print(f'Embedding: {res[0]["token_embeddings"]}')

Original text: This is Albert.  My last name is Einstein.  I like physics and atoms.
Model: roberta-base
Embedding: tensor([[-0.6795, -0.2041,  1.0153,  ...,  0.2426, -0.2324,  0.3107],
        [-0.1609, -0.2013,  0.9313,  ..., -0.0443,  0.6380,  0.7524],
        [-0.0846, -0.2399,  0.2524,  ...,  0.0154, -0.5154, -0.1708],
        ...,
        [ 0.2307,  0.0850, -0.3529,  ..., -0.6223,  0.1720, -0.2028],
        [-0.3223,  0.3806, -0.7739,  ...,  0.2957, -0.2913,  0.2791],
        [ 0.2760, -0.0849, -0.0120,  ..., -0.2799, -0.2166, -0.1328]],
       device='cuda:0')
