# Using RoBERTa for text encoding

## Overview 


## References
- [Kaggle notebook on how to get embeddings from Roberta](https://www.kaggle.com/code/maostack/clrp-how-to-get-text-embedding-from-roberta)


In [1]:
from transformers import RobertaModel, RobertaConfig, RobertaTokenizer
from umap import UMAP 

from IPython.core.interactiveshell import InteractiveShell 
InteractiveShell.ast_node_interactivity = "all"

In [2]:
config = RobertaConfig()

In [3]:
config

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [4]:
model = RobertaModel(config)

In [5]:
model

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=1)
    (position_embeddings): Embedding(512, 768, padding_idx=1)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Drop

****

In [6]:
model02 = RobertaModel(config, add_pooling_layer=False)

In [7]:
model02

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=1)
    (position_embeddings): Embedding(512, 768, padding_idx=1)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Drop

*****

In [8]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model03 = RobertaModel.from_pretrained('roberta-base')

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
text_corpus = ['one ring to rule them all', 'one ring to find them', 'and in the darkness bind them', 'where the shadows lie']
text_encoding = tokenizer(text_corpus, return_tensors='pt', padding=True)
text_encoding

{'input_ids': tensor([[    0,  1264,  3758,     7,  2178,   106,    70,     2],
        [    0,  1264,  3758,     7,   465,   106,     2,     1],
        [    0,   463,    11,     5, 15073, 23379,   106,     2],
        [    0,  8569,     5, 21841,  6105,     2,     1,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0]])}

In [10]:
for text in range(len(text_corpus)): 
    tokenizer.convert_ids_to_tokens(text_encoding.input_ids[text])

['<s>', 'one', 'Ġring', 'Ġto', 'Ġrule', 'Ġthem', 'Ġall', '</s>']

['<s>', 'one', 'Ġring', 'Ġto', 'Ġfind', 'Ġthem', '</s>', '<pad>']

['<s>', 'and', 'Ġin', 'Ġthe', 'Ġdarkness', 'Ġbind', 'Ġthem', '</s>']

['<s>', 'where', 'Ġthe', 'Ġshadows', 'Ġlie', '</s>', '<pad>', '<pad>']

In [11]:
embedding = model03(**text_encoding)

In [12]:
dir(embedding)

['__annotations__',
 '__class__',
 '__contains__',
 '__dataclass_fields__',
 '__dataclass_params__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__post_init__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'attentions',
 'clear',
 'copy',
 'cross_attentions',
 'fromkeys',
 'get',
 'hidden_states',
 'items',
 'keys',
 'last_hidden_state',
 'move_to_end',
 'past_key_values',
 'pooler_output',
 'pop',
 'popitem',
 'setdefault',
 'to_tuple',
 'update',
 'values']

In [13]:
embedding

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0382,  0.0822, -0.0111,  ..., -0.0629, -0.0729, -0.0390],
         [-0.1757, -0.4192,  0.1776,  ...,  0.3204, -0.0071, -0.1615],
         [ 0.1746,  0.0520,  0.0645,  ...,  0.2238, -0.1494, -0.0292],
         ...,
         [ 0.2547,  0.3574,  0.1010,  ...,  0.1376,  0.1597,  0.2394],
         [-0.0033,  0.0170,  0.0569,  ...,  0.2159, -0.2354,  0.1388],
         [-0.0305,  0.0831, -0.0361,  ..., -0.1029, -0.0824, -0.0772]],

        [[-0.0273,  0.0776, -0.0037,  ..., -0.0557, -0.0959, -0.0015],
         [-0.0379, -0.3423, -0.0414,  ...,  0.2339, -0.2113,  0.0935],
         [-0.0141,  0.0983,  0.0524,  ...,  0.1447, -0.2077,  0.2854],
         ...,
         [-0.0059,  0.0324,  0.0998,  ...,  0.1088, -0.0283,  0.1670],
         [-0.0215,  0.0783, -0.0231,  ..., -0.0902, -0.1017, -0.0278],
         [-0.0273,  0.0775, -0.0038,  ..., -0.0557, -0.0960, -0.0014]],

        [[-0.0280,  0.0751, -0.0374,  ..., -0.0854, -

In [14]:
last_hidden_state = embedding[0]
last_hidden_state.shape

torch.Size([4, 8, 768])

There are 4 texts, with 8 tokens each (after padding), each of which is embedded in a vector of length 768. 

In [15]:
last_hidden_state[0].shape

torch.Size([8, 768])

In [16]:
last_hidden_state[0][0].shape

torch.Size([768])

## Grabbing the CLS token from each text: 

In [17]:
cls_tokens = last_hidden_state[:, 0, :]
cls_tokens.shape

torch.Size([4, 768])