# Using RoBERTa for text encoding

## Overview 


## References
- [Kaggle notebook on how to get embeddings from Roberta](https://www.kaggle.com/code/maostack/clrp-how-to-get-text-embedding-from-roberta)


In [1]:
from transformers import RobertaModel, RobertaConfig, RobertaTokenizer

from IPython.core.interactiveshell import InteractiveShell 
InteractiveShell.ast_node_interactivity = "all"

In [2]:
config = RobertaConfig()

In [3]:
config

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [4]:
model = RobertaModel(config)

In [5]:
model

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=1)
    (position_embeddings): Embedding(512, 768, padding_idx=1)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Drop

****

In [6]:
model02 = RobertaModel(config, add_pooling_layer=False)

In [7]:
model02

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=1)
    (position_embeddings): Embedding(512, 768, padding_idx=1)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Drop

*****

In [8]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model03 = RobertaModel.from_pretrained('roberta-base')

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
input_ids = tokenizer('hello there', return_tensors='pt')
input_ids

{'input_ids': tensor([[    0, 42891,    89,     2]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

In [10]:
embedding = model03(**input_ids)

In [11]:
last_hidden_state = embedding[0]
last_hidden_state.shape
last_hidden_state[0][0]

torch.Size([1, 4, 768])

tensor([-9.5116e-02,  9.5225e-02, -3.7027e-03, -1.3327e-01,  5.8054e-02,
        -1.1925e-01, -8.9735e-03,  1.6545e-02,  9.5150e-02, -5.9935e-02,
        -5.0207e-03,  8.6815e-02,  6.2271e-02, -1.7962e-02,  9.0222e-02,
         1.9137e-02, -7.1721e-02, -3.1796e-02, -1.2938e-02, -4.5020e-02,
        -1.1467e-01,  5.5105e-02, -4.4007e-02,  9.4992e-02, -2.4072e-03,
         3.4001e-02,  8.8103e-02,  5.9016e-02, -8.6964e-02, -2.9370e-02,
        -3.1404e-02, -2.3733e-02,  5.7615e-02, -3.3627e-02,  1.3552e-02,
         6.4657e-02,  1.3288e-02,  1.2101e-03, -9.0675e-02,  2.8775e-02,
        -1.6822e-02,  8.8934e-02,  1.7067e-02,  1.0417e-02,  6.1437e-02,
         4.5098e-02,  8.5016e-03,  3.4433e-02, -2.6296e-02,  1.2864e-02,
         2.1041e-02,  9.1921e-02, -5.4271e-02,  1.3900e-02, -8.9060e-02,
         3.0105e-02, -7.5359e-03,  1.0409e-01,  5.5841e-02, -5.0041e-02,
         2.3604e-02, -1.3245e-01, -1.0336e-01, -3.6074e-02,  3.3968e-02,
        -9.1029e-03, -3.4693e-02, -1.5416e-03,  1.8

There is 1 text, with 4 tokens, each of which is embedded in a vector of length 768

In [12]:
last_hidden_state[0][0]

tensor([-9.5116e-02,  9.5225e-02, -3.7027e-03, -1.3327e-01,  5.8054e-02,
        -1.1925e-01, -8.9735e-03,  1.6545e-02,  9.5150e-02, -5.9935e-02,
        -5.0207e-03,  8.6815e-02,  6.2271e-02, -1.7962e-02,  9.0222e-02,
         1.9137e-02, -7.1721e-02, -3.1796e-02, -1.2938e-02, -4.5020e-02,
        -1.1467e-01,  5.5105e-02, -4.4007e-02,  9.4992e-02, -2.4072e-03,
         3.4001e-02,  8.8103e-02,  5.9016e-02, -8.6964e-02, -2.9370e-02,
        -3.1404e-02, -2.3733e-02,  5.7615e-02, -3.3627e-02,  1.3552e-02,
         6.4657e-02,  1.3288e-02,  1.2101e-03, -9.0675e-02,  2.8775e-02,
        -1.6822e-02,  8.8934e-02,  1.7067e-02,  1.0417e-02,  6.1437e-02,
         4.5098e-02,  8.5016e-03,  3.4433e-02, -2.6296e-02,  1.2864e-02,
         2.1041e-02,  9.1921e-02, -5.4271e-02,  1.3900e-02, -8.9060e-02,
         3.0105e-02, -7.5359e-03,  1.0409e-01,  5.5841e-02, -5.0041e-02,
         2.3604e-02, -1.3245e-01, -1.0336e-01, -3.6074e-02,  3.3968e-02,
        -9.1029e-03, -3.4693e-02, -1.5416e-03,  1.8