word or sentence embedding from BERT model #1950

ghost · 2019-11-26T13:54:48Z

How can I extract embeddings for a sentence or a set of words directly from pre-trained models (Standard BERT)? For example, I am using Spacy for this purpose at the moment where I can do it as follows:

sentence vector:
sentence_vector = bert_model("This is an apple").vector

word_vectors:

words = bert_model("This is an apple")
word_vectors = [w.vector for w in words]

I am wondering if this is possible directly with huggingface pre-trained models (especially BERT).

The text was updated successfully, but these errors were encountered:

bilal2vec · 2019-11-26T15:13:04Z

You can use BertModel, it'll return the hidden states for the input sentence.

ghost · 2019-11-26T15:22:09Z

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

maxzzze · 2019-11-26T15:31:18Z

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].

Example:

inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

TheEdoardo93 · 2019-11-26T15:55:59Z

By using this code, you can obtain a PyTorch tensor of (1, N, 768) shape, where N is the number of different tokens extracted from BertTokenizer. If you want to build the sentence vector by exploiting these N tensors, how do you do that? @engrsfi

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

ghost · 2019-11-26T16:00:42Z

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].

Example:
inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

I am interested in the last hidden states which are seen as kind of embeddings. I think you are referring to all hidden states including the output of the embedding layer.

"**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs
```.

ghost · 2019-11-26T16:03:29Z

By using this code, you can obtain a PyTorch tensor of (1, N, 768) shape, where N is the number of different tokens extracted from BertTokenizer. If you want to build the sentence vector by exploiting these N tensors, how do you do that? @engrsfi
Found it, thanks @bkkaggle . Just for others who are looking for the same information.
Using Pytorch:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
Using Tensorflow:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

You can take an average of them. However, I think the embeddings at first position [CLS] are considered a kind of sentence vector because only those are fed to a further classifier if any for downstream tasks. Disclaimer: I am not sure about it.

maxzzze · 2019-11-26T16:04:57Z

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].
Example:
inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]
I am interested in the last hidden states which are seen as kind of embeddings. I think you are referring to all hidden states including the output of the embedding layer.
"**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs
```.

Should be as simple as grabbing the last element in the list:

last_layer = hidden_states[-1]

ghost · 2019-11-26T16:23:29Z

@maxzzze According to the documentation, one can get the last hidden states directly without setting this flag to True. See below.
https://huggingface.co/transformers/_modules/transformers/modeling_bert.html#BertModel

Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BTW, for me, the shape of hidden_states in the below code is (batch_size, 768) when I set this Flag to True, not sure if I can extract last hidden states from that.

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

maxzzze · 2019-11-26T16:29:01Z

@maxzzze According to the documentation, one can get the last hidden states directly without setting this flag to True. See below.

Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BTW, for me, the shape of hidden_states in the below code is (batch_size, 768) whereas it should be (batch_size, num_heads, sequence_length, sequence_length).

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

I believe your comment is in reference to the standard models, but its hard to tell without a link. Can you link where to where in the documentation the pasted doc string is from?

I dont know if you saw my original comment but I was providing an example for how to get hidden_states from the ..ForSequenceClassification models, not the standard ones. The ..ForSequenceClassification models do not output hidden_states by default: https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

ghost · 2019-11-26T16:30:46Z

Sorry, I missed that part :) I am referring to the standard BERTMODEL. Doc link:
https://huggingface.co/transformers/model_doc/bert.html#bertmodel

@maxzzze According to the documentation, one can get the last hidden states directly without setting this flag to True. See below.
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
BTW, for me, the shape of hidden_states in the below code is (batch_size, 768) whereas it should be (batch_size, num_heads, sequence_length, sequence_length).
output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]
I believe your comment is in reference to the standard models, but its hard to tell without a link. Can you link where to where in the documentation the pasted doc string is from?

I dont know if you saw my original comment but I was providing an example for how to get hidden_states from the ..ForSequenceClassification models, not the standard ones. The ..ForSequenceClassification models do not output hidden_states by default: https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

TheEdoardo93 · 2019-11-26T16:53:09Z

@engrsfi @maxzzze @bkkaggle
Please, look here. I hope it can help :)

maxzzze · 2019-11-26T17:29:20Z

@TheEdoardo93 is this example taking the first element in each of the hidden_states?

bilal2vec · 2019-11-26T18:57:33Z

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.

Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.

If you want to get the embeddings for classification, just do something like:

input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

maxzzze · 2019-11-26T19:52:56Z

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.

Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.

If you want to get the embeddings for classification, just do something like:
input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

Do you have any reference as to "people usually only take the hidden states of the [CLS] token of the last layer"?

bilal2vec · 2019-11-27T01:36:20Z

Here are a few related links: 1, 2, 3

The [CLS] token isn't the only (or necessarily the best) way to finetune, but it is the easiest and is Bert's default

ghost · 2019-11-27T15:08:15Z

There is some clarification about the use of the last hidden states in the BERT Paper.
According to the paper, the last hidden state for [CLS] is mainly used for classification tasks and the last hidden states for all tokens are used for token level tasks such as sequence tagging or question answering.

From the paper:

At the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

Reference:
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805.pdf)

alessiocancian · 2019-12-08T18:17:13Z

What about ALBERT? The output of the last hidden state isn't the same of the embedding because in the doc they say that the embedding have a size of 128 for every model (https://arxiv.org/pdf/1909.11942.pdf page 6).
But I'm not sure if the 128-embedding referenced in the table is something internally used to represent words or the final word embedding.

duan348733684 · 2020-01-14T03:07:00Z

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

if batch size is N, how to convert?

ghost · 2020-01-15T17:48:44Z

What about ALBERT? The output of the last hidden state isn't the same of the embedding because in the doc they say that the embedding have a size of 128 for every model (https://arxiv.org/pdf/1909.11942.pdf page 6).
But I'm not sure if the 128-embedding referenced in the table is something internally used to represent words or the final word embedding.

128 is used internally by Albert. The output of the model (last hidden state) is your actual word embeddings. In order to understand this better, you should read the following blog from Google.
https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html

Quote:
"The key to optimizing performance, captured in the design of ALBERT, is to allocate the model’s capacity more efficiently. Input-level embeddings (words, sub-tokens, etc.) need to learn context-independent representations, a representation for the word “bank”, for example. In contrast, hidden-layer embeddings need to refine that into context-dependent representations, e.g., a representation for “bank” in the context of financial transactions, and a different representation for “bank” in the context of river-flow management.

This is achieved by factorization of the embedding parametrization — the embedding matrix is split between input-level embeddings with a relatively-low dimension (e.g., 128), while the hidden-layer embeddings use higher dimensionalities (768 as in the BERT case, or more). With this step alone, ALBERT achieves an 80% reduction in the parameters of the projection block, at the expense of only a minor drop in performance — 80.3 SQuAD2.0 score, down from 80.4; or 67.9 on RACE, down from 68.2 — with all other conditions the same as for BERT."

ghost · 2020-01-15T17:55:51Z

Found it, thanks @bkkaggle . Just for others who are looking for the same information.
Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

if batch size is N, how to convert?

If I understand you correctly, you are asking for how to get the last hidden states for all entries in a batch of size N. If that's the case, then here is the explanation.

Your model expect input of the following shape:

(batch_size, sequence_length)

and returns last hidden states of the following shape:

(batch_size, sequence_length, hidden_size)

You can just go through the last hidden states to get the individual last hidden state for each input in the batch size of N.

Reference:
https://huggingface.co/transformers/model_doc/bert.html

sahand91 · 2020-01-28T07:09:51Z

@engrsfi : What if I want to use bert embedding vector of each token as an input to an LSTM network? Can I get the embedding of each token of the sentence from the last hidden layer of the bert model? In this case I think I can't just use the embedding for [CLS] token as I need word embedding of each token?
I used the code below to get bert's word embedding for all tokens of my sentences. I padded all my sentences to have maximum length of 80 and also used attention mask to ignore padded elements. in this case the shape of last_hidden_states element is of size (batch_size ,80 ,768). However, when I see my embeddings, I can see that embedding vectors for padded elements are not the same? like I have a vector of size 768 for each token of the sentence(most of them are padded tokens). but vectors for padded element are not equal. is it natural?

import tensorflow as tf
import numpy as np
from transformers import BertTokenizer, TFBertModel

bert_model = TFBertModel.from_pretrained("bert-base-uncased")
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenized = x_train['token'].apply((lambda x: bert_tokenizer.encode(x, add_special_tokens=True, max_length=80)))
padded = np.array([i + [0]*(80-len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)
input_ids = tf.constant(padded)
attention_mask = tf.constant(attention_mask)
output= bert_model(input_ids, attention_mask=attention_mask)
last_hidden_states=output[0]

lisazhao9897 · 2020-02-28T01:06:20Z

How can I extract embeddings for a sentence or a set of words directly from pre-trained models (Standard BERT)? For example, I am using Spacy for this purpose at the moment where I can do it as follows:

sentence vector:
sentence_vector = bert_model("This is an apple").vector

word_vectors:
words = bert_model("This is an apple")
word_vectors = [w.vector for w in words]
I am wondering if this is possible directly with huggingface pre-trained models (especially BERT).

Hi, could I ask how you would use Spacy to do this? Is there a link? Thanks a lot.

ghost · 2020-03-03T16:34:06Z

How can I extract embeddings for a sentence or a set of words directly from pre-trained models (Standard BERT)? For example, I am using Spacy for this purpose at the moment where I can do it as follows:
sentence vector:
sentence_vector = bert_model("This is an apple").vector
word_vectors:
words = bert_model("This is an apple")
word_vectors = [w.vector for w in words]
I am wondering if this is possible directly with huggingface pre-trained models (especially BERT).
Hi, could I ask how you would use Spacy to do this? Is there a link? Thanks a lot.

Here is the link:
https://spacy.io/usage/vectors-similarity

sumitsidana · 2020-04-01T10:55:34Z

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.

Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.

If you want to get the embeddings for classification, just do something like:
input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

Thank you for sharing the code. It really helped in understanding tokenization in BERT. I ran this and had a minor problem. Shouldn't it be:

cls_embeddings = embeddings_of_last_layer[0][0]? This is because embeddings_of_last_layer is of the dimension: 1*#tokens*#hidden-units. Then, since [CLS] is the first token (and usually have 101 as id), we want embedding corresponding to just [CLS]. embeddings_of_last_layer[0] is of shape #tokens*#hidden-units and contains embeddings of all the tokens.

muhammadfahid51 · 2020-04-23T10:16:44Z

@sahand91
pooled_output, sequence_output = bert_model(input_)
pooled_output.shape = (1, 768), one vector on 768 entries (represent the whole sentence)
sequence_output.shape = (batch_size, max_len, dim), (1, 256, 768) bs = 1, n_tokens = 256
sequence output gives the vector for each token of the sentence.

I have used the sequence output for classification task like sentiment analysis. As the paper mentions that the pooled output is not a good representation of the whole sentence so we use the sequence output and feed it further in a CNN or LSTM.

So I don't see any problem in using the sequence output for classification tasks as we get to see the actual vector representation of the word say "bank" in both contexts "commercial" and "location" (bank of a river)

giorgiomondauto · 2020-04-24T08:39:53Z

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.
Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.
If you want to get the embeddings for classification, just do something like:
input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]
Thank you for sharing the code. It really helped in understanding tokenization in BERT. I ran this and had a minor problem. Shouldn't it be:

cls_embeddings = embeddings_of_last_layer[0][0]? This is because embeddings_of_last_layer is of the dimension: 1*#tokens*#hidden-units. Then, since [CLS] is the first token (and usually have 101 as id), we want embedding corresponding to just [CLS]. embeddings_of_last_layer[0] is of shape #tokens*#hidden-units and contains embeddings of all the tokens.

Yes i think the same. @sumitsidana
embeddings_of_last_layer[0][0].shape
Out[179]: torch.Size([144]) # where 144 in my case is the hidden_size

Anyone confirming that embeddings_of_last_layer[0][0] is the embedding related to CLS token for each sequence?

muhammadfahid51 · 2020-04-24T09:20:38Z

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.
Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.
If you want to get the embeddings for classification, just do something like:
input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]
Thank you for sharing the code. It really helped in understanding tokenization in BERT. I ran this and had a minor problem. Shouldn't it be:
cls_embeddings = embeddings_of_last_layer[0][0]? This is because embeddings_of_last_layer is of the dimension: 1*#tokens*#hidden-units. Then, since [CLS] is the first token (and usually have 101 as id), we want embedding corresponding to just [CLS]. embeddings_of_last_layer[0] is of shape #tokens*#hidden-units and contains embeddings of all the tokens.
Yes i think the same. @sumitsidana
embeddings_of_last_layer[0][0].shape
Out[179]: torch.Size([144]) # where 144 in my case is the hidden_size

Anyone confirming that embeddings_of_last_layer[0][0] is the embedding related to CLS token for each sequence?

Yes it is. but it is only for first batch. you will have to loop through all the batches and get the first element (CLS) for each sentence.

giorgiomondauto · 2020-04-24T15:33:14Z

Yes gotcha. Thanks

leopardv10 · 2020-04-24T20:33:01Z

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].

Example:
inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

logtis = output[0] means the word embedding. So, does hidden_states = output[1] means the sentence level embedding ?

Saichethan · 2020-05-27T05:44:35Z

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

outputs[0] is sentence embedding for "Hello, my dog is cute" right?
then what is outputs[1]?

steveguang · 2020-05-28T06:06:54Z

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

If I want to encode a list of strings,
input_ids = torch.tensor(tokenizer.encode(["Hello, my dog is cute", "how are you"])).unsqueeze(0)
It does not really gives me 2*768 array. The only is would be
input_ids = [torch.tensor([tokenizer.encode(text) for text in ["Hello, my dog is cute", "how are you"]]).unsqueeze(0)]
Anything to make it faster?

stale · 2020-07-27T09:33:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cerofrais · 2021-01-13T06:35:47Z

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

This is great, i am interested in how to get word vectors for out of vocabulary (OOV) tokens. Any references would help. thanks .
for example if i use this sentences : "This framework generates embeddings for each input sentence"
i am getting 11 tokens(+start and end) when i have only 8 words, embeddings is a out of vocab in my model.

tvrbanec · 2021-02-17T11:06:19Z

@engrsfi

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple

It stops with errors on model = TFBertModel.from_pretrained('bert-base-uncased'):

model = TFBertModel.from_pretrained('bert-base-uncased')
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_utils.py", line 484, in from_pretrained
model(model.dummy_inputs, training=False) # build the network with dummy inputs
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 739, in call
outputs = self.bert(inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 606, in call
embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 709, in call
self._maybe_build(inputs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1966, in _maybe_build
self.build(input_shapes)
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 146, in build
initializer=get_initializer(self.initializer_range),
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 389, in add_weight
aggregation=aggregation)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/tracking/base.py", line 713, in _add_variable_with_custom_getter
**kwargs_for_getter)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 154, in make_variable
shape=variable_shape if variable_shape else None)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 260, in call
return cls._variable_v1_call(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
shape=shape)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 199, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 2502, in default_variable_creator
shape=shape)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 264, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 464, in init
shape=shape)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 608, in _init_from_args
initial_value() if init_from_fn else initial_value,
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 134, in
init_val = lambda: initializer(shape, dtype=dtype)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/init_ops_v2.py", line 341, in call
dtype = _assert_float_dtype(dtype)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/init_ops_v2.py", line 769, in _assert_float_dtype
raise ValueError("Expected floating point type, got %s." % dtype)
ValueError: Expected floating point type, got <dtype: 'int32'>.

amaiya mentioned this issue Feb 11, 2020

Can't get layer output for transformers amaiya/ktrain#61

Closed

mdavis95 mentioned this issue Feb 28, 2020

BERT / ELMO embeddings for NER amaiya/ktrain#22

Closed

afogarty85 mentioned this issue Jul 18, 2020

TFBertForSequenceClassification: TypeError: call() got an unexpected keyword argument 'labels' #5151

Closed

2 tasks

stale bot added the wontfix label Jul 27, 2020

GregoireMialon mentioned this issue Jul 30, 2020

Hidden State Embedding-Transformers #6154

Closed

stale bot closed this as completed Aug 3, 2020

vin-nag mentioned this issue Apr 26, 2021

Perturb Hidden-State in Encoder-Decoder Models #11456

Closed

darsh169 mentioned this issue Jun 15, 2021

Update model.py hasanhuz/SpanEmo#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word or sentence embedding from BERT model #1950

word or sentence embedding from BERT model #1950

ghost commented Nov 26, 2019 •

edited by ghost

Loading

bilal2vec commented Nov 26, 2019

ghost commented Nov 26, 2019

maxzzze commented Nov 26, 2019

TheEdoardo93 commented Nov 26, 2019 •

edited

Loading

ghost commented Nov 26, 2019

ghost commented Nov 26, 2019

maxzzze commented Nov 26, 2019

ghost commented Nov 26, 2019 •

edited by ghost

Loading

maxzzze commented Nov 26, 2019

ghost commented Nov 26, 2019

TheEdoardo93 commented Nov 26, 2019

maxzzze commented Nov 26, 2019

bilal2vec commented Nov 26, 2019

maxzzze commented Nov 26, 2019

bilal2vec commented Nov 27, 2019

ghost commented Nov 27, 2019 •

edited by ghost

Loading

alessiocancian commented Dec 8, 2019

duan348733684 commented Jan 14, 2020

ghost commented Jan 15, 2020 •

edited by ghost

Loading

ghost commented Jan 15, 2020

sahand91 commented Jan 28, 2020

lisazhao9897 commented Feb 28, 2020

ghost commented Mar 3, 2020

sumitsidana commented Apr 1, 2020

muhammadfahid51 commented Apr 23, 2020

giorgiomondauto commented Apr 24, 2020

muhammadfahid51 commented Apr 24, 2020 •

edited

Loading

giorgiomondauto commented Apr 24, 2020

leopardv10 commented Apr 24, 2020

Saichethan commented May 27, 2020

steveguang commented May 28, 2020

stale bot commented Jul 27, 2020

cerofrais commented Jan 13, 2021

tvrbanec commented Feb 17, 2021

word or sentence embedding from BERT model #1950

word or sentence embedding from BERT model #1950

Comments

ghost commented Nov 26, 2019 • edited by ghost Loading

bilal2vec commented Nov 26, 2019

ghost commented Nov 26, 2019

maxzzze commented Nov 26, 2019

TheEdoardo93 commented Nov 26, 2019 • edited Loading

ghost commented Nov 26, 2019

ghost commented Nov 26, 2019

maxzzze commented Nov 26, 2019

ghost commented Nov 26, 2019 • edited by ghost Loading

maxzzze commented Nov 26, 2019

ghost commented Nov 26, 2019

TheEdoardo93 commented Nov 26, 2019

maxzzze commented Nov 26, 2019

bilal2vec commented Nov 26, 2019

maxzzze commented Nov 26, 2019

bilal2vec commented Nov 27, 2019

ghost commented Nov 27, 2019 • edited by ghost Loading

alessiocancian commented Dec 8, 2019

duan348733684 commented Jan 14, 2020

ghost commented Jan 15, 2020 • edited by ghost Loading

ghost commented Jan 15, 2020

sahand91 commented Jan 28, 2020

lisazhao9897 commented Feb 28, 2020

ghost commented Mar 3, 2020

sumitsidana commented Apr 1, 2020

muhammadfahid51 commented Apr 23, 2020

giorgiomondauto commented Apr 24, 2020

muhammadfahid51 commented Apr 24, 2020 • edited Loading

giorgiomondauto commented Apr 24, 2020

leopardv10 commented Apr 24, 2020

Saichethan commented May 27, 2020

steveguang commented May 28, 2020

stale bot commented Jul 27, 2020

cerofrais commented Jan 13, 2021

tvrbanec commented Feb 17, 2021

ghost commented Nov 26, 2019 •

edited by ghost

Loading

TheEdoardo93 commented Nov 26, 2019 •

edited

Loading

ghost commented Nov 26, 2019 •

edited by ghost

Loading

ghost commented Nov 27, 2019 •

edited by ghost

Loading

ghost commented Jan 15, 2020 •

edited by ghost

Loading

muhammadfahid51 commented Apr 24, 2020 •

edited

Loading