Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word or sentence embedding from BERT model #1950

Closed
ghost opened this issue Nov 26, 2019 · 34 comments
Closed

word or sentence embedding from BERT model #1950

ghost opened this issue Nov 26, 2019 · 34 comments
Labels

Comments

@ghost
Copy link

ghost commented Nov 26, 2019

How can I extract embeddings for a sentence or a set of words directly from pre-trained models (Standard BERT)? For example, I am using Spacy for this purpose at the moment where I can do it as follows:

sentence vector:
sentence_vector = bert_model("This is an apple").vector

word_vectors:

words = bert_model("This is an apple")
word_vectors = [w.vector for w in words]

I am wondering if this is possible directly with huggingface pre-trained models (especially BERT).

@bilal2vec
Copy link
Contributor

You can use BertModel, it'll return the hidden states for the input sentence.

@ghost
Copy link
Author

ghost commented Nov 26, 2019

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

@maxzzze
Copy link

maxzzze commented Nov 26, 2019

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].

Example:

inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

@TheEdoardo93
Copy link

TheEdoardo93 commented Nov 26, 2019

By using this code, you can obtain a PyTorch tensor of (1, N, 768) shape, where N is the number of different tokens extracted from BertTokenizer. If you want to build the sentence vector by exploiting these N tensors, how do you do that? @engrsfi

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

@ghost
Copy link
Author

ghost commented Nov 26, 2019

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].

Example:

inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

I am interested in the last hidden states which are seen as kind of embeddings. I think you are referring to all hidden states including the output of the embedding layer.

"**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs
```.

@ghost
Copy link
Author

ghost commented Nov 26, 2019

By using this code, you can obtain a PyTorch tensor of (1, N, 768) shape, where N is the number of different tokens extracted from BertTokenizer. If you want to build the sentence vector by exploiting these N tensors, how do you do that? @engrsfi

Found it, thanks @bkkaggle . Just for others who are looking for the same information.
Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

You can take an average of them. However, I think the embeddings at first position [CLS] are considered a kind of sentence vector because only those are fed to a further classifier if any for downstream tasks. Disclaimer: I am not sure about it.

@maxzzze
Copy link

maxzzze commented Nov 26, 2019

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].
Example:

inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

I am interested in the last hidden states which are seen as kind of embeddings. I think you are referring to all hidden states including the output of the embedding layer.

"**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs
```.

Should be as simple as grabbing the last element in the list:

last_layer = hidden_states[-1]

@ghost
Copy link
Author

ghost commented Nov 26, 2019

@maxzzze According to the documentation, one can get the last hidden states directly without setting this flag to True. See below.
https://huggingface.co/transformers/_modules/transformers/modeling_bert.html#BertModel

Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BTW, for me, the shape of hidden_states in the below code is (batch_size, 768) when I set this Flag to True, not sure if I can extract last hidden states from that.

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

@maxzzze
Copy link

maxzzze commented Nov 26, 2019

@maxzzze According to the documentation, one can get the last hidden states directly without setting this flag to True. See below.

Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BTW, for me, the shape of hidden_states in the below code is (batch_size, 768) whereas it should be (batch_size, num_heads, sequence_length, sequence_length).

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

I believe your comment is in reference to the standard models, but its hard to tell without a link. Can you link where to where in the documentation the pasted doc string is from?

I dont know if you saw my original comment but I was providing an example for how to get hidden_states from the ..ForSequenceClassification models, not the standard ones. The ..ForSequenceClassification models do not output hidden_states by default: https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

@ghost
Copy link
Author

ghost commented Nov 26, 2019

Sorry, I missed that part :) I am referring to the standard BERTMODEL. Doc link:
https://huggingface.co/transformers/model_doc/bert.html#bertmodel

@maxzzze According to the documentation, one can get the last hidden states directly without setting this flag to True. See below.

Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

BTW, for me, the shape of hidden_states in the below code is (batch_size, 768) whereas it should be (batch_size, num_heads, sequence_length, sequence_length).

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

I believe your comment is in reference to the standard models, but its hard to tell without a link. Can you link where to where in the documentation the pasted doc string is from?

I dont know if you saw my original comment but I was providing an example for how to get hidden_states from the ..ForSequenceClassification models, not the standard ones. The ..ForSequenceClassification models do not output hidden_states by default: https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

@TheEdoardo93
Copy link

@engrsfi @maxzzze @bkkaggle
Please, look here. I hope it can help :)

@maxzzze
Copy link

maxzzze commented Nov 26, 2019

@TheEdoardo93 is this example taking the first element in each of the hidden_states?

@bilal2vec
Copy link
Contributor

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.

Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.

If you want to get the embeddings for classification, just do something like:

input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

@maxzzze
Copy link

maxzzze commented Nov 26, 2019

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.

Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.

If you want to get the embeddings for classification, just do something like:

input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

Do you have any reference as to "people usually only take the hidden states of the [CLS] token of the last layer"?

@bilal2vec
Copy link
Contributor

Here are a few related links: 1, 2, 3

The [CLS] token isn't the only (or necessarily the best) way to finetune, but it is the easiest and is Bert's default

@ghost
Copy link
Author

ghost commented Nov 27, 2019

There is some clarification about the use of the last hidden states in the BERT Paper.
According to the paper, the last hidden state for [CLS] is mainly used for classification tasks and the last hidden states for all tokens are used for token level tasks such as sequence tagging or question answering.

From the paper:

At the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

Reference:
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805.pdf)

@alessiocancian
Copy link

What about ALBERT? The output of the last hidden state isn't the same of the embedding because in the doc they say that the embedding have a size of 128 for every model (https://arxiv.org/pdf/1909.11942.pdf page 6).
But I'm not sure if the 128-embedding referenced in the table is something internally used to represent words or the final word embedding.

@duan348733684
Copy link

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

if batch size is N, how to convert?

@ghost
Copy link
Author

ghost commented Jan 15, 2020

What about ALBERT? The output of the last hidden state isn't the same of the embedding because in the doc they say that the embedding have a size of 128 for every model (https://arxiv.org/pdf/1909.11942.pdf page 6).
But I'm not sure if the 128-embedding referenced in the table is something internally used to represent words or the final word embedding.

128 is used internally by Albert. The output of the model (last hidden state) is your actual word embeddings. In order to understand this better, you should read the following blog from Google.
https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html

Quote:
"The key to optimizing performance, captured in the design of ALBERT, is to allocate the model’s capacity more efficiently. Input-level embeddings (words, sub-tokens, etc.) need to learn context-independent representations, a representation for the word “bank”, for example. In contrast, hidden-layer embeddings need to refine that into context-dependent representations, e.g., a representation for “bank” in the context of financial transactions, and a different representation for “bank” in the context of river-flow management.

This is achieved by factorization of the embedding parametrization — the embedding matrix is split between input-level embeddings with a relatively-low dimension (e.g., 128), while the hidden-layer embeddings use higher dimensionalities (768 as in the BERT case, or more). With this step alone, ALBERT achieves an 80% reduction in the parameters of the projection block, at the expense of only a minor drop in performance — 80.3 SQuAD2.0 score, down from 80.4; or 67.9 on RACE, down from 68.2 — with all other conditions the same as for BERT."

@ghost
Copy link
Author

ghost commented Jan 15, 2020

Found it, thanks @bkkaggle . Just for others who are looking for the same information.
Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

if batch size is N, how to convert?

If I understand you correctly, you are asking for how to get the last hidden states for all entries in a batch of size N. If that's the case, then here is the explanation.

Your model expect input of the following shape:

(batch_size, sequence_length)

and returns last hidden states of the following shape:

(batch_size, sequence_length, hidden_size)

You can just go through the last hidden states to get the individual last hidden state for each input in the batch size of N.

Reference:
https://huggingface.co/transformers/model_doc/bert.html

@sahand91
Copy link

@engrsfi : What if I want to use bert embedding vector of each token as an input to an LSTM network? Can I get the embedding of each token of the sentence from the last hidden layer of the bert model? In this case I think I can't just use the embedding for [CLS] token as I need word embedding of each token?
I used the code below to get bert's word embedding for all tokens of my sentences. I padded all my sentences to have maximum length of 80 and also used attention mask to ignore padded elements. in this case the shape of last_hidden_states element is of size (batch_size ,80 ,768). However, when I see my embeddings, I can see that embedding vectors for padded elements are not the same? like I have a vector of size 768 for each token of the sentence(most of them are padded tokens). but vectors for padded element are not equal. is it natural?

import tensorflow as tf
import numpy as np
from transformers import BertTokenizer, TFBertModel

bert_model = TFBertModel.from_pretrained("bert-base-uncased")
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenized = x_train['token'].apply((lambda x: bert_tokenizer.encode(x, add_special_tokens=True, max_length=80)))
padded = np.array([i + [0]*(80-len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)
input_ids = tf.constant(padded)
attention_mask = tf.constant(attention_mask)
output= bert_model(input_ids, attention_mask=attention_mask)
last_hidden_states=output[0]

@lisazhao9897
Copy link

How can I extract embeddings for a sentence or a set of words directly from pre-trained models (Standard BERT)? For example, I am using Spacy for this purpose at the moment where I can do it as follows:

sentence vector:
sentence_vector = bert_model("This is an apple").vector

word_vectors:

words = bert_model("This is an apple")
word_vectors = [w.vector for w in words]

I am wondering if this is possible directly with huggingface pre-trained models (especially BERT).

Hi, could I ask how you would use Spacy to do this? Is there a link? Thanks a lot.

@ghost
Copy link
Author

ghost commented Mar 3, 2020

How can I extract embeddings for a sentence or a set of words directly from pre-trained models (Standard BERT)? For example, I am using Spacy for this purpose at the moment where I can do it as follows:
sentence vector:
sentence_vector = bert_model("This is an apple").vector
word_vectors:

words = bert_model("This is an apple")
word_vectors = [w.vector for w in words]

I am wondering if this is possible directly with huggingface pre-trained models (especially BERT).

Hi, could I ask how you would use Spacy to do this? Is there a link? Thanks a lot.

Here is the link:
https://spacy.io/usage/vectors-similarity

@sumitsidana
Copy link

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.

Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.

If you want to get the embeddings for classification, just do something like:

input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

Thank you for sharing the code. It really helped in understanding tokenization in BERT. I ran this and had a minor problem. Shouldn't it be:

cls_embeddings = embeddings_of_last_layer[0][0]? This is because embeddings_of_last_layer is of the dimension: 1*#tokens*#hidden-units. Then, since [CLS] is the first token (and usually have 101 as id), we want embedding corresponding to just [CLS]. embeddings_of_last_layer[0] is of shape #tokens*#hidden-units and contains embeddings of all the tokens.

@muhammadfahid51
Copy link

@sahand91
pooled_output, sequence_output = bert_model(input_)
pooled_output.shape = (1, 768), one vector on 768 entries (represent the whole sentence)
sequence_output.shape = (batch_size, max_len, dim), (1, 256, 768) bs = 1, n_tokens = 256
sequence output gives the vector for each token of the sentence.

I have used the sequence output for classification task like sentiment analysis. As the paper mentions that the pooled output is not a good representation of the whole sentence so we use the sequence output and feed it further in a CNN or LSTM.

So I don't see any problem in using the sequence output for classification tasks as we get to see the actual vector representation of the word say "bank" in both contexts "commercial" and "location" (bank of a river)

@giorgiomondauto
Copy link

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.
Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.
If you want to get the embeddings for classification, just do something like:

input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

Thank you for sharing the code. It really helped in understanding tokenization in BERT. I ran this and had a minor problem. Shouldn't it be:

cls_embeddings = embeddings_of_last_layer[0][0]? This is because embeddings_of_last_layer is of the dimension: 1*#tokens*#hidden-units. Then, since [CLS] is the first token (and usually have 101 as id), we want embedding corresponding to just [CLS]. embeddings_of_last_layer[0] is of shape #tokens*#hidden-units and contains embeddings of all the tokens.

Yes i think the same. @sumitsidana
embeddings_of_last_layer[0][0].shape
Out[179]: torch.Size([144]) # where 144 in my case is the hidden_size

Anyone confirming that embeddings_of_last_layer[0][0] is the embedding related to CLS token for each sequence?

@muhammadfahid51
Copy link

muhammadfahid51 commented Apr 24, 2020

@engrsfi You can process the hidden states of BERT (all layers or only the last layer) in whatever way you want.
Most people usually only take the hidden states of the [CLS] token of the last layer - using the hidden states for all tokens or from multiple layers doesn't usually help you that much.
If you want to get the embeddings for classification, just do something like:

input_sentence = torch.tensor(tokenizer.encode("[CLS] My sentence")).unsqueeze(0)
out = model(input_sentence)
embeddings_of_last_layer = out[0]
cls_embeddings = embeddings_of_last_layer[0]

Thank you for sharing the code. It really helped in understanding tokenization in BERT. I ran this and had a minor problem. Shouldn't it be:
cls_embeddings = embeddings_of_last_layer[0][0]? This is because embeddings_of_last_layer is of the dimension: 1*#tokens*#hidden-units. Then, since [CLS] is the first token (and usually have 101 as id), we want embedding corresponding to just [CLS]. embeddings_of_last_layer[0] is of shape #tokens*#hidden-units and contains embeddings of all the tokens.

Yes i think the same. @sumitsidana
embeddings_of_last_layer[0][0].shape
Out[179]: torch.Size([144]) # where 144 in my case is the hidden_size

Anyone confirming that embeddings_of_last_layer[0][0] is the embedding related to CLS token for each sequence?

Yes it is. but it is only for first batch. you will have to loop through all the batches and get the first element (CLS) for each sentence.

@giorgiomondauto
Copy link

Yes gotcha. Thanks

@leopardv10
Copy link

This is a bit different for ...ForSequenceClassification models. I've found that the item at outputs[0] are the logits and the only way to get the hidden_states is to set config.output_hidden_states=True when initializing the model. Only then was I able to get the hidden_states which are located at outputs[1].

Example:

inputs = {
    "input_ids": batch[0],
    "attention_mask": batch[1]
}

output = bertmodel(**inputs)
logits = output[0]
hidden_states = output[1]

logtis = output[0] means the word embedding. So, does hidden_states = output[1] means the sentence level embedding ?

@Saichethan
Copy link

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

outputs[0] is sentence embedding for "Hello, my dog is cute" right?
then what is outputs[1]?

@steveguang
Copy link

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

If I want to encode a list of strings,
input_ids = torch.tensor(tokenizer.encode(["Hello, my dog is cute", "how are you"])).unsqueeze(0)
It does not really gives me 2*768 array. The only is would be
input_ids = [torch.tensor([tokenizer.encode(text) for text in ["Hello, my dog is cute", "how are you"]]).unsqueeze(0)]
Anything to make it faster?

@stale
Copy link

stale bot commented Jul 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@cerofrais
Copy link

Found it, thanks @bkkaggle . Just for others who are looking for the same information.

Using Pytorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

This is great, i am interested in how to get word vectors for out of vocabulary (OOV) tokens. Any references would help. thanks .
for example if i use this sentences : "This framework generates embeddings for each input sentence"
i am getting 11 tokens(+start and end) when i have only 8 words, embeddings is a out of vocab in my model.

@tvrbanec
Copy link

@engrsfi

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple

It stops with errors on model = TFBertModel.from_pretrained('bert-base-uncased'):

model = TFBertModel.from_pretrained('bert-base-uncased')
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_utils.py", line 484, in from_pretrained
model(model.dummy_inputs, training=False) # build the network with dummy inputs
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 739, in call
outputs = self.bert(inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 606, in call
embedding_output = self.embeddings([input_ids, position_ids, token_type_ids, inputs_embeds], training=training)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 709, in call
self._maybe_build(inputs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1966, in _maybe_build
self.build(input_shapes)
File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 146, in build
initializer=get_initializer(self.initializer_range),
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 389, in add_weight
aggregation=aggregation)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/tracking/base.py", line 713, in _add_variable_with_custom_getter
**kwargs_for_getter)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 154, in make_variable
shape=variable_shape if variable_shape else None)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 260, in call
return cls._variable_v1_call(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
shape=shape)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 199, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 2502, in default_variable_creator
shape=shape)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/variables.py", line 264, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 464, in init
shape=shape)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 608, in _init_from_args
initial_value() if init_from_fn else initial_value,
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py", line 134, in
init_val = lambda: initializer(shape, dtype=dtype)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/init_ops_v2.py", line 341, in call
dtype = _assert_float_dtype(dtype)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/init_ops_v2.py", line 769, in _assert_float_dtype
raise ValueError("Expected floating point type, got %s." % dtype)
ValueError: Expected floating point type, got <dtype: 'int32'>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests