Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to enable tokenizer padding option in feature extraction pipeline? #9671

Closed
bowang-rw-02 opened this issue Jan 19, 2021 · 5 comments
Closed

Comments

@bowang-rw-02
Copy link

I am trying to use our pipeline() to extract features of sentence tokens.
Because the lengths of my sentences are not same, and I am then going to feed the token features to RNN-based models, I want to padding sentences to a fixed length to get the same size features.
Before knowing our convenient pipeline() method, I am using a general version to get the features, which works fine but inconvenient, like that:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = 'After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.'

encoded_input = tokenizer(text, padding='max_length', truncation=True, max_length=40)
indexed_tokens = encoded_input['input_ids']
segments_ids = encoded_input['token_type_ids']

tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

model = AutoModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model.eval()

with torch.no_grad():
    outputs = model(tokens_tensor, segments_tensors)
    hidden_states = outputs[2]

Then I also need to merge (or select) the features from returned hidden_states by myself... and finally get a [40,768] padded feature for this sentence's tokens as I want. However, as you can see, it is very inconvenient.
Compared to that, the pipeline method works very well and easily, which only needs the following 5-line codes.

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
nlp = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
features = nlp(text)

Then I can directly get the tokens' features of original (length) sentence, which is [22,768].

However, how can I enable the padding option of the tokenizer in pipeline?
As I saw #9432 and #9576 , I knew that now we can add truncation options to the pipeline object (here is called nlp), so I imitated and wrote this code:

text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
features = nlp(text, padding='max_length', truncation=True, max_length=40)

The program did not throw me an error though, but just return me a [512,768] vector...?
So is there any method to correctly enable the padding options? Thank you!

@LysandreJik
Copy link
Member

Hi! I think you're looking for padding="longest"?

@LysandreJik
Copy link
Member

Your result if of length 512 because you asked padding="max_length", and the tokenizer max length is 512. If you ask for "longest", it will pad up to the longest value in your batch:

>>> text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
... features = nlp([text, text * 2], padding="longest", truncation=True, max_length=40)

returns features which are of size [42, 768].

@bowang-rw-02
Copy link
Author

bowang-rw-02 commented Jan 19, 2021

Your result if of length 512 because you asked padding="max_length", and the tokenizer max length is 512. If you ask for "longest", it will pad up to the longest value in your batch:

>>> text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
... features = nlp([text, text * 2], padding="longest", truncation=True, max_length=40)

returns features which are of size [42, 768].

Thank you very much! This method works! And I think the 'longest' padding strategy is enough for me to use in my dataset.
But I just wonder that can I specify a fixed padding size? Like all sentence could be padded to length 40?
Because in my former 'inconvenient general method', I just use

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = 'After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.'

encoded_input = tokenizer(text, padding='max_length', truncation=True, max_length=40)

and get the fixed size padding sentence though...
(I found this method from the official documentation https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation

@bowang-rw-02
Copy link
Author

Well it seems impossible for now... I just tried

text = "After stealing money from the bank vault, the bank robber was seen " \
       "fishing on the Mississippi river bank."
features = nlp(text, padding='length', truncation=True, length=40)

And the error message showed that:
ValueError: 'length' is not a valid PaddingStrategy, please select one of ['longest', 'max_length', 'do_not_pad']
Anyway, thank you very much!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants