Adding support for `truncation` parameter on `feature-extraction` pipeline. #14193

Narsil · 2021-10-28T14:20:41Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

ioana-blue · 2021-10-28T14:28:19Z

Can you also include padding as well? Since we're extracting features, I'd like to be able to specify both padding and truncation strategies. Thanks! @Narsil

Narsil · 2021-10-28T14:31:42Z

Padding for pipelines is something, I would like to keep orthogonal to business logic (See #13724).

It's more batching than padding, but I imagine you pad mostly for the batch.
Currently, pipelines do not batch ever, meaning the padding is not used. Would adding padding change the results of feature-extraction?

ioana-blue · 2021-10-28T14:45:10Z

Only slightly, I think. Right now you get embeddings of varying size depending on the size of your input sequence. If you want to use somehow these embeddings in a downstream task, it's weird to have varying size. In my case, I think I'm going to use only the embedding corresponding to the CLS, so I'm good.

ioana-blue · 2021-10-28T14:46:58Z

@Narsil Note that padding is supported in other pipelines. I think as a user, it's maddening to have varied behavior depending on which pipeline you use. I personally think that lack of consistency across pipelines is problematic. Take a look at this pipeline code, note there is logic around padding:

transformers/src/transformers/pipelines/text2text_generation.py

Line 92 in 026866d

padding = True

Narsil · 2021-10-28T15:47:26Z

it's maddening to have varied behavior depending on which pipeline you use.

You're 100% correct, that's part of the reason of the large rewrite which is happening.

For instance, the rewrite enables you to write either pipeline(..., truncation=True) or pipe = pipeline(..); pipe(..., truncation=True).
And that for all pipelines, and all parameters. This was far from the case before.

If anything dropping padding from the code you're quoting would be the way to go. (At least, deprecating it first, we have to maintain compatibility as much as possible). This code is currently legacy, and should be rewritten sometime in the future. The thing is there are a couple of directions to be considered for text2text-generation and we're also trying to align pipelines with other libraries. (https://github.com/huggingface/huggingface_hub/tree/main/api-inference-community)

Padding, is like batching, it was very spurious support across pipelines, we're closing the gap, but it takes time, and backward compatibility is important. The core idea is to get orthogonal behavior whereever possible. So as much as possible, individual pipelines should NOT handle them, all this logic should be enabled in the parent class. Not all models are even capable of padding (gpt2 for instance).

Truncation for instance, is not orthogonal, since question-answering and zero-shot-classification will handle long prompts by chunking the input. Some pipelines input cannot really be chunked: summarization for instance uses and encoder-decoder, if the prompt does not fit the size of the model, then the summary cannot realistically chunk (or it will come with its own set of drawbacks let's say).

That's also the reason why adding new parameters is something we try to think about before jumping to it.

truncation in feature-extraction is important because afaik, sentence embeddings do use the feature-extraction capability, and missing the last part of a sentence is indeed OK in a lot of cases (you want to use only the first token embedding, and missing part of the sentence is OK since it's only about matching later). It still needs to be opt-in as you need to explicitly know you can miss part of the sentence. Ideally, we would also prompt a warning since we're ignoring part of the sentence. And since a user sending text has no idea how long it is token-wise, it would be better to tell which part of the sentence is being chunked.

Hope this clears a bit what's going on.
Happy to receive feedback here too.

LysandreJik

Looks good to me, thank you for taking care of it @Narsil, and thank you for the discussion @ioana-blue!

pipeline. Fixes huggingface#14183

LysandreJik approved these changes Nov 1, 2021

View reviewed changes

Narsil added 3 commits November 3, 2021 14:45

Adding support for truncation parameter on feature-extraction

87ed14e

pipeline. Fixes huggingface#14183

Fixing tests on ibert, longformer, and roberta.

b1e6224

Rebase fix.

3da7ba3

Narsil force-pushed the truncation_for_feature_extraction branch from 1a04b27 to 3da7ba3 Compare November 3, 2021 13:48

Narsil merged commit dec759e into huggingface:master Nov 3, 2021

Narsil deleted the truncation_for_feature_extraction branch November 3, 2021 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for `truncation` parameter on `feature-extraction` pipeline. #14193

Adding support for `truncation` parameter on `feature-extraction` pipeline. #14193

Narsil commented Oct 28, 2021

ioana-blue commented Oct 28, 2021 •

edited

Loading

Narsil commented Oct 28, 2021

ioana-blue commented Oct 28, 2021

ioana-blue commented Oct 28, 2021

Narsil commented Oct 28, 2021 •

edited

Loading

LysandreJik left a comment

Adding support for truncation parameter on feature-extraction pipeline. #14193

Adding support for truncation parameter on feature-extraction pipeline. #14193

Conversation

Narsil commented Oct 28, 2021

What does this PR do?

Before submitting

Who can review?

ioana-blue commented Oct 28, 2021 • edited Loading

Narsil commented Oct 28, 2021

ioana-blue commented Oct 28, 2021

ioana-blue commented Oct 28, 2021

Narsil commented Oct 28, 2021 • edited Loading

LysandreJik left a comment

Choose a reason for hiding this comment

Adding support for `truncation` parameter on `feature-extraction` pipeline. #14193

Adding support for `truncation` parameter on `feature-extraction` pipeline. #14193

ioana-blue commented Oct 28, 2021 •

edited

Loading

Narsil commented Oct 28, 2021 •

edited

Loading