Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for truncation parameter on feature-extraction pipeline. #14193

Merged
merged 3 commits into from
Nov 3, 2021

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Oct 28, 2021

Fixes #14183

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ioana-blue
Copy link

ioana-blue commented Oct 28, 2021

Can you also include padding as well? Since we're extracting features, I'd like to be able to specify both padding and truncation strategies. Thanks! @Narsil

@Narsil
Copy link
Contributor Author

Narsil commented Oct 28, 2021

Padding for pipelines is something, I would like to keep orthogonal to business logic (See #13724).

It's more batching than padding, but I imagine you pad mostly for the batch.
Currently, pipelines do not batch ever, meaning the padding is not used. Would adding padding change the results of feature-extraction?

@ioana-blue
Copy link

Only slightly, I think. Right now you get embeddings of varying size depending on the size of your input sequence. If you want to use somehow these embeddings in a downstream task, it's weird to have varying size. In my case, I think I'm going to use only the embedding corresponding to the CLS, so I'm good.

@ioana-blue
Copy link

@Narsil Note that padding is supported in other pipelines. I think as a user, it's maddening to have varied behavior depending on which pipeline you use. I personally think that lack of consistency across pipelines is problematic. Take a look at this pipeline code, note there is logic around padding:

@Narsil
Copy link
Contributor Author

Narsil commented Oct 28, 2021

it's maddening to have varied behavior depending on which pipeline you use.

You're 100% correct, that's part of the reason of the large rewrite which is happening.

For instance, the rewrite enables you to write either pipeline(..., truncation=True) or pipe = pipeline(..); pipe(..., truncation=True).
And that for all pipelines, and all parameters. This was far from the case before.

If anything dropping padding from the code you're quoting would be the way to go. (At least, deprecating it first, we have to maintain compatibility as much as possible). This code is currently legacy, and should be rewritten sometime in the future. The thing is there are a couple of directions to be considered for text2text-generation and we're also trying to align pipelines with other libraries. (https://github.com/huggingface/huggingface_hub/tree/main/api-inference-community)

Padding, is like batching, it was very spurious support across pipelines, we're closing the gap, but it takes time, and backward compatibility is important. The core idea is to get orthogonal behavior whereever possible. So as much as possible, individual pipelines should NOT handle them, all this logic should be enabled in the parent class. Not all models are even capable of padding (gpt2 for instance).

Truncation for instance, is not orthogonal, since question-answering and zero-shot-classification will handle long prompts by chunking the input. Some pipelines input cannot really be chunked: summarization for instance uses and encoder-decoder, if the prompt does not fit the size of the model, then the summary cannot realistically chunk (or it will come with its own set of drawbacks let's say).

That's also the reason why adding new parameters is something we try to think about before jumping to it.

truncation in feature-extraction is important because afaik, sentence embeddings do use the feature-extraction capability, and missing the last part of a sentence is indeed OK in a lot of cases (you want to use only the first token embedding, and missing part of the sentence is OK since it's only about matching later). It still needs to be opt-in as you need to explicitly know you can miss part of the sentence. Ideally, we would also prompt a warning since we're ignoring part of the sentence. And since a user sending text has no idea how long it is token-wise, it would be better to tell which part of the sentence is being chunked.

Hope this clears a bit what's going on.
Happy to receive feedback here too.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you for taking care of it @Narsil, and thank you for the discussion @ioana-blue!

@Narsil Narsil force-pushed the truncation_for_feature_extraction branch from 1a04b27 to 3da7ba3 Compare November 3, 2021 13:48
@Narsil Narsil merged commit dec759e into huggingface:master Nov 3, 2021
@Narsil Narsil deleted the truncation_for_feature_extraction branch November 3, 2021 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline feature extraction: tensor size mismatch
3 participants