Skip to content

[FEATURE] Fail documents who's embedding field is larger than the token limit of an embedding model #2466

@dtaivpp

Description

@dtaivpp

Is your feature request related to a problem?
Yes! I just had a discussion with @ylwu-amzn where we were discussing how documents are embedded. I (and many others I have talked to) were under the impression that when you send a document larger than the token input of a model there was something like pooling going on under the hood. This seems to not be the case however and documents larger than the token limit are simply truncated.

What solution would you like?
There should be a flag to enable/disable document truncation. Transparently truncating data as it's being embedded has catastrophic consequences. Documents that are over the limit may simply never be returned depending on where the document was truncated.

This should probably be configurable via the ML commons settings. We may also want to enable pooling as an alternative. Eg:

PUT _cluster/settings
{
  "persistent": {
    "plugins": {
      "ml_commons": {
        "embedding_auto_truncation": "true",
        "embedding_pooling": "false"
      }
    }
  }
}

What alternatives have you considered?
I am not sure how else we can protect people from the misunderstanding that embedding models have a maximum input token length.

Do you have any additional context?
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions