New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
embeddings.index Truncation RuntimeError: The size of tensor a (889) must match the size of tensor b (512) at non-singleton dimension 1 #74
Comments
Thank you for reporting this issue! Can you share the transformer model you are using for this test? |
That is using
|
Thank you. It looks like sentence-transformers 1.x no longer defaults the tokenizer.max_seq_length, which is fine for models that have a model_max_length set. In the case of the model here, it does not, hence the error. I will add an additional config option to optionally set the tokenizer.max_seq_length to address this issue. In the meantime you have a couple different options.
After the fix it will look something like this: embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens", "maxlength": 512}) |
Thank you, you're absolutely right! Downgrading sentence-transformers to 0.4.1 did the trick for now! |
Just committed a fix for this. You can now also try installing the latest master branch Otherwise, it will be in txtai 3.0 |
Actually |
Did you run it like this? embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens", "maxlength": 512}) The sentence-tranformers change is actually an improvement as defaulting to 128 tokens isn't the best strategy. But it does require an extra parameter if the model doesn't have the model_max_length parameter set. |
You're right, it does work if I include it like that. |
Thank you for confirming this works for Embeddings. This Similarity issue is similar but unrelated to this issue. Each pipeline needs to add a kwargs of {truncation:true} when otherwise not specified. I've added #79 to address this. |
Any pointers on how to do this? I've noticed that strangely |
I had tried to copy what you did in the Summary pipeline to
to seemingly no change in result |
It's really about the Tokenizer. >>> tokenizer = AutoTokenizer.from_pretrained("gsarti/scibert-nli")
>>> tokenizer.model_max_length
1000000000000000019884624838656 Something that may work for your local testing: kwargs = {"truncation": True,}
# Run ZSL pipeline
self.pipeline.tokenizer.model_max_length = 512
results = self.pipeline(text, labels, multi_label=multilabel, **kwargs) I think transformers is assuming this field is always set with the tokenizer. Currently, max_length can't be overridden in calls to the tokenizer for hf pipelines. May be a worthwhile change upstream with transformers to allow that field to be set. |
Going to reopen this issue. I think I can make this less manual. Plan to remove the maxlength parameter I just added and attempt to copy the max_position_embeddings config parameter if the tokenizer doesn't have a max_model_length set. |
Just committed a fix that should address both embeddings and pipelines. The maxlength parameter is no longer needed, it will take the max_position_embeddings config parameter when it's not detected in the tokenizer. |
I can confirm that as of the latest commit, my code using a pretrained Bluebert model works without running into the issue! Thanks for the fix! |
Great, glad to hear it, appreciate your help on this! |
Hello, when I try to run the indexing step, I get this error.
Where to_index =
How do I fix this? I don't see anywhere in the documentation about this. I assume the error message:
is related, and I need to set a max_length to 512 so that any documents that are larger than 512 get truncated to 512 tokens, but I don't see anywhere how to do that...
The text was updated successfully, but these errors were encountered: