-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in embedding when comparing SentenceTransformer vs TEI 0.4.0 and up for multilingual-e5-large #94
Comments
Hello! This part of the code in // See https://github.com/huggingface/tokenizers/pull/1357
if let Some(pre_tokenizer) = tokenizer.get_pre_tokenizer() {
if let PreTokenizerWrapper::Metaspace(m) = pre_tokenizer {
// We are forced to clone since `Tokenizer` does not have a `get_mut` for `pre_tokenizer`
let mut m = m.clone();
m.set_prepend_scheme(PrependScheme::First);
tokenizer.with_pre_tokenizer(PreTokenizerWrapper::Metaspace(m));
} else if let PreTokenizerWrapper::Sequence(s) = pre_tokenizer {
// We are forced to clone since `Tokenizer` does not have a `get_mut` for `pre_tokenizer`
let mut s = s.clone();
for pre_tokenizer in s.get_pre_tokenizers_mut() {
if let PreTokenizerWrapper::Metaspace(m) = pre_tokenizer {
m.set_prepend_scheme(PrependScheme::First);
}
}
tokenizer.with_pre_tokenizer(PreTokenizerWrapper::Sequence(s));
}
} This only targets SentencePiece tokenizers and is aimed to correct a bug that was present in some of them. It seems that it backfired and created a regression instead... @ArthurZucker, can you review this code? What did we miss? |
It seems the logic was broken with tokenizers that had both Whitespace and Metaspace pre-tokenizers. #96 fixes this and the cosine similarity is back to being > 0.999. |
Thanks for the quick turn around! |
We are adding #101 to make sure this hopefuly never happens again. |
Sadly, I have to report that the problem is still not fully fixed. You only have to add a trailing space and it breaks again. My test script e5 = SentenceTransformer("intfloat/multilingual-e5-large")
headers = {'Content-Type': 'application/json'}
url_e5 = 'http://localhost:33200/embed'
texts = [
' '.join(["Car"]*510),
' '.join(["Car"]*507),
' '.join(["Car"]*507) + ' ',
' '.join(["Car"]*507) + ' \n',
' '.join(["Car"]*507) + ' \n ',
]
data = {
"inputs": texts,
"normalize": True,
"truncate": False,
}
response_e5 = httpx.post(url_e5, headers=headers, json=data, timeout=60)
tei_vectors_e5 = response_e5.json()
native_vectors_e5 = e5.encode(texts)
print(cos_sim(tei_vectors_e5[0], native_vectors_e5[0]))
print(cos_sim(tei_vectors_e5[1], native_vectors_e5[1]))
print(cos_sim(tei_vectors_e5[2], native_vectors_e5[2]))
print(cos_sim(tei_vectors_e5[3], native_vectors_e5[3]))
print(cos_sim(tei_vectors_e5[4], native_vectors_e5[4])) prints the following response
Environment:
|
By the way, removing the whole special handling that was introduced due to huggingface/tokenizers#1357 I get all similarities to 0.99998. Could it be that this whole block is not necessary any more because the upstream issue was fixed? I don't understand the subject well enough to assess this on my own. I also don't want to introduce further regressions with other models. @OlivierDehaene what's your take on this? |
@OlivierDehaene did you already have the chance to think about my suggestion? |
This code was removed on main. |
Thank you! If anybody else should be searching the commit: 9d35f82 |
System Info
Comparing SentenceTransformer output with TEI 0.2.2, 0.3.0, 0.4.0, and 0.5.0
Information
Tasks
Reproduction
I deployed the latest version of multilingual-e5-large on the following versions of TEI
They are ran using:
I then generate an embedding using SentenceTransformer:
and I send that same payload to the other 4 backends, enabling normalize and truncate where supported (although truncation is not happening).
I then calculate the cosine similarity between all these vectors:
Expected behavior
I expected the cosine similarities to be very close to 1 for all these values. However, I'm recording the following values:
st_result vs:
Is there a clear explanation of why this is happening? I did not a similar difference when using gte-large (ST vs TEI)
The text was updated successfully, but these errors were encountered: