Difference in embedding when comparing SentenceTransformer vs TEI 0.4.0 and up for multilingual-e5-large #94

gerritvd · 2023-11-29T04:58:54Z

System Info

Comparing SentenceTransformer output with TEI 0.2.2, 0.3.0, 0.4.0, and 0.5.0

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

I deployed the latest version of multilingual-e5-large on the following versions of TEI

v0.2.2
v0.3.0
v0.4.0
v0.5.0

They are ran using:

        # Replace with different versions
        image: ghcr.io/huggingface/text-embeddings-inference:89-0.3.0
        command: [ "text-embeddings-router" ]
        args:
          - "--port"
          - "7860"
          - "--dtype"
          - "float16"
          - "--model-id"
          - "/mnt/models/hub/models--intfloat--multilingual-e5-large/snapshots/9f78368af0062735ba99812349c562316e29f719"

I then generate an embedding using SentenceTransformer:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/multilingual-e5-large", device='cpu')
payload = [' '.join(["Car"]*510)]
st_result = model.encode(payload)

and I send that same payload to the other 4 backends, enabling normalize and truncate where supported (although truncation is not happening).
I then calculate the cosine similarity between all these vectors:

from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(st_result, tei_result)

Expected behavior

I expected the cosine similarities to be very close to 1 for all these values. However, I'm recording the following values:

st_result vs:

v0.2.2 = 0.9999881 (close enough)
v0.3.0 = 0.9999881 (close enough)
v0.4.0 = 0.95674335 (looks too far off)
v0.5.0 = 0.95674335 (looks too far off)

Is there a clear explanation of why this is happening? I did not a similar difference when using gte-large (ST vs TEI)

The text was updated successfully, but these errors were encountered:

OlivierDehaene · 2023-11-29T13:00:56Z

Hello!

This part of the code in main.rs is the culprit:

    // See https://github.com/huggingface/tokenizers/pull/1357
    if let Some(pre_tokenizer) = tokenizer.get_pre_tokenizer() {
        if let PreTokenizerWrapper::Metaspace(m) = pre_tokenizer {
            // We are forced to clone since `Tokenizer` does not have a `get_mut` for `pre_tokenizer`
            let mut m = m.clone();
            m.set_prepend_scheme(PrependScheme::First);
            tokenizer.with_pre_tokenizer(PreTokenizerWrapper::Metaspace(m));
        } else if let PreTokenizerWrapper::Sequence(s) = pre_tokenizer {
            // We are forced to clone since `Tokenizer` does not have a `get_mut` for `pre_tokenizer`
            let mut s = s.clone();
            for pre_tokenizer in s.get_pre_tokenizers_mut() {
                if let PreTokenizerWrapper::Metaspace(m) = pre_tokenizer {
                    m.set_prepend_scheme(PrependScheme::First);
                }
            }
            tokenizer.with_pre_tokenizer(PreTokenizerWrapper::Sequence(s));
        }
    }

This only targets SentencePiece tokenizers and is aimed to correct a bug that was present in some of them. It seems that it backfired and created a regression instead...

@ArthurZucker, can you review this code? What did we miss?

OlivierDehaene · 2023-11-29T16:04:54Z

It seems the logic was broken with tokenizers that had both Whitespace and Metaspace pre-tokenizers. #96 fixes this and the cosine similarity is back to being > 0.999.

gerritvd · 2023-11-29T17:38:47Z

Thanks for the quick turn around!

OlivierDehaene · 2023-11-30T11:46:02Z

We are adding #101 to make sure this hopefuly never happens again.

scriptator · 2024-02-01T15:29:53Z

Sadly, I have to report that the problem is still not fully fixed. You only have to add a trailing space and it breaks again. My test script

e5 = SentenceTransformer("intfloat/multilingual-e5-large")
headers = {'Content-Type': 'application/json'}

url_e5 = 'http://localhost:33200/embed'

texts = [
    ' '.join(["Car"]*510),
    ' '.join(["Car"]*507),
    ' '.join(["Car"]*507) + ' ',
    ' '.join(["Car"]*507) + ' \n',
    ' '.join(["Car"]*507) + ' \n ',
]

data = {
  "inputs": texts,
  "normalize": True,
  "truncate": False,
}

response_e5 = httpx.post(url_e5, headers=headers, json=data, timeout=60)
tei_vectors_e5 = response_e5.json()

native_vectors_e5 = e5.encode(texts)

print(cos_sim(tei_vectors_e5[0], native_vectors_e5[0]))
print(cos_sim(tei_vectors_e5[1], native_vectors_e5[1]))
print(cos_sim(tei_vectors_e5[2], native_vectors_e5[2]))
print(cos_sim(tei_vectors_e5[3], native_vectors_e5[3]))
print(cos_sim(tei_vectors_e5[4], native_vectors_e5[4]))

prints the following response

0.9999883188452392    // original example from this issue --> ok
0.9999884754659525    // original example with just 507 cars --> ok
0.9853260777816594    // add a trailing whitespace --> bad
0.9792360657039959    // add a trailing whitespace and a newline --> even worse
0.9760671020914671    // add a trailing whitespace and a newline and a whitespace --> even worse

Environment:

TEI: Docker CPU build from commit 6395a7a
Local:
- Python 3.11.7
- sentence-transformers==2.2.2
- tokenizers==0.15.0

scriptator · 2024-02-01T15:44:01Z

By the way, removing the whole special handling that was introduced due to huggingface/tokenizers#1357 I get all similarities to 0.99998.

Could it be that this whole block is not necessary any more because the upstream issue was fixed? I don't understand the subject well enough to assess this on my own. I also don't want to introduce further regressions with other models. @OlivierDehaene what's your take on this?

scriptator · 2024-02-16T07:30:15Z

@OlivierDehaene did you already have the chance to think about my suggestion?

OlivierDehaene · 2024-02-23T12:41:09Z

This code was removed on main.

scriptator · 2024-02-23T16:47:01Z

Thank you! If anybody else should be searching the commit: 9d35f82

OlivierDehaene mentioned this issue Nov 29, 2023

fix: fix tokenizers with both withespace and metaspace #96

Merged

OlivierDehaene closed this as completed in #96 Nov 29, 2023

zhangfand mentioned this issue Nov 29, 2023

feat: add grpc router #90

Merged

OlivierDehaene reopened this Feb 5, 2024

OlivierDehaene closed this as completed Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in embedding when comparing SentenceTransformer vs TEI 0.4.0 and up for multilingual-e5-large #94

Difference in embedding when comparing SentenceTransformer vs TEI 0.4.0 and up for multilingual-e5-large #94

gerritvd commented Nov 29, 2023 •

edited

OlivierDehaene commented Nov 29, 2023

OlivierDehaene commented Nov 29, 2023

gerritvd commented Nov 29, 2023

OlivierDehaene commented Nov 30, 2023

scriptator commented Feb 1, 2024

scriptator commented Feb 1, 2024 •

edited

scriptator commented Feb 16, 2024

OlivierDehaene commented Feb 23, 2024

scriptator commented Feb 23, 2024 •

edited

Difference in embedding when comparing SentenceTransformer vs TEI 0.4.0 and up for multilingual-e5-large #94

Difference in embedding when comparing SentenceTransformer vs TEI 0.4.0 and up for multilingual-e5-large #94

Comments

gerritvd commented Nov 29, 2023 • edited

System Info

Information

Tasks

Reproduction

Expected behavior

OlivierDehaene commented Nov 29, 2023

OlivierDehaene commented Nov 29, 2023

gerritvd commented Nov 29, 2023

OlivierDehaene commented Nov 30, 2023

scriptator commented Feb 1, 2024

scriptator commented Feb 1, 2024 • edited

scriptator commented Feb 16, 2024

OlivierDehaene commented Feb 23, 2024

scriptator commented Feb 23, 2024 • edited

gerritvd commented Nov 29, 2023 •

edited

scriptator commented Feb 1, 2024 •

edited

scriptator commented Feb 23, 2024 •

edited