Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in embedding when comparing SentenceTransformer vs TEI 0.4.0 and up for multilingual-e5-large #94

Closed
2 of 4 tasks
gerritvd opened this issue Nov 29, 2023 · 9 comments · Fixed by #96
Closed
2 of 4 tasks

Comments

@gerritvd
Copy link

gerritvd commented Nov 29, 2023

System Info

Comparing SentenceTransformer output with TEI 0.2.2, 0.3.0, 0.4.0, and 0.5.0

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I deployed the latest version of multilingual-e5-large on the following versions of TEI

  1. v0.2.2
  2. v0.3.0
  3. v0.4.0
  4. v0.5.0

They are ran using:

        # Replace with different versions
        image: ghcr.io/huggingface/text-embeddings-inference:89-0.3.0
        command: [ "text-embeddings-router" ]
        args:
          - "--port"
          - "7860"
          - "--dtype"
          - "float16"
          - "--model-id"
          - "/mnt/models/hub/models--intfloat--multilingual-e5-large/snapshots/9f78368af0062735ba99812349c562316e29f719"

I then generate an embedding using SentenceTransformer:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/multilingual-e5-large", device='cpu')
payload = [' '.join(["Car"]*510)]
st_result = model.encode(payload)

and I send that same payload to the other 4 backends, enabling normalize and truncate where supported (although truncation is not happening).
I then calculate the cosine similarity between all these vectors:

from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(st_result, tei_result)

Expected behavior

I expected the cosine similarities to be very close to 1 for all these values. However, I'm recording the following values:

st_result vs:

  1. v0.2.2 = 0.9999881 (close enough)
  2. v0.3.0 = 0.9999881 (close enough)
  3. v0.4.0 = 0.95674335 (looks too far off)
  4. v0.5.0 = 0.95674335 (looks too far off)

Is there a clear explanation of why this is happening? I did not a similar difference when using gte-large (ST vs TEI)

@OlivierDehaene
Copy link
Member

Hello!

This part of the code in main.rs is the culprit:

    // See https://github.com/huggingface/tokenizers/pull/1357
    if let Some(pre_tokenizer) = tokenizer.get_pre_tokenizer() {
        if let PreTokenizerWrapper::Metaspace(m) = pre_tokenizer {
            // We are forced to clone since `Tokenizer` does not have a `get_mut` for `pre_tokenizer`
            let mut m = m.clone();
            m.set_prepend_scheme(PrependScheme::First);
            tokenizer.with_pre_tokenizer(PreTokenizerWrapper::Metaspace(m));
        } else if let PreTokenizerWrapper::Sequence(s) = pre_tokenizer {
            // We are forced to clone since `Tokenizer` does not have a `get_mut` for `pre_tokenizer`
            let mut s = s.clone();
            for pre_tokenizer in s.get_pre_tokenizers_mut() {
                if let PreTokenizerWrapper::Metaspace(m) = pre_tokenizer {
                    m.set_prepend_scheme(PrependScheme::First);
                }
            }
            tokenizer.with_pre_tokenizer(PreTokenizerWrapper::Sequence(s));
        }
    }

This only targets SentencePiece tokenizers and is aimed to correct a bug that was present in some of them. It seems that it backfired and created a regression instead...

@ArthurZucker, can you review this code? What did we miss?

@OlivierDehaene
Copy link
Member

It seems the logic was broken with tokenizers that had both Whitespace and Metaspace pre-tokenizers. #96 fixes this and the cosine similarity is back to being > 0.999.

@gerritvd
Copy link
Author

Thanks for the quick turn around!

@OlivierDehaene
Copy link
Member

We are adding #101 to make sure this hopefuly never happens again.

@scriptator
Copy link
Contributor

Sadly, I have to report that the problem is still not fully fixed. You only have to add a trailing space and it breaks again. My test script

e5 = SentenceTransformer("intfloat/multilingual-e5-large")
headers = {'Content-Type': 'application/json'}

url_e5 = 'http://localhost:33200/embed'

texts = [
    ' '.join(["Car"]*510),
    ' '.join(["Car"]*507),
    ' '.join(["Car"]*507) + ' ',
    ' '.join(["Car"]*507) + ' \n',
    ' '.join(["Car"]*507) + ' \n ',
]

data = {
  "inputs": texts,
  "normalize": True,
  "truncate": False,
}

response_e5 = httpx.post(url_e5, headers=headers, json=data, timeout=60)
tei_vectors_e5 = response_e5.json()

native_vectors_e5 = e5.encode(texts)

print(cos_sim(tei_vectors_e5[0], native_vectors_e5[0]))
print(cos_sim(tei_vectors_e5[1], native_vectors_e5[1]))
print(cos_sim(tei_vectors_e5[2], native_vectors_e5[2]))
print(cos_sim(tei_vectors_e5[3], native_vectors_e5[3]))
print(cos_sim(tei_vectors_e5[4], native_vectors_e5[4]))

prints the following response

0.9999883188452392    // original example from this issue --> ok
0.9999884754659525    // original example with just 507 cars --> ok
0.9853260777816594    // add a trailing whitespace --> bad
0.9792360657039959    // add a trailing whitespace and a newline --> even worse
0.9760671020914671    // add a trailing whitespace and a newline and a whitespace --> even worse

Environment:

  • TEI: Docker CPU build from commit 6395a7a
  • Local:
    • Python 3.11.7
    • sentence-transformers==2.2.2
    • tokenizers==0.15.0

@scriptator
Copy link
Contributor

scriptator commented Feb 1, 2024

By the way, removing the whole special handling that was introduced due to huggingface/tokenizers#1357 I get all similarities to 0.99998.

image

Could it be that this whole block is not necessary any more because the upstream issue was fixed? I don't understand the subject well enough to assess this on my own. I also don't want to introduce further regressions with other models. @OlivierDehaene what's your take on this?

@OlivierDehaene OlivierDehaene reopened this Feb 5, 2024
@scriptator
Copy link
Contributor

@OlivierDehaene did you already have the chance to think about my suggestion?

@OlivierDehaene
Copy link
Member

This code was removed on main.

@scriptator
Copy link
Contributor

scriptator commented Feb 23, 2024

Thank you! If anybody else should be searching the commit: 9d35f82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants