Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mxbai-embed-large embedding not consistent with original paper #24357

Open
5 tasks done
jeugregg opened this issue Jul 17, 2024 · 1 comment
Open
5 tasks done

mxbai-embed-large embedding not consistent with original paper #24357

jeugregg opened this issue Jul 17, 2024 · 1 comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module

Comments

@jeugregg
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.embeddings import OllamaEmbeddings
from sentence_transformers.util import cos_sim
import numpy as np
from numpy.testing import assert_almost_equal
# definitions
ollama_emb = OllamaEmbeddings(model='mxbai-embed-large')

# test on ollama
query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'

docs = [
    query,
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

r_1 = ollama_emb.embed_documents(docs)

# Calculate cosine similarity
similarities = cos_sim(r_1[0], r_1[1:])
print(similarities.numpy()[0])
print("to be compared to :\n [0.7920, 0.6369, 0.1651, 0.3621]")
try :
    assert_almost_equal( similarities.numpy()[0], np.array([0.7920, 0.6369, 0.1651, 0.3621]),decimal=2)
    print("TEST 1 : OLLAMA PASSED.")
except AssertionError:
    print("TEST 1 : OLLAMA FAILED.")

Error Message and Stack Trace (if applicable)

No response

Description

THe test is not working well.
It works with ollama directly but not with ollama under Langchain.
Also, it works well with Llamafile under Langchain.
The issue seems to be the same than here : ollama/ollama#4207
Why is it not fixed with langchain?

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.5.0: Wed May 1 20:13:18 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6030
Python Version: 3.10.4 (main, Mar 31 2022, 03:37:37) [Clang 12.0.0 ]

Package Information

langchain_core: 0.2.20
langchain: 0.2.8
langchain_community: 0.2.7
langsmith: 0.1.88
langchain_chroma: 0.1.1
langchain_text_splitters: 0.2.2

ollama : 0.2.1

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph
langserve

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jul 17, 2024
@jeugregg
Copy link
Author

jeugregg commented Jul 19, 2024

Actually, the issue is that by default langchain add an option : embed_instruction embed_instruction: str = "passage: "
And it kills everything with 'mxbai-embed-large'.
So to pass the test we need to add this when declaring model :

ollama_emb = OllamaEmbeddings(model="mxbai-embed-large", embed_instruction="")

I don't know if it is a good example to be actually accurate.
What do you think that we need to use?
I am going to try :

The mxbai-embed-large blog says to use :

- for embedding docs : embed_instruction = ""
- for query : query_instruction =  "Represent this sentence for searching relevant passages: "

It works well with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module
Projects
None yet
Development

No branches or pull requests

1 participant