mxbai-embed-large embedding not consistent with original paper #24357

jeugregg · 2024-07-17T17:30:05Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.embeddings import OllamaEmbeddings
from sentence_transformers.util import cos_sim
import numpy as np
from numpy.testing import assert_almost_equal
# definitions
ollama_emb = OllamaEmbeddings(model='mxbai-embed-large')

# test on ollama
query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'

docs = [
    query,
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

r_1 = ollama_emb.embed_documents(docs)

# Calculate cosine similarity
similarities = cos_sim(r_1[0], r_1[1:])
print(similarities.numpy()[0])
print("to be compared to :\n [0.7920, 0.6369, 0.1651, 0.3621]")
try :
    assert_almost_equal( similarities.numpy()[0], np.array([0.7920, 0.6369, 0.1651, 0.3621]),decimal=2)
    print("TEST 1 : OLLAMA PASSED.")
except AssertionError:
    print("TEST 1 : OLLAMA FAILED.")

Error Message and Stack Trace (if applicable)

No response

Description

THe test is not working well.
It works with ollama directly but not with ollama under Langchain.
Also, it works well with Llamafile under Langchain.
The issue seems to be the same than here : ollama/ollama#4207
Why is it not fixed with langchain?

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.5.0: Wed May 1 20:13:18 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6030
Python Version: 3.10.4 (main, Mar 31 2022, 03:37:37) [Clang 12.0.0 ]

Package Information

langchain_core: 0.2.20
langchain: 0.2.8
langchain_community: 0.2.7
langsmith: 0.1.88
langchain_chroma: 0.1.1
langchain_text_splitters: 0.2.2

ollama : 0.2.1

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph
langserve

The text was updated successfully, but these errors were encountered:

jeugregg · 2024-07-19T15:29:49Z

Actually, the issue is that by default langchain add an option : embed_instruction embed_instruction: str = "passage: "
And it kills everything with 'mxbai-embed-large'.
So to pass the test we need to add this when declaring model :

ollama_emb = OllamaEmbeddings(model="mxbai-embed-large", embed_instruction="")

I don't know if it is a good example to be actually accurate.
What do you think that we need to use?
I am going to try :

The mxbai-embed-large blog says to use :

- for embedding docs : embed_instruction = ""
- for query : query_instruction =  "Represent this sentence for searching relevant passages: "

It works well with that.

dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mxbai-embed-large embedding not consistent with original paper #24357

mxbai-embed-large embedding not consistent with original paper #24357

jeugregg commented Jul 17, 2024

jeugregg commented Jul 19, 2024 •

edited

Loading

mxbai-embed-large embedding not consistent with original paper #24357

mxbai-embed-large embedding not consistent with original paper #24357

Comments

jeugregg commented Jul 17, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Packages not installed (Not Necessarily a Problem)

jeugregg commented Jul 19, 2024 • edited Loading

jeugregg commented Jul 19, 2024 •

edited

Loading