Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results can be returned by embedding search when Content storage enabled #496

Closed
moore-ryan opened this issue Jul 7, 2023 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@moore-ryan
Copy link

moore-ryan commented Jul 7, 2023

There seems to be a correctness issue that manifests if the following conditions are met:

  • Content storage ENABLED for the Embeddings
  • Documents being indexed are python dicts
  • One or more of the documents being indexed has an empty '"text" field in the dict (i.e. {"text":""})

If all of these conditions are met, then it seems as though search results for all documents AFTER the first empty "text" field will be mis-aligned by the number of documents with empty text fields.

See below code to reproduce:

from txtai.embeddings import Embeddings


dict_docs = [(1, {"text":""}, None), (2, {"text":"New York"}, None), (3, {"text":"California"}, None)]
text_docs = [(1, "", None), (2, "New York", None), (3, "California", None)]

dict_content_embed = Embeddings({"content":True,"path":"sentence-transformers/multi-qa-mpnet-base-dot-v1"})
dict_content_embed.index(dict_docs)

dict_no_content_embed = Embeddings({"content":False,"path":"sentence-transformers/multi-qa-mpnet-base-dot-v1"})
dict_no_content_embed.index(dict_docs)

text_content_embed = Embeddings({"content":True,"path":"sentence-transformers/multi-qa-mpnet-base-dot-v1"})
text_content_embed.index(text_docs)

text_no_content_embed = Embeddings({"content":False,"path":"sentence-transformers/multi-qa-mpnet-base-dot-v1"})
text_no_content_embed.index(text_docs)

print(f"Content+Dict Results:\n{dict_content_embed.search('New York')}\n")
print(f"No Content+Dict Results:\n{dict_no_content_embed.search('New York')}\n")
print(f"Content+Text Results:\n{text_content_embed.search('New York')}\n")
print(f"No Content+Text Results:\n{text_no_content_embed.search('New York')}\n")

The code will print:

Content+Dict Results:
[{'id': '3', 'text': 'California', 'score': 0.9999998807907104}, {'id': '2', 'text': 'New York', 'score': 0.2981969118118286}]

No Content+Dict Results:
[(2, 0.9999998807907104), (3, 0.5046365857124329), (1, 0.2981969118118286)]

Content+Text Results:
[{'id': '2', 'text': 'New York', 'score': 0.9999998807907104}, {'id': '3', 'text': 'California', 'score': 0.5046365857124329}, {'id': '1', 'text': '', 'score': 0.2981969118118286}]

No Content+Text Results:
[(2, 0.9999998807907104), (3, 0.5046365857124329), (1, 0.2981969118118286)]

As you can see, the embedding search incorrectly returns document 3 ("California") as the top match when the documents are python dicts and content storage is enabled.

@davidmezzetti davidmezzetti self-assigned this Jul 7, 2023
@davidmezzetti davidmezzetti added the bug Something isn't working label Jul 7, 2023
@davidmezzetti davidmezzetti added this to the v5.6.0 milestone Jul 7, 2023
@davidmezzetti
Copy link
Member

Thank you for the detailed report on this!

I would ensure that None or empty string aren't passed as the text to index as it's not going to produce useful results. With that being said, the behavior should be the same regardless of whether content is enabled. I just checked in a fix for this, thank you again for finding this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants