Incorrect results can be returned by embedding search when Content storage enabled #496

moore-ryan · 2023-07-07T20:16:30Z

There seems to be a correctness issue that manifests if the following conditions are met:

Content storage ENABLED for the Embeddings
Documents being indexed are python dicts
One or more of the documents being indexed has an empty '"text" field in the dict (i.e. {"text":""})

If all of these conditions are met, then it seems as though search results for all documents AFTER the first empty "text" field will be mis-aligned by the number of documents with empty text fields.

See below code to reproduce:

from txtai.embeddings import Embeddings


dict_docs = [(1, {"text":""}, None), (2, {"text":"New York"}, None), (3, {"text":"California"}, None)]
text_docs = [(1, "", None), (2, "New York", None), (3, "California", None)]

dict_content_embed = Embeddings({"content":True,"path":"sentence-transformers/multi-qa-mpnet-base-dot-v1"})
dict_content_embed.index(dict_docs)

dict_no_content_embed = Embeddings({"content":False,"path":"sentence-transformers/multi-qa-mpnet-base-dot-v1"})
dict_no_content_embed.index(dict_docs)

text_content_embed = Embeddings({"content":True,"path":"sentence-transformers/multi-qa-mpnet-base-dot-v1"})
text_content_embed.index(text_docs)

text_no_content_embed = Embeddings({"content":False,"path":"sentence-transformers/multi-qa-mpnet-base-dot-v1"})
text_no_content_embed.index(text_docs)

print(f"Content+Dict Results:\n{dict_content_embed.search('New York')}\n")
print(f"No Content+Dict Results:\n{dict_no_content_embed.search('New York')}\n")
print(f"Content+Text Results:\n{text_content_embed.search('New York')}\n")
print(f"No Content+Text Results:\n{text_no_content_embed.search('New York')}\n")

The code will print:

Content+Dict Results:
[{'id': '3', 'text': 'California', 'score': 0.9999998807907104}, {'id': '2', 'text': 'New York', 'score': 0.2981969118118286}]

No Content+Dict Results:
[(2, 0.9999998807907104), (3, 0.5046365857124329), (1, 0.2981969118118286)]

Content+Text Results:
[{'id': '2', 'text': 'New York', 'score': 0.9999998807907104}, {'id': '3', 'text': 'California', 'score': 0.5046365857124329}, {'id': '1', 'text': '', 'score': 0.2981969118118286}]

No Content+Text Results:
[(2, 0.9999998807907104), (3, 0.5046365857124329), (1, 0.2981969118118286)]

As you can see, the embedding search incorrectly returns document 3 ("California") as the top match when the documents are python dicts and content storage is enabled.

The text was updated successfully, but these errors were encountered:

davidmezzetti · 2023-07-07T23:32:54Z

Thank you for the detailed report on this!

I would ensure that None or empty string aren't passed as the text to index as it's not going to produce useful results. With that being said, the behavior should be the same regardless of whether content is enabled. I just checked in a fix for this, thank you again for finding this bug.

davidmezzetti self-assigned this Jul 7, 2023

davidmezzetti added the bug Something isn't working label Jul 7, 2023

davidmezzetti added this to the v5.6.0 milestone Jul 7, 2023

davidmezzetti closed this as completed in 56f0bce Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect results can be returned by embedding search when Content storage enabled #496

Incorrect results can be returned by embedding search when Content storage enabled #496

moore-ryan commented Jul 7, 2023 •

edited

Loading

davidmezzetti commented Jul 7, 2023

Incorrect results can be returned by embedding search when Content storage enabled #496

Incorrect results can be returned by embedding search when Content storage enabled #496

Comments

moore-ryan commented Jul 7, 2023 • edited Loading

davidmezzetti commented Jul 7, 2023

moore-ryan commented Jul 7, 2023 •

edited

Loading