Skip to content

search(q, index="dense") returns values with high scores even if there should not be a match #908

@ddofborg

Description

@ddofborg

I am using product titles as my corpus. I embed them using:

    embeddings = txtai.Embeddings(
        defaults=False,
        normalize=True,
        indexes={
            "keyword": {
                "keyword": True
            },
            "dense": {
                "path": "NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers"
            }
        }
    )

Then index as normal.

When I do a search for words which are 100% unrelated to the corpus. The dense index almost always returns products with scores between 0.10 and 0.35. Sometimes 0.50. But a fully correct and matching products would get 0.60.

Are where more finetune methods to dive deeper and find if the match is making sense?

I also use the BM25 matcher and weight the results, but it's not ideal, as the dense index gives a very high score to bad results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions