I am using product titles as my corpus. I embed them using:
embeddings = txtai.Embeddings(
defaults=False,
normalize=True,
indexes={
"keyword": {
"keyword": True
},
"dense": {
"path": "NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers"
}
}
)
Then index as normal.
When I do a search for words which are 100% unrelated to the corpus. The dense index almost always returns products with scores between 0.10 and 0.35. Sometimes 0.50. But a fully correct and matching products would get 0.60.
Are where more finetune methods to dive deeper and find if the match is making sense?
I also use the BM25 matcher and weight the results, but it's not ideal, as the dense index gives a very high score to bad results.