Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bm25 and cross-language searching #260

Closed
imdoge opened this issue May 15, 2024 · 3 comments
Closed

bm25 and cross-language searching #260

imdoge opened this issue May 15, 2024 · 3 comments

Comments

@imdoge
Copy link

imdoge commented May 15, 2024

I noticed that MiniSearch has implemented a JavaScript version of BM25. I'm wondering why MiniSearch does not support cross-language searching. Recently, I have been using Python to debug RAG-related applications, such as llamaIndex and LangChain. These libraries' BM25 searches can perform cross-language searching.

However, I am looking for a JavaScript version of BM25 search and found MiniSearch, which is an excellent library, but it doesn't support cross-language searching. Could you explain why this is the case?

P.S For example: If the data is "bike," searching for "vélo."

thanks~

@rolftimmermans
Copy link
Contributor

rolftimmermans commented May 15, 2024

I'll try to answer this given that I opened the original BM25/BM25+ pull request for MiniSearch.

MiniSearch searches in approximately two stages: matching and ranking. (This is a bit of a simplification; for this explanation I will ignore features like filtering and boosting).

The first step is matching. MiniSearch implements a fuzzy search algorithm that looks for words that are textually similar to the words in the query. All documents that match the query in some way are collected.

The second step is ranking. The goal is to show the matching documents in order of relevance; which documents match best? BM25 and BM25+ are ranking algorithms. They do not generate search results, they only (re-)order them.

Cross-language searching (finding "bike" when you search for "vélo") needs to happen during the matching phase. Unless you provide your own translations, this is not something MiniSearch can do. MiniSearch provides fuzzy text-based matching, but the strings "bike" and "vélo" are not similar and will not match.

You could:

  • Use a translation service to translate your source documents into multiple pre-defined languages and index those with MiniSearch.
  • Use an LLM to generate embeddings from your document and implement semantic search with a similarity metric such as cosine similarity. I'm not aware of any front-end only libraries that implement this; today you will need to implement this as a back-end service.

@lucaong
Copy link
Owner

lucaong commented May 17, 2024

I do not have much else to add to @rolftimmermans 's great answer.

@imdoge can I close the issue, or do you have further questions?

@imdoge
Copy link
Author

imdoge commented May 17, 2024

No further questions, thank you for the answer.

@imdoge imdoge closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants