-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] txtai as a proxy like nboost #21
Comments
Thank you for the feature suggestion. That is an interesting concept, basically looking for a search result re-ranker. In reviewing how nboost works, it looks like they stand in the middle, make a query modification to return more than the requested number of results, re-rank and return the Top N results. txtai is designed to go at the sentence/paragraph level vs document level. The embedding.similarity method could be called between each document section and the query to get a list of scores for a document. Then mean or max pooling of those results can build a document score and results are returned based on that. Issue #12 covers adding a model serving capability via FastAPI. If an additional API call that looked something like: /$model/rank Post Params:
Returns: Would it be beneficial? Could eventually grow into calling the HTTP APIs directly but I would probably start with this to support any platform that could pass data this way. Alternatively or possibly in addition to this, if you'd be willing to index your data both in a search system and txtai, a much more efficient search time ranking could happen like below: /$model/rank Post Params:
Returns: This way would pull all the sections from the txtai index and run a batch similarity query against the already existing indexed embeddings. But this method would depend on how much index time and data storage requirements you would want to add as overhead. Having the data indexed in txtai could also allow a hybrid query approach where both are queried and results joined in some way. Lot of different ideas here but a list of initial thoughts to see what path(s) sounds the best to explore. |
If you want a dataset for test, you can use the latest arXiv dataset available here with txtai. to get the dump
Is it possible to index this dataset quickly as there is an update every week ? |
Finally have the changes necessary to accomplish what you need via txtai. txtai has underwent a lot of changes since this issue was created. The best approach is indeed to use txtai as a proxy. #49 added a Similarity pipeline that can be used directly from Python. An example of ranking documents with this is below. from txtai.pipeline import Similarity
# Run existing search, assume results has a "text" field
results = existingsearch(query, n=100)
#Use default model, can be any MNLI model on the Hugging Face model hub
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])
# Use scores to re-sort existing results
reranked = [(scores[x], results[x]) for x in reversed(np.argsort(scores))] Alternatively, Embeddings similarity methods can still be used. # Replace lines from above
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])
# With
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
scores = embeddings.similarity(query, [r["text"] for r in results]) The same tasks can be performed with the API with the following minimal configuration: similarity: An example using txtai.js import {Embeddings} from "txtai";
let embeddings = new Embeddings("http://localhost:8000");
let results = await embeddings.similarity(query, results); There are a number of language bindings for the API that can be used to perform the same logic. I'm going to go ahead and close this issue but please feel free to re-open or open a new issue if you are still interested in using txtai for this functionality. |
Hi,
Hope you are all well !
I was wondering if we can use txtai like nboost as a proxy for elasticsearch or manticoresearch ?
i am really interested by an integration to manticoresearch as I wrote https://paper2code.com around this full-text search engine.
Thanks for your insights and inputs about this question.
Cheers,
X
The text was updated successfully, but these errors were encountered: