Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] txtai as a proxy like nboost #21

Closed
ghost opened this issue Sep 3, 2020 · 3 comments
Closed

[Feature] txtai as a proxy like nboost #21

ghost opened this issue Sep 3, 2020 · 3 comments
Assignees
Milestone

Comments

@ghost
Copy link

ghost commented Sep 3, 2020

Hi,

Hope you are all well !

I was wondering if we can use txtai like nboost as a proxy for elasticsearch or manticoresearch ?

i am really interested by an integration to manticoresearch as I wrote https://paper2code.com around this full-text search engine.

Thanks for your insights and inputs about this question.

Cheers,
X

@davidmezzetti
Copy link
Member

Thank you for the feature suggestion. That is an interesting concept, basically looking for a search result re-ranker.

In reviewing how nboost works, it looks like they stand in the middle, make a query modification to return more than the requested number of results, re-rank and return the Top N results.

txtai is designed to go at the sentence/paragraph level vs document level. The embedding.similarity method could be called between each document section and the query to get a list of scores for a document. Then mean or max pooling of those results can build a document score and results are returned based on that.

Issue #12 covers adding a model serving capability via FastAPI. If an additional API call that looked something like:

/$model/rank

Post Params:

  • data: List of sections grouped by document
  • query: query text
  • size: number of documents to return
  • pooling: mean or max
  • tokenize: if data should be split into sentences

Returns:
data sorted based on most relevance to query

Would it be beneficial? Could eventually grow into calling the HTTP APIs directly but I would probably start with this to support any platform that could pass data this way.

Alternatively or possibly in addition to this, if you'd be willing to index your data both in a search system and txtai, a much more efficient search time ranking could happen like below:

/$model/rank

Post Params:

  • ids: list of document ids
  • query: query text
  • size: number of document ids to return
  • pooling: mean or max

Returns:
ids sorted based on most relevance to query

This way would pull all the sections from the txtai index and run a batch similarity query against the already existing indexed embeddings. But this method would depend on how much index time and data storage requirements you would want to add as overhead. Having the data indexed in txtai could also allow a hybrid query approach where both are queried and results joined in some way.

Lot of different ideas here but a list of initial thoughts to see what path(s) sounds the best to explore.

@ghost
Copy link
Author

ghost commented Sep 4, 2020

If you want a dataset for test, you can use the latest arXiv dataset available here with txtai.

to get the dump

gsutil ls gs://arxiv-dataset/metadata-v5/
gsutil cp gs://arxiv-dataset/metadata-v5/arxiv-metadata-oai.json .

Is it possible to index this dataset quickly as there is an update every week ?

@davidmezzetti
Copy link
Member

Finally have the changes necessary to accomplish what you need via txtai. txtai has underwent a lot of changes since this issue was created.

The best approach is indeed to use txtai as a proxy.

#49 added a Similarity pipeline that can be used directly from Python. An example of ranking documents with this is below.

from txtai.pipeline import Similarity

# Run existing search, assume results has a "text" field
results = existingsearch(query, n=100)

#Use default model, can be any MNLI model on the Hugging Face model hub
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])

# Use scores to re-sort existing results
reranked = [(scores[x], results[x]) for x in reversed(np.argsort(scores))]

Alternatively, Embeddings similarity methods can still be used.

# Replace lines from above
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])

# With
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
scores = embeddings.similarity(query, [r["text"] for r in results])

The same tasks can be performed with the API with the following minimal configuration:

similarity:

An example using txtai.js

import {Embeddings} from "txtai";

let embeddings = new Embeddings("http://localhost:8000");
let results = await embeddings.similarity(query, results);

There are a number of language bindings for the API that can be used to perform the same logic.

I'm going to go ahead and close this issue but please feel free to re-open or open a new issue if you are still interested in using txtai for this functionality.

@davidmezzetti davidmezzetti self-assigned this Jan 5, 2021
@davidmezzetti davidmezzetti added this to the v2.0.0 milestone May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant