[Feature] txtai as a proxy like nboost #21

ghost · 2020-09-03T12:41:23Z

Hi,

Hope you are all well !

I was wondering if we can use txtai like nboost as a proxy for elasticsearch or manticoresearch ?

i am really interested by an integration to manticoresearch as I wrote https://paper2code.com around this full-text search engine.

Thanks for your insights and inputs about this question.

Cheers,
X

davidmezzetti · 2020-09-03T15:31:49Z

Thank you for the feature suggestion. That is an interesting concept, basically looking for a search result re-ranker.

In reviewing how nboost works, it looks like they stand in the middle, make a query modification to return more than the requested number of results, re-rank and return the Top N results.

txtai is designed to go at the sentence/paragraph level vs document level. The embedding.similarity method could be called between each document section and the query to get a list of scores for a document. Then mean or max pooling of those results can build a document score and results are returned based on that.

Issue #12 covers adding a model serving capability via FastAPI. If an additional API call that looked something like:

/$model/rank

Post Params:

data: List of sections grouped by document
query: query text
size: number of documents to return
pooling: mean or max
tokenize: if data should be split into sentences

Returns:
data sorted based on most relevance to query

Would it be beneficial? Could eventually grow into calling the HTTP APIs directly but I would probably start with this to support any platform that could pass data this way.

Alternatively or possibly in addition to this, if you'd be willing to index your data both in a search system and txtai, a much more efficient search time ranking could happen like below:

/$model/rank

Post Params:

ids: list of document ids
query: query text
size: number of document ids to return
pooling: mean or max

Returns:
ids sorted based on most relevance to query

This way would pull all the sections from the txtai index and run a batch similarity query against the already existing indexed embeddings. But this method would depend on how much index time and data storage requirements you would want to add as overhead. Having the data indexed in txtai could also allow a hybrid query approach where both are queried and results joined in some way.

Lot of different ideas here but a list of initial thoughts to see what path(s) sounds the best to explore.

ghost · 2020-09-04T05:39:15Z

If you want a dataset for test, you can use the latest arXiv dataset available here with txtai.

to get the dump

gsutil ls gs://arxiv-dataset/metadata-v5/
gsutil cp gs://arxiv-dataset/metadata-v5/arxiv-metadata-oai.json .

Is it possible to index this dataset quickly as there is an update every week ?

davidmezzetti · 2021-01-05T18:01:39Z

Finally have the changes necessary to accomplish what you need via txtai. txtai has underwent a lot of changes since this issue was created.

The best approach is indeed to use txtai as a proxy.

#49 added a Similarity pipeline that can be used directly from Python. An example of ranking documents with this is below.

from txtai.pipeline import Similarity

# Run existing search, assume results has a "text" field
results = existingsearch(query, n=100)

#Use default model, can be any MNLI model on the Hugging Face model hub
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])

# Use scores to re-sort existing results
reranked = [(scores[x], results[x]) for x in reversed(np.argsort(scores))]

Alternatively, Embeddings similarity methods can still be used.

# Replace lines from above
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])

# With
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
scores = embeddings.similarity(query, [r["text"] for r in results])

The same tasks can be performed with the API with the following minimal configuration:

similarity:

An example using txtai.js

import {Embeddings} from "txtai";

let embeddings = new Embeddings("http://localhost:8000");
let results = await embeddings.similarity(query, results);

There are a number of language bindings for the API that can be used to perform the same logic.

I'm going to go ahead and close this issue but please feel free to re-open or open a new issue if you are still interested in using txtai for this functionality.

davidmezzetti closed this as completed Jan 5, 2021

davidmezzetti self-assigned this Jan 5, 2021

davidmezzetti added this to the v2.0.0 milestone May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] txtai as a proxy like nboost #21

[Feature] txtai as a proxy like nboost #21

ghost commented Sep 3, 2020

davidmezzetti commented Sep 3, 2020

ghost commented Sep 4, 2020

davidmezzetti commented Jan 5, 2021

[Feature] txtai as a proxy like nboost #21

[Feature] txtai as a proxy like nboost #21

Comments

ghost commented Sep 3, 2020

davidmezzetti commented Sep 3, 2020

ghost commented Sep 4, 2020

to get the dump

davidmezzetti commented Jan 5, 2021