-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full text search (FTS) indices #1195
Comments
Maybe worth a look when we implement this: https://github.com/huggingface/tokenizers |
Got some user feedback on potential API ideas we might want: https://discord.com/channels/1030247538198061086/1197630499926057021/1238721206006317066 |
What is this forWith the capability of full text search, we can retrieve the document data more efficient, and with BM25 we can rank the results to reach better retrieval quality. How we do thisThe index consists of 3 parts:
We divide the index structure into the 3 files cause it allows us to minimize IO:
TODO itemsFeatures
Docs
Additional items
|
To get it work as soon as possible, I haven't integrated it into the filter expression, instead, just added a new interface to execute the full text search, but will remove this interface once we get the parser ready. Here is a Python example: import random
import lance
import pyarrow as pa
import string
import tempfile
# generate dataset
n = 1000
ids = range(n)
docs = ["".join(random.choices(string.ascii_letters, k=5)) for _ in range(n)]
id_array = pa.array(ids, type=pa.int64())
# the inverted index supports large string array only
doc_array = pa.array(docs, type=pa.large_string())
table = pa.table({"id": id_array, "doc": doc_array})
temp_dir = tempfile.mkdtemp()
dataset = lance.write_dataset(table, temp_dir)
dataset.create_scalar_index("doc","INVERTED")
results = dataset.scanner(["id", "doc"], limit=10, full_text_query=("doc", docs[0])).to_table()
print(results) |
Given that we have https://github.com/lancedb/tantivy-object-store ready now, we can start to integrate tantive FTS into the rust core, and offer FTS to js/python/rust bindings.
Because we need to work on a variety of storage systems, we will likely need to vendor and adapt tantivy to meet our needs. Many of the components, such as the tokenizer and scoring can be re-used as is.
The text was updated successfully, but these errors were encountered: