Full text search (FTS) indices #1195

eddyxu · 2023-08-31T18:31:09Z

Given that we have https://github.com/lancedb/tantivy-object-store ready now, we can start to integrate tantive FTS into the rust core, and offer FTS to js/python/rust bindings.

Because we need to work on a variety of storage systems, we will likely need to vendor and adapt tantivy to meet our needs. Many of the components, such as the tokenizer and scoring can be re-used as is.

wjones127 · 2024-05-03T21:12:49Z

Maybe worth a look when we implement this: https://github.com/huggingface/tokenizers

wjones127 · 2024-05-13T16:23:29Z

Got some user feedback on potential API ideas we might want: https://discord.com/channels/1030247538198061086/1197630499926057021/1238721206006317066

BubbleCal · 2024-07-15T16:34:34Z

BubbleCal · 2024-07-15T17:11:33Z

To get it work as soon as possible, I haven't integrated it into the filter expression, instead, just added a new interface to execute the full text search, but will remove this interface once we get the parser ready. Here is a Python example:

import random
import lance
import pyarrow as pa
import string
import tempfile

# generate dataset
n = 1000
ids = range(n)
docs = ["".join(random.choices(string.ascii_letters, k=5)) for _ in range(n)]

id_array = pa.array(ids, type=pa.int64())
 # the inverted index supports large string array only
doc_array = pa.array(docs, type=pa.large_string())

table = pa.table({"id": id_array, "doc": doc_array})
temp_dir = tempfile.mkdtemp()
dataset = lance.write_dataset(table, temp_dir)
dataset.create_scalar_index("doc","INVERTED")

results = dataset.scanner(["id", "doc"], limit=10, full_text_query=("doc", docs[0])).to_table()
print(results)

eddyxu assigned westonpace, chebbyChefNEQ and wjones127 Aug 31, 2023

eddyxu added arrow Apache Arrow related issues rust Rust related tasks labels Aug 31, 2023

wjones127 changed the title ~~[Rust] Integrate with Tantive Rust crate~~ Full text search (FTS) indices Mar 12, 2024

wjones127 added this to the (WIP) Lance Roadmap milestone Mar 12, 2024

wjones127 mentioned this issue Mar 15, 2024

Roadmap 2024 #2079

Open

20 tasks

BubbleCal mentioned this issue Jul 15, 2024

feat: integrate inverted index into lance index APIs #2577

Open

BubbleCal self-assigned this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full text search (FTS) indices #1195

Full text search (FTS) indices #1195

eddyxu commented Aug 31, 2023 •

edited by wjones127

Loading

wjones127 commented May 3, 2024

wjones127 commented May 13, 2024

BubbleCal commented Jul 15, 2024 •

edited

Loading

BubbleCal commented Jul 15, 2024

Full text search (FTS) indices #1195

Full text search (FTS) indices #1195

Comments

eddyxu commented Aug 31, 2023 • edited by wjones127 Loading

wjones127 commented May 3, 2024

wjones127 commented May 13, 2024

BubbleCal commented Jul 15, 2024 • edited Loading

What is this for

How we do this

TODO items

Features

Docs

Additional items

BubbleCal commented Jul 15, 2024

eddyxu commented Aug 31, 2023 •

edited by wjones127

Loading

BubbleCal commented Jul 15, 2024 •

edited

Loading