Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full text search (FTS) indices #1195

Open
Tracked by #2079
eddyxu opened this issue Aug 31, 2023 · 4 comments
Open
Tracked by #2079

Full text search (FTS) indices #1195

eddyxu opened this issue Aug 31, 2023 · 4 comments
Assignees
Labels
arrow Apache Arrow related issues rust Rust related tasks

Comments

@eddyxu
Copy link
Contributor

eddyxu commented Aug 31, 2023

Given that we have https://github.com/lancedb/tantivy-object-store ready now, we can start to integrate tantive FTS into the rust core, and offer FTS to js/python/rust bindings.

Because we need to work on a variety of storage systems, we will likely need to vendor and adapt tantivy to meet our needs. Many of the components, such as the tokenizer and scoring can be re-used as is.

@eddyxu eddyxu added arrow Apache Arrow related issues rust Rust related tasks labels Aug 31, 2023
@wjones127 wjones127 changed the title [Rust] Integrate with Tantive Rust crate Full text search (FTS) indices Mar 12, 2024
@wjones127 wjones127 added this to the (WIP) Lance Roadmap milestone Mar 12, 2024
@wjones127 wjones127 mentioned this issue Mar 15, 2024
20 tasks
@wjones127
Copy link
Contributor

Maybe worth a look when we implement this: https://github.com/huggingface/tokenizers

@wjones127
Copy link
Contributor

Got some user feedback on potential API ideas we might want: https://discord.com/channels/1030247538198061086/1197630499926057021/1238721206006317066

@BubbleCal
Copy link
Contributor

BubbleCal commented Jul 15, 2024

What is this for

With the capability of full text search, we can retrieve the document data more efficient, and with BM25 we can rank the results to reach better retrieval quality.

How we do this

The index consists of 3 parts: TokenSet, InvertedList and DocSet. We will store them on 3 individual files.

  • TokenSet is basically a map from token(word) to token_id and the frequency of it.
  • InvertedList records the row_id and frequency each token occurs for fast retrieving
  • DocSet records the number of tokens for each row, which is used for BM25

We divide the index structure into the 3 files cause it allows us to minimize IO:

  • For appending, we need to load only the previous TokenSet and generate InvertedList and DocSet for the new data
  • For filtering, we need to load only the TokenSet and InvertedList
  • For filtering and ranking with BM25, we need to load all the files

TODO items

Features

Docs

  • Docs
  • Examples

Additional items

  • More languages support

@BubbleCal BubbleCal self-assigned this Jul 15, 2024
@BubbleCal
Copy link
Contributor

To get it work as soon as possible, I haven't integrated it into the filter expression, instead, just added a new interface to execute the full text search, but will remove this interface once we get the parser ready. Here is a Python example:

import random
import lance
import pyarrow as pa
import string
import tempfile

# generate dataset
n = 1000
ids = range(n)
docs = ["".join(random.choices(string.ascii_letters, k=5)) for _ in range(n)]

id_array = pa.array(ids, type=pa.int64())
 # the inverted index supports large string array only
doc_array = pa.array(docs, type=pa.large_string())

table = pa.table({"id": id_array, "doc": doc_array})
temp_dir = tempfile.mkdtemp()
dataset = lance.write_dataset(table, temp_dir)
dataset.create_scalar_index("doc","INVERTED")

results = dataset.scanner(["id", "doc"], limit=10, full_text_query=("doc", docs[0])).to_table()
print(results)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Apache Arrow related issues rust Rust related tasks
Projects
None yet
Development

No branches or pull requests

5 participants