Basic full text search capabilities #62

changhiskhan · 2023-05-07T00:01:58Z

This is v1 of integrating full text search index into LanceDB.

API

The query API is roughly the same as before, except if the input is text instead of a vector we assume that its fts search.

Example

If table is a LanceDB LanceTable, then:

Build index: table.create_fts_index("text")

Query: df = table.search("puppy").limit(10).select(["text"]).to_df()

Implementation

Here we use the tantivy-py package to build the index. We then use the row id's as the full-text-search index's doc id then we just do a Take operation to fetch the rows.

Limitations

don't support incremental row appends yet. New data won't show up in search
local filesystem only
requires building tantivy explicitly

eddyxu · 2023-05-07T00:27:30Z

python/lancedb/fts.py

+    for b in dataset.to_batches(columns=fields):
+        for i in range(b.num_rows):
+            doc = tantivy.Document()
+            doc.add_integer("doc_id", i)


This id is the id in a RecordBatch, not in the dataset.

eddyxu · 2023-05-25T02:10:29Z

docs/src/fts.md

+df = table.search("puppy").limit(10).select(["text"]).to_df()
+```
+
+LanceDB automatically looks for an FTS index if the input is str.


This is a python only feature, right?

For now. Ideally we can integrate Tantivy from Rust but I think that takes more architectural clarity than we currently have (since we don't really know how the usage will look)

Add some doc w.r.t. of this python only API?

Yup will do

eddyxu · 2023-05-25T02:11:10Z

python/lancedb/query.py

+        # get the index path
+        index_path = self._table._get_fts_index_path()
+        # open the index
+        index = tantivy.Index.open(index_path)


this does not support multi-column?

So with fts indices you can have a single index that covers multiple columns.

eddyxu · 2023-05-25T02:11:47Z

python/lancedb/table.py

@@ -130,6 +133,24 @@ def create_index(self, metric="L2", num_partitions=256, num_sub_vectors=96):
        )
        self._reset_dataset()

+    def create_fts_index(self, field_names: Union[str, List[str]]):


can we share the create_index() API ?

We can, it just makes that api harder to use because now you have to figure out the index type name and also which kwargs go with which index type etc.

would that make a bunch of create_foo_index() ? i.e., a create_btree_index() or create_vector_index()

I think we can refactor this into an Index/IndexBuilder abstraction. I just didn't want to commit to it now because this is an experimental feature.

This is v1 of integrating full text search index into LanceDB. The query API is roughly the same as before, except if the input is text instead of a vector we assume that its fts search. If `table` is a LanceDB LanceTable, then: Build index: `table.create_fts_index("text")` Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()` Here we use the tantivy-py package to build the index. We then use the row id's as the full-text-search index's doc id then we just do a Take operation to fetch the rows. 1. don't support incremental row appends yet. New data won't show up in search 2. local filesystem only 3. requires building tantivy explicitly --------- Co-authored-by: Chang She <chang@lancedb.com>

This is v1 of integrating full text search index into LanceDB. # API The query API is roughly the same as before, except if the input is text instead of a vector we assume that its fts search. ## Example If `table` is a LanceDB LanceTable, then: Build index: `table.create_fts_index("text")` Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()` # Implementation Here we use the tantivy-py package to build the index. We then use the row id's as the full-text-search index's doc id then we just do a Take operation to fetch the rows. # Limitations 1. don't support incremental row appends yet. New data won't show up in search 2. local filesystem only 3. requires building tantivy explicitly --------- Co-authored-by: Chang She <chang@lancedb.com>

changhiskhan requested a review from eddyxu May 7, 2023 00:01

eddyxu reviewed May 7, 2023

View reviewed changes

changhiskhan changed the title ~~basic pieces for fts integration~~ Basic full text search capabilities May 7, 2023

changhiskhan force-pushed the changhiskhan/fts branch 3 times, most recently from 3ab7b72 to 571a6b7 Compare May 7, 2023 19:38

changhiskhan added 9 commits May 24, 2023 17:07

basic pieces for fts integration

e3b037c

first cut fts integration

f417454

address review comment

6985d29

add optional tantivy dependency

7935867

py38 compat for type annotations

88ae862

isort black

c79c7ac

list -> tuple

f130097

GHA

6d29a2a

install directly from github

0053410

changhiskhan force-pushed the changhiskhan/fts branch from 571a6b7 to 0053410 Compare May 24, 2023 23:07

changhiskhan added 2 commits May 24, 2023 18:13

allow multiple columns to be indexed and add docs

e4f9f4a

remove print

96862f2

eddyxu reviewed May 25, 2023

View reviewed changes

eddyxu approved these changes May 25, 2023

View reviewed changes

changhiskhan added 2 commits May 24, 2023 22:16

address PR comments

a4a64b0

add another experimental warning

9a5ae6f

changhiskhan merged commit f485378 into main May 25, 2023
5 checks passed

changhiskhan deleted the changhiskhan/fts branch May 25, 2023 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic full text search capabilities #62

Basic full text search capabilities #62

changhiskhan commented May 7, 2023 •

edited

eddyxu May 7, 2023

changhiskhan May 7, 2023

eddyxu May 25, 2023

changhiskhan May 25, 2023

eddyxu May 25, 2023

changhiskhan May 25, 2023

eddyxu May 25, 2023

changhiskhan May 25, 2023

eddyxu May 25, 2023

changhiskhan May 25, 2023

eddyxu May 25, 2023

changhiskhan May 25, 2023

Basic full text search capabilities #62

Basic full text search capabilities #62

Conversation

changhiskhan commented May 7, 2023 • edited

API

Example

Implementation

Limitations

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

changhiskhan commented May 7, 2023 •

edited