Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic full text search capabilities #62

Merged
merged 13 commits into from May 25, 2023
Merged

Basic full text search capabilities #62

merged 13 commits into from May 25, 2023

Conversation

changhiskhan
Copy link
Contributor

@changhiskhan changhiskhan commented May 7, 2023

This is v1 of integrating full text search index into LanceDB.

API

The query API is roughly the same as before, except if the input is text instead of a vector we assume that its fts search.

Example

If table is a LanceDB LanceTable, then:

Build index: table.create_fts_index("text")

Query: df = table.search("puppy").limit(10).select(["text"]).to_df()

Implementation

Here we use the tantivy-py package to build the index. We then use the row id's as the full-text-search index's doc id then we just do a Take operation to fetch the rows.

Limitations

  1. don't support incremental row appends yet. New data won't show up in search
  2. local filesystem only
  3. requires building tantivy explicitly

@changhiskhan changhiskhan requested a review from eddyxu May 7, 2023 00:01
for b in dataset.to_batches(columns=fields):
for i in range(b.num_rows):
doc = tantivy.Document()
doc.add_integer("doc_id", i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This id is the id in a RecordBatch, not in the dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@changhiskhan changhiskhan changed the title basic pieces for fts integration Basic full text search capabilities May 7, 2023
@changhiskhan changhiskhan force-pushed the changhiskhan/fts branch 3 times, most recently from 3ab7b72 to 571a6b7 Compare May 7, 2023 19:38
df = table.search("puppy").limit(10).select(["text"]).to_df()
```

LanceDB automatically looks for an FTS index if the input is str.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a python only feature, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now. Ideally we can integrate Tantivy from Rust but I think that takes more architectural clarity than we currently have (since we don't really know how the usage will look)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some doc w.r.t. of this python only API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup will do

# get the index path
index_path = self._table._get_fts_index_path()
# open the index
index = tantivy.Index.open(index_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not support multi-column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with fts indices you can have a single index that covers multiple columns.

@@ -130,6 +133,24 @@ def create_index(self, metric="L2", num_partitions=256, num_sub_vectors=96):
)
self._reset_dataset()

def create_fts_index(self, field_names: Union[str, List[str]]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we share the create_index() API ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, it just makes that api harder to use because now you have to figure out the index type name and also which kwargs go with which index type etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would that make a bunch of create_foo_index() ? i.e., a create_btree_index() or create_vector_index()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can refactor this into an Index/IndexBuilder abstraction. I just didn't want to commit to it now because this is an experimental feature.

@changhiskhan changhiskhan merged commit f485378 into main May 25, 2023
5 checks passed
@changhiskhan changhiskhan deleted the changhiskhan/fts branch May 25, 2023 04:25
jaichopra pushed a commit that referenced this pull request Jun 2, 2023
This is v1 of integrating full text search index into LanceDB.

The query API is roughly the same as before, except if the input is text
instead of a vector we assume that its fts search.

If `table` is a LanceDB LanceTable, then:

Build index: `table.create_fts_index("text")`

Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()`

Here we use the tantivy-py package to build the index. We then use the
row id's as the full-text-search index's doc id then we just do a Take
operation to fetch the rows.

1. don't support incremental row appends yet. New data won't show up in
search
2. local filesystem only
3. requires building tantivy explicitly

---------

Co-authored-by: Chang She <chang@lancedb.com>
jaichopra pushed a commit that referenced this pull request Jun 2, 2023
This is v1 of integrating full text search index into LanceDB.

The query API is roughly the same as before, except if the input is text
instead of a vector we assume that its fts search.

If `table` is a LanceDB LanceTable, then:

Build index: `table.create_fts_index("text")`

Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()`

Here we use the tantivy-py package to build the index. We then use the
row id's as the full-text-search index's doc id then we just do a Take
operation to fetch the rows.

1. don't support incremental row appends yet. New data won't show up in
search
2. local filesystem only
3. requires building tantivy explicitly

---------

Co-authored-by: Chang She <chang@lancedb.com>
jaichopra pushed a commit that referenced this pull request Jun 2, 2023
This is v1 of integrating full text search index into LanceDB.

The query API is roughly the same as before, except if the input is text
instead of a vector we assume that its fts search.

If `table` is a LanceDB LanceTable, then:

Build index: `table.create_fts_index("text")`

Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()`

Here we use the tantivy-py package to build the index. We then use the
row id's as the full-text-search index's doc id then we just do a Take
operation to fetch the rows.

1. don't support incremental row appends yet. New data won't show up in
search
2. local filesystem only
3. requires building tantivy explicitly

---------

Co-authored-by: Chang She <chang@lancedb.com>
raghavdixit99 pushed a commit to raghavdixit99/lancedb that referenced this pull request Apr 5, 2024
This is v1 of integrating full text search index into LanceDB.

# API
The query API is roughly the same as before, except if the input is text
instead of a vector we assume that its fts search.

## Example
If `table` is a LanceDB LanceTable, then:

Build index: `table.create_fts_index("text")`

Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()`

# Implementation
Here we use the tantivy-py package to build the index. We then use the
row id's as the full-text-search index's doc id then we just do a Take
operation to fetch the rows.

# Limitations

1. don't support incremental row appends yet. New data won't show up in
search
2. local filesystem only 
3. requires building tantivy explicitly

---------

Co-authored-by: Chang She <chang@lancedb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants