New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic full text search capabilities #62
Conversation
python/lancedb/fts.py
Outdated
for b in dataset.to_batches(columns=fields): | ||
for i in range(b.num_rows): | ||
doc = tantivy.Document() | ||
doc.add_integer("doc_id", i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This id is the id in a RecordBatch, not in the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
3ab7b72
to
571a6b7
Compare
571a6b7
to
0053410
Compare
df = table.search("puppy").limit(10).select(["text"]).to_df() | ||
``` | ||
|
||
LanceDB automatically looks for an FTS index if the input is str. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a python only feature, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now. Ideally we can integrate Tantivy from Rust but I think that takes more architectural clarity than we currently have (since we don't really know how the usage will look)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some doc w.r.t. of this python only API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup will do
# get the index path | ||
index_path = self._table._get_fts_index_path() | ||
# open the index | ||
index = tantivy.Index.open(index_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not support multi-column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with fts indices you can have a single index that covers multiple columns.
@@ -130,6 +133,24 @@ def create_index(self, metric="L2", num_partitions=256, num_sub_vectors=96): | |||
) | |||
self._reset_dataset() | |||
|
|||
def create_fts_index(self, field_names: Union[str, List[str]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we share the create_index()
API ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, it just makes that api harder to use because now you have to figure out the index type name and also which kwargs go with which index type etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would that make a bunch of create_foo_index()
? i.e., a create_btree_index()
or create_vector_index()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can refactor this into an Index/IndexBuilder abstraction. I just didn't want to commit to it now because this is an experimental feature.
This is v1 of integrating full text search index into LanceDB. The query API is roughly the same as before, except if the input is text instead of a vector we assume that its fts search. If `table` is a LanceDB LanceTable, then: Build index: `table.create_fts_index("text")` Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()` Here we use the tantivy-py package to build the index. We then use the row id's as the full-text-search index's doc id then we just do a Take operation to fetch the rows. 1. don't support incremental row appends yet. New data won't show up in search 2. local filesystem only 3. requires building tantivy explicitly --------- Co-authored-by: Chang She <chang@lancedb.com>
This is v1 of integrating full text search index into LanceDB. The query API is roughly the same as before, except if the input is text instead of a vector we assume that its fts search. If `table` is a LanceDB LanceTable, then: Build index: `table.create_fts_index("text")` Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()` Here we use the tantivy-py package to build the index. We then use the row id's as the full-text-search index's doc id then we just do a Take operation to fetch the rows. 1. don't support incremental row appends yet. New data won't show up in search 2. local filesystem only 3. requires building tantivy explicitly --------- Co-authored-by: Chang She <chang@lancedb.com>
This is v1 of integrating full text search index into LanceDB. The query API is roughly the same as before, except if the input is text instead of a vector we assume that its fts search. If `table` is a LanceDB LanceTable, then: Build index: `table.create_fts_index("text")` Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()` Here we use the tantivy-py package to build the index. We then use the row id's as the full-text-search index's doc id then we just do a Take operation to fetch the rows. 1. don't support incremental row appends yet. New data won't show up in search 2. local filesystem only 3. requires building tantivy explicitly --------- Co-authored-by: Chang She <chang@lancedb.com>
This is v1 of integrating full text search index into LanceDB. # API The query API is roughly the same as before, except if the input is text instead of a vector we assume that its fts search. ## Example If `table` is a LanceDB LanceTable, then: Build index: `table.create_fts_index("text")` Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()` # Implementation Here we use the tantivy-py package to build the index. We then use the row id's as the full-text-search index's doc id then we just do a Take operation to fetch the rows. # Limitations 1. don't support incremental row appends yet. New data won't show up in search 2. local filesystem only 3. requires building tantivy explicitly --------- Co-authored-by: Chang She <chang@lancedb.com>
This is v1 of integrating full text search index into LanceDB.
API
The query API is roughly the same as before, except if the input is text instead of a vector we assume that its fts search.
Example
If
table
is a LanceDB LanceTable, then:Build index:
table.create_fts_index("text")
Query:
df = table.search("puppy").limit(10).select(["text"]).to_df()
Implementation
Here we use the tantivy-py package to build the index. We then use the row id's as the full-text-search index's doc id then we just do a Take operation to fetch the rows.
Limitations