Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full Text Seach: where clause ignored or unclear behavior #721

Closed
pchalasani opened this issue Dec 20, 2023 · 5 comments · Fixed by #739
Closed

Full Text Seach: where clause ignored or unclear behavior #721

pchalasani opened this issue Dec 20, 2023 · 5 comments · Fixed by #739
Assignees

Comments

@pchalasani
Copy link

pchalasani commented Dec 20, 2023

Assuming I've set up an fts_index... when doing this:

table.search(query).where(filter).limit(10).to_list()

I expect the FTS to be limited to records where filter is True. But it doesn't seem to be the case,
as in the first query below -- the filter seems to be ignored, or I don't understand the behavior, or I'm missing some setting.

I expected the first query below to return NO results. The second one seems to work as expected.

In [1]: import lancedb
   ...:
   ...: uri = "data/sample-lancedb"
   ...: db = lancedb.connect(uri)
   ...:
   ...: table = db.create_table("my_table",
   ...:             data=[{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
   ...:                   {"vector": [5.9, 26.5], "text": "There are several kittens playing"}])

In [2]: table.create_fts_index("text")

In [3]: table.search("Frodo").where("text like '%kitten%'").limit(10).to_list()
Out[3]:
[{'vector': [3.0999999046325684, 4.099999904632568],
  'text': 'Frodo was a happy puppy',
  'score': 0.6931471824645996}]

In [4]: table.search(None).where("text like '%kitten%'").limit(10).to_list()
Out[4]:
[{'vector': [5.900000095367432, 26.5],
  'text': 'There are several kittens playing'}]
@changhiskhan changhiskhan self-assigned this Dec 20, 2023
@changhiskhan
Copy link
Contributor

I expected the first query below to return NO results. The second one seems to work as expected.

If we get the FTS results then apply the filter on top, I think that should be good right?

@pchalasani
Copy link
Author

If we get the FTS results then apply the filter on top, I think that should be good right?

that would be fine, yes. I think you mean something like this --

results = table.search("Frodo")
filtered = results.where("text like '%kitten%'")

It doesn't work yet, I assume you mean you're planning to implement it.

@changhiskhan
Copy link
Contributor

It doesn't work yet, I assume you mean you're planning to implement it.
yeah just wanted to confirm this would solve your need before I go implement it

@changhiskhan
Copy link
Contributor

@pchalasani you can take a look at the unit test in #739 to see usage. Lmk if this is what you were expecting.

@pchalasani
Copy link
Author

pchalasani commented Dec 22, 2023

Lmk if this is what you were expecting.

@changhiskhan yes it looks like the behavior I was expecting, for all results, the where condition should hold. In other words, the FTS should be restricted to records that match the where condition.
[edited]

changhiskhan added a commit that referenced this issue Dec 27, 2023
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
koolamusic pushed a commit to tecmie/lancedb that referenced this issue Jan 12, 2024
Closes lancedb#721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once lancedb#676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
raghavdixit99 pushed a commit to raghavdixit99/lancedb that referenced this issue Apr 5, 2024
Closes lancedb#721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once lancedb#676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
raghavdixit99 pushed a commit to raghavdixit99/lancedb that referenced this issue Apr 5, 2024
Closes lancedb#721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once lancedb#676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
westonpace pushed a commit that referenced this issue Apr 5, 2024
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
westonpace pushed a commit that referenced this issue Apr 5, 2024
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
alexkohler pushed a commit to alexkohler/lancedb that referenced this issue Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants