Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent duplicate data in FTS index #728

Merged
merged 4 commits into from
Dec 20, 2023
Merged

fix: prevent duplicate data in FTS index #728

merged 4 commits into from
Dec 20, 2023

Conversation

wjones127
Copy link
Contributor

This forces the user to replace the whole FTS directory when re-creating the index, prevent duplicate data from being created. Previously, the whole dataset was re-added to the existing index, duplicating existing rows in the index.

This (in combination with lancedb/lance#1707) caused #726, since the duplicate data emitted duplicate indices for take() and an upstream issue caused those queries to fail.

This solution isn't ideal, since it makes the FTS index temporarily unavailable while the index is built. In the future, we should have multiple FTS index directories, which would allow atomic commits of new indexes (as well as multiple indexes for different columns).

Fixes #498.
Fixes #726.

@wjones127 wjones127 marked this pull request as ready for review December 20, 2023 19:45
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great thanks. just one minor nit.

python/lancedb/table.py Outdated Show resolved Hide resolved
wjones127 and others added 4 commits December 20, 2023 12:34
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
@wjones127 wjones127 merged commit f9dd7a5 into main Dec 20, 2023
10 checks passed
@wjones127 wjones127 deleted the replace-fts-index branch December 20, 2023 21:07
raghavdixit99 pushed a commit to raghavdixit99/lancedb that referenced this pull request Apr 5, 2024
This forces the user to replace the whole FTS directory when re-creating
the index, prevent duplicate data from being created. Previously, the
whole dataset was re-added to the existing index, duplicating existing
rows in the index.

This (in combination with lancedb/lance#1707) caused lancedb#726, since the
duplicate data emitted duplicate indices for `take()` and an upstream
issue caused those queries to fail.

This solution isn't ideal, since it makes the FTS index temporarily
unavailable while the index is built. In the future, we should have
multiple FTS index directories, which would allow atomic commits of new
indexes (as well as multiple indexes for different columns).

Fixes lancedb#498.
Fixes lancedb#726.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
westonpace pushed a commit that referenced this pull request Apr 5, 2024
This forces the user to replace the whole FTS directory when re-creating
the index, prevent duplicate data from being created. Previously, the
whole dataset was re-added to the existing index, duplicating existing
rows in the index.

This (in combination with lancedb/lance#1707) caused #726, since the
duplicate data emitted duplicate indices for `take()` and an upstream
issue caused those queries to fail.

This solution isn't ideal, since it makes the FTS index temporarily
unavailable while the index is built. In the future, we should have
multiple FTS index directories, which would allow atomic commits of new
indexes (as well as multiple indexes for different columns).

Fixes #498.
Fixes #726.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
alexkohler pushed a commit to alexkohler/lancedb that referenced this pull request Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid string array when doing FTS create_fts_index creates duplicates
2 participants