Moving from qdrant to lancedb #899

nairajay2k · 2024-01-30T09:56:54Z

nairajay2k
Jan 30, 2024

Hi,

We are trying to move our billion scale vector(1024 dim and payload/metadata) db to lancedb.

Observed that as inserts go from 0 and have reached around 25 million...the batch inserts(batch size 400) are now taking 1.2 seconds , up from 0.06 seconds while starting. How can we speed the inserts up? We have not enabled any indexes as of yet so this behaviour is a bit strange.
Is there a way where I can separate the payload and vectors into separate tables? Reason for this is there is a lot of redundant data in a single table(due to payload repeating across 100s of vectors). Is it possible yet or on the roadmap? Also would need to query as a joint view with some filters on the payload table and vector similarity on the vector table.
If s3 can be used instead of file system ,how does using s3 instead of filesystem impact insert/query performance assuming network latency can be discounted?
Is this the best place to put such queries?

westonpace · 2024-01-30T13:40:35Z

westonpace
Jan 30, 2024
Maintainer

Are you running compaction and cleanup on a regular basis? A good rule of thumb is to try and limit your dataset to 100 or so fragments until you reach 1 billion rows. Unfortunately, compaction and cleanup are missing from the API docs at the moment.

import datetime
import lancedb
import pyarrow as pa

tab = pa.Table.from_pydict({"x": [1]})
conn = lancedb.connect("/tmp/lance")

table = conn.create_table("my_table", tab)
table.add(tab)
table.add(tab)
table.add(tab)
table.add(tab)

print(len(table.to_lance().get_fragments()))
# 5

table.compact_files()
print(len(table.to_lance().get_fragments()))
# 1

print(table.cleanup_old_versions(older_than=datetime.timedelta(seconds=1)))
# CleanupStats { bytes_removed: 4050, old_versions: 7 }

We are currently working on additions to the Lance format which will allow us to pick an encoding per data page and working on better data encodings. If your payload is identical across hundreds of rows then that column would probably be a good candidate for RLE encoding (or some of the pages could even be encoded with constant encoding if you have a long enough run). This is probably still a few months out but it is being actively worked on (feat: update lance file format to support per-page encoding lance#1857).
I'm not entirely certain what assuming network latency can be discounted means here. Network latency is the most significant difference from a traditional filesystem when it comes to performance. Switching to S3 from local storage will have some impact on query / insert performance because S3 is slower.

Pooling / sharing your table instances and keeping them alive in memory will help query performance because the table metadata and indices are cached.

Yes. You can also reach us on Discord. Though an advantage of creating Github issues is that they become searchable for future askers.

4 replies

nairajay2k Jan 31, 2024
Author

What does cleanup do?? Is it deleting records earlier than a certain period??

"Pooling / sharing your table instances and keeping them alive in memory will help query performance because the table metadata and indices are cached."
How to pool/share and keep in memory? Pl share some code. Will this also help insert throughput?

westonpace Jan 31, 2024
Maintainer

What does cleanup do?? Is it deleting records earlier than a certain period??

No, cleanup will never delete records from the current version of the dataset. Lance supports "versioning". This means we have to store some information every time we make a change that we can use to roll back that change if desired (e.g. a write-ahead-log). More info here.

Cleanup will prune this version data. Once you cleanup a version you will no longer be able to roll back to that version. In normal operation it is a good idea to run cleanup every few weeks. When you are doing a large number of appends you probably want to run cleanup more frequently.

How to pool/share and keep in memory? Pl share some code.

For example, it is faster to do:

import lancedb
conn = lancedb.connect()
table = lancedb.open_table("foo")
my_results = table.to_arrow()
# This query will be faster because we don't need to initialize `table`
my_results_2 = table.to_arrow()

This would be slower:

import lancedb
conn = lancedb.connect()
my_results = lancedb.open_table("foo").to_arrow()
# Calling open_table twice forces us to initialize the table twice
my_results_2 = lancedb.open_table("foo").to_arrow()

Pooling / sharing instances is probably out of scope for a simple example because it depends on your application. For example, if you are creating a restful service then you can pool tables to share as you serve requests. If you are creating a command line application where each execution does one table operation then there probably isn't much opportunity for pooling/sharing.

Will this also help insert throughput?

Slightly, but not much.

nairajay2k Feb 1, 2024
Author

My table is append only so far...No deletes or updates yet...So would there be multiple versions that could be cleaned?

westonpace Feb 1, 2024
Maintainer

Yes. Every transaction that modifies the table (including append) will create a new version. Queries, which do not modify the table, will not create new versions.

nairajay2k · 2024-01-31T14:56:05Z

nairajay2k
Jan 31, 2024
Author

An observation was that if batch size was reduced to 100 from 400..the insert time remained more or less the same...That could also mean larger batches could be inserted in roughly the same time, improving the throughput

1 reply

westonpace Jan 31, 2024
Maintainer

Yes, inserting larger batches will help in two ways:

There is an overhead cost for each call to add. If you make fewer calls you have less overhead.
If you insert larger batches you will have fewer framgents. There are some parts of the add function which have a complexity of O(# of fragments). This is why compaction is recommended to keep the # of fragments at a reasonable number.

nairajay2k · 2024-02-01T17:21:50Z

nairajay2k
Feb 1, 2024
Author

How can we get the number of records in a table?

3 replies

westonpace Feb 1, 2024
Maintainer

Use Table.count_rows (looks like this is missing from the API docs, I'll investigate)

import lancedb
conn = lancedb.connect("/tmp/my_lance")
tab = lancedb.open_table("test")
print(tab.count_rows())
# 5

nairajay2k Feb 1, 2024
Author

Is this a costly call for large tables or is it fetched from some metadata?

westonpace Feb 1, 2024
Maintainer

It is fetched from metadata.

nairajay2k · 2024-02-05T11:21:34Z

nairajay2k
Feb 5, 2024
Author

How to add a scalar index for something nested...e.g on qdrant I would specify something the field name like "classifications_ipcr[].classification" for a nested sort of arrangement. Can I do this in lancedb and pl share some code both for indexing and searching.

a sample json is provided below
{
"c1": [
1021833000,
89867089
],
"c2": [
{
"e1": "ES",
"e2": "ddb",
"e3": "ddb",
"e4": "Some text",
"e5": "1"
},
{
"e1": "e1",
"e2": "e2",
"e4": "Some other text",
"e5": "1"
}
]
}
I would want to index/search c1 and c2.e4 .

Thanks in advance

1 reply

westonpace Feb 5, 2024
Maintainer

I'm not sure if we support scalar indices on nested fields at the moment but it should be straightforward. I will try and look into it this week.

#929

nairajay2k · 2024-02-05T23:07:38Z

nairajay2k
Feb 5, 2024
Author

Can I assist?

…

On Mon, 5 Feb, 2024, 8:33 pm Weston Pace, ***@***.***> wrote: I'm not sure if we support scalar indices on nested fields at the moment but it should be straightforward. I will try and look into it this week. #929 <#929> — Reply to this email directly, view it on GitHub <#899 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJ6XWJZTTECC2ZKSMAGQMC3YSDYDFAVCNFSM6AAAAABCQ53LPGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DGNZRGM2TO> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

westonpace Feb 6, 2024
Maintainer

Sure. You might start by creating a test case here: https://github.com/lancedb/lance/blob/v0.9.13/python/python/tests/test_scalar_index.py and seeing what fails. I would expect the syntax to be something like...

dataset.create_scalar_index("c2.e4", index_type="BTREE")

I'm not exactly sure what the corresponding filter would look like. Datafusion supports nested fields in their SQL (at least according to this issue) but I don't know what the syntax looks like. That might be the next step. Then give the filter a try and see what fails.

nairajay2k · 2024-02-06T13:48:19Z

nairajay2k
Feb 6, 2024
Author

Wouldn't this need some change in rust code as well?

…

On Tue, 6 Feb, 2024, 7:03 pm Weston Pace, ***@***.***> wrote: Sure. You might start by creating a test case here: https://github.com/lancedb/lance/blob/v0.9.13/python/python/tests/test_scalar_index.py and seeing what fails. I would expect the syntax to be something like... dataset.create_scalar_index("c2.e4", index_type="BTREE") I'm not exactly sure what the corresponding filter would look like. Datafusion supports nested fields in their SQL (at least according to this issue <apache/datafusion#119>) but I don't know what the syntax looks like. That might be the next step. Then give the filter a try and see what fails. — Reply to this email directly, view it on GitHub <#899 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJ6XWJYM5XEZS2IHUSOQNRDYSIWJPAVCNFSM6AAAAABCQ53LPGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DGOBSHAZDO> . You are receiving this because you authored the thread.Message ID: ***@***.***>

4 replies

nairajay2k Feb 8, 2024
Author

Pl point me to the steps I need to do in order to run testcases for lancedb. Apologies for my ignorance.

Thanks

westonpace Feb 9, 2024
Maintainer

It's not a problem. Yes, it will likely require changes in rust as well. It will be easier to create the test in lance instead of lancedb. To run the tests you can:

Install rust if you have not already done so
Install python
Install protobuf-compiler if it isn't already installed (protoc needs to be on your path)
Setup either conda or a venv
Install maturin (pip install maturin)

git clone https://github.com/lancedb/lance.git lance
cd lance/python
# This will build the rust code, then build the rust/python bindings
# then symlink a dev copy of pylance into your venv
maturin develop --extras=tests
python -mpytest python/tests

nairajay2k Feb 11, 2024
Author

Not sure which line in your list is building rust code. Pl let me know . I am a rust newbie

nairajay2k Feb 11, 2024
Author

This is what I get for a field that is an int array or list

ERROR python/tests/test_scalar_index.py::test_load_nested_indices - TypeError: Scalar index column arr1 must be int, float, bool, or str

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving from qdrant to lancedb #899

{{title}}

Replies: 6 comments 14 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Moving from qdrant to lancedb #899

nairajay2k Jan 30, 2024

Replies: 6 comments · 14 replies

westonpace Jan 30, 2024 Maintainer

nairajay2k Jan 31, 2024 Author

westonpace Jan 31, 2024 Maintainer

nairajay2k Feb 1, 2024 Author

westonpace Feb 1, 2024 Maintainer

nairajay2k Jan 31, 2024 Author

westonpace Jan 31, 2024 Maintainer

nairajay2k Feb 1, 2024 Author

westonpace Feb 1, 2024 Maintainer

nairajay2k Feb 1, 2024 Author

westonpace Feb 1, 2024 Maintainer

nairajay2k Feb 5, 2024 Author

westonpace Feb 5, 2024 Maintainer

nairajay2k Feb 5, 2024 Author

westonpace Feb 6, 2024 Maintainer

nairajay2k Feb 6, 2024 Author

nairajay2k Feb 8, 2024 Author

westonpace Feb 9, 2024 Maintainer

nairajay2k Feb 11, 2024 Author

nairajay2k Feb 11, 2024 Author

nairajay2k
Jan 30, 2024

Replies: 6 comments 14 replies

westonpace
Jan 30, 2024
Maintainer

nairajay2k Jan 31, 2024
Author

westonpace Jan 31, 2024
Maintainer

nairajay2k Feb 1, 2024
Author

westonpace Feb 1, 2024
Maintainer

nairajay2k
Jan 31, 2024
Author

westonpace Jan 31, 2024
Maintainer

nairajay2k
Feb 1, 2024
Author

westonpace Feb 1, 2024
Maintainer

nairajay2k Feb 1, 2024
Author

westonpace Feb 1, 2024
Maintainer

nairajay2k
Feb 5, 2024
Author

westonpace Feb 5, 2024
Maintainer

nairajay2k
Feb 5, 2024
Author

westonpace Feb 6, 2024
Maintainer

nairajay2k
Feb 6, 2024
Author

nairajay2k Feb 8, 2024
Author

westonpace Feb 9, 2024
Maintainer

nairajay2k Feb 11, 2024
Author

nairajay2k Feb 11, 2024
Author