add_columns dataset API #1458

wjones127 · 2023-10-23T19:02:46Z

We have an add_columns() API for fragments, and a merge() API for dataset. It would be nice to create a add_columns() API for dataset that is similar to the one for fragments.

It would also be very nice to add progress tracking as an option there, since this might be a long operation. For example, users might call this API to add a new embedding column to the dataset.

The text was updated successfully, but these errors were encountered:

wjones127 · 2024-01-25T18:53:56Z

We would also like to support a compute expression, so users can easily do a cast, for example.

The API could look like:

def add_columns(exprs_or_udf: Dict[str, str] | AddColumnUDF):
    ...

Then users could write expressions:

dataset.add_columns({"bloat16_vec": "cast(vector as array<bfloat16, 512>)"})

Or could specify a UDF using some decorator:

pool = ThreadPoolExecutor(8)

@add_columns_udf(
   reads_columns=['vector'],
   pool=pool,
)
def cast_bfloat16(batch: pa.RecordBatch):
    ...


dataset.add_columns(cast_bfloat16)

The pool parameter is important with UDFs to allow some parallism.

wjones127 · 2024-01-25T22:53:41Z

Another requirement: for the UDF, it would be nice to provide some way for the data to be staged so that it can be resumed in case of a crash.

One way to do that in Python is use a sqlite file as a durable mapping from hash(input_data) to output_data. We could even be extra smart and keep track of which fragments have already been written and where. This could also help with cleanup.

Adds a new `add_columns()` method on datasets. This can be passed either SQL expressions or a user-defined batch function. Closes #1458

wjones127 added the enhancement New feature or request label Oct 23, 2023

wjones127 mentioned this issue Oct 26, 2023

feat: dataset add_columns method #1468

Merged

wjones127 mentioned this issue Jan 16, 2024

[EPIC] bfloat16 support #1813

Open

9 tasks

westonpace added the priority: high Issues that are high priority (for LanceDb, the organization) label Jan 29, 2024

wjones127 self-assigned this Jan 29, 2024

wjones127 mentioned this issue Jan 29, 2024

Feature: support add_columns API lancedb/lancedb#896

Open

wjones127 closed this as completed in #1468 Feb 5, 2024

wjones127 added a commit that referenced this issue Feb 5, 2024

feat: dataset add_columns method (#1468)

912b8fc

Adds a new `add_columns()` method on datasets. This can be passed either SQL expressions or a user-defined batch function. Closes #1458

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add_columns dataset API #1458

add_columns dataset API #1458

wjones127 commented Oct 23, 2023

wjones127 commented Jan 25, 2024

wjones127 commented Jan 25, 2024 •

edited

add_columns dataset API #1458

add_columns dataset API #1458

Comments

wjones127 commented Oct 23, 2023

wjones127 commented Jan 25, 2024

wjones127 commented Jan 25, 2024 • edited

wjones127 commented Jan 25, 2024 •

edited