Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_columns dataset API #1458

Closed
Tracked by #1813
wjones127 opened this issue Oct 23, 2023 · 2 comments · Fixed by #1468
Closed
Tracked by #1813

add_columns dataset API #1458

wjones127 opened this issue Oct 23, 2023 · 2 comments · Fixed by #1468
Assignees
Labels
enhancement New feature or request priority: high Issues that are high priority (for LanceDb, the organization)

Comments

@wjones127
Copy link
Contributor

We have an add_columns() API for fragments, and a merge() API for dataset. It would be nice to create a add_columns() API for dataset that is similar to the one for fragments.

It would also be very nice to add progress tracking as an option there, since this might be a long operation. For example, users might call this API to add a new embedding column to the dataset.

@wjones127 wjones127 added the enhancement New feature or request label Oct 23, 2023
@wjones127
Copy link
Contributor Author

We would also like to support a compute expression, so users can easily do a cast, for example.

The API could look like:

def add_columns(exprs_or_udf: Dict[str, str] | AddColumnUDF):
    ...

Then users could write expressions:

dataset.add_columns({"bloat16_vec": "cast(vector as array<bfloat16, 512>)"})

Or could specify a UDF using some decorator:

pool = ThreadPoolExecutor(8)

@add_columns_udf(
   reads_columns=['vector'],
   pool=pool,
)
def cast_bfloat16(batch: pa.RecordBatch):
    ...


dataset.add_columns(cast_bfloat16)

The pool parameter is important with UDFs to allow some parallism.

@wjones127
Copy link
Contributor Author

wjones127 commented Jan 25, 2024

Another requirement: for the UDF, it would be nice to provide some way for the data to be staged so that it can be resumed in case of a crash.

One way to do that in Python is use a sqlite file as a durable mapping from hash(input_data) to output_data. We could even be extra smart and keep track of which fragments have already been written and where. This could also help with cleanup.

@westonpace westonpace added the priority: high Issues that are high priority (for LanceDb, the organization) label Jan 29, 2024
@wjones127 wjones127 self-assigned this Jan 29, 2024
wjones127 added a commit that referenced this issue Feb 5, 2024
Adds a new `add_columns()` method on datasets. This can be passed either
SQL expressions or a user-defined batch function.

Closes #1458
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: high Issues that are high priority (for LanceDb, the organization)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants