Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): expose drop_columns() in Python #1904

Merged
merged 2 commits into from Feb 3, 2024

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Feb 2, 2024

Renames the Rust method drop() to drop_columns() for clarity and also alignment with add_columns() and alter_columns().

Closes #1076
Related #1674

test

feat: support returning Pandas DF

feat: add_columns supports pools

fix test name conflict

fix: shutdown pool

fix

remove process pool

remove method

implement Rust API

clean up testing of drop_columns

feat: rename columns

partially implement casting

wip: add Python bindings

wip: start fixing impl

make it work

remove irrelevant changes

pass tests

format

format
@wjones127 wjones127 changed the title feat: rename drop() to drop_columns() feat(python): expose drop_columns() in Python Feb 2, 2024
@wjones127 wjones127 marked this pull request as ready for review February 2, 2024 22:45
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor comments / questions. This looks good!


fn drop_columns(&mut self, columns: Vec<&str>) -> PyResult<()> {
let mut new_self = self.ds.as_ref().clone();
RT.block_on(None, new_self.drop_columns(&columns))?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_on and not spawn? Is that because this API can't possibly call back into python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it should be safe if there's no way to call back into Python.

Comment on lines +1011 to +1014
lance::Error::InvalidInput { source, .. } => {
PyValueError::new_err(source.to_string())
}
_ => PyIOError::new_err(err.to_string()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a generic error conversion that always maps InvalidInput to PyValueError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah refactoring our error handling is on my TODO list. I'll put up an issue soon.

@@ -1466,6 +1436,56 @@ impl Dataset {
}
}

impl Dataset {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in a new impl block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put it in a separate impl block, it groups the methods together in the API docs. You can add a documentation block to it. My plan is to put all the "schema evolution" methods under one header and have a little doc string describing them together.

Comment on lines +1443 to +1445
/// underlying storage. In order to remove the data, you must subsequently
/// call `compact_files` to rewrite the data without the removed columns and
/// then call `cleanup_files` to remove the old files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users aren't going to see this but I feel like we're going to get questions about it. Might need to end up in a top level rst document someday.

Copy link
Contributor Author

@wjones127 wjones127 Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I should add this to the Python API docs. But later adding it to the user guide will be good as well.

}
}

let columns_to_remove = self.manifest.schema.project(columns)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think project can handle nested column references right? What happens if a nested reference is provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will drop the nested column. There is a unit test for this:

dataset.drop_columns(&["s.d"]).await?;

@@ -2663,270 +2667,65 @@ mod tests {
metadata.clone(),
));

let struct_array_1 = Arc::new(StructArray::from(vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for simplifying these tests!

@wjones127 wjones127 merged commit 5d898a7 into lancedb:main Feb 3, 2024
13 checks passed
@wjones127 wjones127 deleted the drop-columns branch February 3, 2024 03:23
@wjones127 wjones127 restored the drop-columns branch February 3, 2024 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add drop_column API to remove a column from a dataset
2 participants