-
Notifications
You must be signed in to change notification settings - Fork 10
feat: make LakeBench runtime and storage backend agnostic #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mwc360 can you review this PR? One general thing: when you close (external) PRs like mine, instead of using fast forward merge I'm a fan of squash commits basically taking all (small) commits of the PR and making one large commit of it. |
|
@keen85 - I meant to squash, will change the project to require squash for merges. I'll take a look next week. thx! |
|
@mwc360, have you had the chance to have a look? |
|
@keen85 - I apologize, it probably won't be till early next week. Had a few things come up recently. |
|
@keen85 - love what you'd added here. I'm layering in some related changes to support local execution with flexible file paths, generalizing the naming of input paths, etc. to make it more agnostic. I'll hopefully have stuff ready on Monday so we can get this merged in. |
|
@keen85 - I made updates per the following:
The last thing I want to do before merging this in is to update the benchmark class references to be storage agnostic and remove the unify the mount and abfss input vars. Glad you created this PR, this will be a big improvement for lakebench to be used in many more scenarios! |
|
@keen85 - I've moved on to testing in Fabric and I'm getting Deleting all files works. The problem appears to be deleting the directories. Given that recursive doesn't appear to work properly I was thinking to is the directories and then sort descending by segment length but fs.find is returning partial paths like the below :/ (abfss:// is missing form every item after the first value). |
|
@keen85 - I was able to repo the issue deleting directories recursively in the I did a quick test with this passed in and it does appear to fix the issue. I just need to test it w/ obstore and then I'll likely submit a PR. |
|
@mwc360 , it seems that rs object store's behavior (and therefore also obstore's fsspec implementation) for local file systems is not very consistent with cloud object storage backends. Also it does not go well with directories in general (because for classic object storages directories are only virtual and not actual objects as they are in Azure Data Lake gen2). So, maybe original fsspec is a better choice for making the file management transactions backend agnostic. So, the API would be the same but if lakebench is supposed to support AWS S3 at some point, we'll need to add the AWS specific fsspec implementation as well. |
|
@keen85 - ok take a look at what I did for the filesystem support: c325b22 I also standardized and simplified benchmark parameter inputs to remove abfss references and no longer delineate between mount and abfss vars. I tested everything locally and verified that things run in Fabric Spark/Python. Unless you have any feedback on my changes or edits, I think this is good to merge in. I'd say the last thing you should probably do is update the PR description to note that a combination of fsspec and notebookutils is temporarily being used due to the current issues with obstore. |
Co-authored-by: keen85 <keen85@users.noreply.github.com> Co-authored-by: mwc360 <mwc360@users.noreply.github.com>
Co-authored-by: Miles Cole <52209784+mwc360@users.noreply.github.com> Co-authored-by: Martin <29750255+keen85@users.noreply.github.com>
This PR aims at:
notebookutilsand Fabric specific codeobstoreis a Python wrapper for rust crateobject_store(the same one that is used by delta-rs, polars and sail). It is now used in LakeBench to carry out file system operations (e.g. deleting temporary files) and this way it should be easy to make LakeBench also compatible with other storage backends like AWS or GCP.object_storerust crate, thestorage_optionsparameter is supported by those as well. This allows out to configure the storage configuration once and we only need to pass on the config dictionary to the engines.