Skip to content

Conversation

@keen85
Copy link
Contributor

@keen85 keen85 commented Sep 6, 2025

This PR aims at:

  • making LakeBench also run locally and reduce Fabric dependency
    • remove/centralize notebookutils and Fabric specific code
  • prepare LakeBench to support also other storage backends
    • obstore is a Python wrapper for rust crate object_store (the same one that is used by delta-rs, polars and sail). It is now used in LakeBench to carry out file system operations (e.g. deleting temporary files) and this way it should be easy to make LakeBench also compatible with other storage backends like AWS or GCP.
    • since many of the engines use the object_store rust crate, the storage_options parameter is supported by those as well. This allows out to configure the storage configuration once and we only need to pass on the config dictionary to the engines.

@keen85 keen85 marked this pull request as ready for review September 7, 2025 15:03
@keen85
Copy link
Contributor Author

keen85 commented Sep 7, 2025

@mwc360 can you review this PR?

One general thing: when you close (external) PRs like mine, instead of using fast forward merge I'm a fan of squash commits basically taking all (small) commits of the PR and making one large commit of it.
IMHO this makes the git history on the main branch easier to read and it allows quickly reverting a feature if necessary.

@mwc360
Copy link
Owner

mwc360 commented Sep 11, 2025

@keen85 - I meant to squash, will change the project to require squash for merges.

I'll take a look next week. thx!

@keen85
Copy link
Contributor Author

keen85 commented Sep 23, 2025

@mwc360, have you had the chance to have a look?

@mwc360
Copy link
Owner

mwc360 commented Sep 24, 2025

@keen85 - I apologize, it probably won't be till early next week. Had a few things come up recently.

@mwc360
Copy link
Owner

mwc360 commented Oct 3, 2025

@keen85 - love what you'd added here. I'm layering in some related changes to support local execution with flexible file paths, generalizing the naming of input paths, etc. to make it more agnostic. I'll hopefully have stuff ready on Monday so we can get this merged in.

@mwc360
Copy link
Owner

mwc360 commented Oct 11, 2025

@keen85 - I made updates per the following:

  • engine storage references are generalized and standardized to be storage agnostic (URI instead of ABFSS)
  • I tested all engines and made a few minor changes to ensure that running locally works
  • I added methods in the BaseEngine to auto detect the runtime and os

The last thing I want to do before merging this in is to update the benchmark class references to be storage agnostic and remove the unify the mount and abfss input vars.

Glad you created this PR, this will be a big improvement for lakebench to be used in many more scenarios!

@mwc360
Copy link
Owner

mwc360 commented Oct 15, 2025

@keen85 - I've moved on to testing in Fabric and I'm getting AlreadyExistsError with fsspec (same error you already reported). Any ideas on how to workaround this?

Deleting all files works. The problem appears to be deleting the directories. Given that recursive doesn't appear to work properly I was thinking to is the directories and then sort descending by segment length but fs.find is returning partial paths like the below :/ (abfss:// is missing form every item after the first value).

['abfss://lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/customer',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/customer/_delta_log',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/date_dim',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/date_dim/_delta_log',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/item',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/item/_delta_log',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/store',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/store/_delta_log',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/store_sales',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/store_sales/_delta_log',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/total_sales_fact',
 'lakebench@msit-onelake.dfs.fabric.microsoft.com/lakebench.Lakehouse/Tables/duckdb_eltbench_test/total_sales_fact/_delta_log']

@mwc360
Copy link
Owner

mwc360 commented Oct 15, 2025

@keen85 - I was able to repo the issue deleting directories recursively in the adlfs library, pretty sure the underlying issue is that the recursive flag isn't passed to the storage sdk: https://github.com/fsspec/adlfs/blob/f76842b4650ba29bb3383daf80cba197fd44e8c2/adlfs/spec.py#L1283C23-L1283C55

I did a quick test with this passed in and it does appear to fix the issue. I just need to test it w/ obstore and then I'll likely submit a PR.

@keen85
Copy link
Contributor Author

keen85 commented Oct 15, 2025

@mwc360 , it seems that rs object store's behavior (and therefore also obstore's fsspec implementation) for local file systems is not very consistent with cloud object storage backends. Also it does not go well with directories in general (because for classic object storages directories are only virtual and not actual objects as they are in Azure Data Lake gen2).

So, maybe original fsspec is a better choice for making the file management transactions backend agnostic.
Downside would be that fsspec is just a specification and there are implementations for each of the cloud providers - like adlfs library for Azure Storage.

So, the API would be the same but if lakebench is supposed to support AWS S3 at some point, we'll need to add the AWS specific fsspec implementation as well.

@mwc360
Copy link
Owner

mwc360 commented Oct 16, 2025

@keen85 - ok take a look at what I did for the filesystem support: c325b22

I also standardized and simplified benchmark parameter inputs to remove abfss references and no longer delineate between mount and abfss vars. I tested everything locally and verified that things run in Fabric Spark/Python.

Unless you have any feedback on my changes or edits, I think this is good to merge in. I'd say the last thing you should probably do is update the PR description to note that a combination of fsspec and notebookutils is temporarily being used due to the current issues with obstore.

@mwc360 mwc360 mentioned this pull request Oct 16, 2025
@mwc360 mwc360 merged commit 0efbe65 into mwc360:main Oct 16, 2025
mwc360 added a commit that referenced this pull request Oct 17, 2025
Co-authored-by: keen85 <keen85@users.noreply.github.com>
Co-authored-by: mwc360 <mwc360@users.noreply.github.com>
mwc360 added a commit that referenced this pull request Oct 17, 2025
Co-authored-by: Miles Cole <52209784+mwc360@users.noreply.github.com>
Co-authored-by: Martin <29750255+keen85@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants