-
Notifications
You must be signed in to change notification settings - Fork 172
Description
Is your feature request related to a problem? Please describe.
Right now accessing workspace storage account data from within Databricks is a little convoluted (if there is an easier way please let me know!):
- Get the workspace storage account data lake endpoint and account key
- Put the account key in a Databricks Secret (or just use it directly in notebooks, although this is definitely not best practice).
- Access the storage account directly from dbutils, e.g.
spark.conf.set(
"fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope-name", key="storage-account-access-key-name"))
df = spark.read.json("dbfs:/databricks-datasets/iot/iot_devices.json")
Describe the solution you'd like
It would be a lot easier if the Databricks workspace service automatically created a Unity Catalog pointing to the datalake container.
This would require the following steps to be automated:
- Grant the Databricks managed identity Storage Blob Data Contributor on the workspace storage account.
- There is already a Databricks Credential set up to use the managed identity, so a new one does not need to be created.
- Create an External Location in Databricks Unity Catalog configured to access the
datalakecontainer in the workspace storage account (stgwsNNNN). - Create a Catalog in Unity Catalog configured to access the external location.
- All of the above steps can be performed via Terraform included in the Databricks Workspace Service (there is a first party terraform provider for databricks; their databricks CLI actually uses terraform internally to perform most of its functionality).
This would allow Databricks users to immediately start working with files/data in the datalake container.
Describe alternatives you've considered
Manually accessing the storage account via the code up above works, but is cumbersome and requires multiple manual steps/connections.
Additional context
Unity Catalog is the preferred method for Databricks to access cloud storage at this point.