Skip to content

Deploying the Databricks Workspace Service should automatically set up a Unity Catalog pointing to the workspace storage account datalake container #4488

@ericmeans3cloud

Description

@ericmeans3cloud

Is your feature request related to a problem? Please describe.
Right now accessing workspace storage account data from within Databricks is a little convoluted (if there is an easier way please let me know!):

  1. Get the workspace storage account data lake endpoint and account key
  2. Put the account key in a Databricks Secret (or just use it directly in notebooks, although this is definitely not best practice).
  3. Access the storage account directly from dbutils, e.g.
spark.conf.set(
  "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", 
  dbutils.secrets.get(scope="<scope-name", key="storage-account-access-key-name"))

df = spark.read.json("dbfs:/databricks-datasets/iot/iot_devices.json")

Describe the solution you'd like
It would be a lot easier if the Databricks workspace service automatically created a Unity Catalog pointing to the datalake container.

This would require the following steps to be automated:

  1. Grant the Databricks managed identity Storage Blob Data Contributor on the workspace storage account.
  2. There is already a Databricks Credential set up to use the managed identity, so a new one does not need to be created.
  3. Create an External Location in Databricks Unity Catalog configured to access the datalake container in the workspace storage account (stgwsNNNN).
  4. Create a Catalog in Unity Catalog configured to access the external location.
  5. All of the above steps can be performed via Terraform included in the Databricks Workspace Service (there is a first party terraform provider for databricks; their databricks CLI actually uses terraform internally to perform most of its functionality).

This would allow Databricks users to immediately start working with files/data in the datalake container.

Describe alternatives you've considered
Manually accessing the storage account via the code up above works, but is cumbersome and requires multiple manual steps/connections.

Additional context

Unity Catalog is the preferred method for Databricks to access cloud storage at this point.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions