[Databricks] Supporting OAuth & Serverless compute #127

zacdav-db · 2024-10-14T11:23:28Z

Not the prettiest first attempt, have added support for serverless and OAuth.

Using the Databricks SDK and the sdkConfig as a mechanism to connect and authenticate
Serverless defaults to FALSE and currently still requires version to be specified (this strictly isn't required)
Added boolean check to remove spark configs when on serverless as they can't be applied

…ng to SDK for auth. This should enable full OAuth support.

zacdav-db · 2024-10-14T11:27:39Z

R/databricks-utils.R

+  # # Checks for OAuth Databricks token inside the RStudio API
+  # if (is.null(token) && exists(".rs.api.getDatabricksToken")) {
+  #   getDatabricksToken <- get(".rs.api.getDatabricksToken")
+  #   token <- set_names(getDatabricksToken(databricks_host()), "oauth")
+  # }


This should be handled by SDK config component.

Hey, are we talking about this SDK? https://github.com/databricks/databricks-sdk-py/ And if so, can you point me to where it handles the RStudio token? I can't seem to find it

SDK won't detect the .rs.api.getDatabricks* but maybe theres a gap in my understanding, I thought connect would also write to a config file as well, which the SDK should pickup?

edgararuiz

Thank you for sending over this PR, it's looking great!

edgararuiz · 2024-10-14T13:16:00Z

R/databricks-utils.R

+  # # Checks for OAuth Databricks token inside the RStudio API
+  # if (is.null(token) && exists(".rs.api.getDatabricksToken")) {
+  #   getDatabricksToken <- get(".rs.api.getDatabricksToken")
+  #   token <- set_names(getDatabricksToken(databricks_host()), "oauth")
+  # }


Hey, are we talking about this SDK? https://github.com/databricks/databricks-sdk-py/ And if so, can you point me to where it handles the RStudio token? I can't seem to find it

edgararuiz · 2024-10-14T13:19:40Z

R/sparklyr-spark-connect.R

  silent <- args$silent %||% FALSE

  method <- method[[1]]
  token <- databricks_token(token, fail = FALSE)


Based on your comment on line 137, I think we should remove this line. And have token only populated when the user passes it as an argument in the spark_connect() call

My thinking for leaving this was that users explicitly setting the DATABRICKS_TOKEN and DATABRICKS_HOST vars should have those respected as it were set explicitly. The Databricks Python SDK won't detect those when its done from R.

databricks_token function also looks for CONNECT_DATABRICKS_TOKEN so i think its probably important to leave that intact?

I was expecting hierarchy to be:

Explicit token

DATABRICKS_TOKEN

CONNECT_DATABRICKS_TOKEN

.rs.api.getDatabricksToken(host)

Python SDK explicit setting of profile

Python SDK detection of DEFAULT profile

Where 1-4 are handled by databricks_token

edgararuiz · 2024-10-14T13:20:36Z

R/sparklyr-spark-connect.R

+  # sdk config
+  conf_args <- list(host = master)
+  # if token is found, propagate
+  # otherwise trust in sdk to detect and do what it can?


Yes, if we remove line 72, then this if makes sense to leave

edgararuiz · 2024-10-14T13:23:13Z

R/sparklyr-spark-connect.R

-  conn <- exec(databricks_session, !!!remote_args)
+  sdk_config <- db_sdk$core$Config(!!!conf_args)
+
+  # unsure if this iss needed anymore?


I think we need to remove this from here, specially since we can't use httr2:::is_hosted_session() since a ::: is not allowed. Do you think this is important for the package to do if the user is on desktop? If so, what do you think about isolating it in its own exported function? Maybe pysparklyr::databricks_desktop_login()?

I don't think this is required, will do some testing without.

zacdav-db · 2024-10-15T06:23:47Z

Im reviewing this with a clearer mind and I think we should attempt to defer as much as possible to the SDK for Databricks auth and logic. Will have an attempt.

edgararuiz · 2024-10-15T14:37:46Z

Hey, want me to do another review, or are you still working on this?

zacdav-db · 2024-10-15T22:40:03Z

@edgararuiz its at the point where I'd appreciate some input if you have a spare moment. It's not finalised but want to ensure we agree on the direction / shape its taking! 🙏

edgararuiz · 2024-10-16T16:58:49Z

@zacdav-db - Looking good. Feel free to remove sanitze_host(), we can always restore it later if needed. Also, is it passing tests locally for you?

zacdav-db · 2024-10-18T04:47:35Z

@edgararuiz not passing tests locally, seems theres an install issue.

Theres also one complication with using the SDK for auth everywhere, it requires the python env to be loaded, which depends on version being known. version can't be determined without API which if using SDK, requires the env.

We probably don't want to force version to be specified, but I think the only was we can defer to SDK is to do so.

edwardsd · 2025-01-16T21:33:09Z

After successfully connecting to Databricks serverless compute using this PR I tried to call sparklyr::copy_to but it fails with an error message that serverless does not support "CACHE TABLE AS SELECT"

sc <- sparklyr::spark_connect(method     = "databricks_connect",
                              master     = databricks_host,
                              serverless = TRUE,
                              version    = "15.1")

mtcars_sdf  <- sparklyr::copy_to(sc, df = mtcars,   name = "mtcars",   overwrite = TRUE)

Error in py_call_impl(x, dots$unnamed, dots$named) : 
  pyspark.errors.exceptions.connect.AnalysisException: [NOT_SUPPORTED_WITH_SERVERLESS] CACHE TABLE AS SELECT is not supported on serverless compute. SQLSTATE: 0A000;
CacheTableAsSelect mtcars, SELECT *
<...truncated...>at#7131, wt#7132, qsec#7133, vs#7134, am#7135, gear#7136, carb#7137]
   +- SubqueryAlias sparklyr_tmp_table_b2e5c0a2_cb8e_4237_bc3d_36c41f8bdc2b
      +- View (`sparklyr_tmp_table_b2e5c0a2_cb8e_4237_bc3d_36c41f8bdc2b`, [mpg#7127, cyl#7128, disp#7129, hp#7130, drat#7131, wt#7132, qsec#7133, vs#7134, am#7135, gear#7136, carb#7137])
         +- Project [_0#7105 AS mpg#7127, _1#7106 AS cyl#7128, _2#7107 AS disp#7129, _3#7108 AS hp#7130, _4#7109 AS drat#7131, _5#7110 AS wt#7132, _6#7111 AS qsec#7133, _7#7112 AS vs#7134, _8#7113 AS am#7135, _9#7114 AS gear#7136, _10#7115 AS carb#7137]
            +- LocalRelation [_0#7105, _1#7106, _2#7107, _3#7108, _4#7109, _5#7110, _6#7111, _7#7112, _8#7113, _9#7114, _10#7115]

zacdav-db · 2025-03-01T09:06:43Z

The issue you are seeing @edwardsd is due to caching limitations on serverless.

Should be able to adjust the behaviour to be handled, having a look at it now.

…aching on serverless (`memory = TRUE` will be ignored).

…itial install / deployment scenario. Adjusted minor logic on connection.

zacdav-db · 2025-03-01T12:14:04Z

@edwardsd would you mind trying again to see if things work as expected?

@edgararuiz I did some fairly heavy refactoring - would appreciate if you can review the changes at a high level. I still need to adjust tests.

edwardsd · 2025-03-07T01:53:34Z

@zacdav-db based on my somewhat limited testing things worked as expected in Posit Workbench.

I did, however, have an issue connecting, using the code below, when running in Quarto

sc <- sparklyr::spark_connect(method     = "databricks_connect",
                              master     = databricks_host,
                              serverless = TRUE,
                              version    = "15.1")

The code fails because when you run Quarto it shells out to a command-line R session which is technically no longer an RStudio session and therefore does not have access to the .rs.api.

I think you can fix this by using the databricks.sdk.core.Config function to retrieve the Databricks OAuth token e.g.,

db_sdk_core <- import_check("databricks.sdk.core", envname, silent = TRUE)
 
config <- db_sdk_core$Config(profile="workbench")
token <- config.token

I think you would need to update this function in pysparklyr - https://github.com/zacdav-db/pysparklyr/blob/oauth/R/databricks-utils.R#L34

edgararuiz · 2025-03-07T15:34:01Z

@zacdav-db - main had few issues with testing, so now that those issues are fixed, I went ahead and merged the changes back to your branch. This should allow us to know which test, if any, to change

edgararuiz · 2025-03-07T17:30:02Z

@zacdav-db - It looks like either all, or most of the, current errors from tests (https://github.com/mlverse/pysparklyr/actions/runs/13723934962/job/38385743702#step:11:568) are due to this line: https://github.com/mlverse/pysparklyr/pull/127/files#diff-f3356851368fadfba470c2f4566fd222adeee5cc921d5bd7072c240989aadd3cR166, I think it serverless may need to default to FALSE instead of NULL

…when using non-databricks connections.

zacdav-db · 2025-03-10T13:46:12Z

Thanks @edgararuiz, I've pushed that change now.

What are your thoughts regarding the quarto credential issue on workbench?

Merge branch 'main' into pr/127 # Conflicts: # R/sparklyr-spark-connect.R

edgararuiz · 2025-03-10T18:50:31Z

@zacdav-db - I merged testing updates from main, and now the test failures are due to implicitly installing the SDK, so all that's needed is to update the snapshot (https://github.com/mlverse/pysparklyr/actions/runs/13771252327/job/38513089374#step:14:466)

zacdav-db · 2025-03-11T04:06:51Z

@edgararuiz I had a crack at that, hopefully it works.

edgararuiz · 2025-03-11T16:28:53Z

@zacdav-db - Tests are passing now. Are you going to implement the suggested solution for the Quarto/Workbench issue? It seems straight forward to cover that case

zacdav-db · 2025-03-12T00:44:22Z

@edgararuiz I'm having a look at that now.

zacdav-db · 2025-03-12T01:11:06Z

My concern right now is that using the Python SDK to look for credentials is tricky when databricks_token() can be invoked in situations before the setup/install has occurred (e.g. install_databricks() uses databricks_token() to detect runtime version of the cluster).

I'm not clear on how CONNECT_DATABRICKS_TOKEN is populated within Workbench, I feel as if that should have worked?

edgararuiz · 2025-03-12T01:23:36Z

Thanks @zacdav-db !!!

zacdav-db · 2025-03-12T01:24:40Z

@edwardsd We're going to open a seperate issue/PR for the quarto situation.

Zac Davies added 4 commits September 27, 2024 17:39

investigating tweaks to support all auth methods.

bd9ec95

first complete attempt at adding support for serverless while deferri…

14b9e2d

…ng to SDK for auth. This should enable full OAuth support.

adding serverless components

17caa0d

Updating NEWS

e5fa257

zacdav-db commented Oct 14, 2024

View reviewed changes

removing testing logic

dd4bd04

edgararuiz reviewed Oct 14, 2024

View reviewed changes

wip refactoring to use Databricks py SDK heavily

4440303

fix check error for connection method + remove sanitize for now.

811dacf

Zac Davies added 2 commits March 1, 2025 22:23

Continued refactoring to use SDK client for all requests. Disabling c…

f73188d

…aching on serverless (`memory = TRUE` will be ignored).

Added back the ability to GET request cluster info without SDK for in…

44e7e4d

…itial install / deployment scenario. Adjusted minor logic on connection.

Merge branch 'main' into oauth

7c314e8

Merge branch 'main' into pr/127

0db91b9

Setting serverless as FALSE for initialize_connection to fix issue …

e452ee2

…when using non-databricks connections.

edgararuiz added 2 commits March 10, 2025 12:35

Merges changes from main

1307e7f

Merge branch 'main' into pr/127 # Conflicts: # R/sparklyr-spark-connect.R

Removes residual from conflicts fix

69fd1d1

adjusting snapshot to include databricks-sdk

8377dd5

edgararuiz merged commit 5aa7278 into mlverse:main Mar 12, 2025
6 checks passed

zacdav-db mentioned this pull request Mar 12, 2025

[BUG] Quarto auth behaviour on Posit Workbench #139

Open

[Databricks] Supporting OAuth & Serverless compute #127

[Databricks] Supporting OAuth & Serverless compute #127

Uh oh!

Conversation

zacdav-db commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edgararuiz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zacdav-db Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zacdav-db commented Oct 15, 2024

Uh oh!

edgararuiz commented Oct 15, 2024

Uh oh!

zacdav-db commented Oct 15, 2024

Uh oh!

edgararuiz commented Oct 16, 2024

Uh oh!

zacdav-db commented Oct 18, 2024

Uh oh!

edwardsd commented Jan 16, 2025

Uh oh!

zacdav-db commented Mar 1, 2025

Uh oh!

zacdav-db commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwardsd commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edgararuiz commented Mar 7, 2025

Uh oh!

edgararuiz commented Mar 7, 2025

Uh oh!

zacdav-db commented Mar 10, 2025

Uh oh!

edgararuiz commented Mar 10, 2025

Uh oh!

zacdav-db commented Mar 11, 2025

Uh oh!

edgararuiz commented Mar 11, 2025

Uh oh!

zacdav-db commented Mar 12, 2025

Uh oh!

zacdav-db commented Mar 12, 2025

Uh oh!

Uh oh!

edgararuiz commented Mar 12, 2025

Uh oh!

zacdav-db commented Mar 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zacdav-db commented Oct 14, 2024 •

edited

Loading

zacdav-db Oct 15, 2024 •

edited

Loading

zacdav-db commented Mar 1, 2025 •

edited

Loading

edwardsd commented Mar 7, 2025 •

edited

Loading