Skip to content

Conversation

@zacdav-db
Copy link

@zacdav-db zacdav-db commented Oct 14, 2024

Not the prettiest first attempt, have added support for serverless and OAuth.

  • Using the Databricks SDK and the sdkConfig as a mechanism to connect and authenticate
  • Serverless defaults to FALSE and currently still requires version to be specified (this strictly isn't required)
  • Added boolean check to remove spark configs when on serverless as they can't be applied

Comment on lines 45 to 49
# # Checks for OAuth Databricks token inside the RStudio API
# if (is.null(token) && exists(".rs.api.getDatabricksToken")) {
# getDatabricksToken <- get(".rs.api.getDatabricksToken")
# token <- set_names(getDatabricksToken(databricks_host()), "oauth")
# }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled by SDK config component.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, are we talking about this SDK? https://github.com/databricks/databricks-sdk-py/ And if so, can you point me to where it handles the RStudio token? I can't seem to find it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDK won't detect the .rs.api.getDatabricks* but maybe theres a gap in my understanding, I thought connect would also write to a config file as well, which the SDK should pickup?

Copy link
Collaborator

@edgararuiz edgararuiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sending over this PR, it's looking great!

Comment on lines 45 to 49
# # Checks for OAuth Databricks token inside the RStudio API
# if (is.null(token) && exists(".rs.api.getDatabricksToken")) {
# getDatabricksToken <- get(".rs.api.getDatabricksToken")
# token <- set_names(getDatabricksToken(databricks_host()), "oauth")
# }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, are we talking about this SDK? https://github.com/databricks/databricks-sdk-py/ And if so, can you point me to where it handles the RStudio token? I can't seem to find it

silent <- args$silent %||% FALSE

method <- method[[1]]
token <- databricks_token(token, fail = FALSE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your comment on line 137, I think we should remove this line. And have token only populated when the user passes it as an argument in the spark_connect() call

Copy link
Author

@zacdav-db zacdav-db Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking for leaving this was that users explicitly setting the DATABRICKS_TOKEN and DATABRICKS_HOST vars should have those respected as it were set explicitly. The Databricks Python SDK won't detect those when its done from R.

databricks_token function also looks for CONNECT_DATABRICKS_TOKEN so i think its probably important to leave that intact?

I was expecting hierarchy to be:

  1. Explicit token
  2. DATABRICKS_TOKEN
  3. CONNECT_DATABRICKS_TOKEN
  4. .rs.api.getDatabricksToken(host)
  5. Python SDK explicit setting of profile
  6. Python SDK detection of DEFAULT profile

Where 1-4 are handled by databricks_token

# sdk config
conf_args <- list(host = master)
# if token is found, propagate
# otherwise trust in sdk to detect and do what it can?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we remove line 72, then this if makes sense to leave

conn <- exec(databricks_session, !!!remote_args)
sdk_config <- db_sdk$core$Config(!!!conf_args)

# unsure if this iss needed anymore?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to remove this from here, specially since we can't use httr2:::is_hosted_session() since a ::: is not allowed. Do you think this is important for the package to do if the user is on desktop? If so, what do you think about isolating it in its own exported function? Maybe pysparklyr::databricks_desktop_login()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is required, will do some testing without.

@zacdav-db
Copy link
Author

Im reviewing this with a clearer mind and I think we should attempt to defer as much as possible to the SDK for Databricks auth and logic. Will have an attempt.

@edgararuiz
Copy link
Collaborator

Hey, want me to do another review, or are you still working on this?

@zacdav-db
Copy link
Author

@edgararuiz its at the point where I'd appreciate some input if you have a spare moment. It's not finalised but want to ensure we agree on the direction / shape its taking! 🙏

@edgararuiz
Copy link
Collaborator

@zacdav-db - Looking good. Feel free to remove sanitze_host(), we can always restore it later if needed. Also, is it passing tests locally for you?

@zacdav-db
Copy link
Author

@edgararuiz not passing tests locally, seems theres an install issue.

Theres also one complication with using the SDK for auth everywhere, it requires the python env to be loaded, which depends on version being known. version can't be determined without API which if using SDK, requires the env.

We probably don't want to force version to be specified, but I think the only was we can defer to SDK is to do so.

@edwardsd
Copy link

After successfully connecting to Databricks serverless compute using this PR I tried to call sparklyr::copy_to but it fails with an error message that serverless does not support "CACHE TABLE AS SELECT"

sc <- sparklyr::spark_connect(method     = "databricks_connect",
                              master     = databricks_host,
                              serverless = TRUE,
                              version    = "15.1")

mtcars_sdf  <- sparklyr::copy_to(sc, df = mtcars,   name = "mtcars",   overwrite = TRUE)
Error in py_call_impl(x, dots$unnamed, dots$named) : 
  pyspark.errors.exceptions.connect.AnalysisException: [NOT_SUPPORTED_WITH_SERVERLESS] CACHE TABLE AS SELECT is not supported on serverless compute. SQLSTATE: 0A000;
CacheTableAsSelect mtcars, SELECT *
<...truncated...>at#7131, wt#7132, qsec#7133, vs#7134, am#7135, gear#7136, carb#7137]
   +- SubqueryAlias sparklyr_tmp_table_b2e5c0a2_cb8e_4237_bc3d_36c41f8bdc2b
      +- View (`sparklyr_tmp_table_b2e5c0a2_cb8e_4237_bc3d_36c41f8bdc2b`, [mpg#7127, cyl#7128, disp#7129, hp#7130, drat#7131, wt#7132, qsec#7133, vs#7134, am#7135, gear#7136, carb#7137])
         +- Project [_0#7105 AS mpg#7127, _1#7106 AS cyl#7128, _2#7107 AS disp#7129, _3#7108 AS hp#7130, _4#7109 AS drat#7131, _5#7110 AS wt#7132, _6#7111 AS qsec#7133, _7#7112 AS vs#7134, _8#7113 AS am#7135, _9#7114 AS gear#7136, _10#7115 AS carb#7137]
            +- LocalRelation [_0#7105, _1#7106, _2#7107, _3#7108, _4#7109, _5#7110, _6#7111, _7#7112, _8#7113, _9#7114, _10#7115]

@zacdav-db
Copy link
Author

The issue you are seeing @edwardsd is due to caching limitations on serverless.

Should be able to adjust the behaviour to be handled, having a look at it now.

Zac Davies added 2 commits March 1, 2025 22:23
…aching on serverless (`memory = TRUE` will be ignored).
…itial install / deployment scenario. Adjusted minor logic on connection.
@zacdav-db
Copy link
Author

zacdav-db commented Mar 1, 2025

@edwardsd would you mind trying again to see if things work as expected?

@edgararuiz I did some fairly heavy refactoring - would appreciate if you can review the changes at a high level. I still need to adjust tests.

@edwardsd
Copy link

edwardsd commented Mar 7, 2025

@zacdav-db based on my somewhat limited testing things worked as expected in Posit Workbench.

I did, however, have an issue connecting, using the code below, when running in Quarto

sc <- sparklyr::spark_connect(method     = "databricks_connect",
                              master     = databricks_host,
                              serverless = TRUE,
                              version    = "15.1")

The code fails because when you run Quarto it shells out to a command-line R session which is technically no longer an RStudio session and therefore does not have access to the .rs.api.

I think you can fix this by using the databricks.sdk.core.Config function to retrieve the Databricks OAuth token e.g.,

db_sdk_core <- import_check("databricks.sdk.core", envname, silent = TRUE)
 
config <- db_sdk_core$Config(profile="workbench")
token <- config.token

I think you would need to update this function in pysparklyr - https://github.com/zacdav-db/pysparklyr/blob/oauth/R/databricks-utils.R#L34

@edgararuiz
Copy link
Collaborator

@zacdav-db - main had few issues with testing, so now that those issues are fixed, I went ahead and merged the changes back to your branch. This should allow us to know which test, if any, to change

@edgararuiz
Copy link
Collaborator

@zacdav-db - It looks like either all, or most of the, current errors from tests (https://github.com/mlverse/pysparklyr/actions/runs/13723934962/job/38385743702#step:11:568) are due to this line: https://github.com/mlverse/pysparklyr/pull/127/files#diff-f3356851368fadfba470c2f4566fd222adeee5cc921d5bd7072c240989aadd3cR166, I think it serverless may need to default to FALSE instead of NULL

@zacdav-db
Copy link
Author

Thanks @edgararuiz, I've pushed that change now.

What are your thoughts regarding the quarto credential issue on workbench?

Merge branch 'main' into pr/127

# Conflicts:
#	R/sparklyr-spark-connect.R
@edgararuiz
Copy link
Collaborator

@zacdav-db - I merged testing updates from main, and now the test failures are due to implicitly installing the SDK, so all that's needed is to update the snapshot (https://github.com/mlverse/pysparklyr/actions/runs/13771252327/job/38513089374#step:14:466)

@zacdav-db
Copy link
Author

@edgararuiz I had a crack at that, hopefully it works.

@edgararuiz
Copy link
Collaborator

@zacdav-db - Tests are passing now. Are you going to implement the suggested solution for the Quarto/Workbench issue? It seems straight forward to cover that case

@zacdav-db
Copy link
Author

@edgararuiz I'm having a look at that now.

@zacdav-db
Copy link
Author

My concern right now is that using the Python SDK to look for credentials is tricky when databricks_token() can be invoked in situations before the setup/install has occurred (e.g. install_databricks() uses databricks_token() to detect runtime version of the cluster).

I'm not clear on how CONNECT_DATABRICKS_TOKEN is populated within Workbench, I feel as if that should have worked?

@edgararuiz edgararuiz merged commit 5aa7278 into mlverse:main Mar 12, 2025
6 checks passed
@edgararuiz
Copy link
Collaborator

Thanks @zacdav-db !!!

@zacdav-db
Copy link
Author

@edwardsd We're going to open a seperate issue/PR for the quarto situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants