-
Notifications
You must be signed in to change notification settings - Fork 5
[Databricks] Supporting OAuth & Serverless compute #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ng to SDK for auth. This should enable full OAuth support.
R/databricks-utils.R
Outdated
| # # Checks for OAuth Databricks token inside the RStudio API | ||
| # if (is.null(token) && exists(".rs.api.getDatabricksToken")) { | ||
| # getDatabricksToken <- get(".rs.api.getDatabricksToken") | ||
| # token <- set_names(getDatabricksToken(databricks_host()), "oauth") | ||
| # } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be handled by SDK config component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, are we talking about this SDK? https://github.com/databricks/databricks-sdk-py/ And if so, can you point me to where it handles the RStudio token? I can't seem to find it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDK won't detect the .rs.api.getDatabricks* but maybe theres a gap in my understanding, I thought connect would also write to a config file as well, which the SDK should pickup?
edgararuiz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sending over this PR, it's looking great!
R/databricks-utils.R
Outdated
| # # Checks for OAuth Databricks token inside the RStudio API | ||
| # if (is.null(token) && exists(".rs.api.getDatabricksToken")) { | ||
| # getDatabricksToken <- get(".rs.api.getDatabricksToken") | ||
| # token <- set_names(getDatabricksToken(databricks_host()), "oauth") | ||
| # } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, are we talking about this SDK? https://github.com/databricks/databricks-sdk-py/ And if so, can you point me to where it handles the RStudio token? I can't seem to find it
| silent <- args$silent %||% FALSE | ||
|
|
||
| method <- method[[1]] | ||
| token <- databricks_token(token, fail = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your comment on line 137, I think we should remove this line. And have token only populated when the user passes it as an argument in the spark_connect() call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking for leaving this was that users explicitly setting the DATABRICKS_TOKEN and DATABRICKS_HOST vars should have those respected as it were set explicitly. The Databricks Python SDK won't detect those when its done from R.
databricks_token function also looks for CONNECT_DATABRICKS_TOKEN so i think its probably important to leave that intact?
I was expecting hierarchy to be:
- Explicit
token DATABRICKS_TOKENCONNECT_DATABRICKS_TOKEN.rs.api.getDatabricksToken(host)- Python SDK explicit setting of profile
- Python SDK detection of
DEFAULTprofile
Where 1-4 are handled by databricks_token
R/sparklyr-spark-connect.R
Outdated
| # sdk config | ||
| conf_args <- list(host = master) | ||
| # if token is found, propagate | ||
| # otherwise trust in sdk to detect and do what it can? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we remove line 72, then this if makes sense to leave
R/sparklyr-spark-connect.R
Outdated
| conn <- exec(databricks_session, !!!remote_args) | ||
| sdk_config <- db_sdk$core$Config(!!!conf_args) | ||
|
|
||
| # unsure if this iss needed anymore? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to remove this from here, specially since we can't use httr2:::is_hosted_session() since a ::: is not allowed. Do you think this is important for the package to do if the user is on desktop? If so, what do you think about isolating it in its own exported function? Maybe pysparklyr::databricks_desktop_login()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is required, will do some testing without.
|
Im reviewing this with a clearer mind and I think we should attempt to defer as much as possible to the SDK for Databricks auth and logic. Will have an attempt. |
|
Hey, want me to do another review, or are you still working on this? |
|
@edgararuiz its at the point where I'd appreciate some input if you have a spare moment. It's not finalised but want to ensure we agree on the direction / shape its taking! 🙏 |
|
@zacdav-db - Looking good. Feel free to remove |
|
@edgararuiz not passing tests locally, seems theres an install issue. Theres also one complication with using the SDK for auth everywhere, it requires the python env to be loaded, which depends on We probably don't want to force |
|
After successfully connecting to Databricks serverless compute using this PR I tried to call sparklyr::copy_to but it fails with an error message that serverless does not support "CACHE TABLE AS SELECT" sc <- sparklyr::spark_connect(method = "databricks_connect",
master = databricks_host,
serverless = TRUE,
version = "15.1")
mtcars_sdf <- sparklyr::copy_to(sc, df = mtcars, name = "mtcars", overwrite = TRUE) |
|
The issue you are seeing @edwardsd is due to caching limitations on serverless. Should be able to adjust the behaviour to be handled, having a look at it now. |
…aching on serverless (`memory = TRUE` will be ignored).
…itial install / deployment scenario. Adjusted minor logic on connection.
|
@edwardsd would you mind trying again to see if things work as expected? @edgararuiz I did some fairly heavy refactoring - would appreciate if you can review the changes at a high level. I still need to adjust tests. |
|
@zacdav-db based on my somewhat limited testing things worked as expected in Posit Workbench. I did, however, have an issue connecting, using the code below, when running in Quarto sc <- sparklyr::spark_connect(method = "databricks_connect",
master = databricks_host,
serverless = TRUE,
version = "15.1")The code fails because when you run Quarto it shells out to a command-line R session which is technically no longer an RStudio session and therefore does not have access to the I think you can fix this by using the db_sdk_core <- import_check("databricks.sdk.core", envname, silent = TRUE)
config <- db_sdk_core$Config(profile="workbench")
token <- config.tokenI think you would need to update this function in pysparklyr - https://github.com/zacdav-db/pysparklyr/blob/oauth/R/databricks-utils.R#L34 |
|
@zacdav-db - |
|
@zacdav-db - It looks like either all, or most of the, current errors from tests (https://github.com/mlverse/pysparklyr/actions/runs/13723934962/job/38385743702#step:11:568) are due to this line: https://github.com/mlverse/pysparklyr/pull/127/files#diff-f3356851368fadfba470c2f4566fd222adeee5cc921d5bd7072c240989aadd3cR166, I think it |
…when using non-databricks connections.
|
Thanks @edgararuiz, I've pushed that change now. What are your thoughts regarding the quarto credential issue on workbench? |
Merge branch 'main' into pr/127 # Conflicts: # R/sparklyr-spark-connect.R
|
@zacdav-db - I merged testing updates from |
|
@edgararuiz I had a crack at that, hopefully it works. |
|
@zacdav-db - Tests are passing now. Are you going to implement the suggested solution for the Quarto/Workbench issue? It seems straight forward to cover that case |
|
@edgararuiz I'm having a look at that now. |
|
My concern right now is that using the Python SDK to look for credentials is tricky when I'm not clear on how |
|
Thanks @zacdav-db !!! |
|
@edwardsd We're going to open a seperate issue/PR for the quarto situation. |
Not the prettiest first attempt, have added support for serverless and OAuth.
sdkConfigas a mechanism to connect and authenticateFALSEand currently still requiresversionto be specified (this strictly isn't required)