Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds support for scanning parquet from GCP #1056

Merged
merged 16 commits into from
Apr 26, 2024

Conversation

andyquinterom
Copy link
Contributor

Related to #1054

This should support GCP parquet scanning.

I tested this snippet and it works perfectly.

library(polars)
library(dplyr)
library(tidypolars)

iris_parquet <- pl$scan_parquet(
  "gs://path/to/*.parquet",
  cloud_options = c("service_account" = "./service_account.json")
)

iris_parquet |>
  select(starts_with(c("Sep", "Pet"))) |> 
  mutate(
    petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
  ) |>
  filter(between(Sepal.Length, 4.5, 5.5)) |> 
  head() |>
  compute()

Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andyquinterom, I don't have any experience with interacting directly with cloud providers from R but this looks nice. I also can't compile on Windows for now, I have the same errors as in some of the CI. I'll explore a bit why.

In the meantime, I added a few minor comments. I would also like to know how we can test this. Is there a way to mock the connection to GCP and other services?

src/rust/Cargo.toml Outdated Show resolved Hide resolved
src/rust/src/rdataframe/read_parquet.rs Outdated Show resolved Hide resolved
src/rust/src/utils/mod.rs Outdated Show resolved Hide resolved
man/S3_as_arrow_table.Rd Outdated Show resolved Hide resolved
@andyquinterom
Copy link
Contributor Author

Thanks @andyquinterom, I don't have any experience with interacting directly with cloud providers from R but this looks nice. I also can't compile on Windows for now, I have the same errors as in some of the CI. I'll explore a bit why.

In the meantime, I added a few minor comments. I would also like to know how we can test this. Is there a way to mock the connection to GCP and other services?

Thanks for taking the time to read my submission.

I'll fix some details you commented, it was late last night and things could be improved.

I'll check the compilation errors in windows as well to see if I find anything.

On the mocking side, this functionality is fully on the Rust polars-io crate. I don't know if we could truly mock it or rely on the rust testing implementation. I will see how the Python layer is tested for an idea.

@andyquinterom
Copy link
Contributor Author

@etiennebacher I just pushed some commits that should fix the issues you mentioned

@andyquinterom
Copy link
Contributor Author

@etiennebacher It seems the CI fail algo happens on main. https://github.com/pola-rs/r-polars/actions/runs/8803923620/job/24163176648

@etiennebacher
Copy link
Collaborator

One failure is due to extendr and the R-devel changes upstream, but the others are specific to this PR, e.g https://github.com/pola-rs/r-polars/actions/runs/8803957782/job/24163425862?pr=1056

@andyquinterom
Copy link
Contributor Author

andyquinterom commented Apr 23, 2024

One failure is due to extendr and the R-devel changes upstream, but the others are specific to this PR, e.g https://github.com/pola-rs/r-polars/actions/runs/8803957782/job/24163425862?pr=1056

@etiennebacher
Just pushed a change adding -lcrypt32 to the windows linker arguments. Let's see if this fixes the issue

@andyquinterom
Copy link
Contributor Author

@etiennebacher Seems like the CI issues are solved.

@etiennebacher
Copy link
Collaborator

I still see a similar error here: https://github.com/pola-rs/r-polars/actions/runs/8805357520/job/24168950149?pr=1056#step:9:335

On the mocking side, this functionality is fully on the Rust polars-io crate. I don't know if we could truly mock it or rely on the rust testing implementation. I will see how the Python layer is tested for an idea.

Ideally we want to test our implementation as well. They take care of all the details but we want to ensure that our R binding is correct. I don't know whether they can mock this in python but if they do, it would be great to mimic it.

@andyquinterom
Copy link
Contributor Author

andyquinterom commented Apr 23, 2024

@etiennebacher

I'm looking into the tests it seems like during testing an S3 bucket is created.

https://github.com/pola-rs/polars/blob/ba85175a830601ba622f0cf3615e2297c70b9e96/py-polars/tests/unit/io/cloud/test_aws.py

This would make it pretty difficult to test unless we get some credential for creating and uploading files to S3.

@andyquinterom
Copy link
Contributor Author

@etiennebacher
Copy link
Collaborator

Thanks, I can compile too. For testing, I think we can use some S3/GCS buckets that are publicly available, like those in this arrow vignette. This is not ideal because we rely on something external but I don't have an alternative for now. There might be more ideas in the arrow test suite.

However, the following doesn't work for me and I don't know why:

library(arrow)
arrow_with_gcs()
#> [1] TRUE

arrow::read_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet") |> 
  head()
#>   carat color clarity depth table price    x    y    z
#> 1  0.23     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 2  0.31     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 3  0.30     J     SI1  64.0    55   339 4.25 4.28 2.73
#> 4  0.30     J     SI1  63.4    54   351 4.23 4.29 2.70
#> 5  0.30     J     SI1  63.8    56   351 4.23 4.26 2.71
#> 6  0.30     I     SI2  63.3    56   351 4.26 4.30 2.71

pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")

#> Error: Execution halted with the following contexts
#>    0: In R: in pl$scan_parquet():
#>    0: During function call [pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")]
#>    1: Encountered the following error in Rust-Polars:
#>       	Generic GCS error: Error performing token request: Error after 2 retries in 3.3478343s, max_retries:2, retry_timeout:10s, 
#>            source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken): 
#>            error trying to connect: tcp connect error: deadline has elapsed

@eitsupi do you have some ideas about this?

@eitsupi
Copy link
Collaborator

eitsupi commented Apr 24, 2024

I think Polars does not support such addresses, because I get an error when I try that string in Python Polars.

@andyquinterom
Copy link
Contributor Author

Thanks, I can compile too. For testing, I think we can use some S3/GCS buckets that are publicly available, like those in this arrow vignette. This is not ideal because we rely on something external but I don't have an alternative for now. There might be more ideas in the arrow test suite.

However, the following doesn't work for me and I don't know why:

library(arrow)

arrow_with_gcs()

#> [1] TRUE



arrow::read_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet") |> 

  head()

#>   carat color clarity depth table price    x    y    z

#> 1  0.23     E     VS1  56.9    65   327 4.05 4.07 2.31

#> 2  0.31     J     SI2  63.3    58   335 4.34 4.35 2.75

#> 3  0.30     J     SI1  64.0    55   339 4.25 4.28 2.73

#> 4  0.30     J     SI1  63.4    54   351 4.23 4.29 2.70

#> 5  0.30     J     SI1  63.8    56   351 4.23 4.26 2.71

#> 6  0.30     I     SI2  63.3    56   351 4.26 4.30 2.71



pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")



#> Error: Execution halted with the following contexts

#>    0: In R: in pl$scan_parquet():

#>    0: During function call [pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")]

#>    1: Encountered the following error in Rust-Polars:

#>       	Generic GCS error: Error performing token request: Error after 2 retries in 3.3478343s, max_retries:2, retry_timeout:10s, 

#>            source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken): 

#>            error trying to connect: tcp connect error: deadline has elapsed

@eitsupi do you have some ideas about this?

This also happens on the Rust side, if we don't have a Service Account given it will error out since it doesn't know where to look for the metadata of the bucket. GCP buckets seem to not be tested in the Python side.

The only way to fix this "issue" would be to run the GCP test suite on GCP servers or use a service account.

@eitsupi
Copy link
Collaborator

eitsupi commented Apr 24, 2024

I know of a server product that simulates such a GCS, but I don't know if polars can connect to this (and if it is even necessary to test with it).
https://github.com/fsouza/fake-gcs-server

At the moment I do not have the time to test this.

@andyquinterom
Copy link
Contributor Author

If I provide a service account it works with the dataset in the arrow vignette.

library(polars)
library(dplyr)
library(tidypolars)

pl$scan_parquet(
  "gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet",
  cloud_options = c(
    service_account = "./service_account.json"
  )
)$collect()
# shape: (4_906, 10)
# ┌───────┬───────┬─────────┬───────┬───┬──────┬──────┬──────┬──────┐
# │ carat ┆ color ┆ clarity ┆ depth ┆ … ┆ x    ┆ y    ┆ z    ┆ cut  │
# │ ---   ┆ ---   ┆ ---     ┆ ---   ┆   ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
# │ f64   ┆ cat   ┆ cat     ┆ f64   ┆   ┆ f64  ┆ f64  ┆ f64  ┆ str  │
# ╞═══════╪═══════╪═════════╪═══════╪═══╪══════╪══════╪══════╪══════╡
# │ 0.23  ┆ E     ┆ VS1     ┆ 56.9  ┆ … ┆ 4.05 ┆ 4.07 ┆ 2.31 ┆ Good │
# │ 0.31  ┆ J     ┆ SI2     ┆ 63.3  ┆ … ┆ 4.34 ┆ 4.35 ┆ 2.75 ┆ Good │
# │ 0.3   ┆ J     ┆ SI1     ┆ 64.0  ┆ … ┆ 4.25 ┆ 4.28 ┆ 2.73 ┆ Good │
# │ 0.3   ┆ J     ┆ SI1     ┆ 63.4  ┆ … ┆ 4.23 ┆ 4.29 ┆ 2.7  ┆ Good │
# │ 0.3   ┆ J     ┆ SI1     ┆ 63.8  ┆ … ┆ 4.23 ┆ 4.26 ┆ 2.71 ┆ Good │
# │ …     ┆ …     ┆ …       ┆ …     ┆ … ┆ …    ┆ …    ┆ …    ┆ …    │
# │ 0.8   ┆ G     ┆ VS2     ┆ 64.2  ┆ … ┆ 5.84 ┆ 5.81 ┆ 3.74 ┆ Good │
# │ 0.84  ┆ I     ┆ VS1     ┆ 63.7  ┆ … ┆ 5.94 ┆ 5.9  ┆ 3.77 ┆ Good │
# │ 0.74  ┆ D     ┆ SI1     ┆ 63.1  ┆ … ┆ 5.71 ┆ 5.74 ┆ 3.61 ┆ Good │
# │ 0.79  ┆ F     ┆ SI1     ┆ 58.1  ┆ … ┆ 6.06 ┆ 6.13 ┆ 3.54 ┆ Good │
# │ 0.72  ┆ D     ┆ SI1     ┆ 63.1  ┆ … ┆ 5.69 ┆ 5.75 ┆ 3.61 ┆ Good │
# └───────┴───────┴─────────┴───────┴───┴──────┴──────┴──────┴──────┘

I guess I could try to create a fake service account that is enough to have the Rust side of things to the request as it needs. I will try to look into the Rust code to see if there are improvements to be made however, that would take quite some time to dig into.

@andyquinterom
Copy link
Contributor Author

@etiennebacher I would also be willing to create a dummy service account, but I am not very confident that's outside of what the project would be comfortable doing due to possible security concerns.

@etiennebacher
Copy link
Collaborator

etiennebacher commented Apr 24, 2024

The only way to fix this "issue" would be to run the GCP test suite on GCP servers or use a service account.

How does the arrow package deal with that? They have tests for GCP if I'm not mistaken.

I would also be willing to create a dummy service account

I don't think that's the way to go. I suppose the ideal solution would be for rust-polars to support this kind of address but I don't know if they plan to do it (and I won't request this there because I never use this type of service).

For now, let's just mention that this argument is experimental but eventually we'll need some way to test this. I'll review more properly tomorrow but you can already rename this arg from cloud_options to storage_options

Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass of comments. Can you also add a bullet point in NEWS that mentions this is experimental?

R/io_parquet.R Outdated Show resolved Hide resolved
R/io_parquet.R Outdated Show resolved Hide resolved
src/rust/src/rdatatype.rs Outdated Show resolved Hide resolved
@andyquinterom
Copy link
Contributor Author

@etiennebacher Just pushed the changes. I tried to stay almost as similar as possible to the python documentation. Tell me if you need anything more done.

Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I just added a news item. Actually I see this PR adds this arg only for pl$scan_parquet(), but it should also be in pl$read_parquet(). Can you add it too? Sorry, I forgot that earlier

@andyquinterom
Copy link
Contributor Author

Thanks, I just added a news item. Actually I see this PR adds this arg only for pl$scan_parquet(), but it should also be in pl$read_parquet(). Can you add it too? Sorry, I forgot that earlier

@etiennebacher I did not change the read_parquet function. I have only validated functionality on scan_parquet. I see the implementation is a simple collect over a scan so I will get to it once I have some time

@etiennebacher
Copy link
Collaborator

@andyquinterom I added the arg in read_parquet so that I can merge once the ci passes. Many thanks for your contribution!

@andyquinterom
Copy link
Contributor Author

@etiennebacher I just noticed your comment. I had just pushed the changes as well :/

Tell me if you want me to revert the changes or if they are ok as they are?

@etiennebacher
Copy link
Collaborator

I'll wrap up the PR, you can let it as is (the details section is missing in the docs of read_parquet)

@etiennebacher etiennebacher merged commit ec1eda1 into pola-rs:main Apr 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants