Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: rewrite as_polars_df.data.frame #817

Merged
merged 8 commits into from
Feb 17, 2024
Merged

refactor: rewrite as_polars_df.data.frame #817

merged 8 commits into from
Feb 17, 2024

Conversation

eitsupi
Copy link
Collaborator

@eitsupi eitsupi commented Feb 17, 2024

Close #816
It used to call pl$DataFrame internally, but has been completely rewritten to not call it.

Speeds seem to be comparable, but memory usage may have been reduced.
(The bench package does not measure memory usage on the Rust side, but given that the output RPolarsDataFrame is the same, memory usage on the Rust side should not be much different.)

big_df = do.call(rbind, lapply(1:10, \(x) nycflights13::flights))

library(polars)
library(nanoarrow)

bench::mark(
  pl_DataFrame = {
    pl$DataFrame(big_df)
  },
  as_polars_df = {
    as_polars_df(big_df)
  },
  from_nanoarrow = {
    as_polars_df(as_nanoarrow_array_stream(big_df))
  },
  check = FALSE,
  min_iterations = 10L
)
#> # A tibble: 3 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pl_DataFrame   826.41ms 839.49ms     1.16     1.91MB    0.289
#> 2 as_polars_df   738.85ms 864.31ms     1.16   204.05KB    0.129
#> 3 from_nanoarrow    1.15s    1.27s     0.790   77.48MB    0.339

Created on 2024-02-17 with reprex v2.0.2

@eitsupi
Copy link
Collaborator Author

eitsupi commented Feb 17, 2024

I will update the test and documentation later.

@etiennebacher @sorhawell If you are interested, check the performance in the benchmarks above.

@etiennebacher
Copy link
Collaborator

I'll review in more details later, but I'm a bit confused about the objective of this PR. If the speed isn't affected and we don't know if the memory used decreases, then why rewrite this? Is this an intermediate step to fix #676?

@eitsupi
Copy link
Collaborator Author

eitsupi commented Feb 17, 2024

Yes, pl$DataFrame is not optimized for the purpose of scanning data.frame and has a lot of extra processing.
In the current main branch, as_polars_df depends on pl$DataFrame, but this should clearly be reversed, given that the latter has more to do, and as_polars_df should not depend on pl$DataFrame.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Feb 17, 2024

Given the simplicity of the process, it is not surprising that memory usage is reduced.

@eitsupi eitsupi marked this pull request as ready for review February 17, 2024 13:57
NEWS.md Outdated Show resolved Hide resolved
Co-authored-by: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com>
@eitsupi eitsupi merged commit d8a6e77 into main Feb 17, 2024
@eitsupi eitsupi deleted the rdf_to_rpldf branch February 17, 2024 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite as_polars_df.data.frame
2 participants