Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame: Simple translations #105

Merged
merged 27 commits into from
Apr 14, 2023
Merged

Conversation

vincentarelbundock
Copy link
Collaborator

@vincentarelbundock vincentarelbundock commented Apr 12, 2023

Translated and tested:

  • DataFrame: max, mean, median, min, sum, var, std, first, last, head, tail, reverse, slice, null_count, estimated_size
  • LazyFrame: max, mean, median, min, sum, var, std, first, last, head, tail, reverse, slice
  • GroupBy: max, mean, median, min, sum, var, std, first, last, null_count

TODO:

  • .quantile: QuantileInterpolOptions input
  • is_duplicated(): How do we return a ChunkedArray to R output?

@vincentarelbundock vincentarelbundock changed the title Translation DataFrame: Simple translations Apr 12, 2023
@vincentarelbundock vincentarelbundock changed the title DataFrame: Simple translations [WIP] DataFrame: Simple translations Apr 12, 2023
@eitsupi eitsupi marked this pull request as draft April 12, 2023 12:43
@sorhawell
Copy link
Collaborator

sorhawell commented Apr 12, 2023

Interesting that DataFrame.first and DataFrame.last is not implemented for py-polars, but they are in rust-polars just as you wrap them.

@sorhawell
Copy link
Collaborator

sorhawell commented Apr 12, 2023

How should I re-write the slice function below to make length argument optional (None)?

you can do like this (inspired by py-polars wrapper )

fn slice(&self, offset: Robj, length: Robj) -> Result<LazyFrame, String> {
        Ok(LazyFrame(self.0.clone().slice(
            robj_to!(i64, offset)?,
            robj_to!(Option, u32, length)?.unwrap_or(u32::MAX),
        )))
    }

when I get lost in type conversions, I find it helpful to break up the lines and let rust-analyzer explain what types I have currently made

image

@sorhawell
Copy link
Collaborator

Missing robj_to_u8 makes rob_to fail when I try to implement DataFrame.std()

you can cut a corner and use the extendr conversion like I did here in the beginning.

we can make this PR into a co-op and then I can add the u8 conversion OR just use extendr in this PR, and I file an issue to upgrade u8 conversions.

@sorhawell
Copy link
Collaborator

sorhawell commented Apr 12, 2023

Could you supply a minimal example of implementing something like .max() on DataFrameGroupBy?

here's a minimal example + tidying up the GroupBy class

after zzz.R is updated you can do like this

GroupBy_max = function() {
  self$agg(pl$all()$max())
}

and use case from py-polars

> df = pl$DataFrame(
+         a = c(1, 2, 2, 3, 4, 5),
+         b = c(0.5, 0.5, 4, 10, 13, 14),
+         c = c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE),
+         d = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana")
+ )
> df$groupby("d", maintain_order=TRUE)$max()
polars DataFrame: shape: (3, 4)
┌────────┬─────┬──────┬──────┐
│ dabc    │
│ ------------  │
│ strf64f64bool │
╞════════╪═════╪══════╪══════╡
│ Apple3.010.0true │
│ Orange2.00.5true │
│ Banana5.014.0true │
└────────┴─────┴──────┴──────┘

@vincentarelbundock
Copy link
Collaborator Author

Thanks for all the extra info! Everything seems to have worked without a hitch.

I used the u8 shortcut for now.

I'll look at a few more candidates for translation and let you know when I think this PR is ready for a review.

@vincentarelbundock vincentarelbundock marked this pull request as ready for review April 13, 2023 12:50
@vincentarelbundock
Copy link
Collaborator Author

vincentarelbundock commented Apr 13, 2023

@sorhawell This PR now covers most of the trivial translations for DataFrame, LazyFrame, and GroupBy. I think it's a good place to stop for now.

Let me know when you've had time to look at this and I can implement any change you need.

Once this is merged, I'll turn to the S3 methods PR #107.

Depending on time, energy, etc., I might eventually come back to translate some of the trickier bits, but those will have to wait.

@eitsupi eitsupi changed the title [WIP] DataFrame: Simple translations DataFrame: Simple translations Apr 13, 2023
Copy link
Collaborator

@eitsupi eitsupi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
If you don't mind, would you consider improving the test using patrick?

tests/testthat/test-dataframe.R Outdated Show resolved Hide resolved
@vincentarelbundock
Copy link
Collaborator Author

Sorry for the dirty commit history. Made some stupid mistakes with patrick, but it should work now.

@sorhawell
Copy link
Collaborator

looks very good :) I will add to this PR some minor style changes later, I was just derping around in how to commit to a PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choosing what R type to replace a rust type is not always obvious. In this case i32 could not describe larger than 2Gb. Both bit64::integer64 (2^63) and R double (2^53) would be plenty big. I choose f64/double encoding of the usize type, because it will not force the user to use bit64 package. These conversion flavors could in the future be controlled with global environment variables.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... no change needed

@sorhawell
Copy link
Collaborator

I added some changes to the non-fallible rust methods. These do not need to return a Result, and if default arg value on R side, the extendr wrapper can be used directly as is with flag "use_extendr_wrapper".

with a few tests for null_count and estimated_size, this looks like a wrap :)

@sorhawell
Copy link
Collaborator

patrick tests look beautiful

@vincentarelbundock
Copy link
Collaborator Author

Great! I will study your changes and comments, and will make changes as appropriate. It may take me a few days because of work + family stuff.

@vincentarelbundock
Copy link
Collaborator Author

I pushed a simple test for .estimated_size().

Here’s something weird: .as_data_frame() changes the values in the .null_count() data frame.

Install the PR branch:

remotes::install_github("vincentarelbundock/r-polars@translation")
library(rpolars)
tmp = mtcars
tmp[1:2, 1:2] = NA
tmp[5, 3] = NA
a = pl$DataFrame(tmp)$null_count()

# correct
a
# polars DataFrame: shape: (1, 11)
# ┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
# │ mpg ┆ cyl ┆ disp ┆ hp  ┆ ... ┆ vs  ┆ am  ┆ gear ┆ carb │
# │ --- ┆ --- ┆ ---  ┆ --- ┆     ┆ --- ┆ --- ┆ ---  ┆ ---  │
# │ u32 ┆ u32 ┆ u32  ┆ u32 ┆     ┆ u32 ┆ u32 ┆ u32  ┆ u32  │
# ╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
# │ 2   ┆ 2   ┆ 1    ┆ 0   ┆ ... ┆ 0   ┆ 0   ┆ 0    ┆ 0    │
# └─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘

# incorrect
a$as_data_frame()
#             mpg           cyl          disp hp drat wt qsec vs am gear carb
# 1 9.881313e-324 9.881313e-324 4.940656e-324  0    0  0    0  0  0    0    0

Any ideas why we get this?

@eitsupi
Copy link
Collaborator

eitsupi commented Apr 14, 2023

Sorry, I merged #112 because NEWS was not updated correctly.
Could you please move the news item to the correct position?

This reverts commit 76b6964.
@vincentarelbundock
Copy link
Collaborator Author

I just reverted the NEWS commit. Not sure what's the best process to add it back, but for the record, here is what I had written:

- New methods implemented for DataFrame, LazyFrame, and GroupBy objects: min, max, mean, median, sum, std, var, first, last, head, tail, reverse, slice, null_count, estimated_size (#105 @vincentarelbundock)

## New Contributors

- @grantmcdermott made their first contribution in #81
- @vincentarelbundock made their first contribution in #105

@vincentarelbundock
Copy link
Collaborator Author

Ah I see what you mean @eitsupi . Thanks for the merge. I updated NEWS in the right position.

This reverts commit 6c11a63.
@vincentarelbundock
Copy link
Collaborator Author

nope, still broken. Giving up on this now. We can update the NEWS at the very end.

@sorhawell
Copy link
Collaborator

Here’s something weird: .as_data_frame() changes the values in the .null_count() data frame.

Currently r-polars maps arrow-u32 to R bit64::nteger64 which again is a hack to get i64 in R. Under the hood bit64::integer64 is just an i64 which is tagged with "SexpReal", if bit64 is not loaded the user is in for a surprise.

rpostgresql on other packages does the same. I prefer in future to map to f64 and let the connoisseurs actively opt for bit64 package

> library(bit64)
> library(rpolars)
> tmp = mtcars
> tmp[1:2, 1:2] = NA
> tmp[5, 3] = NA
> a = pl$DataFrame(tmp)$null_count()
> a
polars DataFrame: shape: (1, 11)
┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
│ mpgcyldisphp...vsamgearcarb │
│ ------------ ┆     ┆ ------------  │
│ u32u32u32u32 ┆     ┆ u32u32u32u32  │
╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
│ 2210...0000    │
└─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘
> as.data.frame(a)
  mpg cyl disp hp drat wt qsec vs am gear carb
1   2   2    1  0    0  0    0  0  0    0    0
> str(as.data.frame(a))
'data.frame':	1 obs. of  11 variables:
 $ mpg :integer64 2 
 $ cyl :integer64 2 
 $ disp:integer64 1 
 $ hp  :integer64 0 
 $ drat:integer64 0 
 $ wt  :integer64 0 
 $ qsec:integer64 0 
 $ vs  :integer64 0 
 $ am  :integer64 0 
 $ gear:integer64 0 
 $ carb:integer64 0 

@vincentarelbundock
Copy link
Collaborator Author

Got it, thanks. I added very simple .null_count() and .estimated_size() tests.

AFAICT, this completes my TODO list. Feel free to let me know; I'm happy to make more changes to this PR if helpful.

@sorhawell
Copy link
Collaborator

excellent many thanks @vincentarelbundock

@sorhawell sorhawell merged commit 16b2d55 into pola-rs:main Apr 14, 2023
@vincentarelbundock
Copy link
Collaborator Author

Thanks for your guidance, edits, and prompt responses. This was a great contribution experience :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants