`DataFrame`: Simple translations #105

vincentarelbundock · 2023-04-12T12:40:31Z

Translated and tested:

DataFrame: max, mean, median, min, sum, var, std, first, last, head, tail, reverse, slice, null_count, estimated_size
LazyFrame: max, mean, median, min, sum, var, std, first, last, head, tail, reverse, slice
GroupBy: max, mean, median, min, sum, var, std, first, last, null_count

TODO:

.quantile: QuantileInterpolOptions input
is_duplicated(): How do we return a ChunkedArray to R output?

sorhawell · 2023-04-12T21:25:23Z

Interesting that DataFrame.first and DataFrame.last is not implemented for py-polars, but they are in rust-polars just as you wrap them.

sorhawell · 2023-04-12T21:47:57Z

How should I re-write the slice function below to make length argument optional (None)?

you can do like this (inspired by py-polars wrapper )

fn slice(&self, offset: Robj, length: Robj) -> Result<LazyFrame, String> {
        Ok(LazyFrame(self.0.clone().slice(
            robj_to!(i64, offset)?,
            robj_to!(Option, u32, length)?.unwrap_or(u32::MAX),
        )))
    }

when I get lost in type conversions, I find it helpful to break up the lines and let rust-analyzer explain what types I have currently made

sorhawell · 2023-04-12T22:04:54Z

Missing robj_to_u8 makes rob_to fail when I try to implement DataFrame.std()

you can cut a corner and use the extendr conversion like I did here in the beginning.

we can make this PR into a co-op and then I can add the u8 conversion OR just use extendr in this PR, and I file an issue to upgrade u8 conversions.

sorhawell · 2023-04-12T23:04:28Z

Could you supply a minimal example of implementing something like .max() on DataFrameGroupBy?

here's a minimal example + tidying up the GroupBy class

after zzz.R is updated you can do like this

GroupBy_max = function() {
  self$agg(pl$all()$max())
}

and use case from py-polars

> df = pl$DataFrame(
+         a = c(1, 2, 2, 3, 4, 5),
+         b = c(0.5, 0.5, 4, 10, 13, 14),
+         c = c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE),
+         d = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana")
+ )
> df$groupby("d", maintain_order=TRUE)$max()
polars DataFrame: shape: (3, 4)
┌────────┬─────┬──────┬──────┐
│ d      ┆ a   ┆ b    ┆ c    │
│ ---    ┆ --- ┆ ---  ┆ ---  │
│ str    ┆ f64 ┆ f64  ┆ bool │
╞════════╪═════╪══════╪══════╡
│ Apple  ┆ 3.0 ┆ 10.0 ┆ true │
│ Orange ┆ 2.0 ┆ 0.5  ┆ true │
│ Banana ┆ 5.0 ┆ 14.0 ┆ true │
└────────┴─────┴──────┴──────┘

* DataFrame$tail() * DataFrame_tail PR review

vincentarelbundock · 2023-04-13T01:52:05Z

Thanks for all the extra info! Everything seems to have worked without a hitch.

I used the u8 shortcut for now.

I'll look at a few more candidates for translation and let you know when I think this PR is ready for a review.

vincentarelbundock · 2023-04-13T12:53:47Z

@sorhawell This PR now covers most of the trivial translations for DataFrame, LazyFrame, and GroupBy. I think it's a good place to stop for now.

Let me know when you've had time to look at this and I can implement any change you need.

Once this is merged, I'll turn to the S3 methods PR #107.

Depending on time, energy, etc., I might eventually come back to translate some of the trickier bits, but those will have to wait.

eitsupi

Thanks for working on this!
If you don't mind, would you consider improving the test using patrick?

tests/testthat/test-dataframe.R

vincentarelbundock · 2023-04-13T14:00:29Z

Sorry for the dirty commit history. Made some stupid mistakes with patrick, but it should work now.

sorhawell · 2023-04-13T17:26:44Z

looks very good :) I will add to this PR some minor style changes later, I was just derping around in how to commit to a PR

sorhawell · 2023-04-13T21:38:48Z

src/rust/src/rdataframe/mod.rs

Choosing what R type to replace a rust type is not always obvious. In this case i32 could not describe larger than 2Gb. Both bit64::integer64 (2^63) and R double (2^53) would be plenty big. I choose f64/double encoding of the usize type, because it will not force the user to use bit64 package. These conversion flavors could in the future be controlled with global environment variables.

... no change needed

R/dataframe__frame.R

sorhawell · 2023-04-13T21:57:57Z

I added some changes to the non-fallible rust methods. These do not need to return a Result, and if default arg value on R side, the extendr wrapper can be used directly as is with flag "use_extendr_wrapper".

with a few tests for null_count and estimated_size, this looks like a wrap :)

sorhawell · 2023-04-13T21:59:56Z

patrick tests look beautiful

vincentarelbundock · 2023-04-13T22:01:53Z

Great! I will study your changes and comments, and will make changes as appropriate. It may take me a few days because of work + family stuff.

vincentarelbundock · 2023-04-14T12:54:18Z

I pushed a simple test for .estimated_size().

Here’s something weird: .as_data_frame() changes the values in the .null_count() data frame.

Install the PR branch:

remotes::install_github("vincentarelbundock/r-polars@translation")

library(rpolars)
tmp = mtcars
tmp[1:2, 1:2] = NA
tmp[5, 3] = NA
a = pl$DataFrame(tmp)$null_count()

# correct
a
# polars DataFrame: shape: (1, 11)
# ┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
# │ mpg ┆ cyl ┆ disp ┆ hp  ┆ ... ┆ vs  ┆ am  ┆ gear ┆ carb │
# │ --- ┆ --- ┆ ---  ┆ --- ┆     ┆ --- ┆ --- ┆ ---  ┆ ---  │
# │ u32 ┆ u32 ┆ u32  ┆ u32 ┆     ┆ u32 ┆ u32 ┆ u32  ┆ u32  │
# ╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
# │ 2   ┆ 2   ┆ 1    ┆ 0   ┆ ... ┆ 0   ┆ 0   ┆ 0    ┆ 0    │
# └─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘

# incorrect
a$as_data_frame()
#             mpg           cyl          disp hp drat wt qsec vs am gear carb
# 1 9.881313e-324 9.881313e-324 4.940656e-324  0    0  0    0  0  0    0    0

Any ideas why we get this?

eitsupi · 2023-04-14T13:05:33Z

Sorry, I merged #112 because NEWS was not updated correctly.
Could you please move the news item to the correct position?

This reverts commit 76b6964.

vincentarelbundock · 2023-04-14T13:08:23Z

I just reverted the NEWS commit. Not sure what's the best process to add it back, but for the record, here is what I had written:

- New methods implemented for DataFrame, LazyFrame, and GroupBy objects: min, max, mean, median, sum, std, var, first, last, head, tail, reverse, slice, null_count, estimated_size (#105 @vincentarelbundock)

## New Contributors

- @grantmcdermott made their first contribution in #81
- @vincentarelbundock made their first contribution in #105

vincentarelbundock · 2023-04-14T13:11:35Z

Ah I see what you mean @eitsupi . Thanks for the merge. I updated NEWS in the right position.

This reverts commit 6c11a63.

vincentarelbundock · 2023-04-14T13:14:17Z

nope, still broken. Giving up on this now. We can update the NEWS at the very end.

sorhawell · 2023-04-14T14:14:15Z

Here’s something weird: .as_data_frame() changes the values in the .null_count() data frame.

Currently r-polars maps arrow-u32 to R bit64::nteger64 which again is a hack to get i64 in R. Under the hood bit64::integer64 is just an i64 which is tagged with "SexpReal", if bit64 is not loaded the user is in for a surprise.

rpostgresql on other packages does the same. I prefer in future to map to f64 and let the connoisseurs actively opt for bit64 package

> library(bit64)
> library(rpolars)
> tmp = mtcars
> tmp[1:2, 1:2] = NA
> tmp[5, 3] = NA
> a = pl$DataFrame(tmp)$null_count()
> a
polars DataFrame: shape: (1, 11)
┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
│ mpg ┆ cyl ┆ disp ┆ hp  ┆ ... ┆ vs  ┆ am  ┆ gear ┆ carb │
│ --- ┆ --- ┆ ---  ┆ --- ┆     ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ u32 ┆ u32 ┆ u32  ┆ u32 ┆     ┆ u32 ┆ u32 ┆ u32  ┆ u32  │
╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
│ 2   ┆ 2   ┆ 1    ┆ 0   ┆ ... ┆ 0   ┆ 0   ┆ 0    ┆ 0    │
└─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘
> as.data.frame(a)
  mpg cyl disp hp drat wt qsec vs am gear carb
1   2   2    1  0    0  0    0  0  0    0    0
> str(as.data.frame(a))
'data.frame':	1 obs. of  11 variables:
 $ mpg :integer64 2 
 $ cyl :integer64 2 
 $ disp:integer64 1 
 $ hp  :integer64 0 
 $ drat:integer64 0 
 $ wt  :integer64 0 
 $ qsec:integer64 0 
 $ vs  :integer64 0 
 $ am  :integer64 0 
 $ gear:integer64 0 
 $ carb:integer64 0

vincentarelbundock · 2023-04-14T16:25:50Z

Got it, thanks. I added very simple .null_count() and .estimated_size() tests.

AFAICT, this completes my TODO list. Feel free to let me know; I'm happy to make more changes to this PR if helpful.

sorhawell · 2023-04-14T18:30:36Z

excellent many thanks @vincentarelbundock

vincentarelbundock · 2023-04-14T18:32:02Z

Thanks for your guidance, edits, and prompt responses. This was a great contribution experience :)

vincentarelbundock added 3 commits April 11, 2023 22:41

DataFrame translations: No arguments

9bb47ae

translate: DataFrame$reverse()

a89e355

translation: DataFrame$slice()

fa1a914

vincentarelbundock changed the title ~~Translation~~ DataFrame: Simple translations Apr 12, 2023

vincentarelbundock changed the title ~~DataFrame: Simple translations~~ [WIP] DataFrame: Simple translations Apr 12, 2023

eitsupi marked this pull request as draft April 12, 2023 12:43

etiennebacher and others added 6 commits April 12, 2023 20:13

typo in readme (pola-rs#102)

10b44cd

DataFrame$tail() (pola-rs#103)

386c0fb

* DataFrame$tail() * DataFrame_tail PR review

merge conflict cruft

3c43036

GroupBy translations

347d41c

DataFrame$slice() optional arg

a9a0fc6

translation DataFrame.var() and .std()

0e25e16

vincentarelbundock added 4 commits April 12, 2023 23:54

DataFrame: null_count(), estimated_size()

21e0733

docs

da78b0f

Merge branch 'main' into translation

2c17a2f

try to fix examples

9f4d846

vincentarelbundock marked this pull request as ready for review April 13, 2023 12:50

eitsupi changed the title ~~[WIP] DataFrame: Simple translations~~ DataFrame: Simple translations Apr 13, 2023

eitsupi reviewed Apr 13, 2023

View reviewed changes

tests/testthat/test-dataframe.R Outdated Show resolved Hide resolved

vincentarelbundock added 4 commits April 13, 2023 09:19

patrick

3582ca1

patrick 2

5351a36

patrick 3

4933fcf

patrick 4

96b98eb

sorhawell added 3 commits April 13, 2023 23:12

simplify basic LazyFrame methods

d938064

roxydocs for previous commit

d986571

add newline eof

960e0ed

sorhawell reviewed Apr 13, 2023

View reviewed changes

R/dataframe__frame.R Show resolved Hide resolved

simplify DataFrame .null_count .estimated_size

943171d

test .estimated_size()

0b0fc26

NEWS

76b6964

vincentarelbundock force-pushed the translation branch from 503a52f to 76b6964 Compare April 14, 2023 13:02

Revert "NEWS"

93af6d6

This reverts commit 76b6964.

NEWS 2

6c11a63

vincentarelbundock force-pushed the translation branch from 407ad02 to 6c11a63 Compare April 14, 2023 13:12

Revert "NEWS 2"

8ce30ff

This reverts commit 6c11a63.

test .null_count() on DataFrame and GroupBy

512e806

sorhawell merged commit 16b2d55 into pola-rs:main Apr 14, 2023

sorhawell mentioned this pull request Apr 14, 2023

polars should prefer to convert to R double over bit64 as default. bit64 should be optional. #116

Closed

vincentarelbundock deleted the translation branch April 15, 2023 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DataFrame`: Simple translations #105

`DataFrame`: Simple translations #105

vincentarelbundock commented Apr 12, 2023 •

edited

Loading

sorhawell commented Apr 12, 2023 •

edited

Loading

sorhawell commented Apr 12, 2023 •

edited

Loading

sorhawell commented Apr 12, 2023

sorhawell commented Apr 12, 2023 •

edited

Loading

vincentarelbundock commented Apr 13, 2023

vincentarelbundock commented Apr 13, 2023 •

edited

Loading

eitsupi left a comment

vincentarelbundock commented Apr 13, 2023

sorhawell commented Apr 13, 2023

sorhawell Apr 13, 2023

sorhawell Apr 13, 2023

sorhawell commented Apr 13, 2023

sorhawell commented Apr 13, 2023

vincentarelbundock commented Apr 13, 2023

vincentarelbundock commented Apr 14, 2023

eitsupi commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

sorhawell commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

sorhawell commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

DataFrame: Simple translations #105

DataFrame: Simple translations #105

Conversation

vincentarelbundock commented Apr 12, 2023 • edited Loading

sorhawell commented Apr 12, 2023 • edited Loading

sorhawell commented Apr 12, 2023 • edited Loading

sorhawell commented Apr 12, 2023

sorhawell commented Apr 12, 2023 • edited Loading

vincentarelbundock commented Apr 13, 2023

vincentarelbundock commented Apr 13, 2023 • edited Loading

eitsupi left a comment

Choose a reason for hiding this comment

vincentarelbundock commented Apr 13, 2023

sorhawell commented Apr 13, 2023

sorhawell Apr 13, 2023

Choose a reason for hiding this comment

sorhawell Apr 13, 2023

Choose a reason for hiding this comment

sorhawell commented Apr 13, 2023

sorhawell commented Apr 13, 2023

vincentarelbundock commented Apr 13, 2023

vincentarelbundock commented Apr 14, 2023

eitsupi commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

sorhawell commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

sorhawell commented Apr 14, 2023

vincentarelbundock commented Apr 14, 2023

`DataFrame`: Simple translations #105

`DataFrame`: Simple translations #105

vincentarelbundock commented Apr 12, 2023 •

edited

Loading

sorhawell commented Apr 12, 2023 •

edited

Loading

sorhawell commented Apr 12, 2023 •

edited

Loading

sorhawell commented Apr 12, 2023 •

edited

Loading

vincentarelbundock commented Apr 13, 2023 •

edited

Loading