-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame
: Simple translations
#105
Conversation
DataFrame
: Simple translationsDataFrame
: Simple translations
Interesting that DataFrame.first and DataFrame.last is not implemented for py-polars, but they are in rust-polars just as you wrap them. |
you can do like this (inspired by py-polars wrapper ) fn slice(&self, offset: Robj, length: Robj) -> Result<LazyFrame, String> {
Ok(LazyFrame(self.0.clone().slice(
robj_to!(i64, offset)?,
robj_to!(Option, u32, length)?.unwrap_or(u32::MAX),
)))
} when I get lost in type conversions, I find it helpful to break up the lines and let rust-analyzer explain what types I have currently made |
you can cut a corner and use the extendr conversion like I did here in the beginning. we can make this PR into a co-op and then I can add the u8 conversion OR just use extendr in this PR, and I file an issue to upgrade u8 conversions. |
here's a minimal example + tidying up the GroupBy class after zzz.R is updated you can do like this GroupBy_max = function() {
self$agg(pl$all()$max())
} and use case from py-polars > df = pl$DataFrame(
+ a = c(1, 2, 2, 3, 4, 5),
+ b = c(0.5, 0.5, 4, 10, 13, 14),
+ c = c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE),
+ d = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana")
+ )
> df$groupby("d", maintain_order=TRUE)$max()
polars DataFrame: shape: (3, 4)
┌────────┬─────┬──────┬──────┐
│ d ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ bool │
╞════════╪═════╪══════╪══════╡
│ Apple ┆ 3.0 ┆ 10.0 ┆ true │
│ Orange ┆ 2.0 ┆ 0.5 ┆ true │
│ Banana ┆ 5.0 ┆ 14.0 ┆ true │
└────────┴─────┴──────┴──────┘
|
* DataFrame$tail() * DataFrame_tail PR review
Thanks for all the extra info! Everything seems to have worked without a hitch. I used the I'll look at a few more candidates for translation and let you know when I think this PR is ready for a review. |
@sorhawell This PR now covers most of the trivial translations for Let me know when you've had time to look at this and I can implement any change you need. Once this is merged, I'll turn to the S3 methods PR #107. Depending on time, energy, etc., I might eventually come back to translate some of the trickier bits, but those will have to wait. |
DataFrame
: Simple translationsDataFrame
: Simple translations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
If you don't mind, would you consider improving the test using patrick?
Sorry for the dirty commit history. Made some stupid mistakes with |
looks very good :) I will add to this PR some minor style changes later, I was just derping around in how to commit to a PR |
src/rust/src/rdataframe/mod.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choosing what R type to replace a rust type is not always obvious. In this case i32 could not describe larger than 2Gb. Both bit64::integer64 (2^63) and R double (2^53) would be plenty big. I choose f64/double encoding of the usize type, because it will not force the user to use bit64 package. These conversion flavors could in the future be controlled with global environment variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... no change needed
I added some changes to the non-fallible rust methods. These do not need to return a Result, and if default arg value on R side, the extendr wrapper can be used directly as is with flag "use_extendr_wrapper". with a few tests for null_count and estimated_size, this looks like a wrap :) |
patrick tests look beautiful |
Great! I will study your changes and comments, and will make changes as appropriate. It may take me a few days because of work + family stuff. |
I pushed a simple test for Here’s something weird: Install the PR branch: remotes::install_github("vincentarelbundock/r-polars@translation") library(rpolars)
tmp = mtcars
tmp[1:2, 1:2] = NA
tmp[5, 3] = NA
a = pl$DataFrame(tmp)$null_count()
# correct
a
# polars DataFrame: shape: (1, 11)
# ┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
# │ mpg ┆ cyl ┆ disp ┆ hp ┆ ... ┆ vs ┆ am ┆ gear ┆ carb │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
# ╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
# │ 2 ┆ 2 ┆ 1 ┆ 0 ┆ ... ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
# └─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘
# incorrect
a$as_data_frame()
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 9.881313e-324 9.881313e-324 4.940656e-324 0 0 0 0 0 0 0 0 Any ideas why we get this? |
503a52f
to
76b6964
Compare
Sorry, I merged #112 because NEWS was not updated correctly. |
This reverts commit 76b6964.
I just reverted the NEWS commit. Not sure what's the best process to add it back, but for the record, here is what I had written: - New methods implemented for DataFrame, LazyFrame, and GroupBy objects: min, max, mean, median, sum, std, var, first, last, head, tail, reverse, slice, null_count, estimated_size (#105 @vincentarelbundock)
## New Contributors
- @grantmcdermott made their first contribution in #81
- @vincentarelbundock made their first contribution in #105 |
Ah I see what you mean @eitsupi . Thanks for the merge. I updated NEWS in the right position. |
407ad02
to
6c11a63
Compare
This reverts commit 6c11a63.
nope, still broken. Giving up on this now. We can update the NEWS at the very end. |
Currently r-polars maps arrow-u32 to R bit64::nteger64 which again is a hack to get i64 in R. Under the hood bit64::integer64 is just an i64 which is tagged with "SexpReal", if bit64 is not loaded the user is in for a surprise. rpostgresql on other packages does the same. I prefer in future to map to f64 and let the connoisseurs actively opt for bit64 package > library(bit64)
> library(rpolars)
> tmp = mtcars
> tmp[1:2, 1:2] = NA
> tmp[5, 3] = NA
> a = pl$DataFrame(tmp)$null_count()
> a
polars DataFrame: shape: (1, 11)
┌─────┬─────┬──────┬─────┬─────┬─────┬─────┬──────┬──────┐
│ mpg ┆ cyl ┆ disp ┆ hp ┆ ... ┆ vs ┆ am ┆ gear ┆ carb │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪══════╪═════╪═════╪═════╪═════╪══════╪══════╡
│ 2 ┆ 2 ┆ 1 ┆ 0 ┆ ... ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
└─────┴─────┴──────┴─────┴─────┴─────┴─────┴──────┴──────┘
> as.data.frame(a)
mpg cyl disp hp drat wt qsec vs am gear carb
1 2 2 1 0 0 0 0 0 0 0 0
> str(as.data.frame(a))
'data.frame': 1 obs. of 11 variables:
$ mpg :integer64 2
$ cyl :integer64 2
$ disp:integer64 1
$ hp :integer64 0
$ drat:integer64 0
$ wt :integer64 0
$ qsec:integer64 0
$ vs :integer64 0
$ am :integer64 0
$ gear:integer64 0
$ carb:integer64 0
|
Got it, thanks. I added very simple .null_count() and .estimated_size() tests. AFAICT, this completes my TODO list. Feel free to let me know; I'm happy to make more changes to this PR if helpful. |
excellent many thanks @vincentarelbundock |
Thanks for your guidance, edits, and prompt responses. This was a great contribution experience :) |
Translated and tested:
DataFrame
:max
,mean
,median
,min
,sum
,var
,std
,first
,last
,head
,tail
,reverse
,slice
,null_count
,estimated_size
LazyFrame
:max
,mean
,median
,min
,sum
,var
,std
,first
,last
,head
,tail
,reverse
,slice
GroupBy
:max
,mean
,median
,min
,sum
,var
,std
,first
,last
,null_count
TODO:
.quantile
:QuantileInterpolOptions
inputis_duplicated()
: How do we return aChunkedArray
to R output?