-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pl.row_num()
as syntactic sugar for pl.int_range(0, pl.count())
#12420
Comments
pl.row()
as syntactic sugar for pl.int_range(0, pl.count())
pl.row_nbr()
as syntactic sugar for pl.int_range(0, pl.count())
Prefer |
pl.row_nbr()
as syntactic sugar for pl.int_range(0, pl.count())
pl.row_num()
as syntactic sugar for pl.int_range(0, pl.count())
A somewhat related suggestion was for In trying to see what this is called elsewhere, it seems |
There already is Adding this new expression shortcut seems fine with me. It should either be called The resulting data type should be |
I think What about naming the expression Apart from how I use it, it does seem odd that it's the only |
I think someone pointed out in a previous discussion that If it was only an expression you would need to use df.select(pl.row_number(), pl.all()) (Or, at least, that's what I think the suggestion was.) |
Keep in mind that we don't need any of these methods - they are simply for convenience. With that in mind, the DataFrame method makes sense because row numbers make a lot of sense in a DataFrame context - people trying to mimic indexes. I think it's fine to have both. |
Wait a minute. Shouldn't this just be adding It seems they do the same thing even if Also, if I do timeit on a 1M row df, then cumcount takes 272 µs ± 30.8 µs per loop whereas int_range takes 1.63 ms ± 30.5 µs per loop. |
They differ in types, pl.select(pl.lit('foo').cumcount()).schema
# OrderedDict([('literal', UInt32)]) pl.select(pl.int_range(0, 1)).schema
# OrderedDict([('int', Int64)]) pl.select(pl.lit('foo')).with_row_count().schema
# OrderedDict([('row_nr', UInt32), ('literal', Utf8)]) I guess the idea of wanting .rolling(index_column = pl.foo(), ... versus .rolling(index_column = pl.foo().cast(int), ... |
+1 for having |
Small nit - if it's called "row number", should it start counting from 1 by default like SQL does instead of 0? |
I considered this but I don't think so. We start counting at 0, right? I find the SQL behavior here strange. The |
I would expect |
If that's the case then I think I would prefer |
I also considered |
My vote goes to |
Would it be too much clutter to include both? a An alternative is Come to think of it, this would be a pretty useful in general: if you're building up a series of frames and want the next frame to start at the next index, you could begin at whatever |
@mcrumiller Thinking about it some more I don't think it'd really be necessary for |
There
We can except a dtype in the expression. I'd like to slow this down a little bit. I don't think we should add an expression for this. We have ranges, a well known construct in many programming languages. |
Yes, though I'd argue In practice people would probably just use |
This doesn't really help when trying to bring someone to your point of view.... That being said, I agree that this whole conversion is being muddled. I think many of us agree that being able to quickly index columns and frames by integer number is very useful and common enough that it warrants some function. The cleanest solution to me is a single expression # default input args shown for clarity
df.select(pl.row_index(start=0, dtype=pl.UInt32))
df.with_row_index(start=0, dtype=pl.UInt32) # |
I also think it's worth adding a |
Note that in the next release, you will be able to write |
I know this is closed, but I really think a |
As per Ritchie's comment:
So it is marked as "not planned". We may change our tune in the future. |
Use import polars as pl
df = pl.DataFrame({
"group": [1, 1, 1, 2, 2],
"value": [1, None, 2, 3, 4],
})
df.sort('value', descending=True).with_columns(pl.int_range(1, pl.len()+1).over('group').alias('nth'))
|
Sorry I deleted my comment, I realized I could just use |
Careful though, rank won't give you a rank for nulls. |
Thank you. In my case I don't have nulls so this will be ok. Can't say I love |
@mkleinbort-ic I work around this now by always defining a import polars as pl
def row_index():
return pl.int_range(1, pl.len()+1)
df = pl.DataFrame({
"group": [1, 1, 1, 2, 2],
"value": [1, None, 2, 3, 4],
})
df.sort('value', descending=True).with_columns(row_index().over('group').alias('nth')) Doesn't have to be a function but it's more obvious that it's computing something. |
I also really think that Using Furthermore, |
Actually, there is also a significant difference between a = pl.DataFrame({"x": ["a","b","c"]})
b = pl.DataFrame({"y": ["d","e"]})
a.join(
b,
left_on=pl.int_range(pl.len()).mod(len(b)),
right_on=pl.int_range(pl.len()),
)
# InvalidOperationError: All join key expressions must be elementwise. |
Description
Having a quick expression to grab an integer-based row number is often requested and would be pretty useful. Instead of the usual options:
This would simplify to:
The text was updated successfully, but these errors were encountered: