[ENH] nth function #3404

samukweku · 2023-01-03T20:28:14Z

Implement dt.nth(cols, n) function to return the nth row (also per group) for the specified columns. If n goes out of bounds, NA-row is returned.

Closes #3128

oleksiyskononenko · 2023-01-03T22:21:58Z

Thanks @samukweku. So the issue on this PR is that skipna doesn't work? Why tests fail with "got an unexpected keyword argument n"?

docs/api/dt/nth.rst

docs/releases/v1.1.0.rst

src/datatable/__init__.py

tests/dt/test-nth.py

docs/api/dt/nth.rst

oleksiyskononenko · 2023-01-10T02:47:48Z

I guess this implementation has an issue even with no skipna. Look at the example below

from datatable import dt, f, by
DT = dt.Frame([1, 1, 2])
DT[:, f.C0.nth(0), by(f.C0)]

Triggers

AssertionError: Assertion 'i < nrows()' failed in src/core/column.cc, line 236

UPD: I guess this issue has never been fixed: #3346 (comment)
I propose not to close PR's when there is something wrong with your commits, just let me know and we work together on fixing it. Otherwise, we loose a conversation and some issues like the one above.

oleksiyskononenko · 2023-01-10T02:59:31Z

As for the skipna, I've been looking into this and it seems that if we just pass the validity mask to Nth_ColumnImpl we could have issues with negative n. Because to figure out which row is the row -X, we first need to establish all the other row ids (skipping missings) and then calculate the X'th row from the end.

So what we need to do is, first, to apply the validity mask to the original frame (filtering out "all" or "any" missings).
However, since the rows are already grouped, we may need to adjust the group by...

samukweku · 2023-01-10T21:56:25Z

my bad ... I am lost on the explanation regarding skipna

samukweku · 2023-01-10T22:01:18Z

@oleksiyskononenko /home/sam/datatable/docs/api/fexpr/nth.rst: WARNING: document isn't included in any toctree where is the toctree?

oleksiyskononenko · 2023-01-10T22:33:50Z

@samukweku ah, you just need to add nth() function to the api, just like we do for all the other functions, see index-api.rst.

samukweku · 2023-01-14T13:38:14Z

@oleksiyskononenko what do you suggest is the way forward for this PR? drop the skipna parameter and let the user handle it instead? Do you mind showing me an implementation for skipna assuming a positive n, so I can understand how FExpr_all and any would be used

samukweku · 2023-01-21T13:06:55Z

Looking further at pandas' implementation of dropna, the dropna is applied to the entire dataframe (including the groupby column), before the nth function is applied. What we should be doing for nth and any other relevant function is to apply this per column; if the user wants it per row, that should be applied in the i section as a filter, same way we'd do if we were removing nulls from the dataframe:

Using the example from pandas' home page

from datatable import dt, f, by
import pandas as pd

df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
                                   'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
g = df.groupby('A')

DT = dt.Frame(df)

In [5]: df
Out[5]: 
   A    B
0  1  NaN
1  1  2.0
2  2  3.0
3  1  4.0
4  2  5.0

In [6]: DT
Out[6]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     1        2
 2 |     2        3
 3 |     1        4
 4 |     2        5
[5 rows x 2 columns]

without skipping nulls:

In [7]: g.nth(0)
Out[7]: 
     B
A     
1  NaN
2  3.0

In [9]: DT[:, dt.nth(f.B,0), 'A']
Out[9]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2        3
[2 rows x 2 columns]

In [10]: DT[:, dt.nth(f.B,1), 'A']
Out[10]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        2
 1 |     2        5
[2 rows x 2 columns]

In [11]: g.nth(1)
Out[11]: 
     B
A     
1  2.0
2  5.0

In [12]: g.nth(-1)
Out[12]: 
     B
A     
1  4.0
2  5.0

In [13]: DT[:, dt.nth(f.B,-1), 'A']
Out[13]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        4
 1 |     2        5
[2 rows x 2 columns]

skipping nulls:

In [14]: g.nth(0, dropna='any')
Out[14]: 
     B
A     
1  2.0
2  3.0

In [21]: g.nth(0, dropna='all')
Out[21]: 
     B
A     
1  NaN
2  3.0

In [42]: DT[~((f[:]==None).rowall()), :][:, dt.nth(f.B, 0), f.A]
Out[42]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2        3
[2 rows x 2 columns]

In [43]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 0), f.A]
Out[43]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        2
 1 |     2        3
[2 rows x 2 columns]

In [54]: g.nth(3, dropna='any')
Out[54]: 
    B
A    
1 NaN
2 NaN

In [55]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 3), f.A]
Out[55]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2       NA
[2 rows x 2 columns]

It'd probably wont be a bad idea to implement a dropna function for use in the i section; a better option (which I hope to get to sometime) is to implement methods for ==, !=, >, ... which would make chaining easier, while making the computation clearer and cleaner to the eyes -> DT[~f[:].eq(None).rowany(),:].

That was a digression; my point is in Pandas, they treat nth as per row, we should treat it as per column; if the user wishes something per row, it should go into the i section (and rightly so).

Of course, another option would be via cumcount, and subsequent filtering:

In [51]: DT[:, f[:].extend(dt.cumcount()), f.A]
Out[51]: 
   |     A        B     C0
   | int64  float64  int64
-- + -----  -------  -----
 0 |     1       NA      0
 1 |     1        2      1
 2 |     1        4      2
 3 |     2        3      0
 4 |     2        5      1
[5 rows x 3 columns]

# fetch second row per column
In [52]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==1, :-1]
Out[52]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        2
 1 |     2        5
[2 rows x 2 columns]

# fetch first row per column
In [53]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==0, :-1]
Out[53]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2        3
[2 rows x 2 columns]

Allowing the skipna per column allows us to also extend to first and last implementation to fetch first non-null value, if the user desires that.

samukweku · 2023-01-22T01:00:42Z

@oleksiyskononenko revisiting the issue of skipna per column or rowwise ^^^^^^^^^^^^^^^

samukweku · 2023-03-02T09:33:29Z

@oleksiyskononenko, @sh1ng , @st-pasha - need ur help with how to use FExpr_row functions -

Workframe inputs = arg_->evaluate_n(ctx);
      Grouping gmode = inputs.get_grouping_mode();
      colvec columns;
      size_t ncols = inputs.ncols();
      size_t nrows = 1;
      columns.reserve(ncols);
      for (size_t i = 0; i < ncols; ++i) {
        Column col = inputs.retrieve_column(i);
        xassert(i == 0 || nrows == col.nrows());
        nrows = col.nrows();
        columns.emplace_back(col);        
      }

      Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols);

Error message:

src/core/expr/fexpr_nth.cc: In member function ‘dt::expr::Workframe dt::expr::FExpr_Nth<SKIPNA>::evaluate_n(dt::expr::EvalContext&) const’:
src/core/expr/fexpr_nth.cc:77:52: error: cannot call member function ‘virtual Column dt::expr::FExpr_RowAll::apply_function(colvec&&, dt::expr::size_t, dt::expr::size_t) const’ without object
   77 |       Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols);

what is the correct way of using FExpr_RowAll?

samukweku · 2023-04-16T23:40:32Z

@oleksiyskononenko this is dependent on PR #3404 and #3444. once those PRS are concluded, I can pick up on this

Co-authored-by: oleksiyskononenko <35204136+oleksiyskononenko@users.noreply.github.com>

samukweku · 2023-04-24T02:13:17Z

@oleksiyskononenko figured out how to implement SKIPNA='any' or SKIPNA='all', similar to what pandas implements. If you've got some time to review, after the build completes. thanks

samukweku · 2023-05-27T23:06:06Z

@oleksiyskononenko just checking in, waiting for your feedback

samukweku added the new feature Feature requests for new functionality label Jan 3, 2023

samukweku requested a review from oleksiyskononenko January 3, 2023 20:28

samukweku self-assigned this Jan 3, 2023

oleksiyskononenko reviewed Jan 3, 2023

View reviewed changes

docs/api/dt/nth.rst Outdated Show resolved Hide resolved

oleksiyskononenko reviewed Jan 3, 2023

View reviewed changes

docs/releases/v1.1.0.rst Outdated Show resolved Hide resolved

oleksiyskononenko reviewed Jan 3, 2023

View reviewed changes

src/datatable/__init__.py Outdated Show resolved Hide resolved

oleksiyskononenko reviewed Jan 3, 2023

View reviewed changes

tests/dt/test-nth.py Outdated Show resolved Hide resolved

oleksiyskononenko reviewed Jan 4, 2023

View reviewed changes

tests/dt/test-nth.py Outdated Show resolved Hide resolved

samukweku changed the title ~~Samukweku/nth function~~ [ENH] nth function Jan 4, 2023

oleksiyskononenko reviewed Jan 5, 2023

View reviewed changes

docs/api/dt/nth.rst Outdated Show resolved Hide resolved

samukweku force-pushed the samukweku/nth_function branch from d0dd1a8 to f9a3234 Compare February 18, 2023 05:38

samukweku mentioned this pull request Mar 11, 2023

[ENH] Refactor count/countna to use FExpr #3440

Merged

samukweku force-pushed the samukweku/nth_function branch from 67890c5 to 7435bf5 Compare April 11, 2023 11:59

samukweku and others added 6 commits April 21, 2023 09:30

implement nth, rewrite first/last as FExpr

34348e3

rename file

5b3a99f

revert commits; restore skipna argument; focus only on nth function

339b159

whitespace

d89578b

Update docs/releases/v1.1.0.rst

77ddd4b

Co-authored-by: oleksiyskononenko <35204136+oleksiyskononenko@users.noreply.github.com>

fix test failures

2f645b5

samukweku and others added 11 commits April 21, 2023 09:30

fix test failures for test-f.py

691f018

fix test failures for test-f.py

140fa82

mark pytest funcs

738668c

Update docs/api/dt/nth.rst

3371dfe

Co-authored-by: oleksiyskononenko <35204136+oleksiyskononenko@users.noreply.github.com>

fix for grouped column

d8fc26f

add test for grouped column

7f6b10f

add test for grouped column

6f0c35c

add fexpr.rst

2c01928

default 0 for nth

82c1876

fix fexpr.cc

8e1d633

early work for skipna

0bc78d9

samukweku force-pushed the samukweku/nth_function branch from 7435bf5 to 0bc78d9 Compare April 20, 2023 23:30

samukweku added 2 commits April 24, 2023 11:48

implement skipna

a1f8575

refactor

eb2fb7b

samukweku added 2 commits May 17, 2023 20:18

Merge branch 'main' into samukweku/nth_function

4e7f524

fix test fails

b8b5c1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] nth function #3404

[ENH] nth function #3404

samukweku commented Jan 3, 2023

oleksiyskononenko commented Jan 3, 2023

oleksiyskononenko commented Jan 10, 2023 •

edited

oleksiyskononenko commented Jan 10, 2023

samukweku commented Jan 10, 2023

samukweku commented Jan 10, 2023

oleksiyskononenko commented Jan 10, 2023

samukweku commented Jan 14, 2023

samukweku commented Jan 21, 2023

samukweku commented Jan 22, 2023

samukweku commented Mar 2, 2023 •

edited

samukweku commented Apr 16, 2023 •

edited

samukweku commented Apr 24, 2023

samukweku commented May 27, 2023

[ENH] nth function #3404

Are you sure you want to change the base?

[ENH] nth function #3404

Conversation

samukweku commented Jan 3, 2023

oleksiyskononenko commented Jan 3, 2023

oleksiyskononenko commented Jan 10, 2023 • edited

oleksiyskononenko commented Jan 10, 2023

samukweku commented Jan 10, 2023

samukweku commented Jan 10, 2023

oleksiyskononenko commented Jan 10, 2023

samukweku commented Jan 14, 2023

samukweku commented Jan 21, 2023

samukweku commented Jan 22, 2023

samukweku commented Mar 2, 2023 • edited

samukweku commented Apr 16, 2023 • edited

samukweku commented Apr 24, 2023

samukweku commented May 27, 2023

oleksiyskononenko commented Jan 10, 2023 •

edited

samukweku commented Mar 2, 2023 •

edited

samukweku commented Apr 16, 2023 •

edited