[ENH] `nth` function #3346

samukweku · 2022-09-03T22:00:15Z

Implement dt.nth(cols, n=0) function to return the nth row (also per group) for the specified columns. If n goes out of bounds, NA-row is returned.

Closes #3128

samukweku · 2022-09-03T22:01:13Z

@oleksiyskononenko MVP implementation of the nth function, similar to pandas' nth function/dplyr's nth function. Feedback appreciated before adding docs

oleksiyskononenko · 2022-09-07T21:16:59Z

I wonder if you want to replace first()/last() with nth()? Like nth(0) would mean first() and nth(-1) would mean last()?

samukweku · 2022-09-07T23:34:23Z

@oleksiyskononenko not a bad idea, as it is more generic.

oleksiyskononenko · 2022-09-08T20:11:11Z

So if you look at how the first() / last() functions work, you will see that it is a purely virtual column and not even a rowindex on the source column: https://github.com/h2oai/datatable/blob/main/src/core/expr/head_reduce_unary.cc#L112-L165

FirstLast_ColumnImpl::get_element() just gets the first or last element from a group. What you need to do in the case of the nth(n=...) function, is to see how your n compares to the group size and return NA if it is out of bounds.

In the case the column is grouped, first() / last() immediately return NA column for zero-rows frame, or the source column (note, the first or last elements from each group from the grouped column is actually the grouped column itself): https://github.com/h2oai/datatable/blob/main/src/core/expr/head_reduce_unary.cc#L179-L182

So for the nth(n=...) function: n could only be 0 or -1, otherwise, NA column should be returned.

I suggest we convert implementation on this PR to something similar we have for first() / last().

samukweku · 2022-09-08T21:10:26Z

@oleksiyskononenko what's the disadvantage of using a rowindex for this function? Performance? Or something else?

oleksiyskononenko · 2022-09-08T21:12:10Z

Yeah, current implementation uses as much memory as needed for

Buffer buf = Buffer::mem(gby.size() * sizeof(int32_t));

When you do it virtual as for first() / last(), you won't really need additional memory.

samukweku · 2022-09-08T21:23:29Z

Thanks @oleksiyskononenko

While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.

WIP for #3279

Cumulative functions acces the column's data in parallel. If the input column appears to be a `Latent_ColumnImpl`, materialization of such a column happens in parallel. However, the `Latent_ColumnImpl::materialize()` is not a thread-safe method and invoking it from multiple threads results in a segfault. In this PR we `vivify()` the source column, i.e. materialization, if necessary, happens in one thread only, before processing it in parallel. Closes #3345

As we eliminated Travis from our building pipeline (#3042), its status badge stopped working. In this PR we replace it with the AppVeyor's badge.

samukweku · 2022-09-15T11:12:43Z

@oleksiyskononenko made changes to the nth function, based on feedback. I also added a skipna parameter, to get non null values.

When you have some time, kindly have a look at the PR; your feedback is always appreciated.

samukweku · 2022-09-17T09:21:27Z

I made a blunder on this and should have just done a git pull 🤦

src/core/expr/fexpr.cc

src/core/expr/fexpr_nth.cc

This PR implements column's aliasing as proposed in #2684. We couldn't name the method `.as()` though, because `as` is a built-in python keyword — hence, we use `.alias()` instead. Column aliasing is now also available in the group-by clause. Closes #2504

Fix casting `void` columns to categoricals and add corresponding tests. WIP for #1691

Allow column names to be missing when detecting a header in CSV files. Closes #3363

Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691

- use `col`/`cols` as a parameter name when dt function supports single/multiple column(s); - convert `dt.shift()` documentation into standard format; - cosmetics.

Implement `dt.categories()` to get categories for categorical columns. WIP for #1691

…#3368) - fix "See also" section for categorical types; - improve `cbind()`/`rbind()` documentation.

It seems that `dt.categories()` is the first datatable function, that needs to produce `Grouping::GtoFEW` columns. That's because the number of categories in a categorical column could be anything between `0` and `nrows - 1`. Currently, datatable doesn't really support `Grouping::GtoFEW`, but may need it for the cases when `dt.categories()` is combined with other f-expressions, or when `dt.categories()` is applied to columns that have different number of underlying categories. In this PR we - add some basic support for `Grouping::GtoFEW` grouping mode; - adjust `dt.categories()` to produce `Grouping::GtoFEW` columns, that in the case of uneven number of rows are promoted to `Grouping::GtoALL`; - do minor refactoring in `dt.alias()` function. WIP for #1691

Implement `dt.codes()` to get integer codes for categorical columns. Since we currently don't support unsigned column types, the internal representation of codes has been also changed to signed integers. WIP for #1691

…tegorical columns (#3372) In this PR we - implement casts from `dt.cat*(...)` to all of the basic types; - as a consequence, support for converting categorical columns to CSV has been added. WIP for #1691

Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`. WIP for #1691

#3344) Enhance `dt.fillna()` to support filling missing values with a particular value, that could be a scalar, sequence or an `FExpr`. WIP for #3279

Since python `3.6` reached its end of life, update documentation to `3.7+`. Closes #3376

…and `cummax()` (#3381) Add `reverse` parameter to control direction of cumulative function's calculations: - when `False`, calculation is done from top to bottom (default); - when `True`, calculation is done from bottom to top. Сloses #3279

It seems that `na_position` parameter was missing in the `dt.sort()` documentation for some reason, though it was referred to in the examples. In this PR we fix this issue.

It appears as though we never initialized `na_position_` in the case of `dt.by()`, and this resulted in some random data corruption for columns, that contain missing values. As of this PR, we initialize `na_position_` to `NaPosition::FIRST` to be consistent with what we declare in `dt.by()` [documentation](https://datatable.readthedocs.io/en/latest/api/dt/by.html): ``` The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values. ``` Also, we switch to python 3.8 for testing debug wheels, so that we keep track of the status of the mentioned groupby tests. Closes #3331

- make codebase compatible with Python 3.11 changes; - only use the required adopted code from `python/pythoncapi-compat`; - remove obsolete code for the older Python versions; - switch to Python 3.11 on AppVeyor for Windows and linux; - add Python 3.11 support to Jenkins; Closes #3374

WIP for #2562 Closes #3390

Closes #3392

) - the fix for #3390 has been already pushed as a part of #3388, here we merely add a corresponding test; - as of #2472, `f[:]` excludes the groupby columns, in this PR we make the corresponding adjustments to the docs. Closes #3390

WIP for #2562

While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.

Cumulative functions acces the column's data in parallel. If the input column appears to be a `Latent_ColumnImpl`, materialization of such a column happens in parallel. However, the `Latent_ColumnImpl::materialize()` is not a thread-safe method and invoking it from multiple threads results in a segfault. In this PR we `vivify()` the source column, i.e. materialization, if necessary, happens in one thread only, before processing it in parallel. Closes #3345

This PR implements column's aliasing as proposed in #2684. We couldn't name the method `.as()` though, because `as` is a built-in python keyword — hence, we use `.alias()` instead. Column aliasing is now also available in the group-by clause. Closes #2504

samukweku · 2023-01-03T02:37:59Z

just could not resolve the conflicts - moved to #3403

sammychoco added 5 commits September 2, 2022 14:32

skeleton for nth integers

6b80fb7

seketon logic for nth, using rowindex

9359413

add tests

08d04ac

add method and tests for method

784c3f7

update

41ac89d

samukweku added the new feature Feature requests for new functionality label Sep 3, 2022

samukweku requested a review from oleksiyskononenko September 3, 2022 22:00

samukweku self-assigned this Sep 3, 2022

sammychoco added 2 commits September 4, 2022 08:03

add newline

081c462

add newline

b6a4f79

sammychoco and others added 7 commits September 15, 2022 19:25

use VirtualColumn implementation instead of RowIndex

0923a80

FExpr/Expr adjustments in the docs (#3347)

135b4c6

While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.

Minor docs adjustments to cumcount() and ngroup() (#3349)

a4d84f3

WIP for #3279

Fix status badge in documentation (#3351)

7a0e192

As we eliminated Travis from our building pipeline (#3042), its status badge stopped working. In this PR we replace it with the AppVeyor's badge.

skeleton for skipna logic

5118a0d

skipna added

36ec769

Merge branch 'main' into samukweku/nth

e4c2c91

oleksiyskononenko reviewed Sep 19, 2022

View reviewed changes

src/core/expr/fexpr.cc Outdated Show resolved Hide resolved

src/core/expr/fexpr_nth.cc Outdated Show resolved Hide resolved

samukweku and others added 2 commits September 20, 2022 17:39

add nth method

565ba32

oleksiyskononenko and others added 25 commits January 3, 2023 12:07

Fix casting void columns to categoricals (#3362)

6eb48a2

Fix casting `void` columns to categoricals and add corresponding tests. WIP for #1691

Improve header detection heuristics in fread() (#3364)

bae12ea

Allow column names to be missing when detecting a header in CSV files. Closes #3363

Implement casting of the most column types to categoricals (#3365)

93c2244

Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691

Use col/cols and convert shift docs to standard format (#3366)

fb0df58

- use `col`/`cols` as a parameter name when dt function supports single/multiple column(s); - convert `dt.shift()` documentation into standard format; - cosmetics.

Implement dt.categories() (#3367)

098c410

Implement `dt.categories()` to get categories for categorical columns. WIP for #1691

Fix "See also" sections for cat* types and cbind()/rbind() docs (…

fd310d0

…#3368) - fix "See also" section for categorical types; - improve `cbind()`/`rbind()` documentation.

Implement dt.codes() (#3371)

ae6e9c6

Implement `dt.codes()` to get integer codes for categorical columns. Since we currently don't support unsigned column types, the internal representation of codes has been also changed to signed integers. WIP for #1691

Implement casts from categorical types, add to_csv() support for ca…

690e094

…tegorical columns (#3372) In this PR we - implement casts from `dt.cat*(...)` to all of the basic types; - as a consequence, support for converting categorical columns to CSV has been added. WIP for #1691

Adjust copyright years in types/type_*.cc

ad728aa

Implement statistics for categorical columns (#3373)

a1dd57c

Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`. WIP for #1691

[ENH] Enhance dt.fillna() to support filling with a particular value (

20e3da1

#3344) Enhance `dt.fillna()` to support filling missing values with a particular value, that could be a scalar, sequence or an `FExpr`. WIP for #3279

Update documentation regarding removal of python 3.6 (#3377)

c11a122

Since python `3.6` reached its end of life, update documentation to `3.7+`. Closes #3376

Add parameter na_position to dt.sort() documentation (#3389)

020961e

It seems that `na_position` parameter was missing in the `dt.sort()` documentation for some reason, though it was referred to in the examples. In this PR we fix this issue.

Refactor sum() and prod() reducers to use FExpr (#3388)

dbad9c5

WIP for #2562 Closes #3390

Enable macOS on AppVeyor for py311 (#3395)

f220005

Closes #3392

Refactor .extend() and .remove() to use FExpr (#3393)

56317c6

WIP for #2562

add method and tests for method

da2c467

FExpr/Expr adjustments in the docs (#3347)

e3f5fbd

While #2562 is still WIP, we need to have proper a documentation for both `Expr` and `FExpr` input/return types. This PR fixes it.

samukweku closed this Jan 3, 2023

samukweku deleted the samukweku/nth branch January 3, 2023 02:42

oleksiyskononenko mentioned this pull request Jan 10, 2023

[ENH] nth function #3404

Open

st-pasha removed this from the Release 1.1.0 milestone Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] `nth` function #3346

[ENH] `nth` function #3346

samukweku commented Sep 3, 2022 •

edited

samukweku commented Sep 3, 2022

oleksiyskononenko commented Sep 7, 2022

samukweku commented Sep 7, 2022

oleksiyskononenko commented Sep 8, 2022 •

edited

samukweku commented Sep 8, 2022

oleksiyskononenko commented Sep 8, 2022 •

edited

samukweku commented Sep 8, 2022

samukweku commented Sep 15, 2022

samukweku commented Sep 17, 2022

samukweku commented Jan 3, 2023

[ENH] nth function #3346

[ENH] nth function #3346

Conversation

samukweku commented Sep 3, 2022 • edited

samukweku commented Sep 3, 2022

oleksiyskononenko commented Sep 7, 2022

samukweku commented Sep 7, 2022

oleksiyskononenko commented Sep 8, 2022 • edited

samukweku commented Sep 8, 2022

oleksiyskononenko commented Sep 8, 2022 • edited

samukweku commented Sep 8, 2022

samukweku commented Sep 15, 2022

samukweku commented Sep 17, 2022

samukweku commented Jan 3, 2023

[ENH] `nth` function #3346

[ENH] `nth` function #3346

samukweku commented Sep 3, 2022 •

edited

oleksiyskononenko commented Sep 8, 2022 •

edited

oleksiyskononenko commented Sep 8, 2022 •

edited