[ENH] Column renaming #3313

samukweku · 2022-07-15T07:41:59Z

Implementation for column renaming

from datatable import dt, f, by

grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
data = {'ID': ["x%d" % r for r in range(10)],
             'Gender': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
             'ExamYear': [2007, 2007, 2007, 2008, 2008,
                          2008, 2008, 2009, 2009, 2009],
             'Class': ['algebra', 'stats', 'bio', 'algebra',
                       'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
             'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
                              'yes', 'yes', 'yes', 'yes', 'yes'],
             'Passed': ['yes' if x > 50 else 'no' for x in grades],
             'Employed': [True, True, True, False,
                          False, False, False, True, True, False],
             'Grade': grades}

df = dt.Frame(data)

# proposal via this PR
df[:, dt.mean(f.Grade), by((f.ExamYear < 2009).alias('grp'))]

   |   grp    Grade
   | bool8  float64
-- + -----  -------
 0 |     0  60.6667
 1 |     1  70.8571
[2 rows x 2 columns]

samukweku · 2022-07-15T07:43:03Z

@oleksiyskononenko early stages on this PR; the idea is formalised; however, your review will be helpful and guide me in the right direction.

I have also noticed some issues regarding the methods, which I'll bring up in this PR after your review

oleksiyskononenko · 2022-07-19T10:09:26Z

Do you see any benefits renaming as() to alias()? It seems to me that alias is an an alternative name, while here we create a new name, so as() could be more appropriate.

src/core/expr/fexpr_alias.cc

tests/dt/test-alias.py

oleksiyskononenko · 2022-07-19T10:52:24Z

src/core/expr/fexpr_alias.cc

+        base_frame.add_column(wf.retrieve_column(i),
+                              std::string(),
+                              gmode);
+        base_frame.rename(names_[i]);


My feeling is that .rename() should be able to rename a set of columns to different names. Right now this method only accepts one name for all columns and in some cases it actually adds a prefix...

samukweku · 2022-07-19T11:16:04Z

Do you see any benefits renaming as() to alias()? It seems to me that alias is an an alternative name, while here we create a new name, so as() could be more appropriate.

@oleksiyskononenko as is a python keyword, hence my issue. Alternatively, if we could just make as a method, we can possibly avoid this issue. How to do that though is beyond me (I made too many errors; so I took the easy way out).

As always, I am open to learning how to do things, so pls feel free to guide me

Co-authored-by: oleksiyskononenko <35204136+oleksiyskononenko@users.noreply.github.com>

oleksiyskononenko · 2022-07-19T18:41:57Z

Yeah, you're right. If we introduce a function dt.as() then we could potentially have issues when doing from datatable import as. So here we can only support .as() as an f-method.

At the same time we already have a couple of functions with the same names as the python's built-ins: dt.min(), dt.max() and dt.sum(). However, to make them work properly we had to add wrappers that would decide which function (dt or python) to call:
https://github.com/h2oai/datatable/blob/main/src/datatable/expr/reduce.py#L113-L142

Unfortunately, I don't think it is possible to have such a wrapper for a python's keyword.

samukweku · 2022-07-19T22:07:11Z

If we can implement it as a method only on f, similar to extend and remove, I feel we can avoid the keyword issue. Unfortunately I am not knowledgable enough to implement that

samukweku · 2022-07-19T22:08:57Z

By the way, @oleksiyskononenko is there a difference between FExpr and Expr. I always thought they were the same, but it seems there is an Expr object and FExpr

samukweku · 2022-07-19T22:10:17Z

Also, the alias idea I borrowed from pyspark and Polars libraries

oleksiyskononenko · 2022-07-19T22:13:28Z

@samukweku yes, you can think of FExpr as the new version of Expr. See this issue: #2562

oleksiyskononenko · 2022-07-20T05:20:41Z

@samukweku No worries, we will figure out how to move forward with this PR. Let's first finalize #3310 and #3311, then we will come back and implement a proper column renaming.

It appears as though we had some incorrect directions in documentation for `first()` and `last()`. In this PR we fix them and also make some other minor improvements to the text.

In #3288 we seem to miss datetime types. In this PR we add support for `date32` and `time64` types in `cummin()` and `cummax()` functions. WIP for #3279

Adjust our custom theme in a way similar to `sphinx_rtd_theme`, see readthedocs/sphinx_rtd_theme#1021. This fixes the search functionality for sphinx `4.*`. We can take care of sphinx `5.*`, that was recently released later, if needed. Closes #3299

…as comparison

Remove unused `gby` in the case when `dt.unique()` is called in the group by context.

When we've been working on #3284, we missed void grouped columns support for `dt.nunique()`. This PR fixes it. Closes #3284

WIP for #3279 Closes #2892

…ent column types (#3319) Frame's method `.sum()` now returns the same column types as the corresponding `dt.sum()` reducer. Closes #2904

Cosmetic improvements of docs for `cumcount()` and `ngroups()`. WIP for #3279

Improve "Using datatable" section by adding more consistency to the code and fixing the text. In future, we may also want to add a sample "in.csv" file , so that all the code examples could really be copy-pasted to python for execution.

…3324) Label ids for both FTRL and LinearModel are stored as `int32` column, so It makes no sense to use `ARR64` rowindex to identify the new labels. In this PR we safely change `RowIndex::ARR64` to `RowIndex::ARR32` when creating new labels for classification problems.

…ils` (#3321) Currently our Jenkins is using macOS Big Sur, and in order to make OS coverage as large as possible we switch AppVeyor to use macOS Monterey. Also, in this PR we replace the deprecated `distutils` module with `sysconfig` in order to get the proper platform tag. Closes #3322 Closes #3177

Closes #3302

…els (#3327) Support for `manylinux2010` image that we're using to build datatable on `x86_64` is about to be dropped (pypa/manylinux#1281), so we switch to `manylinux2014` that we're already using on `ppc64le`. Also, Python 3.7 will reach its end of life soon, hence we switch to Python 3.8 when generating debug wheels. In principle, we can generate debug wheels for all the supported Python versions, however, this will significantly slow down our building pipeline.

In this PR we adjust AppVeyor builds to - enable `pyarrow` tests; - on Windows, enable C++ tests by testing debug wheels for Python 3.9; - on Windows, fix builds to properly report failures; - for consistency, rename `DTTEST` to `DT_TEST` and namespace `dttest` to `dt::tests`. Note, that, when enabled, the C++ tests [redefine](https://github.com/h2oai/datatable/blob/main/src/core/utils/tests.h#L91-L100) `protected` and `private` keywords to `public`. This is a pretty dangerous approach, that we might need to reconsider, because this redefinition only happens in the files which include `utils/tests.h`. On Windows, for instance, this caused a pile of linking errors due to the fact that some methods were expected to be `public`, but were declared as `protected` or `private`.

…3332) - make signatures of the functions referenced in the `FExpr` API section to be consistent with the actual signatures of the `dt.*()` functions; - couple of other minor fixes.

samukweku · 2022-08-10T10:20:11Z

bungled this, closing this and creating a new one : #3333

sammychoco added 7 commits July 14, 2022 14:49

skeleton

ffaccfd

skeleton base

d6edc7c

base implementation of as

c429373

use alias as function name

5e6f9e5

add tests

247439b

add tests for method .alias

f4c7287

add docs links

cb05a73

samukweku added the new feature Feature requests for new functionality label Jul 15, 2022

samukweku requested a review from oleksiyskononenko July 15, 2022 07:42

samukweku self-assigned this Jul 15, 2022

add whitespace

fccdd11

oleksiyskononenko reviewed Jul 19, 2022

View reviewed changes

src/core/expr/fexpr_alias.cc Outdated Show resolved Hide resolved

tests/dt/test-alias.py Outdated Show resolved Hide resolved

oleksiyskononenko reviewed Jul 19, 2022

View reviewed changes

samukweku and others added 2 commits July 19, 2022 21:17

Update tests/dt/test-alias.py

6ce12db

Co-authored-by: oleksiyskononenko <35204136+oleksiyskononenko@users.noreply.github.com>

Update src/core/expr/fexpr_alias.cc

15f7bc2

Co-authored-by: oleksiyskononenko <35204136+oleksiyskononenko@users.noreply.github.com>

oleksiyskononenko and others added 6 commits August 10, 2022 17:46

Improve documentation for first() and last() (#3312)

e5ad7a7

It appears as though we had some incorrect directions in documentation for `first()` and `last()`. In this PR we fix them and also make some other minor improvements to the text.

Add DT_DISABLE changelog record

d4568c2

Fix a broken link on the documentation page Creating a new FExpr

2c2c64f

Add support for datetime in cumminmax (#3314)

24c0c51

In #3288 we seem to miss datetime types. In this PR we add support for `date32` and `time64` types in `cummin()` and `cummax()` functions. WIP for #3279

Remove cumsum() and cummax() from "Missing functionality" in pand…

cb22dd2

…as comparison

oleksiyskononenko and others added 14 commits August 10, 2022 17:46

Clean-up dt.nunique() internals (#3317)

4504d01

Remove unused `gby` in the case when `dt.unique()` is called in the group by context.

Support void grouped columns in dt.nunique() (#3318)

2e7b9c2

When we've been working on #3284, we missed void grouped columns support for `dt.nunique()`. This PR fixes it. Closes #3284

[ENH] Add cumcount() and ngroup() functions (#3310)

201c2a6

WIP for #3279 Closes #2892

Make Frame's method .sum() and dt.sum() reducer to return consist…

2e4248b

…ent column types (#3319) Frame's method `.sum()` now returns the same column types as the corresponding `dt.sum()` reducer. Closes #2904

Improve docs for cumcount() and ngroups() (#3320)

30f9fa2

Cosmetic improvements of docs for `cumcount()` and `ngroups()`. WIP for #3279

Support freading from public S3 buckets (#3325)

de9e024

Closes #3302

Fix signatures of the dt functions referenced in the FExpr API (#…

7a9158a

…3332) - make signatures of the functions referenced in the `FExpr` API section to be consistent with the actual signatures of the `dt.*()` functions; - couple of other minor fixes.

Enhance AppVeyor.yml

7232e82

Fix main builds on AppVeyor

d8507e0

samukweku force-pushed the samukweku/as branch from 11b352c to d8507e0 Compare August 10, 2022 09:58

samukweku closed this Aug 10, 2022

samukweku deleted the samukweku/as branch October 29, 2022 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Column renaming #3313

[ENH] Column renaming #3313

samukweku commented Jul 15, 2022

samukweku commented Jul 15, 2022

oleksiyskononenko commented Jul 19, 2022

oleksiyskononenko Jul 19, 2022

samukweku commented Jul 19, 2022

oleksiyskononenko commented Jul 19, 2022

samukweku commented Jul 19, 2022

samukweku commented Jul 19, 2022

samukweku commented Jul 19, 2022 •

edited

oleksiyskononenko commented Jul 19, 2022

oleksiyskononenko commented Jul 20, 2022

samukweku commented Aug 10, 2022

[ENH] Column renaming #3313

[ENH] Column renaming #3313

Conversation

samukweku commented Jul 15, 2022

samukweku commented Jul 15, 2022

oleksiyskononenko commented Jul 19, 2022

oleksiyskononenko Jul 19, 2022

Choose a reason for hiding this comment

samukweku commented Jul 19, 2022

oleksiyskononenko commented Jul 19, 2022

samukweku commented Jul 19, 2022

samukweku commented Jul 19, 2022

samukweku commented Jul 19, 2022 • edited

oleksiyskononenko commented Jul 19, 2022

oleksiyskononenko commented Jul 20, 2022

samukweku commented Aug 10, 2022

samukweku commented Jul 19, 2022 •

edited