Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Column renaming #3313

Closed
wants to merge 30 commits into from
Closed

[ENH] Column renaming #3313

wants to merge 30 commits into from

Conversation

samukweku
Copy link
Collaborator

Implementation for column renaming

from datatable import dt, f, by

grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
data = {'ID': ["x%d" % r for r in range(10)],
             'Gender': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
             'ExamYear': [2007, 2007, 2007, 2008, 2008,
                          2008, 2008, 2009, 2009, 2009],
             'Class': ['algebra', 'stats', 'bio', 'algebra',
                       'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
             'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
                              'yes', 'yes', 'yes', 'yes', 'yes'],
             'Passed': ['yes' if x > 50 else 'no' for x in grades],
             'Employed': [True, True, True, False,
                          False, False, False, True, True, False],
             'Grade': grades}

df = dt.Frame(data)

# proposal via this PR
df[:, dt.mean(f.Grade), by((f.ExamYear < 2009).alias('grp'))]

   |   grp    Grade
   | bool8  float64
-- + -----  -------
 0 |     0  60.6667
 1 |     1  70.8571
[2 rows x 2 columns]

@samukweku samukweku added the new feature Feature requests for new functionality label Jul 15, 2022
@samukweku samukweku self-assigned this Jul 15, 2022
@samukweku
Copy link
Collaborator Author

@oleksiyskononenko early stages on this PR; the idea is formalised; however, your review will be helpful and guide me in the right direction.

I have also noticed some issues regarding the methods, which I'll bring up in this PR after your review

@oleksiyskononenko
Copy link
Contributor

Do you see any benefits renaming as() to alias()? It seems to me that alias is an an alternative name, while here we create a new name, so as() could be more appropriate.

src/core/expr/fexpr_alias.cc Outdated Show resolved Hide resolved
tests/dt/test-alias.py Outdated Show resolved Hide resolved
base_frame.add_column(wf.retrieve_column(i),
std::string(),
gmode);
base_frame.rename(names_[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that .rename() should be able to rename a set of columns to different names. Right now this method only accepts one name for all columns and in some cases it actually adds a prefix...

@samukweku
Copy link
Collaborator Author

Do you see any benefits renaming as() to alias()? It seems to me that alias is an an alternative name, while here we create a new name, so as() could be more appropriate.

@oleksiyskononenko as is a python keyword, hence my issue. Alternatively, if we could just make as a method, we can possibly avoid this issue. How to do that though is beyond me (I made too many errors; so I took the easy way out).

As always, I am open to learning how to do things, so pls feel free to guide me

samukweku and others added 2 commits July 19, 2022 21:17
Co-authored-by: oleksiyskononenko <35204136+oleksiyskononenko@users.noreply.github.com>
Co-authored-by: oleksiyskononenko <35204136+oleksiyskononenko@users.noreply.github.com>
@oleksiyskononenko
Copy link
Contributor

Yeah, you're right. If we introduce a function dt.as() then we could potentially have issues when doing from datatable import as. So here we can only support .as() as an f-method.

At the same time we already have a couple of functions with the same names as the python's built-ins: dt.min(), dt.max() and dt.sum(). However, to make them work properly we had to add wrappers that would decide which function (dt or python) to call:
https://github.com/h2oai/datatable/blob/main/src/datatable/expr/reduce.py#L113-L142

Unfortunately, I don't think it is possible to have such a wrapper for a python's keyword.

@samukweku
Copy link
Collaborator Author

If we can implement it as a method only on f, similar to extend and remove, I feel we can avoid the keyword issue. Unfortunately I am not knowledgable enough to implement that

@samukweku
Copy link
Collaborator Author

By the way, @oleksiyskononenko is there a difference between FExpr and Expr. I always thought they were the same, but it seems there is an Expr object and FExpr

@samukweku
Copy link
Collaborator Author

samukweku commented Jul 19, 2022

Also, the alias idea I borrowed from pyspark and Polars libraries

@oleksiyskononenko
Copy link
Contributor

@samukweku yes, you can think of FExpr as the new version of Expr. See this issue: #2562

@oleksiyskononenko
Copy link
Contributor

@samukweku No worries, we will figure out how to move forward with this PR. Let's first finalize #3310 and #3311, then we will come back and implement a proper column renaming.

oleksiyskononenko and others added 6 commits August 10, 2022 17:46
It appears as though we had some incorrect directions in documentation for `first()` and `last()`. In this PR we fix them and also make some other minor improvements to the text.
In #3288 we seem to miss datetime types. In this PR we add support for `date32` and `time64` types in `cummin()` and `cummax()` functions.

WIP for #3279
Adjust our custom theme in a way similar to `sphinx_rtd_theme`, see  readthedocs/sphinx_rtd_theme#1021. This fixes the search functionality for sphinx `4.*`. We can take care of sphinx `5.*`, that was recently released later, if needed.

Closes #3299
oleksiyskononenko and others added 14 commits August 10, 2022 17:46
Remove unused `gby` in the case when `dt.unique()` is called in the group by context.
When we've been working on #3284, we missed void grouped columns support for `dt.nunique()`. This PR fixes it.

Closes #3284
…ent column types (#3319)

Frame's method `.sum()` now returns the same column types as the corresponding `dt.sum()` reducer.

Closes #2904
Cosmetic improvements of docs for `cumcount()` and `ngroups()`.

WIP for #3279
Improve "Using datatable" section by adding more consistency to the code and fixing the text. In future, we may also want to add a sample "in.csv" file , so that all the code examples could really be copy-pasted to python for execution.
…3324)

Label ids for both FTRL and LinearModel are stored as `int32` column, so It makes no sense to use `ARR64` rowindex to identify the new labels. In this PR we safely change `RowIndex::ARR64` to `RowIndex::ARR32` when creating new labels for classification problems.
…ils` (#3321)

Currently our Jenkins is using macOS Big Sur, and in order to make OS coverage as large as possible we switch AppVeyor to use macOS Monterey. Also, in this PR we replace the deprecated `distutils` module with `sysconfig` in order to get the proper platform tag.

Closes #3322 
Closes #3177
…els (#3327)

Support for `manylinux2010` image that we're using to build datatable on `x86_64` is about to be dropped (pypa/manylinux#1281), so we switch to `manylinux2014` that we're already using on `ppc64le`. 

Also, Python 3.7 will reach its end of life soon, hence we switch to Python 3.8 when generating debug wheels. In principle, we can generate debug wheels for all the supported Python versions, however, this will significantly slow down our building pipeline.
In this PR we adjust AppVeyor builds to 
- enable `pyarrow` tests;
- on Windows, enable C++ tests by testing debug wheels for Python 3.9;
- on Windows, fix builds to properly report failures;
- for consistency, rename `DTTEST` to `DT_TEST` and namespace `dttest` to `dt::tests`.

Note, that, when enabled, the C++ tests [redefine](https://github.com/h2oai/datatable/blob/main/src/core/utils/tests.h#L91-L100) `protected` and `private` keywords to `public`. This is a pretty dangerous approach, that we might need to reconsider, because this redefinition only happens in the files which include `utils/tests.h`. On Windows, for instance, this caused a pile of linking errors due to the fact that some methods were expected to be `public`, but were declared as `protected` or `private`.
…3332)

- make signatures of the functions referenced in the `FExpr` API section to be consistent with the actual signatures of the `dt.*()` functions;
- couple of other minor fixes.
@samukweku
Copy link
Collaborator Author

bungled this, closing this and creating a new one : #3333

@samukweku samukweku closed this Aug 10, 2022
@samukweku samukweku deleted the samukweku/as branch October 29, 2022 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Feature requests for new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FR] Option to add a name to grouping in by, especially for boolean expressions
2 participants