Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Column aliasing #3333

Merged
merged 11 commits into from Sep 20, 2022
Merged

[ENH] Column aliasing #3333

merged 11 commits into from Sep 20, 2022

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Aug 10, 2022

This PR implements column's aliasing as proposed in #2684. We couldn't name the method .as() though, because as is a built-in python keyword — hence, we use .alias() instead. Column aliasing is now also available in the group-by clause.

Closes #2504

@samukweku samukweku added the new feature Feature requests for new functionality label Aug 10, 2022
@samukweku samukweku self-assigned this Aug 10, 2022
@samukweku samukweku mentioned this pull request Aug 10, 2022
@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Aug 11, 2022

@samukweku So what if we only introduce an f-method .as() but do not introduce a corresponding dt.as() function? Then we won't have to deal with the issues related to the Python built-in keyword as.

Otherwise, my feeling is that

dt.alias(f.C0, "newcol")

is not much better than just

{"newcol" : f.C0}

The latter even looks cleaner to me.

Btw, one of the tests is failing due to some documentation variable being missing.

@samukweku
Copy link
Collaborator Author

@oleksiyskononenko that is so much better/cleaner and was the original idea of the PR. i just could not figure out how to write methods off the f symbol

@samukweku
Copy link
Collaborator Author

@oleksiyskononenko how do I go about implementing f.as? or do you have plans for this that you would like to implement?

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Aug 12, 2022

@samukweku I guess you already implemented many of f-methods here: https://github.com/h2oai/datatable/blob/main/src/core/expr/fexpr.cc However, your implementation was normally

  • to import a function from the main datatable module;
  • invoke this function.

Here you won't be able to do this, because we're not going to implement any new dt.as() function. So the way to go is to implement f.as() only as py::oobj PyFExpr::as(const XArgs&). Just like all these functions: https://github.com/h2oai/datatable/blob/main/src/core/expr/fexpr.cc#L235-L287 No need to implement an additional static py::oobj fn_as(const py::XArgs& args).

@oleksiyskononenko
Copy link
Contributor

So essentially you just need to move your implementation from

static py::oobj pyfn_alias(const py::XArgs &args) {
  YOUR_IMPLEMENTATION
}

to

oobj PyFExpr::alias(const XArgs& args) {
  YOUR_IMPLEMENTATION
}

and make some other adjustments, so that it works.

@samukweku
Copy link
Collaborator Author

been busy these past days on some other stuff ... I'll try and resume on this on the weekend

@samukweku
Copy link
Collaborator Author

@oleksiyskononenko I'm not even sure what I'm doing here; I don't know how to directly set up a method on the f function. If you do not mind, kindly point me in the right direction. I could not follow along with the re.match or re.len examples

@oleksiyskononenko
Copy link
Contributor

@samukweku Sure, let me rewrite what you've got here, so that you will see.

@samukweku
Copy link
Collaborator Author

I will greatly appreciate that @oleksiyskononenko

@samukweku samukweku mentioned this pull request Aug 25, 2022
8 tasks
@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Aug 30, 2022

@samukweku So I've fixed what you currently have to only support the f.as() method, but it appears as though Python won't support any method called as() and issues a SyntaxError immediately. So you're right and we need to invent some other name for this functionality. Another question is if we also want to support dt.as() or not:

  • if not, then I can just push what I currently have an we just rename f.as() to something else;
  • if yes, then we need to go back to your original implementation of dt.alias() / f.alias(), but I'm not sure what was wrong with it.

@samukweku
Copy link
Collaborator Author

@oleksiyskononenko having it as a method is better and cleaner; it will be in the same way as remove/ extend. f.alias is a good fit.

@oleksiyskononenko
Copy link
Contributor

Having it as an f-method and also as dt function will not probably heart. I'm also starting to think that since we already have both dt.as_type() and f.as_type(), why don't we call a new functionality .as_name().

@samukweku
Copy link
Collaborator Author

samukweku commented Aug 31, 2022

Not sure as_name is intuitive enough. rename like in pandas/dplyr/julia, alias like in pyspark and pypolars. I'm thinking of names that are already familiar to dataframe users

@samukweku
Copy link
Collaborator Author

@oleksiyskononenko if we leave as is (allowing for both a function and a method), are we agreed on what the name should be? rename?alias? as_name? (not a fan of this last one, tbh) @pradkrish @vopani @Peter-Pasta any thoughts on this?

@oleksiyskononenko
Copy link
Contributor

My feeling is that

  • rename is normally to rename existing columns in the frame;
  • alias is an alternative name to a column, i.e. it could be accessed with two or more different names;
  • as is to select a column as something else, i.e. as_type() — as a different type, as_name() — under a different name, etc.

Since we can't use as due to Python restrictions, we should probably be more specific and indicate what exactly that "something else" is going to be.

@samukweku
Copy link
Collaborator Author

samukweku commented Sep 11, 2022

as, as used in SQL, is to rename a column, so we are not far off if we use rename/alias

Also with this function we can rename columns and alias them. So it does not matter much IMO. alias just seems more direct - basically saying my original name is C0, but U can call me Jeremy now

@oleksiyskononenko
Copy link
Contributor

@samukweku yeah, let me push updates to this PR to fix couple of things, then we could decide on the name.

@oleksiyskononenko
Copy link
Contributor

So I've re-factored the function, while keeping the name dt.alias(). Made it work for both dt.alias() and FExpr.alias() cases, streamlined some code and improved tests.

Please see if everything makes sense to you and don't hesitate to ask any questions.

What we're missing here is documentation and also the final concord on the function name.

While we can leave it as alias, it may still be confusing. For instance, there is a pandas issue on the column aliasing with totally different meaning: pandas-dev/pandas#11723

One of the options for the name I came with is name_as(). For f-methods it will look just fine, something like f[0].name_as("count") or f[:].name_as(["currency", "count"]) is pretty intuitive. However, it won't look so good if we use it through dt.name_as(f.C0, "count").

@samukweku
Copy link
Collaborator Author

samukweku commented Sep 17, 2022

Thanks @oleksiyskononenko . I believe your name suggestion (name_as) is right. However, it should be only a method on the f symbol, and not as a dt function

@samukweku
Copy link
Collaborator Author

@oleksiyskononenko I think we should go ahead and use the name_as but only as a method on f. If you don't mind, can you show how to do this, and I can use that as a template for future work maybe?

@oleksiyskononenko
Copy link
Contributor

@samukweku That's not a big deal to have only the f-method, however, I don't think it will hurt if we keep a dt function also. For instance, as_type() is available in both forms:

@oleksiyskononenko
Copy link
Contributor

The only advantage I see of keeping this as an f-method is that we could then support names as not only a list/tuple/string, but as a variable number of arguments. I.e. changing the signature from

name_as(names)

to

name_as(*names)

Then, instead of doing f[0:2].name_as(["C0", "C1"]) one could simply do f[0:2].name_as("C0", "C1").

@samukweku
Copy link
Collaborator Author

@oleksiyskononenko the variable args is great! Totally forgot about it. I think it is a great idea for both function and method. I think name_as is ok - I'll always be biased toward alias 😁

@oleksiyskononenko
Copy link
Contributor

Unfortunately, with the function dt.name_as(cols, names) it could be messy to use variable arguments. That's because both cols and names could consist of more than one element.

@samukweku
Copy link
Collaborator Author

samukweku commented Sep 17, 2022

Does dt.also_as make sense? Or dt.aka?

@samukweku
Copy link
Collaborator Author

I'll add the docs and make changes, while changing to name_as

@oleksiyskononenko
Copy link
Contributor

@samukweku I guess you can leave it as alias in the docs. Since the as approach is called alias in SQL, it should be intuitive for the users. I will remove dt.alias() and will make FExpr.alias() to accept variable number of arguments.

@oleksiyskononenko oleksiyskononenko added this to the Release 1.1.0 milestone Sep 19, 2022
@oleksiyskononenko oleksiyskononenko changed the title [ENH] Column renaming [ENH] Column aliasing Sep 19, 2022
@oleksiyskononenko
Copy link
Contributor

@samukweku I found that I have already created documentation, just had to commit the files. Please take a look and if it's ok, we can merge this PR.

@oleksiyskononenko oleksiyskononenko added documentation test Add new tests, or fix existing tests labels Sep 20, 2022
@oleksiyskononenko oleksiyskononenko merged commit 75cbb0f into main Sep 20, 2022
@oleksiyskononenko oleksiyskononenko deleted the samukweku/alias branch September 20, 2022 06:20
@samukweku
Copy link
Collaborator Author

My bad @oleksiyskononenko I was having a look at it. Glad it has been merged though - it is a superb addition to the library

@oleksiyskononenko
Copy link
Contributor

@samukweku oops, sorry, I saw your thumbs up and thought you already finished with the review. Anyways, please continue your review and if you see anything wrong or needs clarification, please open a separate PR with additions.

samukweku added a commit that referenced this pull request Sep 20, 2022
This PR implements column's aliasing as proposed in #2684. We couldn't name the method `.as()` though, because `as` is a built-in python keyword — hence, we use `.alias()` instead. Column aliasing is now also available in the group-by clause.

Closes #2504
samukweku added a commit that referenced this pull request Sep 21, 2022
This PR implements column's aliasing as proposed in #2684. We couldn't name the method `.as()` though, because `as` is a built-in python keyword — hence, we use `.alias()` instead. Column aliasing is now also available in the group-by clause.

Closes #2504
samukweku added a commit that referenced this pull request Jan 2, 2023
This PR implements column's aliasing as proposed in #2684. We couldn't name the method `.as()` though, because `as` is a built-in python keyword — hence, we use `.alias()` instead. Column aliasing is now also available in the group-by clause.

Closes #2504
samukweku added a commit that referenced this pull request Jan 3, 2023
This PR implements column's aliasing as proposed in #2684. We couldn't name the method `.as()` though, because `as` is a built-in python keyword — hence, we use `.alias()` instead. Column aliasing is now also available in the group-by clause.

Closes #2504
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation new feature Feature requests for new functionality test Add new tests, or fix existing tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FR] Option to add a name to grouping in by, especially for boolean expressions
2 participants