Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement equivalent of Pandas qcut() #1680

Closed
navdeep-G opened this issue Feb 22, 2019 · 10 comments · Fixed by #2559
Closed

Implement equivalent of Pandas qcut() #1680

navdeep-G opened this issue Feb 22, 2019 · 10 comments · Fixed by #2559
Assignees
Labels
new feature Feature requests for new functionality

Comments

@navdeep-G
Copy link

  • Describe what feature you would like to see implemented.

Pandas equivalent of qcut(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

  • If possible, give an example of how it may look in the code and what result
    will be produced.

Similar to Pandas qcut(). Maybe some improvements involved API wise but unsure of those at the moment.

@navdeep-G navdeep-G added the new feature Feature requests for new functionality label Feb 22, 2019
@st-pasha
Copy link
Contributor

In pandas, qcut produces a categorical column.
We don't have categoricals just yet, but we can produce integer codes. Would this be acceptable? Also, once we add support for categoricals, the return type of this function would probably have to be changed.

In terms of API, I'm thinking of simple method that applies to a column expression and produces a quantile encoding:

DT[:, f.A.qcut(q=4)]

@navdeep-G
Copy link
Author

navdeep-G commented Feb 23, 2019

@st-pasha Yes, DT[:, f.A.qcut(q=4)] seems like a good API to start with.

However, I am not sure how integer codes would work in this case. Can you give an example of what you expcet the output to look like given the current datatable framework?

@st-pasha
Copy link
Contributor

qcut() is supposed to convert any int/real variable into quantiles. So for example range(12) cut into 4 quartiles would convert into [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3].

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Feb 26, 2019

What about using Aggregator for this purpose?

>>> import datatable as dt
>>> from datatable.models import aggregate
>>> df = dt.Frame(range(12))
>>> [df_agg, df_map] = aggregate(df, min_rows = 0, n_bins = 4)
>>> df_map
     exemplar_id
---  -----------
 0             0
 1             0
 2             0
 3             1
 4             1
 5             1
 6             2
 7             2
 8             2
 9             3
10             3
11             3

@st-pasha
Copy link
Contributor

@oleksiyskononenko Aggregator doesn't have such a precise semantics. For example,

>>> df = dt.Frame([0.0, 0.1, 0.1, 0.1, 0.4, 0.8, 0.9, 0.99, 0.95, 0.91])
>>> [df_agg, df_map] = aggregate(df, min_rows=0, n_bins=4)
>>> df_map
    exemplar_id
--  -----------
 0            0
 1            0
 2            0
 3            0
 4            1
 5            2
 6            2
 7            2
 8            2
 9            2

[10 rows x 1 column]

Note that in this case the fourth quartile isn't even produced.

@st-pasha
Copy link
Contributor

@navdeep-G Will integer codes work for you, or should we wait with this issue until we have categorical column type?

@navdeep-G
Copy link
Author

@st-pasha I am not sure if integer codes would work for me. I think waiting for categorical types would be best. Is there an issue for this? We can tag it here as a dependency.

@st-pasha
Copy link
Contributor

Prerequisite: #1691

@oleksiyskononenko
Copy link
Contributor

@st-pasha Yes, that’s because I compress it at the end. Could be easily disabled or enabled depending on an input parameter.

@mdymczyk
Copy link

mdymczyk commented Mar 4, 2019

@navdeep-G @st-pasha actually pandas.qcut creates either categoricals or integers but that's not important :-)

I think we (or at least MLI) should be ok with anything that describes to which group a given value belongs to so integers should be fine.

Currently, we're using the group boundaries i.e. [[0, 1), [1,2], (2,3]] etc. as extra training features, but if we instead use [0,1,2] it should give us the same result.

oleksiyskononenko added a commit that referenced this issue Aug 9, 2020
In this PR we add a `qcut(cols, nquantiles=10)` function that can be applied to a Frame or to an f-expression. The function behavior is very close to the following Pandas code: `pd.qcut(input, q, labels=False)`.

WIP for #2358 
Closes #1680
@st-pasha st-pasha added this to the Release 0.11.0 milestone Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Feature requests for new functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants