New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement categorical columns #1691
Comments
Just a quick question: the |
Hmm, interesting, I have not heard about that. Perhaps, there are specific scenarios where the factor variables become slower? Like, when the number of factors approaches the number of rows? There are clearly situations where factors would be preferable. For example, if there are only few of them: this would speed up sorting for example, and also have the potential to greatly reduce the required storage space. Anyways, the categorical type should be in addition to, not as a replacement for the regular string type. So the user will be able to use whatever format better suits his/her need. |
@XiaomoWu the reason to avoid factor was not speed but problems with its levels when combining, filtering, or performing string operations like |
When implemented it might allow to read 1e9 data sets into db-benchmark for pandas, currently |
Fix casting `void` columns to categoricals and add corresponding tests. WIP for #1691
Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691
Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691
It seems that `dt.categories()` is the first datatable function, that needs to produce `Grouping::GtoFEW` columns. That's because the number of categories in a categorical column could be anything between `0` and `nrows - 1`. Currently, datatable doesn't really support `Grouping::GtoFEW`, but may need it for the cases when `dt.categories()` is combined with other f-expressions, or when `dt.categories()` is applied to columns that have different number of underlying categories. In this PR we - add some basic support for `Grouping::GtoFEW` grouping mode; - adjust `dt.categories()` to produce `Grouping::GtoFEW` columns, that in the case of uneven number of rows are promoted to `Grouping::GtoALL`; - do minor refactoring in `dt.alias()` function. WIP for #1691
Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`. WIP for #1691
Fix casting `void` columns to categoricals and add corresponding tests. WIP for #1691
Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691
It seems that `dt.categories()` is the first datatable function, that needs to produce `Grouping::GtoFEW` columns. That's because the number of categories in a categorical column could be anything between `0` and `nrows - 1`. Currently, datatable doesn't really support `Grouping::GtoFEW`, but may need it for the cases when `dt.categories()` is combined with other f-expressions, or when `dt.categories()` is applied to columns that have different number of underlying categories. In this PR we - add some basic support for `Grouping::GtoFEW` grouping mode; - adjust `dt.categories()` to produce `Grouping::GtoFEW` columns, that in the case of uneven number of rows are promoted to `Grouping::GtoALL`; - do minor refactoring in `dt.alias()` function. WIP for #1691
Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`. WIP for #1691
A categorical column is semantically equivalent to a string column, except that it uses integer codes to store the values. The layout of such column is therefore:
(where
T
could beint8
,int16
orint32
).Tasks and operations we can support for categoricals are:
cat8
,cat16
andcat32
Categorical_ColumnImpl
internalsN/A
handling[i, j]
[:, j]
[i, :]
Conversion:
The text was updated successfully, but these errors were encountered: