Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement categorical columns #1691

Open
14 of 25 tasks
st-pasha opened this issue Feb 27, 2019 · 4 comments
Open
14 of 25 tasks

Implement categorical columns #1691

st-pasha opened this issue Feb 27, 2019 · 4 comments
Assignees
Labels
EPIC ⭐ Big task that may encompass many smaller ones new feature Feature requests for new functionality

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Feb 27, 2019

A categorical column is semantically equivalent to a string column, except that it uses integer codes to store the values. The layout of such column is therefore:

T values[n];  // array of indices into a dictionary
StringColumn<int32> dict;  // "dictionary" column

(where T could be int8, int16 or int32).

Tasks and operations we can support for categoricals are:

Conversion:

  • implement type casts to categorical columns
  • implement type casts from categorical columns
  • read/write categorical columns from/into Jay
  • write categorical columns to csv
  • read categorical columns from csv (fread)
  • convert categorical columns to numpy
  • create categorical columns from numpy
  • convert categorical column to pandas
  • create categorical column from pandas
  • convert categorical columns to pyarrow
  • create categorical columns from pyarrow
@st-pasha st-pasha added the new feature Feature requests for new functionality label Feb 27, 2019
@XiaomoWu
Copy link

XiaomoWu commented May 2, 2019

Just a quick question: the Rdatatable avoids categorical column (in R it's called factor) partially because it slows downs the performance. Just wonder will the performance of pydatatable be affected if introducing categorical column.

@st-pasha
Copy link
Contributor Author

st-pasha commented May 2, 2019

Hmm, interesting, I have not heard about that. Perhaps, there are specific scenarios where the factor variables become slower? Like, when the number of factors approaches the number of rows?

There are clearly situations where factors would be preferable. For example, if there are only few of them: this would speed up sorting for example, and also have the potential to greatly reduce the required storage space.

Anyways, the categorical type should be in addition to, not as a replacement for the regular string type. So the user will be able to use whatever format better suits his/her need.

@jangorecki
Copy link
Contributor

@XiaomoWu the reason to avoid factor was not speed but problems with its levels when combining, filtering, or performing string operations like paste. Factors are faster than character, and can be processed in parallel, while character's R global cache is not thread safe.

@jangorecki
Copy link
Contributor

When implemented it might allow to read 1e9 data sets into db-benchmark for pandas, currently to_pandas() fails with OOM (afair). Having categoricals instead of objects could significantly reduce memory footprint.
Recent attempt to optimise pandas read_csv has failed, see h2oai/db-benchmark#99

@st-pasha st-pasha added this to the Release 1.1.0 milestone Jul 2, 2021
@st-pasha st-pasha added the EPIC ⭐ Big task that may encompass many smaller ones label Jul 2, 2021
@st-pasha st-pasha mentioned this issue Jul 2, 2021
4 tasks
oleksiyskononenko added a commit that referenced this issue Aug 24, 2021
For the moment, these types are not used anywhere, but we need them in order to implement the `Categorical_ColumnImpl`. This PR also includes minor corrections for array types.

WIP for #1691
oleksiyskononenko added a commit that referenced this issue Sep 8, 2021
…3158)

- implemented `CategoricalColumn_Impl`;
- added support for categorical columns in `dt.Frame()`;
- added support for categorical columns in a terminal, also allowing the element access through `[i, j]` selector;
- added and modified some relevant tests.

WIP for #1691
oleksiyskononenko added a commit that referenced this issue Sep 27, 2022
Fix casting `void` columns to categoricals and add corresponding tests.

WIP for #1691
oleksiyskononenko added a commit that referenced this issue Oct 4, 2022
Implement casting of boolean, integer, float, date, time and string columns to categoricals.

WIP for #1691
oleksiyskononenko added a commit that referenced this issue Oct 4, 2022
Implement casting of boolean, integer, float, date, time and string columns to categoricals.

WIP for #1691
oleksiyskononenko added a commit that referenced this issue Oct 11, 2022
Implement `dt.categories()` to get categories for categorical columns.

WIP for #1691
oleksiyskononenko added a commit that referenced this issue Oct 15, 2022
It seems that `dt.categories()` is the first datatable function, that needs to produce `Grouping::GtoFEW` columns. That's because the number of categories in a categorical column could be anything between `0` and `nrows - 1`. Currently, datatable doesn't really support `Grouping::GtoFEW`, but may need it for the cases when `dt.categories()` is combined with other f-expressions, or when `dt.categories()` is applied to columns that have different number of underlying categories.

In this PR we
- add some basic support for `Grouping::GtoFEW` grouping mode;
- adjust `dt.categories()` to produce `Grouping::GtoFEW` columns, that in the case of uneven number of rows are  promoted to `Grouping::GtoALL`;
- do minor refactoring in `dt.alias()` function.

WIP for #1691
oleksiyskononenko added a commit that referenced this issue Oct 18, 2022
Implement `dt.codes()` to get integer codes for categorical columns. Since we currently don't support unsigned column types, the internal representation of codes has been also changed to signed integers.

WIP for #1691
oleksiyskononenko added a commit that referenced this issue Oct 21, 2022
…tegorical columns (#3372)

In this PR we 
- implement casts from `dt.cat*(...)` to all of the basic types;
- as a consequence, support for converting categorical columns to CSV has been added.

WIP for #1691
oleksiyskononenko added a commit that referenced this issue Oct 26, 2022
Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`.

WIP for #1691
samukweku pushed a commit that referenced this issue Jan 3, 2023
Fix casting `void` columns to categoricals and add corresponding tests.

WIP for #1691
samukweku pushed a commit that referenced this issue Jan 3, 2023
Implement casting of boolean, integer, float, date, time and string columns to categoricals.

WIP for #1691
samukweku pushed a commit that referenced this issue Jan 3, 2023
Implement `dt.categories()` to get categories for categorical columns.

WIP for #1691
samukweku pushed a commit that referenced this issue Jan 3, 2023
It seems that `dt.categories()` is the first datatable function, that needs to produce `Grouping::GtoFEW` columns. That's because the number of categories in a categorical column could be anything between `0` and `nrows - 1`. Currently, datatable doesn't really support `Grouping::GtoFEW`, but may need it for the cases when `dt.categories()` is combined with other f-expressions, or when `dt.categories()` is applied to columns that have different number of underlying categories.

In this PR we
- add some basic support for `Grouping::GtoFEW` grouping mode;
- adjust `dt.categories()` to produce `Grouping::GtoFEW` columns, that in the case of uneven number of rows are  promoted to `Grouping::GtoALL`;
- do minor refactoring in `dt.alias()` function.

WIP for #1691
samukweku pushed a commit that referenced this issue Jan 3, 2023
Implement `dt.codes()` to get integer codes for categorical columns. Since we currently don't support unsigned column types, the internal representation of codes has been also changed to signed integers.

WIP for #1691
samukweku pushed a commit that referenced this issue Jan 3, 2023
…tegorical columns (#3372)

In this PR we 
- implement casts from `dt.cat*(...)` to all of the basic types;
- as a consequence, support for converting categorical columns to CSV has been added.

WIP for #1691
samukweku pushed a commit that referenced this issue Jan 3, 2023
Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`.

WIP for #1691
samukweku pushed a commit that referenced this issue May 2, 2023
samukweku pushed a commit that referenced this issue May 2, 2023
@st-pasha st-pasha removed this from the Release 1.1.0 milestone Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPIC ⭐ Big task that may encompass many smaller ones new feature Feature requests for new functionality
Projects
None yet
Development

No branches or pull requests

4 participants