Implement categorical columns #1691

st-pasha · 2019-02-27T20:13:04Z

A categorical column is semantically equivalent to a string column, except that it uses integer codes to store the values. The layout of such column is therefore:

T values[n];  // array of indices into a dictionary
StringColumn<int32> dict;  // "dictionary" column

(where T could be int8, int16 or int32).

Tasks and operations we can support for categoricals are:

Conversion:

The text was updated successfully, but these errors were encountered:

XiaomoWu · 2019-05-02T06:35:17Z

Just a quick question: the Rdatatable avoids categorical column (in R it's called factor) partially because it slows downs the performance. Just wonder will the performance of pydatatable be affected if introducing categorical column.

st-pasha · 2019-05-02T21:41:06Z

Hmm, interesting, I have not heard about that. Perhaps, there are specific scenarios where the factor variables become slower? Like, when the number of factors approaches the number of rows?

There are clearly situations where factors would be preferable. For example, if there are only few of them: this would speed up sorting for example, and also have the potential to greatly reduce the required storage space.

Anyways, the categorical type should be in addition to, not as a replacement for the regular string type. So the user will be able to use whatever format better suits his/her need.

jangorecki · 2019-05-03T10:23:06Z

@XiaomoWu the reason to avoid factor was not speed but problems with its levels when combining, filtering, or performing string operations like paste. Factors are faster than character, and can be processed in parallel, while character's R global cache is not thread safe.

jangorecki · 2019-08-24T09:26:35Z

When implemented it might allow to read 1e9 data sets into db-benchmark for pandas, currently to_pandas() fails with OOM (afair). Having categoricals instead of objects could significantly reduce memory footprint.
Recent attempt to optimise pandas read_csv has failed, see h2oai/db-benchmark#99

For the moment, these types are not used anywhere, but we need them in order to implement the `Categorical_ColumnImpl`. This PR also includes minor corrections for array types. WIP for #1691

…3158) - implemented `CategoricalColumn_Impl`; - added support for categorical columns in `dt.Frame()`; - added support for categorical columns in a terminal, also allowing the element access through `[i, j]` selector; - added and modified some relevant tests. WIP for #1691

WIP for #1691

Fix casting `void` columns to categoricals and add corresponding tests. WIP for #1691

Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691

Implement `dt.categories()` to get categories for categorical columns. WIP for #1691

It seems that `dt.categories()` is the first datatable function, that needs to produce `Grouping::GtoFEW` columns. That's because the number of categories in a categorical column could be anything between `0` and `nrows - 1`. Currently, datatable doesn't really support `Grouping::GtoFEW`, but may need it for the cases when `dt.categories()` is combined with other f-expressions, or when `dt.categories()` is applied to columns that have different number of underlying categories. In this PR we - add some basic support for `Grouping::GtoFEW` grouping mode; - adjust `dt.categories()` to produce `Grouping::GtoFEW` columns, that in the case of uneven number of rows are promoted to `Grouping::GtoALL`; - do minor refactoring in `dt.alias()` function. WIP for #1691

Implement `dt.codes()` to get integer codes for categorical columns. Since we currently don't support unsigned column types, the internal representation of codes has been also changed to signed integers. WIP for #1691

…tegorical columns (#3372) In this PR we - implement casts from `dt.cat*(...)` to all of the basic types; - as a consequence, support for converting categorical columns to CSV has been added. WIP for #1691

Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`. WIP for #1691

Fix casting `void` columns to categoricals and add corresponding tests. WIP for #1691

Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691

Implement `dt.categories()` to get categories for categorical columns. WIP for #1691

It seems that `dt.categories()` is the first datatable function, that needs to produce `Grouping::GtoFEW` columns. That's because the number of categories in a categorical column could be anything between `0` and `nrows - 1`. Currently, datatable doesn't really support `Grouping::GtoFEW`, but may need it for the cases when `dt.categories()` is combined with other f-expressions, or when `dt.categories()` is applied to columns that have different number of underlying categories. In this PR we - add some basic support for `Grouping::GtoFEW` grouping mode; - adjust `dt.categories()` to produce `Grouping::GtoFEW` columns, that in the case of uneven number of rows are promoted to `Grouping::GtoALL`; - do minor refactoring in `dt.alias()` function. WIP for #1691

Implement `dt.codes()` to get integer codes for categorical columns. Since we currently don't support unsigned column types, the internal representation of codes has been also changed to signed integers. WIP for #1691

…tegorical columns (#3372) In this PR we - implement casts from `dt.cat*(...)` to all of the basic types; - as a consequence, support for converting categorical columns to CSV has been added. WIP for #1691

Implement most of the statistical functions for categorical columns. Once we implement sorting of categorical columns, we could add the missing `dt.mode()`. WIP for #1691

WIP for #1691

st-pasha added the new feature Feature requests for new functionality label Feb 27, 2019

st-pasha mentioned this issue Feb 27, 2019

Implement equivalent of Pandas qcut() #1680

Closed

jangorecki mentioned this issue Aug 24, 2019

pandas/dask try to optimise read_csv to load 1e9 rows data h2oai/db-benchmark#99

Closed

jangorecki mentioned this issue Jan 5, 2021

Whole script performance h2oai/db-benchmark#177

Open

st-pasha added this to the Release 1.1.0 milestone Jul 2, 2021

st-pasha assigned oleksiyskononenko Jul 2, 2021

st-pasha added the EPIC ⭐ Big task that may encompass many smaller ones label Jul 2, 2021

st-pasha mentioned this issue Jul 2, 2021

Roadmap 1.1.0 #3046

Open

4 tasks

oleksiyskononenko mentioned this issue Aug 9, 2021

Implementation of Categorical Columns in datatable #3136

Open

oleksiyskononenko mentioned this issue Aug 20, 2021

Add categorical types cat8, cat16 and cat32 #3149

Merged

oleksiyskononenko mentioned this issue Sep 4, 2021

Implement CategoricalColumn_Impl and support for basic operations #3158

Merged

This was referenced Feb 22, 2022

Internals for validity buffer for categorical columns #3239

Merged

Allow using column selector on categorical columns #3240

Merged

oleksiyskononenko added a commit that referenced this issue Feb 22, 2022

Allow using column selector on cat columns (#3240)

6376ea0

WIP for #1691

oleksiyskononenko mentioned this issue Sep 26, 2022

Fix casting void columns to categoricals #3362

Merged

oleksiyskononenko added a commit that referenced this issue Sep 27, 2022

Fix casting void columns to categoricals (#3362)

0e79950

Fix casting `void` columns to categoricals and add corresponding tests. WIP for #1691

oleksiyskononenko mentioned this issue Oct 4, 2022

Implement casting of the most column types to categoricals #3365

Merged

oleksiyskononenko added a commit that referenced this issue Oct 4, 2022

Implement casting of bool/int/float/str to categoricals (#3365)

b4866f5

Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691

oleksiyskononenko added a commit that referenced this issue Oct 4, 2022

Implement casting of the most column types to categoricals (#3365)

5992075

Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691

oleksiyskononenko mentioned this issue Oct 10, 2022

Implement dt.categories() #3367

Merged

oleksiyskononenko added a commit that referenced this issue Oct 11, 2022

Implement dt.categories() (#3367)

e610c90

Implement `dt.categories()` to get categories for categorical columns. WIP for #1691

oleksiyskononenko mentioned this issue Oct 14, 2022

Add basic support for Grouping::GtoFEW #3370

Merged

oleksiyskononenko mentioned this issue Oct 17, 2022

Implement dt.codes() #3371

Merged

oleksiyskononenko mentioned this issue Oct 21, 2022

Implement casts from categorical columns to other types #3372

Merged

oleksiyskononenko mentioned this issue Oct 25, 2022

Implement statistics for categorical columns #3373

Merged

oleksiyskononenko mentioned this issue Oct 29, 2022

Implement slicing for categorical columns #3379

Merged

samukweku pushed a commit that referenced this issue Jan 3, 2023

Fix casting void columns to categoricals (#3362)

6eb48a2

Fix casting `void` columns to categoricals and add corresponding tests. WIP for #1691

samukweku pushed a commit that referenced this issue Jan 3, 2023

Implement casting of the most column types to categoricals (#3365)

93c2244

Implement casting of boolean, integer, float, date, time and string columns to categoricals. WIP for #1691

samukweku pushed a commit that referenced this issue Jan 3, 2023

Implement dt.categories() (#3367)

098c410

Implement `dt.categories()` to get categories for categorical columns. WIP for #1691

oleksiyskononenko added a commit that referenced this issue Apr 25, 2023

Implement slicing for categorical columns (#3379)

887ad6b

WIP for #1691

oleksiyskononenko mentioned this issue Apr 27, 2023

Minor refactoring of methods to get the underlying column type #3458

Merged

oleksiyskononenko added a commit that referenced this issue Apr 28, 2023

Minor refactoring of methods to get the underlying column type (#3458)

3811806

WIP for #1691

samukweku pushed a commit that referenced this issue Apr 28, 2023

Implement slicing for categorical columns (#3379)

7158865

WIP for #1691

samukweku pushed a commit that referenced this issue Apr 28, 2023

Minor refactoring of methods to get the underlying column type (#3458)

9ecdd96

WIP for #1691

samukweku pushed a commit that referenced this issue May 2, 2023

Implement slicing for categorical columns (#3379)

390df8a

WIP for #1691

samukweku pushed a commit that referenced this issue May 2, 2023

Minor refactoring of methods to get the underlying column type (#3458)

544d71a

WIP for #1691

samukweku pushed a commit that referenced this issue May 2, 2023

Implement slicing for categorical columns (#3379)

a734bcf

WIP for #1691

samukweku pushed a commit that referenced this issue May 2, 2023

Minor refactoring of methods to get the underlying column type (#3458)

f59c1b6

WIP for #1691

st-pasha removed this from the Release 1.1.0 milestone Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement categorical columns #1691

Implement categorical columns #1691

st-pasha commented Feb 27, 2019 •

edited by oleksiyskononenko

Loading

XiaomoWu commented May 2, 2019

st-pasha commented May 2, 2019

jangorecki commented May 3, 2019

jangorecki commented Aug 24, 2019

Implement categorical columns #1691

Implement categorical columns #1691

Comments

st-pasha commented Feb 27, 2019 • edited by oleksiyskononenko Loading

XiaomoWu commented May 2, 2019

st-pasha commented May 2, 2019

jangorecki commented May 3, 2019

jangorecki commented Aug 24, 2019

st-pasha commented Feb 27, 2019 •

edited by oleksiyskononenko

Loading