Skip to content

Commit

Permalink
Implement dt.codes() (#3371)
Browse files Browse the repository at this point in the history
Implement `dt.codes()` to get integer codes for categorical columns. Since we currently don't support unsigned column types, the internal representation of codes has been also changed to signed integers.

WIP for #1691
  • Loading branch information
oleksiyskononenko authored and samukweku committed Jan 3, 2023
1 parent a836fd6 commit ae6e9c6
Show file tree
Hide file tree
Showing 15 changed files with 236 additions and 13 deletions.
2 changes: 1 addition & 1 deletion docs/api/dt/categories.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

.. x-version-added:: 1.1.0

For each column from `cols` get the underlying categories.
Get categories for categorical data.

Parameters
----------
Expand Down
23 changes: 23 additions & 0 deletions docs/api/dt/codes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@

.. xfunction:: datatable.codes
:src: src/core/expr/fexpr_codes.cc pyfn_codes
:tests: tests/types/test-categorical.py
:cvar: doc_dt_codes
:signature: codes(cols)

.. x-version-added:: 1.1.0

Get integer codes for categorical data.

Parameters
----------
cols: FExpr
Input categorical data.

return: FExpr
f-expression that returns integer codes for each column
from `cols`.

except: TypeError
The exception is raised when one of the columns from `cols`
has a non-categorical type.
4 changes: 4 additions & 0 deletions docs/api/fexpr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,9 @@
* - :meth:`.categories()`
- Same as :func:`dt.categories()`.

* - :meth:`.codes()`
- Same as :func:`dt.codes()`.

* - :meth:`.count()`
- Same as :func:`dt.count()`.

Expand Down Expand Up @@ -307,6 +310,7 @@
.alias() <fexpr/alias>
.as_type() <fexpr/as_type>
.categories() <fexpr/categories>
.codes() <fexpr/codes>
.count() <fexpr/count>
.countna() <fexpr/countna>
.cummin() <fexpr/cummin>
Expand Down
7 changes: 7 additions & 0 deletions docs/api/fexpr/codes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@

.. xmethod:: datatable.FExpr.codes
:src: src/core/expr/fexpr.cc PyFExpr::codes
:cvar: doc_FExpr_codes
:signature: codes()

Equivalent to :func:`dt.codes(cols)`.
1 change: 1 addition & 0 deletions docs/api/index-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@ Other
by() <dt/by>
categories() <dt/categories>
cbind() <dt/cbind>
codes() <dt/codes>
corr() <dt/corr>
count() <dt/count>
countna() <dt/countna>
Expand Down
10 changes: 9 additions & 1 deletion docs/releases/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,13 @@
-[new] Class :class:`dt.FExpr` now has method :meth:`.alias()`,
to assign new names to the selected columns. [#2684]
-[new] Added function :func:`dt.categories()`, as well as the corresponding
:meth:`.categories()` method, to retrieve categories
for categorical columns. [#3367]
-[new] Added function :func:`dt.codes()`, as well as the corresponding
:meth:`.codes()` method, to retrieve codes for categorical columns. [#3371]
-[enh] Function :func:`dt.re.match()` now supports case insensitive matching. [#3216]
-[enh] Function :func:`dt.qcut()` can now be used in a groupby context. [#3165]
Expand Down Expand Up @@ -172,14 +179,15 @@

-[new] Added properties :attr:`.is_array <dt.Type.is_array>`,
:attr:`.is_boolean <dt.Type.is_boolean>`,
:attr:`.is_categorical <dt.Type.is_categorical>`,
:attr:`.is_compound <dt.Type.is_compound>`,
:attr:`.is_float <dt.Type.is_float>`,
:attr:`.is_integer <dt.Type.is_integer>`,
:attr:`.is_numeric <dt.Type.is_numeric>`,
:attr:`.is_object <dt.Type.is_object>`,
:attr:`.is_string <dt.Type.is_string>`,
:attr:`.is_temporal <dt.Type.is_temporal>`,
:attr:`.is_void <dt.Type.is_void>` to class :class:`dt.Type`. [#3101]
:attr:`.is_void <dt.Type.is_void>` to class :class:`dt.Type`. [#3101] [#3149]

-[enh] Added support for macOS Big Sur. [#3175]

Expand Down
6 changes: 3 additions & 3 deletions src/core/column/categorical.cc
Original file line number Diff line number Diff line change
Expand Up @@ -163,9 +163,9 @@ bool Categorical_ColumnImpl<T>::get_element(size_t i, Column* out) const {
}


template class Categorical_ColumnImpl<uint8_t>;
template class Categorical_ColumnImpl<uint16_t>;
template class Categorical_ColumnImpl<uint32_t>;
template class Categorical_ColumnImpl<int8_t>;
template class Categorical_ColumnImpl<int16_t>;
template class Categorical_ColumnImpl<int32_t>;


} // namespace dt
6 changes: 3 additions & 3 deletions src/core/column/categorical.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,9 @@ class Categorical_ColumnImpl : public Virtual_ColumnImpl {
};


extern template class Categorical_ColumnImpl<uint8_t>;
extern template class Categorical_ColumnImpl<uint16_t>;
extern template class Categorical_ColumnImpl<uint32_t>;
extern template class Categorical_ColumnImpl<int8_t>;
extern template class Categorical_ColumnImpl<int16_t>;
extern template class Categorical_ColumnImpl<int32_t>;


} // namespace dt
Expand Down
2 changes: 2 additions & 0 deletions src/core/documentation.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ extern const char* doc_dt_as_type;
extern const char* doc_dt_by;
extern const char* doc_dt_categories;
extern const char* doc_dt_cbind;
extern const char* doc_dt_codes;
extern const char* doc_dt_corr;
extern const char* doc_dt_count;
extern const char* doc_dt_countna;
Expand Down Expand Up @@ -286,6 +287,7 @@ extern const char* doc_FExpr;
extern const char* doc_FExpr_alias;
extern const char* doc_FExpr_as_type;
extern const char* doc_FExpr_categories;
extern const char* doc_FExpr_codes;
extern const char* doc_FExpr_count;
extern const char* doc_FExpr_countna;
extern const char* doc_FExpr_cummax;
Expand Down
10 changes: 10 additions & 0 deletions src/core/expr/fexpr.cc
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,16 @@ DECLARE_METHOD(&PyFExpr::categories)
->docs(dt::doc_FExpr_categories);


oobj PyFExpr::codes(const XArgs&) {
auto codesFn = oobj::import("datatable", "codes");
return codesFn.call({this});
}

DECLARE_METHOD(&PyFExpr::codes)
->name("codes")
->docs(dt::doc_FExpr_codes);


oobj PyFExpr::count(const XArgs&) {
auto countFn = oobj::import("datatable", "count");
return countFn.call({this});
Expand Down
1 change: 1 addition & 0 deletions src/core/expr/fexpr.h
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,7 @@ class PyFExpr : public py::XObject<PyFExpr> {
py::oobj alias(const py::XArgs&);
py::oobj as_type(const py::XArgs&);
py::oobj categories(const py::XArgs&);
py::oobj codes(const py::XArgs&);
py::oobj count(const py::XArgs&);
py::oobj countna(const py::XArgs&);
py::oobj cummin(const py::XArgs&);
Expand Down
116 changes: 116 additions & 0 deletions src/core/expr/fexpr_codes.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
//------------------------------------------------------------------------------
// Copyright 2022 H2O.ai
//
// Permission is hereby granted, free of charge, to any person obtaining a
// copy of this software and associated documentation files (the "Software"),
// to deal in the Software without restriction, including without limitation
// the rights to use, copy, modify, merge, publish, distribute, sublicense,
// and/or sell copies of the Software, and to permit persons to whom the
// Software is furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
// IN THE SOFTWARE.
//------------------------------------------------------------------------------
#include "_dt.h"
#include "documentation.h"
#include "column/const.h"
#include "column/sentinel_fw.h"
#include "expr/eval_context.h"
#include "expr/fexpr_func.h"
#include "python/xargs.h"
namespace dt {
namespace expr {


//------------------------------------------------------------------------------
// FExpr_Codes
//------------------------------------------------------------------------------

class FExpr_Codes : public FExpr_Func {
private:
ptrExpr arg_;

public:
FExpr_Codes(ptrExpr &&arg)
: arg_(std::move(arg)) {}


std::string repr() const override {
std::string out = "codes";
out += '(';
out += arg_->repr();
out += ')';
return out;
}


Workframe evaluate_n(EvalContext &ctx) const override {
Workframe wf = arg_->evaluate_n(ctx);

for (size_t i = 0; i < wf.ncols(); ++i) {
Column col = wf.retrieve_column(i);
if (!col.type().is_categorical()) {
throw TypeError() << "Invalid column of type `" << col.stype()
<< "` in " << repr();
}

Column col_codes;
switch (col.stype()) {
case SType::CAT8 : col_codes = evaluate1<int8_t>(col, SType::INT8); break;
case SType::CAT16 : col_codes = evaluate1<int16_t>(col, SType::INT16); break;
case SType::CAT32 : col_codes = evaluate1<int32_t>(col, SType::INT32); break;
default: throw RuntimeError() << "Unknown categorical type: "
<< col.stype();
}

wf.replace_column(i, std::move(col_codes));
}
return wf;
}


template <typename T>
Column evaluate1(const Column& col, const SType stype) const {
Column col_codes;
if (col.n_children()) {
col_codes = Column(new SentinelFw_ColumnImpl<T>(
col.nrows(),
stype,
Buffer(col.get_data_buffer(1))
));
} else {
col_codes = Const_ColumnImpl::make_int_column(col.nrows(), 0, stype);
}
return col_codes;
}
};



//------------------------------------------------------------------------------
// Python-facing `codes()` function
//------------------------------------------------------------------------------

static py::oobj pyfn_codes(const py::XArgs& args) {
auto cols = args[0].to_oobj();
return PyFExpr::make(new FExpr_Codes(as_fexpr(cols)));
}


DECLARE_PYFN(&pyfn_codes)
->name("codes")
->docs(doc_dt_codes)
->arg_names({"cols"})
->n_positional_args(1)
->n_required_args(1);


}} // dt::expr
8 changes: 4 additions & 4 deletions src/core/types/type_categorical.cc
Original file line number Diff line number Diff line change
Expand Up @@ -146,9 +146,9 @@ Column Type_Cat::cast_column(Column&& col) const {
case SType::STR64:
case SType::OBJ:
switch (stype()) {
case SType::CAT8: cast_non_compound<uint8_t>(col); break;
case SType::CAT16: cast_non_compound<uint16_t>(col); break;
case SType::CAT32: cast_non_compound<uint32_t>(col); break;
case SType::CAT8: cast_non_compound<int8_t>(col); break;
case SType::CAT16: cast_non_compound<int16_t>(col); break;
case SType::CAT32: cast_non_compound<int32_t>(col); break;
default: throw RuntimeError()
<< "Unknown categorical type: " << stype();
}
Expand Down Expand Up @@ -187,7 +187,7 @@ void Type_Cat::cast_non_compound(Column& col) const {
auto buf_codes_ptr = static_cast<T*>(buf_codes.xptr());
auto buf_cats_ptr = static_cast<int32_t*>(buf_cats.xptr());

const size_t MAX_CATS = std::numeric_limits<T>::max() + size_t(1);
const size_t MAX_CATS = size_t(1) << (sizeof(T) * 8);

if (gb.size() > MAX_CATS) {
throw ValueError() << "Number of categories in the column is `" << gb.size()
Expand Down
2 changes: 2 additions & 0 deletions src/datatable/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
by,
categories,
cbind,
codes,
cumcount,
cummax,
cummin,
Expand Down Expand Up @@ -93,6 +94,7 @@
"by",
"categories",
"cbind",
"codes",
"corr",
"count",
"cov",
Expand Down
51 changes: 50 additions & 1 deletion tests/types/test-categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ def test_repr_numbers_in_terminal(t):


#-------------------------------------------------------------------------------
# Getting categories
# Categories
#-------------------------------------------------------------------------------

def test_categories_wrong_type():
Expand Down Expand Up @@ -588,3 +588,52 @@ def test_categories_multicolumn_uneven_ncats(cat_type):
assert_equals(DT_cats, DT_ref)


#-------------------------------------------------------------------------------
# Codes
#-------------------------------------------------------------------------------

def test_codes_wrong_type():
DT = dt.Frame(range(10))
msg = r"Invalid column of type int32 in codes\(f\.C0\)"
with pytest.raises(TypeError, match=msg):
DT[:, dt.codes(f.C0)]


@pytest.mark.parametrize('cat_type, code_type', [(dt.Type.cat8, dt.Type.int8),
(dt.Type.cat16, dt.Type.int16),
(dt.Type.cat32, dt.Type.int32)])
def test_codes_void(cat_type, code_type):
N = 11
src = [None] * N
DT = dt.Frame(src, type=cat_type(dt.Type.void))
DT_codes = DT[:, dt.codes(f[:])]
DT_ref = dt.Frame([0] * N, type=code_type)
assert_equals(DT_codes, DT_ref)


@pytest.mark.parametrize('cat_type, code_type', [(dt.Type.cat8, dt.Type.int8),
(dt.Type.cat16, dt.Type.int16),
(dt.Type.cat32, dt.Type.int32)])
def test_codes_simple(cat_type, code_type):
src = ["cat", "dog", "mouse", "cat"]
DT = dt.Frame([src], type=cat_type(dt.Type.str32))
DT_codes = DT[:, dt.codes(f.C0)]
DT_ref = dt.Frame([0, 1, 2, 0], type=code_type)
assert_equals(DT_codes, DT_ref)


@pytest.mark.parametrize('cat_type, code_type', [(dt.Type.cat8, dt.Type.int8),
(dt.Type.cat16, dt.Type.int16),
(dt.Type.cat32, dt.Type.int32)])
def test_codes_multicolumn(cat_type, code_type):
N = 123
src_int = [None, 100, 500, None, 100, 100500, 100, 500] * N
src_str = [None, "dog", "mouse", None, "dog", "cat", "dog", "pig"] * N
DT = dt.Frame([src_int, src_str],
types=[cat_type(dt.Type.int32), cat_type(dt.Type.str32)])
DT_codes = DT[:, dt.codes(f[:])]
DT_ref = dt.Frame([[0, 1, 2, 0, 1, 3, 1, 2] * N,
[0, 2, 3, 0, 2, 1, 2, 4] * N], type=code_type)
assert_equals(DT_codes, DT_ref)


0 comments on commit ae6e9c6

Please sign in to comment.