Aggregator #1156

oleksiyskononenko · 2018-07-10T23:57:55Z

Initial implementation of datatable.extras.aggregate for 1D, 2D and ND table aggregations.

Includes:

1D continuous binning;
2D continuous binning;
1D categorical aggregation;
2D categorical aggregation;
2D mixed (continuous/categorical) aggregation;
ND aggregation that also includes a projection method when ncols > max_dimensions;
implementations of the first() and count() datatable reducers;
other minor changes to datatable.

Includes a general Python/C++ layout and implementations of - 1D continuous aggregation - 2D continuous aggregation - `count()` reduce function - wrappers to enable the usage of the `first()` reducer 1D categorical aggregation can now be done directly from Python through `groupby/count`. It will be implemented in C++ along with the remaining 2D and ND aggregators.

Initial implementations of - 1D categorical aggregation - 2D mixed aggregation From now on, the original dataframe will not be modified in-place, instead, we return a new dataframe consisting of the shallow copies of all the aggregated columns and include the binning information as the additional one at the end. Only int32_t bins are supported, because of the same restriction being valid for `groupby`.

Also includes modifications to other aggregators. To prevent casting to double for each individual value, we cast all the continuous columns to the double ones in advance. This may have consequences for the memory usage, those will be addressed later.

Instead of two-column sorting that is not implemented in `datatable` yet (#1082), we generate group id's by sorting each column separately. Also, instead of using getters/setters, we now access the memory buffer directly that should be better from the performance point of view.

st-pasha

I took a quick look, and here some general comments:

Please set up your editor to automatically convert Tabs into Spaces, and also convert all existing Tabs into spaces (otherwise the code looks badly indented).
Rebase on top of master and check that the code still compiles / runs successfully. It's been quite a while since you created this branch, and it has probably diverged from master substantially already.
Make sure that the code produces no warnings. If you're not sure how to eliminate some of the warnings, I can help. The code currently in master has no warnings with when compiled with the latest Clang.
If possible try to keep to line length limit of 80 characters. This is not a hard rule, however reviewing code on GitHub is easier if the lines are not too long.

st-pasha · 2018-07-11T00:08:19Z

c/extras/aggregator.cc

+{
+  create_dt_out();
+}
+


where's ~Aggregator() ?

This dt_out is the dataframe created and returned by the aggregator to the user. I'm not sure we should destroy it when the aggregator is destroyed. Do you think we should?

st-pasha · 2018-07-11T00:09:40Z

c/expr/reduceop.cc

+      case ST_REAL_F4:     		return count_skipna<float, int64_t>;
+      case ST_REAL_F8:     		return count_skipna<double, int64_t>;
+      case ST_STRING_I4_VCHAR:  return count_skipna<int32_t, int64_t>;
+      case ST_STRING_I8_VCHAR:  return count_skipna<int64_t, int64_t>;


these should be <uint32_t, ...> and <uint64_t, ...>.

Yes, you're right. Will fix it. Thanks.

st-pasha · 2018-07-11T00:22:21Z

c/extras/aggregator.cc

+void Aggregator::aggregate_2d_continuous(double epsilon, int32_t nx_bins, int32_t ny_bins) {
+  RealColumn<double>* c0 = (RealColumn<double>*) dt_out->columns[0];
+  RealColumn<double>* c1 = (RealColumn<double>*) dt_out->columns[1];
+  double* d_c0 = static_cast<double*>(dt_out->columns[0]->data_w());


You could use c0->elements_w() here and below

st-pasha · 2018-07-11T00:35:04Z

c/py_datatable.cc

@@ -261,6 +262,23 @@ PyObject* delete_columns(obj* self, PyObject* args) {



+PyObject* aggregate(obj* self, PyObject* args) {


This method should not be in (py)DataTable class: the class will become user-facing once #1066 is implemented, so we want to keep the API clean.
The easiest approach is to declare this function global (e.g. listed in DatatableModuleMethods in datatablemodule.c) and move its body to extras/aggregate.cc

Ok, got it.

st-pasha · 2018-07-11T00:37:32Z

datatable/extras/aggregate.py

@@ -0,0 +1,14 @@
+    #!/usr/bin/env python3


no spaces before the #!

Yes, another bug of my editor. Fixed now. Thanks.

st-pasha · 2018-07-11T00:40:24Z

datatable/extras/aggregate.py

+    dt_agg = self._dt.aggregate(epsilon, n_bins, nx_bins, ny_bins, max_dimensions, seed)
+    return Frame(dt_agg)
+
+Frame.aggregate = aggregate


I wouldn't recommend adding this method to Frame class. If you just remove this line the code will work just fine:

from datatable.extras import aggregate aggregate(frame, ...)

Been using it as dt.aggregate(...). Sure, will remove it from the Frame.

st-pasha · 2018-07-11T00:42:32Z

tests/extras/test_aggregate.py

+def test_aggregate_2d_continuous_integer_sorted():
+    d0 = dt.Frame([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 
+                   [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
+    d1 = d0.aggregate(1e-15, 0, 3, 3)


perhaps it's better to use named parameters here... How do you know what those 0, 3, 3 mean?

The definition of aggregate is as follows:

def aggregate(self, epsilon=1.0e-15, n_bins=500, nx_bins=50, ny_bins=50, max_dimensions=50, seed=0)

So these are just the number of bins for the test.

st-pasha · 2018-07-11T07:35:35Z

Regarding comments -- the easiest way is to leave them from the form at the bottom of the page in "Conversation" tab of the PR. Comments left as answers to in-line comments will eventually be collapsed out of view when the line they were assigned to changes.

Regarding my comment about ~Aggregator(). I merely noticed that you declare it in the aggregator.h file, but then do not define in the aggregator.cc file. Normally this would compile but then fail at linking stage. However, your code doesn't fail at linking, which got me very surprised. I suspect the reason is that there is no place in the code where the destructor is invoked. Indeed, if you look at pydatatable::aggregate() function you'll see that you're creating Aggregate objects via new Aggregate(), but then never delete them. Thus, it's a resource leak.

As for dt_in and dt_out members of Aggregate -- there should be a clear understanding on who owns these pointers, i.e. who is required to free those resources once they are no longer needed. It looks like Aggregate is not owning them right now -- if that's intentional, then I suggest to leave a clarifying comment about the correct usage of the Aggregate class: in particular, what measures must the user take to ensure that the pointers do not become dangling.

`extras/aggregator`. Cosmetics.

st-pasha · 2018-07-12T21:51:05Z

datatable/extras/__init__.py

+#   file, You can obtain one at http://mozilla.org/MPL/2.0/.
+#-------------------------------------------------------------------------------
+
+__all__ = ("aggregate", )


Generally, in order to export a symbol, you need to import or define it first.
However in your case (allow Python to find datatable.extras.aggregate) having an empty __init__.py should work just fine.

Yes, I was about to remove this line. Thanks!

Also adjusting the LType/SType usage.

st-pasha · 2018-07-16T21:32:54Z

c/expr/reduceop.cc

+//------------------------------------------------------------------------------
+
+template<typename IT, typename OT>
+static void count_skipna(const int32_t* groups, int32_t grp, void** params) {


Traditionally, count(x) function returns the number of non-NA values in x.
In your implementation, however, the function doesn't count anything, but merely returns the number of elements in each group. It's a valid function, just not a suitable name...

Reimplemented this function to match the existing name.

st-pasha · 2018-07-16T21:37:16Z

c/extras/aggregator.cc

+  const double* d_c0 = static_cast<const double*>(dt_out->columns[0]->data());
+  int32_t* d_c1 = static_cast<int32_t*>(dt_out->columns[1]->data_w());
+
+//TODO: handle the case when the column is constant, i.e. min = max


Please indent comments (here and below) at the same level as the surrounding code, otherwise it messes up with code-folding in my editor.

st-pasha · 2018-07-16T21:41:34Z

c/extras/aggregator.h

+
+DECLARE_FUNCTION(
+  aggregate,
+  "aggregate()\n\n",


This argument is the function's docstring. In your case the function has signature OiiiiI, so it would be good to list what parameters the function actually takes.

st-pasha · 2018-07-16T21:44:01Z

datatable/frame.py

@@ -563,8 +563,7 @@ def _delete_columns(self, cols):
            newnames += self.names[(cols[i - 1] + 1):cols[i]]
        newnames += self.names[cols[-1] + 1:]
        self._fill_from_dt(self._dt, names=newnames)
-
-
+


Looks like your editor doesn't have "Trim trailing whitespace" option turned on.

Done. Actually C++ editor had this option turned on, had to fix the Python one only.

Aggregator now returns two dataframes: 1) exemplars dataframe in the format of (original_data_columns, number_of_members) 2) members dataframe in the format of (exemplar_id_it_belongs_to), the exemplar ids are the row ids from the frame 1)

This is not a count(*) function that was used by the aggregator before. count(*) still to be implemented.

Plus cosmetics.

st-pasha · 2018-07-18T20:20:41Z

c/expr/reduceop.cc

+    [&](int64_t i) {
+      IT x = inputs[i];
+      if (!ISNA<IT>(x))
+        ++count;


count += !ISNA<IT>(x); should be faster (in theory), as it avoids branching

Right, fixed.

st-pasha · 2018-07-18T20:40:50Z

See also #1177 for further suggestions for improvement (after this PR is merged)

Initial implementation of `datatable.extras.aggregate` for 1D, 2D and ND table aggregations. Includes: - 1D continuous binning; - 2D continuous binning; - 1D categorical aggregation; - 2D categorical aggregation; - 2D mixed (continuous/categorical) aggregation; - ND aggregation that also includes a projection method when ncols > max_dimensions; - implementations of the `first()` and `count()` `datatable` reducers; - other minor changes to `datatable`.

Oleksiy Kononenko added 6 commits June 20, 2018 19:02

Initial commit of ND aggregator.

b50cbd7

Also includes modifications to other aggregators. To prevent casting to double for each individual value, we cast all the continuous columns to the double ones in advance. This may have consequences for the memory usage, those will be addressed later.

Tests for 1D and 2D aggregators.

1c17235

Extracting aggregator as a separate class.

1f4d7f5

oleksiyskononenko requested a review from st-pasha July 10, 2018 23:57

st-pasha assigned oleksiyskononenko Jul 10, 2018

Merge branch 'master' into aggregator

10cf165

st-pasha reviewed Jul 11, 2018

View reviewed changes

Oleksiy Kononenko added 2 commits July 10, 2018 18:21

Changes to make the aggregator branch be consistent with the master one.

d025bee

Various fixes: editor problems, warnings, etc.

e85f74e

Oleksiy Kononenko added 2 commits July 11, 2018 20:02

Moving the Python aggregator code from py_datatable to

ea708af

`extras/aggregator`. Cosmetics.

Adding __init__.py

56be087

st-pasha reviewed Jul 12, 2018

View reviewed changes

Oleksiy Kononenko added 3 commits July 12, 2018 15:00

Emptying the __init__.py file

d932738

Merge remote-tracking branch 'origin/master' into aggregator.

57d7232

Also adjusting the LType/SType usage.

Move machine precision epsilon to the Aggregator class.

b231984

h2oai deleted a comment from oleksiyskononenko Jul 16, 2018

st-pasha reviewed Jul 16, 2018

View reviewed changes

h2oai deleted a comment from oleksiyskononenko Jul 16, 2018

Oleksiy Kononenko added 4 commits July 18, 2018 12:30

Generate exemplars and members dataframes

27de79d

Aggregator now returns two dataframes: 1) exemplars dataframe in the format of (original_data_columns, number_of_members) 2) members dataframe in the format of (exemplar_id_it_belongs_to), the exemplar ids are the row ids from the frame 1)

Proper implementation of count_skipna reducer

f1695dc

This is not a count(*) function that was used by the aggregator before. count(*) still to be implemented.

Adding the list of parameters to the aggregate docstring

90a0c5a

Plus cosmetics.

Removing trailing whitespace

3b7fbf9

st-pasha reviewed Jul 18, 2018

View reviewed changes

Minor count_skipna modification to avoid branching

b76e964

Setting up correct column names for exemplars and members

e6979aa

st-pasha approved these changes Jul 19, 2018

View reviewed changes

st-pasha merged commit ab59524 into master Jul 19, 2018

st-pasha deleted the aggregator branch July 19, 2018 00:49

st-pasha added this to the Release 0.7.0 milestone Jan 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregator #1156

Aggregator #1156

oleksiyskononenko commented Jul 10, 2018 •

edited by st-pasha

st-pasha left a comment

st-pasha Jul 11, 2018

oleksiyskononenko Jul 11, 2018

st-pasha Jul 11, 2018

oleksiyskononenko Jul 11, 2018

oleksiyskononenko Jul 11, 2018

st-pasha Jul 11, 2018

st-pasha Jul 11, 2018

oleksiyskononenko Jul 11, 2018

oleksiyskononenko Jul 12, 2018

st-pasha Jul 11, 2018

oleksiyskononenko Jul 11, 2018

st-pasha Jul 11, 2018

oleksiyskononenko Jul 11, 2018

oleksiyskononenko Jul 11, 2018

oleksiyskononenko Jul 12, 2018

st-pasha Jul 11, 2018

oleksiyskononenko Jul 11, 2018

st-pasha commented Jul 11, 2018

st-pasha Jul 12, 2018

oleksiyskononenko Jul 12, 2018

st-pasha Jul 16, 2018

oleksiyskononenko Jul 18, 2018

st-pasha Jul 16, 2018

st-pasha Jul 16, 2018

oleksiyskononenko Jul 18, 2018

st-pasha Jul 16, 2018

oleksiyskononenko Jul 18, 2018

st-pasha Jul 18, 2018

oleksiyskononenko Jul 18, 2018

st-pasha commented Jul 18, 2018

		@@ -261,6 +262,23 @@ PyObject* delete_columns(obj* self, PyObject* args) {



		PyObject* aggregate(obj* self, PyObject* args) {

Aggregator #1156

Aggregator #1156

Conversation

oleksiyskononenko commented Jul 10, 2018 • edited by st-pasha

st-pasha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

st-pasha commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

st-pasha commented Jul 18, 2018

oleksiyskononenko commented Jul 10, 2018 •

edited by st-pasha