Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregator #1156

Merged
merged 20 commits into from Jul 19, 2018
Merged

Aggregator #1156

merged 20 commits into from Jul 19, 2018

Conversation

oleksiyskononenko
Copy link
Contributor

@oleksiyskononenko oleksiyskononenko commented Jul 10, 2018

Initial implementation of datatable.extras.aggregate for 1D, 2D and ND table aggregations.

Includes:

  • 1D continuous binning;
  • 2D continuous binning;
  • 1D categorical aggregation;
  • 2D categorical aggregation;
  • 2D mixed (continuous/categorical) aggregation;
  • ND aggregation that also includes a projection method when ncols > max_dimensions;
  • implementations of the first() and count() datatable reducers;
  • other minor changes to datatable.

Closes #1077

Oleksiy Kononenko added 6 commits June 20, 2018 19:02
Includes a general Python/C++ layout and implementations of
	- 1D continuous aggregation
	- 2D continuous aggregation
	- `count()` reduce function
	- wrappers to enable the usage of the `first()` reducer

1D categorical aggregation can now be done directly from Python through
`groupby/count`. It will be implemented in C++ along with the remaining
2D and ND aggregators.
Initial implementations of
	- 1D categorical aggregation
	- 2D mixed aggregation

From now on, the original dataframe will not be modified in-place,
instead, we return a new dataframe consisting of the shallow copies of
all the aggregated columns and include the binning information as the
additional one at the end. Only int32_t bins are supported, because of
the same restriction being valid for `groupby`.
Also includes modifications to other aggregators. To prevent casting to
double for each individual value, we cast all the continuous columns to
the double ones in advance. This may have consequences for the memory
usage, those will be addressed later.
Instead of two-column sorting that is not implemented in `datatable` yet
(#1082), we generate group id's by sorting each column separately. Also,
instead of using getters/setters, we now access the memory buffer
directly that should be better from the performance point of view.
Copy link
Contributor

@st-pasha st-pasha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look, and here some general comments:

  • Please set up your editor to automatically convert Tabs into Spaces, and also convert all existing Tabs into spaces (otherwise the code looks badly indented).
  • Rebase on top of master and check that the code still compiles / runs successfully. It's been quite a while since you created this branch, and it has probably diverged from master substantially already.
  • Make sure that the code produces no warnings. If you're not sure how to eliminate some of the warnings, I can help. The code currently in master has no warnings with when compiled with the latest Clang.
  • If possible try to keep to line length limit of 80 characters. This is not a hard rule, however reviewing code on GitHub is easier if the lines are not too long.

{
create_dt_out();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where's ~Aggregator() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dt_out is the dataframe created and returned by the aggregator to the user. I'm not sure we should destroy it when the aggregator is destroyed. Do you think we should?

case ST_REAL_F4: return count_skipna<float, int64_t>;
case ST_REAL_F8: return count_skipna<double, int64_t>;
case ST_STRING_I4_VCHAR: return count_skipna<int32_t, int64_t>;
case ST_STRING_I8_VCHAR: return count_skipna<int64_t, int64_t>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be <uint32_t, ...> and <uint64_t, ...>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. Will fix it. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

void Aggregator::aggregate_2d_continuous(double epsilon, int32_t nx_bins, int32_t ny_bins) {
RealColumn<double>* c0 = (RealColumn<double>*) dt_out->columns[0];
RealColumn<double>* c1 = (RealColumn<double>*) dt_out->columns[1];
double* d_c0 = static_cast<double*>(dt_out->columns[0]->data_w());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use c0->elements_w() here and below

@@ -261,6 +262,23 @@ PyObject* delete_columns(obj* self, PyObject* args) {



PyObject* aggregate(obj* self, PyObject* args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should not be in (py)DataTable class: the class will become user-facing once #1066 is implemented, so we want to keep the API clean.
The easiest approach is to declare this function global (e.g. listed in DatatableModuleMethods in datatablemodule.c) and move its body to extras/aggregate.cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -0,0 +1,14 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no spaces before the #!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, another bug of my editor. Fixed now. Thanks.

dt_agg = self._dt.aggregate(epsilon, n_bins, nx_bins, ny_bins, max_dimensions, seed)
return Frame(dt_agg)

Frame.aggregate = aggregate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't recommend adding this method to Frame class. If you just remove this line the code will work just fine:

from datatable.extras import aggregate
aggregate(frame, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been using it as dt.aggregate(...). Sure, will remove it from the Frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

def test_aggregate_2d_continuous_integer_sorted():
d0 = dt.Frame([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
d1 = d0.aggregate(1e-15, 0, 3, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it's better to use named parameters here... How do you know what those 0, 3, 3 mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of aggregate is as follows:

def aggregate(self, epsilon=1.0e-15, n_bins=500, nx_bins=50, ny_bins=50, max_dimensions=50, seed=0)

So these are just the number of bins for the test.

@st-pasha
Copy link
Contributor

Regarding comments -- the easiest way is to leave them from the form at the bottom of the page in "Conversation" tab of the PR. Comments left as answers to in-line comments will eventually be collapsed out of view when the line they were assigned to changes.

Regarding my comment about ~Aggregator(). I merely noticed that you declare it in the aggregator.h file, but then do not define in the aggregator.cc file. Normally this would compile but then fail at linking stage. However, your code doesn't fail at linking, which got me very surprised. I suspect the reason is that there is no place in the code where the destructor is invoked. Indeed, if you look at pydatatable::aggregate() function you'll see that you're creating Aggregate objects via new Aggregate(), but then never delete them. Thus, it's a resource leak.

As for dt_in and dt_out members of Aggregate -- there should be a clear understanding on who owns these pointers, i.e. who is required to free those resources once they are no longer needed. It looks like Aggregate is not owning them right now -- if that's intentional, then I suggest to leave a clarifying comment about the correct usage of the Aggregate class: in particular, what measures must the user take to ensure that the pointers do not become dangling.

Oleksiy Kononenko added 2 commits July 11, 2018 20:02
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
#-------------------------------------------------------------------------------

__all__ = ("aggregate", )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, in order to export a symbol, you need to import or define it first.
However in your case (allow Python to find datatable.extras.aggregate) having an empty __init__.py should work just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was about to remove this line. Thanks!

@h2oai h2oai deleted a comment from oleksiyskononenko Jul 16, 2018
//------------------------------------------------------------------------------

template<typename IT, typename OT>
static void count_skipna(const int32_t* groups, int32_t grp, void** params) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traditionally, count(x) function returns the number of non-NA values in x.
In your implementation, however, the function doesn't count anything, but merely returns the number of elements in each group. It's a valid function, just not a suitable name...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reimplemented this function to match the existing name.

const double* d_c0 = static_cast<const double*>(dt_out->columns[0]->data());
int32_t* d_c1 = static_cast<int32_t*>(dt_out->columns[1]->data_w());

//TODO: handle the case when the column is constant, i.e. min = max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please indent comments (here and below) at the same level as the surrounding code, otherwise it messes up with code-folding in my editor.


DECLARE_FUNCTION(
aggregate,
"aggregate()\n\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument is the function's docstring. In your case the function has signature OiiiiI, so it would be good to list what parameters the function actually takes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -563,8 +563,7 @@ def _delete_columns(self, cols):
newnames += self.names[(cols[i - 1] + 1):cols[i]]
newnames += self.names[cols[-1] + 1:]
self._fill_from_dt(self._dt, names=newnames)



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like your editor doesn't have "Trim trailing whitespace" option turned on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Actually C++ editor had this option turned on, had to fix the Python one only.

@h2oai h2oai deleted a comment from oleksiyskononenko Jul 16, 2018
@h2oai h2oai deleted a comment from oleksiyskononenko Jul 16, 2018
Oleksiy Kononenko added 4 commits July 18, 2018 12:30
Aggregator now returns two dataframes:
1) exemplars dataframe in the format of (original_data_columns,
number_of_members)

2) members dataframe in the format of (exemplar_id_it_belongs_to), the
exemplar ids are the row ids from the frame 1)
This is not a count(*) function that was used by the aggregator before.
count(*) still to be implemented.
[&](int64_t i) {
IT x = inputs[i];
if (!ISNA<IT>(x))
++count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count += !ISNA<IT>(x); should be faster (in theory), as it avoids branching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed.

@st-pasha
Copy link
Contributor

See also #1177 for further suggestions for improvement (after this PR is merged)

@st-pasha st-pasha merged commit ab59524 into master Jul 19, 2018
@st-pasha st-pasha deleted the aggregator branch July 19, 2018 00:49
abal5 pushed a commit that referenced this pull request Sep 13, 2018
Initial implementation of `datatable.extras.aggregate` for 1D, 2D and ND table aggregations.

Includes:
- 1D continuous binning;
- 2D continuous binning;
- 1D categorical aggregation;
- 2D categorical aggregation;
- 2D mixed (continuous/categorical) aggregation;
- ND aggregation that also includes a projection method when ncols > max_dimensions;
- implementations of the `first()` and `count()` `datatable` reducers;
- other minor changes to `datatable`.
@st-pasha st-pasha added this to the Release 0.7.0 milestone Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggregator in datatable
2 participants