New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregator #1156
Aggregator #1156
Conversation
Includes a general Python/C++ layout and implementations of - 1D continuous aggregation - 2D continuous aggregation - `count()` reduce function - wrappers to enable the usage of the `first()` reducer 1D categorical aggregation can now be done directly from Python through `groupby/count`. It will be implemented in C++ along with the remaining 2D and ND aggregators.
Initial implementations of - 1D categorical aggregation - 2D mixed aggregation From now on, the original dataframe will not be modified in-place, instead, we return a new dataframe consisting of the shallow copies of all the aggregated columns and include the binning information as the additional one at the end. Only int32_t bins are supported, because of the same restriction being valid for `groupby`.
Also includes modifications to other aggregators. To prevent casting to double for each individual value, we cast all the continuous columns to the double ones in advance. This may have consequences for the memory usage, those will be addressed later.
Instead of two-column sorting that is not implemented in `datatable` yet (#1082), we generate group id's by sorting each column separately. Also, instead of using getters/setters, we now access the memory buffer directly that should be better from the performance point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look, and here some general comments:
- Please set up your editor to automatically convert Tabs into Spaces, and also convert all existing Tabs into spaces (otherwise the code looks badly indented).
- Rebase on top of master and check that the code still compiles / runs successfully. It's been quite a while since you created this branch, and it has probably diverged from master substantially already.
- Make sure that the code produces no warnings. If you're not sure how to eliminate some of the warnings, I can help. The code currently in master has no warnings with when compiled with the latest Clang.
- If possible try to keep to line length limit of 80 characters. This is not a hard rule, however reviewing code on GitHub is easier if the lines are not too long.
c/extras/aggregator.cc
Outdated
{ | ||
create_dt_out(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where's ~Aggregator()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dt_out
is the dataframe created and returned by the aggregator to the user. I'm not sure we should destroy it when the aggregator is destroyed. Do you think we should?
c/expr/reduceop.cc
Outdated
case ST_REAL_F4: return count_skipna<float, int64_t>; | ||
case ST_REAL_F8: return count_skipna<double, int64_t>; | ||
case ST_STRING_I4_VCHAR: return count_skipna<int32_t, int64_t>; | ||
case ST_STRING_I8_VCHAR: return count_skipna<int64_t, int64_t>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be <uint32_t, ...>
and <uint64_t, ...>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. Will fix it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
c/extras/aggregator.cc
Outdated
void Aggregator::aggregate_2d_continuous(double epsilon, int32_t nx_bins, int32_t ny_bins) { | ||
RealColumn<double>* c0 = (RealColumn<double>*) dt_out->columns[0]; | ||
RealColumn<double>* c1 = (RealColumn<double>*) dt_out->columns[1]; | ||
double* d_c0 = static_cast<double*>(dt_out->columns[0]->data_w()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use c0->elements_w()
here and below
c/py_datatable.cc
Outdated
@@ -261,6 +262,23 @@ PyObject* delete_columns(obj* self, PyObject* args) { | |||
|
|||
|
|||
|
|||
PyObject* aggregate(obj* self, PyObject* args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should not be in (py)DataTable class: the class will become user-facing once #1066 is implemented, so we want to keep the API clean.
The easiest approach is to declare this function global (e.g. listed in DatatableModuleMethods
in datatablemodule.c) and move its body to extras/aggregate.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
datatable/extras/aggregate.py
Outdated
@@ -0,0 +1,14 @@ | |||
#!/usr/bin/env python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no spaces before the #!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, another bug of my editor. Fixed now. Thanks.
datatable/extras/aggregate.py
Outdated
dt_agg = self._dt.aggregate(epsilon, n_bins, nx_bins, ny_bins, max_dimensions, seed) | ||
return Frame(dt_agg) | ||
|
||
Frame.aggregate = aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't recommend adding this method to Frame
class. If you just remove this line the code will work just fine:
from datatable.extras import aggregate
aggregate(frame, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Been using it as dt.aggregate(...)
. Sure, will remove it from the Frame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
tests/extras/test_aggregate.py
Outdated
def test_aggregate_2d_continuous_integer_sorted(): | ||
d0 = dt.Frame([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], | ||
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]) | ||
d1 = d0.aggregate(1e-15, 0, 3, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it's better to use named parameters here... How do you know what those 0, 3, 3
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of aggregate is as follows:
def aggregate(self, epsilon=1.0e-15, n_bins=500, nx_bins=50, ny_bins=50, max_dimensions=50, seed=0)
So these are just the number of bins for the test.
Regarding comments -- the easiest way is to leave them from the form at the bottom of the page in "Conversation" tab of the PR. Comments left as answers to in-line comments will eventually be collapsed out of view when the line they were assigned to changes. Regarding my comment about As for |
`extras/aggregator`. Cosmetics.
datatable/extras/__init__.py
Outdated
# file, You can obtain one at http://mozilla.org/MPL/2.0/. | ||
#------------------------------------------------------------------------------- | ||
|
||
__all__ = ("aggregate", ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, in order to export a symbol, you need to import or define it first.
However in your case (allow Python to find datatable.extras.aggregate
) having an empty __init__.py
should work just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was about to remove this line. Thanks!
Also adjusting the LType/SType usage.
//------------------------------------------------------------------------------ | ||
|
||
template<typename IT, typename OT> | ||
static void count_skipna(const int32_t* groups, int32_t grp, void** params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traditionally, count(x)
function returns the number of non-NA values in x
.
In your implementation, however, the function doesn't count anything, but merely returns the number of elements in each group. It's a valid function, just not a suitable name...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reimplemented this function to match the existing name.
c/extras/aggregator.cc
Outdated
const double* d_c0 = static_cast<const double*>(dt_out->columns[0]->data()); | ||
int32_t* d_c1 = static_cast<int32_t*>(dt_out->columns[1]->data_w()); | ||
|
||
//TODO: handle the case when the column is constant, i.e. min = max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please indent comments (here and below) at the same level as the surrounding code, otherwise it messes up with code-folding in my editor.
c/extras/aggregator.h
Outdated
|
||
DECLARE_FUNCTION( | ||
aggregate, | ||
"aggregate()\n\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument is the function's docstring. In your case the function has signature OiiiiI
, so it would be good to list what parameters the function actually takes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
datatable/frame.py
Outdated
@@ -563,8 +563,7 @@ def _delete_columns(self, cols): | |||
newnames += self.names[(cols[i - 1] + 1):cols[i]] | |||
newnames += self.names[cols[-1] + 1:] | |||
self._fill_from_dt(self._dt, names=newnames) | |||
|
|||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like your editor doesn't have "Trim trailing whitespace" option turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Actually C++ editor had this option turned on, had to fix the Python one only.
Aggregator now returns two dataframes: 1) exemplars dataframe in the format of (original_data_columns, number_of_members) 2) members dataframe in the format of (exemplar_id_it_belongs_to), the exemplar ids are the row ids from the frame 1)
This is not a count(*) function that was used by the aggregator before. count(*) still to be implemented.
Plus cosmetics.
c/expr/reduceop.cc
Outdated
[&](int64_t i) { | ||
IT x = inputs[i]; | ||
if (!ISNA<IT>(x)) | ||
++count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count += !ISNA<IT>(x);
should be faster (in theory), as it avoids branching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, fixed.
See also #1177 for further suggestions for improvement (after this PR is merged) |
Initial implementation of `datatable.extras.aggregate` for 1D, 2D and ND table aggregations. Includes: - 1D continuous binning; - 2D continuous binning; - 1D categorical aggregation; - 2D categorical aggregation; - 2D mixed (continuous/categorical) aggregation; - ND aggregation that also includes a projection method when ncols > max_dimensions; - implementations of the `first()` and `count()` `datatable` reducers; - other minor changes to `datatable`.
Initial implementation of
datatable.extras.aggregate
for 1D, 2D and ND table aggregations.Includes:
first()
andcount()
datatable
reducers;datatable
.Closes #1077