-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PUBDEV-6938 GroupBy - support for grouping by String columns #4594
PUBDEV-6938 GroupBy - support for grouping by String columns #4594
Conversation
Add GroupBy support for String values - allowing one to group values using String columns. Add unit test for this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall this looks good to me, memory-wise this can get quite hungry though and I guess that would be the reason this was not supported
overall I am fine with the code but I would urge the author to once again think about the necessity to have the two arrays of columns everywhere, from my quick look it seems possible to have only one array and maybe two arrays with different value types and one mapping array to find the right value for given column/chunk, would make this change much smaller
for (int i = 0; i < gbColsStr.length; i++) { | ||
if (g1._gsStr[i] != null && g2._gsStr[i] == null) return -1; | ||
if (g1._gsStr[i] == null && g2._gsStr[i] != null) return 1; | ||
if (!g1._gsStr[i].equals(g2._gsStr[i])) return g1._gsStr[i].compareTo(g2._gsStr[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compareTo already does equals, I think doing both here is not efficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that it should not return 0
if we hit values that are are equal - based on the previous ordering implementation, as explained in the comment below:
// Compare 2 groups. Iterate down _gs, stop when _gs[i] > that._gs[i],
// or _gs[i] < that._gs[i]. Order by various columns specified by
// gbCols. NaN is treated as least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, thats true, than maybe you could do
int res = g1._gsStr[i].compareTo(g2._gsStr[i]); if (res != 0) return res;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this looks cleaner and more efficient. I changed it.
|
||
// Load into working array | ||
if (vec.isString()) | ||
_gsStr[c] = chks[c].stringAt(row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is better to keep everything in BufferedStrings as much as possible (atStr method), this won't duplicate the memory needed to hold the string, BufferedString just references an existing byte array that holds the actual data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I will adjust it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to use BufferedString
internally.
@honzasterba I see. Hm. I don't know whether I get what you mean by:
Could you elaborate a bit more? The other possiblity that came to my mind was to use single array for column indices by data should be grouped and then add another array of types (containing enum values of certain allowed types - |
yes, sort order is also an argument to use just one array of cols, I think the string/double dychotomy could be very well hidden inside the Group object. |
b41eac2
to
14a5a6c
Compare
h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java
Outdated
Show resolved
Hide resolved
h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java
Outdated
Show resolved
Hide resolved
h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java
Outdated
Show resolved
Hide resolved
h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java
Outdated
Show resolved
Hide resolved
42c6c59
to
75cfaef
Compare
So I changed the code to use single array for column indices for grouping the data (based on the previous discussion with @honzasterba). Therefore now it is already supported throught the "Ast API". I added unit test for this case as well. |
h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java
Outdated
Show resolved
Hide resolved
h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java
Outdated
Show resolved
Hide resolved
75cfaef
to
eed14a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
_aggs = aggs; | ||
_hasMedian = hasMedian; | ||
} | ||
|
||
protected void map(Chunk[] cs, IcedHashSet<G> groups) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thank you for removing the duplicated code!
|
||
// Load into working array | ||
if (vec.isString()) | ||
_gsStr[c] = chks[c].atStr(new BufferedString(), row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation of the design of the G class is to avoid allocating memory per-observation/row
To follow the same pattern we should be re-using (pre-)allocated BufferedStrings when the G class gets created. This is a minor comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I changed it to use pre-allocated BufferedStrings
.
for (int c = 0; c < chks.length; c++) // For all selection cols | ||
_gs[c] = chks[c].atd(row); // Load into working array | ||
for (int c = 0; c < chks.length; c++) { // For all selection cols | ||
Vec vec = chks[c].vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of code in this PR handles switching between the 2 arrays (_gs nad _gsStr). This is due to the fact that we can have an arbitrary order of string/numerical columns on the input. However, in the implementation itself, we can re-order the columns to make sure we will have the exact order we want, eg. numerical first then string ones. Would that be something that would help with making the code less complex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing that comes to my mind that could solve this problem. Switching between _gs
and _gsStr
is done in 2 places - filling the output frame and sorting it. In both cases it is expected to preserve order of the columns. There is a possibility (at least for the "filling" part) of replacing gbColsTypes
array with 2 arrays defining order for each of the columns in _gs
and _gsStr
. But I don't know whether it significantly reduces the complexity, if at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view, two arrays are ok in this case. One array with both types of values has the same complexity; there should also be if
where we will have to decide what to do with string and numeric separately. Maybe some class can solve this, however, I think it is overengineering in this case.
@jancijen thank you, it was a pleasure reading this PR. I suggest thinking about if we can avoid allocation of BufferedStrings in the G#fill method. Also would like to know if re-ordering of the columns would be a feasible way of reducing the complexity of the code (in some parts - for sort you need the original order). |
eed14a3
to
62cbb2e
Compare
Thank you both for thorough feedback. I addressed your comments, but as I mentioned I am not sure about reducing the complexity in switching between |
Rework GroupBy functionality to work with single array of groupby columns indices. This also fixes previous issue with ordering of groupby columns in the output frame and allows it to be used through Ast api. Use BufferedStrings internally to achieve better memory efficiency.
62cbb2e
to
1ece7b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Only one recommendation for better tests. Thank you, @michalkurka and @honzasterba, for complex review. Thank you, @jancijen, for this PR, I especially like you described every step and change.
System.out.println("GroupBy result:"); | ||
System.out.println(resFrame.toTwoDimTable().toString()); | ||
|
||
Assert.assertEquals(expectedResFrame.numCols(), resFrame.numCols()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add message to all asserts. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I added messages to all asserts, and I added unit tests for new methods in ArrayUtils
- select
and occurrenceCount
.
for (int c = 0; c < chks.length; c++) // For all selection cols | ||
_gs[c] = chks[c].atd(row); // Load into working array | ||
for (int c = 0; c < chks.length; c++) { // For all selection cols | ||
Vec vec = chks[c].vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view, two arrays are ok in this case. One array with both types of values has the same complexity; there should also be if
where we will have to decide what to do with string and numeric separately. Maybe some class can solve this, however, I think it is overengineering in this case.
Add unit tests for new ArrayUtils - select and occurrenceCount methods. Add assert messages to the unit tests.
So reviews are final, @jancijen, you can merge this PR. |
This PR adds GroupBy support for
String
values (not forAstGroup.FCN
functions, but it allows grouping rows by a column which is of typeString
), in order to implement TF-IDF (#4380) by using GroupBy operation (AstGroup
).This functionality is not directly accessible from "Ast API", but it can be used via public methods and classes from
AstGroup
as shown in the unit test.