PUBDEV-6938 GroupBy - support for grouping by String columns #4594

jancijen · 2020-05-10T15:56:32Z

This PR adds GroupBy support for String values (not for AstGroup.FCN functions, but it allows grouping rows by a column which is of type String), in order to implement TF-IDF (#4380) by using GroupBy operation (AstGroup).

This functionality is not directly accessible from "Ast API", but it can be used via public methods and classes from AstGroup as shown in the unit test.

Add GroupBy support for String values - allowing one to group values using String columns. Add unit test for this functionality.

honzasterba

overall this looks good to me, memory-wise this can get quite hungry though and I guess that would be the reason this was not supported

overall I am fine with the code but I would urge the author to once again think about the necessity to have the two arrays of columns everywhere, from my quick look it seems possible to have only one array and maybe two arrays with different value types and one mapping array to find the right value for given column/chunk, would make this change much smaller

honzasterba · 2020-05-11T08:35:11Z

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java

+          for (int i = 0; i < gbColsStr.length; i++) {
+            if (g1._gsStr[i] != null && g2._gsStr[i] == null) return -1;
+            if (g1._gsStr[i] == null && g2._gsStr[i] != null) return 1;
+            if (!g1._gsStr[i].equals(g2._gsStr[i])) return g1._gsStr[i].compareTo(g2._gsStr[i]);


compareTo already does equals, I think doing both here is not efficient

I believe that it should not return 0 if we hit values that are are equal - based on the previous ordering implementation, as explained in the comment below:

// Compare 2 groups. Iterate down _gs, stop when _gs[i] > that._gs[i], // or _gs[i] < that._gs[i]. Order by various columns specified by // gbCols. NaN is treated as least

ah, thats true, than maybe you could do

int res = g1._gsStr[i].compareTo(g2._gsStr[i]); if (res != 0) return res;

Sure, this looks cleaner and more efficient. I changed it.

michalkurka · 2020-05-11T14:39:42Z

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java

+
+        // Load into working array
+        if (vec.isString())
+          _gsStr[c] = chks[c].stringAt(row);


it is better to keep everything in BufferedStrings as much as possible (atStr method), this won't duplicate the memory needed to hold the string, BufferedString just references an existing byte array that holds the actual data

Makes sense. I will adjust it.

I changed it to use BufferedString internally.

jancijen · 2020-05-11T19:31:39Z

@honzasterba I see.

Hm. I don't know whether I get what you mean by:

... and maybe two arrays with different value types and one mapping array to find the right value for given column/chunk ...

Could you elaborate a bit more?

The other possiblity that came to my mind was to use single array for column indices by data should be grouped and then add another array of types (containing enum values of certain allowed types - Numeric and String). Then just add if branching whenever columns are used. This I think, would end up with similar amount of code, but it would at least remove the "ordering issue" - now String groupby columns are always first followed by Numeric groupby columns.

honzasterba · 2020-05-11T19:40:30Z

yes, sort order is also an argument to use just one array of cols, I think the string/double dychotomy could be very well hidden inside the Group object.
code like ncs[j].addStr(g._gsStr[j]); could become g.set(ncs[j], j) or even the whole loops of https://github.com/h2oai/h2o-3/pull/4594/files#diff-4a6a6d53c166511396303e901c6f2338R330 could be moved into Group
also you could make G comparable

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java

jancijen · 2020-05-13T21:58:05Z

So I changed the code to use single array for column indices for grouping the data (based on the previous discussion with @honzasterba). Therefore now it is already supported throught the "Ast API". I added unit test for this case as well.

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java

honzasterba

lgtm

michalkurka · 2020-05-14T17:21:50Z

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java

      _aggs = aggs;
      _hasMedian = hasMedian;
    }
+
+    protected void map(Chunk[] cs, IcedHashSet<G> groups) {


Very nice, thank you for removing the duplicated code!

michalkurka · 2020-05-14T17:40:43Z

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java

+
+        // Load into working array
+        if (vec.isString())
+          _gsStr[c] = chks[c].atStr(new BufferedString(), row);


The motivation of the design of the G class is to avoid allocating memory per-observation/row

To follow the same pattern we should be re-using (pre-)allocated BufferedStrings when the G class gets created. This is a minor comment.

I see. I changed it to use pre-allocated BufferedStrings.

michalkurka · 2020-05-14T17:44:41Z

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java

-      for (int c = 0; c < chks.length; c++) // For all selection cols
-        _gs[c] = chks[c].atd(row); // Load into working array
+      for (int c = 0; c < chks.length; c++) { // For all selection cols
+        Vec vec = chks[c].vec();


A lot of code in this PR handles switching between the 2 arrays (_gs nad _gsStr). This is due to the fact that we can have an arbitrary order of string/numerical columns on the input. However, in the implementation itself, we can re-order the columns to make sure we will have the exact order we want, eg. numerical first then string ones. Would that be something that would help with making the code less complex?

There is nothing that comes to my mind that could solve this problem. Switching between _gs and _gsStr is done in 2 places - filling the output frame and sorting it. In both cases it is expected to preserve order of the columns. There is a possibility (at least for the "filling" part) of replacing gbColsTypes array with 2 arrays defining order for each of the columns in _gs and _gsStr. But I don't know whether it significantly reduces the complexity, if at all.

From my point of view, two arrays are ok in this case. One array with both types of values has the same complexity; there should also be if where we will have to decide what to do with string and numeric separately. Maybe some class can solve this, however, I think it is overengineering in this case.

michalkurka · 2020-05-14T17:52:38Z

@jancijen thank you, it was a pleasure reading this PR.

I suggest thinking about if we can avoid allocation of BufferedStrings in the G#fill method. Also would like to know if re-ordering of the columns would be a feasible way of reducing the complexity of the code (in some parts - for sort you need the original order).

jancijen · 2020-05-14T19:44:38Z

Thank you both for thorough feedback.

I addressed your comments, but as I mentioned I am not sure about reducing the complexity in switching between _gs and _gsStr.

Rework GroupBy functionality to work with single array of groupby columns indices. This also fixes previous issue with ordering of groupby columns in the output frame and allows it to be used through Ast api. Use BufferedStrings internally to achieve better memory efficiency.

maurever

Looks good to me. Only one recommendation for better tests. Thank you, @michalkurka and @honzasterba, for complex review. Thank you, @jancijen, for this PR, I especially like you described every step and change.

maurever · 2020-05-18T08:58:31Z

h2o-core/src/test/java/water/rapids/GroupByTest.java

+      System.out.println("GroupBy result:");
+      System.out.println(resFrame.toTwoDimTable().toString());
+
+      Assert.assertEquals(expectedResFrame.numCols(), resFrame.numCols());


Please, add message to all asserts. :)

👍 I added messages to all asserts, and I added unit tests for new methods in ArrayUtils - select and occurrenceCount.

maurever · 2020-05-18T09:04:46Z

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java

-      for (int c = 0; c < chks.length; c++) // For all selection cols
-        _gs[c] = chks[c].atd(row); // Load into working array
+      for (int c = 0; c < chks.length; c++) { // For all selection cols
+        Vec vec = chks[c].vec();


From my point of view, two arrays are ok in this case. One array with both types of values has the same complexity; there should also be if where we will have to decide what to do with string and numeric separately. Maybe some class can solve this, however, I think it is overengineering in this case.

Add unit tests for new ArrayUtils - select and occurrenceCount methods. Add assert messages to the unit tests.

maurever · 2020-05-18T14:47:16Z

So reviews are final, @jancijen, you can merge this PR.

GroupBy support for String values

3a99030

Add GroupBy support for String values - allowing one to group values using String columns. Add unit test for this functionality.

maurever requested review from maurever, honzasterba and michalkurka May 11, 2020 08:00

maurever assigned jancijen May 11, 2020

honzasterba reviewed May 11, 2020

View reviewed changes

michalkurka reviewed May 11, 2020

View reviewed changes

jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch from b41eac2 to 14a5a6c Compare May 13, 2020 19:29

honzasterba reviewed May 13, 2020

View reviewed changes

jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch 2 times, most recently from 42c6c59 to 75cfaef Compare May 13, 2020 21:50

jancijen requested review from honzasterba and michalkurka May 13, 2020 21:55

honzasterba reviewed May 14, 2020

View reviewed changes

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java Outdated Show resolved Hide resolved

h2o-core/src/main/java/water/rapids/ast/prims/mungers/AstGroup.java Outdated Show resolved Hide resolved

jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch from 75cfaef to eed14a3 Compare May 14, 2020 16:10

jancijen requested a review from honzasterba May 14, 2020 16:12

honzasterba approved these changes May 14, 2020

View reviewed changes

michalkurka reviewed May 14, 2020

View reviewed changes

jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch from eed14a3 to 62cbb2e Compare May 14, 2020 19:30

jancijen requested a review from michalkurka May 14, 2020 19:45

jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch from 62cbb2e to 1ece7b8 Compare May 17, 2020 10:51

maurever approved these changes May 18, 2020

View reviewed changes

Unit tests for select and occurrenceCount ArrayUtils

e19dfb6

Add unit tests for new ArrayUtils - select and occurrenceCount methods. Add assert messages to the unit tests.

michalkurka approved these changes May 18, 2020

View reviewed changes

maurever merged commit 22814dd into h2oai:master May 18, 2020

h2o-ops mentioned this pull request May 14, 2023

Implement TF-IDF algorithm #8698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUBDEV-6938 GroupBy - support for grouping by String columns #4594

PUBDEV-6938 GroupBy - support for grouping by String columns #4594

jancijen commented May 10, 2020

honzasterba left a comment

honzasterba May 11, 2020

jancijen May 11, 2020

honzasterba May 11, 2020

jancijen May 13, 2020

michalkurka May 11, 2020

jancijen May 11, 2020

jancijen May 13, 2020

jancijen commented May 11, 2020

honzasterba commented May 11, 2020

jancijen commented May 13, 2020

honzasterba left a comment

michalkurka May 14, 2020

michalkurka May 14, 2020

jancijen May 14, 2020

michalkurka May 14, 2020

jancijen May 14, 2020

maurever May 18, 2020

michalkurka commented May 14, 2020

jancijen commented May 14, 2020 •

edited

Loading

maurever left a comment

maurever May 18, 2020

jancijen May 18, 2020

maurever May 18, 2020

maurever commented May 18, 2020

PUBDEV-6938 GroupBy - support for grouping by String columns #4594

PUBDEV-6938 GroupBy - support for grouping by String columns #4594

Conversation

jancijen commented May 10, 2020

honzasterba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancijen commented May 11, 2020

honzasterba commented May 11, 2020

jancijen commented May 13, 2020

honzasterba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michalkurka commented May 14, 2020

jancijen commented May 14, 2020 • edited Loading

maurever left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maurever commented May 18, 2020

jancijen commented May 14, 2020 •

edited

Loading