Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-6938 GroupBy - support for grouping by String columns #4594

Merged
merged 3 commits into from
May 18, 2020

Conversation

jancijen
Copy link
Contributor

This PR adds GroupBy support for String values (not for AstGroup.FCN functions, but it allows grouping rows by a column which is of type String), in order to implement TF-IDF (#4380) by using GroupBy operation (AstGroup).

This functionality is not directly accessible from "Ast API", but it can be used via public methods and classes from AstGroup as shown in the unit test.

Add GroupBy support for String values - allowing one to group
values using String columns. Add unit test for this functionality.
Copy link
Contributor

@honzasterba honzasterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this looks good to me, memory-wise this can get quite hungry though and I guess that would be the reason this was not supported

overall I am fine with the code but I would urge the author to once again think about the necessity to have the two arrays of columns everywhere, from my quick look it seems possible to have only one array and maybe two arrays with different value types and one mapping array to find the right value for given column/chunk, would make this change much smaller

for (int i = 0; i < gbColsStr.length; i++) {
if (g1._gsStr[i] != null && g2._gsStr[i] == null) return -1;
if (g1._gsStr[i] == null && g2._gsStr[i] != null) return 1;
if (!g1._gsStr[i].equals(g2._gsStr[i])) return g1._gsStr[i].compareTo(g2._gsStr[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compareTo already does equals, I think doing both here is not efficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that it should not return 0 if we hit values that are are equal - based on the previous ordering implementation, as explained in the comment below:

// Compare 2 groups.  Iterate down _gs, stop when _gs[i] > that._gs[i],
// or _gs[i] < that._gs[i].  Order by various columns specified by
// gbCols.  NaN is treated as least

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thats true, than maybe you could do

int res = g1._gsStr[i].compareTo(g2._gsStr[i]); if (res != 0) return res;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this looks cleaner and more efficient. I changed it.


// Load into working array
if (vec.isString())
_gsStr[c] = chks[c].stringAt(row);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is better to keep everything in BufferedStrings as much as possible (atStr method), this won't duplicate the memory needed to hold the string, BufferedString just references an existing byte array that holds the actual data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I will adjust it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to use BufferedString internally.

@jancijen
Copy link
Contributor Author

@honzasterba I see.

Hm. I don't know whether I get what you mean by:

... and maybe two arrays with different value types and one mapping array to find the right value for given column/chunk ...

Could you elaborate a bit more?

The other possiblity that came to my mind was to use single array for column indices by data should be grouped and then add another array of types (containing enum values of certain allowed types - Numeric and String). Then just add if branching whenever columns are used. This I think, would end up with similar amount of code, but it would at least remove the "ordering issue" - now String groupby columns are always first followed by Numeric groupby columns.

@honzasterba
Copy link
Contributor

yes, sort order is also an argument to use just one array of cols, I think the string/double dychotomy could be very well hidden inside the Group object.
code like ncs[j].addStr(g._gsStr[j]); could become g.set(ncs[j], j) or even the whole loops of https://github.com/h2oai/h2o-3/pull/4594/files#diff-4a6a6d53c166511396303e901c6f2338R330 could be moved into Group
also you could make G comparable

@jancijen jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch from b41eac2 to 14a5a6c Compare May 13, 2020 19:29
@jancijen jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch 2 times, most recently from 42c6c59 to 75cfaef Compare May 13, 2020 21:50
@jancijen
Copy link
Contributor Author

So I changed the code to use single array for column indices for grouping the data (based on the previous discussion with @honzasterba). Therefore now it is already supported throught the "Ast API". I added unit test for this case as well.

@jancijen jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch from 75cfaef to eed14a3 Compare May 14, 2020 16:10
@jancijen jancijen requested a review from honzasterba May 14, 2020 16:12
Copy link
Contributor

@honzasterba honzasterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

_aggs = aggs;
_hasMedian = hasMedian;
}

protected void map(Chunk[] cs, IcedHashSet<G> groups) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you for removing the duplicated code!


// Load into working array
if (vec.isString())
_gsStr[c] = chks[c].atStr(new BufferedString(), row);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation of the design of the G class is to avoid allocating memory per-observation/row

To follow the same pattern we should be re-using (pre-)allocated BufferedStrings when the G class gets created. This is a minor comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I changed it to use pre-allocated BufferedStrings.

for (int c = 0; c < chks.length; c++) // For all selection cols
_gs[c] = chks[c].atd(row); // Load into working array
for (int c = 0; c < chks.length; c++) { // For all selection cols
Vec vec = chks[c].vec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of code in this PR handles switching between the 2 arrays (_gs nad _gsStr). This is due to the fact that we can have an arbitrary order of string/numerical columns on the input. However, in the implementation itself, we can re-order the columns to make sure we will have the exact order we want, eg. numerical first then string ones. Would that be something that would help with making the code less complex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing that comes to my mind that could solve this problem. Switching between _gs and _gsStr is done in 2 places - filling the output frame and sorting it. In both cases it is expected to preserve order of the columns. There is a possibility (at least for the "filling" part) of replacing gbColsTypes array with 2 arrays defining order for each of the columns in _gs and _gsStr. But I don't know whether it significantly reduces the complexity, if at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, two arrays are ok in this case. One array with both types of values has the same complexity; there should also be if where we will have to decide what to do with string and numeric separately. Maybe some class can solve this, however, I think it is overengineering in this case.

@michalkurka
Copy link
Contributor

@jancijen thank you, it was a pleasure reading this PR.

I suggest thinking about if we can avoid allocation of BufferedStrings in the G#fill method. Also would like to know if re-ordering of the columns would be a feasible way of reducing the complexity of the code (in some parts - for sort you need the original order).

@jancijen jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch from eed14a3 to 62cbb2e Compare May 14, 2020 19:30
@jancijen
Copy link
Contributor Author

jancijen commented May 14, 2020

Thank you both for thorough feedback.

I addressed your comments, but as I mentioned I am not sure about reducing the complexity in switching between _gs and _gsStr.

@jancijen jancijen requested a review from michalkurka May 14, 2020 19:45
Rework GroupBy functionality to work with single array of groupby
columns indices. This also fixes previous issue with ordering
of groupby columns in the output frame and allows it to be used through
Ast api. Use BufferedStrings internally to achieve better memory
efficiency.
@jancijen jancijen force-pushed the jendrusak_PUBDEV-6938_string-groupby branch from 62cbb2e to 1ece7b8 Compare May 17, 2020 10:51
Copy link
Contributor

@maurever maurever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Only one recommendation for better tests. Thank you, @michalkurka and @honzasterba, for complex review. Thank you, @jancijen, for this PR, I especially like you described every step and change.

System.out.println("GroupBy result:");
System.out.println(resFrame.toTwoDimTable().toString());

Assert.assertEquals(expectedResFrame.numCols(), resFrame.numCols());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add message to all asserts. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I added messages to all asserts, and I added unit tests for new methods in ArrayUtils - select and occurrenceCount.

for (int c = 0; c < chks.length; c++) // For all selection cols
_gs[c] = chks[c].atd(row); // Load into working array
for (int c = 0; c < chks.length; c++) { // For all selection cols
Vec vec = chks[c].vec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, two arrays are ok in this case. One array with both types of values has the same complexity; there should also be if where we will have to decide what to do with string and numeric separately. Maybe some class can solve this, however, I think it is overengineering in this case.

Add unit tests for new ArrayUtils - select and occurrenceCount
methods. Add assert messages to the unit tests.
@maurever
Copy link
Contributor

So reviews are final, @jancijen, you can merge this PR.

@maurever maurever merged commit 22814dd into h2oai:master May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants