[SW-2449] asH2OFrame Method Could Fail on a String Column Having More Than 10 Million Distinct Values #2341

mn-mikke · 2020-10-02T16:47:14Z

This PR converts all Spark string columns to H2O-3 categorical columns as a first step. The H2O backend then converts the columns back to string if they are identified as too unique or if the limit for maximum categorical levels is exceeded.

core/src/main/scala/ai/h2o/sparkling/backend/converters/DataTypeConverter.scala

… Than 10 Million Distinct Values

extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/ImportFrameHandler.scala

michalkurka · 2020-10-12T18:00:13Z

extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/ImportFrameHandler.scala


-    ChunkUtils.finalizeFrame(request.key, rowsPerChunk, columnTypes, domains)
+    convertColumnsWithTooManyCategoricalLevelsToStringColumns(frame, columnTypesAwareOfEmptyFrames, stringDomains)


this is a part I don't understand - the frame is already in DKV and has "too many" categoricals - should thee frame not be in DKV yet?

is there any locking in place, the ParseDataset locks the frame against reading

The frame is in DKV from the beginning of the conversion. The initialize method puts the Frame into DKV (ChunkUtils.initFrame) as partial frame:

public static void initFrame(String keyName, String[] names) { Frame fr = new water.fvec.Frame(Key.<Frame>make(keyName)); fr.preparePartialFrame(names); // Save it directly to DKV fr.update(); }

Do you think that keeping the frame as partial and not finalizing it can avoid to read frame in a provisional state?

The columns with too many levels have null domains and during finalization are converted to T_NUM and then those columns are converted to T_STR with ConvertCategoricalToStringColumnsTask.

I needed to finalize the frame first to get Vector instances which are passed to ConvertCategoricalToStringColumnsTask.

Or is there any dedicated mechanism for locking frames for reading?

After looking at the code mentioned by @mn-mikke, there is no reason why the frame should not be in DKV - at least to my understanding. H2O and presumably SW is a single-user environment. Therefore any accidental reads are not possible this way.

The original ParseDataset phase is finished once the frame is finalized. And there is no ParseDataset active during the convertColumnsWithTooManyCategoricalLevelsToStringColumns - or is there ?

If not, then just removing the old vec (the replace method of a Frame is called) and replacing it with a new one seems to be safe operation. Chunk layout is adapted.

This is done sequentially.

.../test/scala/ai/h2o/sparkling/backend/converters/DataFrameConverterCategoricalTestSuite.scala

extensions/src/main/scala/water/parser/CategoricalPreviewParseWriter.java

Pscheidl · 2020-10-14T11:45:18Z

extensions/src/main/scala/water/parser/CategoricalPreviewParseWriter.java

+    super(1);
+    this._nstrings[0] = totalCount;
+    IcedHashMap<String, String>[] domains = new IcedHashMap[1];
+    domains[0] = new IcedHashMapWrapper(domain);


The proxy/wrapper is fine for now. There is a clear reason for that - memory savings. The very fact that the original PreviewParseWriter uses HashMap as a set effectively doubles the already doubled memory (extreme case).

My suggestion:

Keep this as-is, as the internal logic of the guessTypes method seems to be compliant with this proxy implementation of IcedHashMap.

In the very next H2O fix release, make changes to PreviewParseWriter to enabled better embedding in SW. This could also include migration to a much later introducerd IcedHashSet.

Overridable method getDomain() should be introduced in the parent class as well.

(this action is motivated by our intentions to release SW with this functionality now, and not with next fix release)

Pscheidl

The business logic makes complete sense to me.

LGTM. @mn-mikke Let's improve the PreviewParseWriter on the H2O-3 side in the next iteration. We made a deal with @mn-mikke I'll help with the improvement, won't take long.

mn-mikke added the next fix release label Oct 2, 2020

mn-mikke requested review from honzasterba, satai and michalkurka October 2, 2020 16:47

mn-mikke added work in progress WIP next major release Goes into Major release and removed next fix release labels Oct 5, 2020

honzasterba reviewed Oct 5, 2020

View reviewed changes

core/src/main/scala/ai/h2o/sparkling/backend/converters/DataTypeConverter.scala Outdated Show resolved Hide resolved

mn-mikke added 12 commits October 12, 2020 11:24

[SW-2449] asH2OFrame Method Could Fail on a String Column Having More…

a1c4419

… Than 10 Million Distinct Values

spotlessApply

7496fda

Fix SupportedRDDConverterTestSuite tests

0efa40e

Revert changes in tests

79d66e8

Fix OOM in tests

bb44db3

Escape names

7374ce9

Move the whole conversion logic to H2O backend

bc749d4

Use ExpectedType.Categorical

9ff21d7

typo

7b14bdf

spotlessApply

79d7d32

Remove DataTypeConverterTestSuite

f76389c

fix DataFrameConverterTestSuite

a92457a

mn-mikke force-pushed the mn/SW-2449 branch from 14d9cd6 to a92457a Compare October 12, 2020 09:24

mn-mikke added 5 commits October 12, 2020 12:12

fix calculation of the ratio

1e55627

fix ConvertCategoricalToStringColumnsTask

b2991b0

fix empty frames

d54fba5

conversion logic to separate methods

a4ce1e4

Add more tests

af28732

mn-mikke removed the work in progress WIP label Oct 12, 2020

mn-mikke requested a review from Pscheidl October 12, 2020 15:56

mn-mikke commented Oct 12, 2020

View reviewed changes

extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/ImportFrameHandler.scala Outdated Show resolved Hide resolved

condition for unique columns

8ca5d9c

michalkurka reviewed Oct 12, 2020

View reviewed changes

extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/ImportFrameHandler.scala Outdated Show resolved Hide resolved

michalkurka reviewed Oct 12, 2020

View reviewed changes

.../test/scala/ai/h2o/sparkling/backend/converters/DataFrameConverterCategoricalTestSuite.scala Outdated Show resolved Hide resolved

mn-mikke added 7 commits October 13, 2020 16:52

Add tests on one partition

c0889d3

Use PreviewParseWriter

5a8ac1c

spotless

459a939

fix categorical preview writer

79086e3

remove irrelevant column

02f6df1

Virtual ice hash map

bf2508b

spotlessApply

0f0bf7e

Pscheidl reviewed Oct 14, 2020

View reviewed changes

extensions/src/main/scala/water/parser/CategoricalPreviewParseWriter.java Show resolved Hide resolved

Pscheidl reviewed Oct 14, 2020

View reviewed changes

mn-mikke added 3 commits October 14, 2020 15:47

Adding DKV.put and disabling tests with big datasets on external backend

c2794e4

spotlessApply

72ca306

change test for external backend in test

e96f5ba

Pscheidl approved these changes Oct 14, 2020

View reviewed changes

mn-mikke merged commit b2f2d55 into master Oct 14, 2020

mn-mikke deleted the mn/SW-2449 branch October 14, 2020 19:54

DinukaH2O mentioned this pull request May 23, 2023

asH2OFrame Method Could Fail on a String Column Having More Than 10 Million Distinct Values #3106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SW-2449] asH2OFrame Method Could Fail on a String Column Having More Than 10 Million Distinct Values #2341

[SW-2449] asH2OFrame Method Could Fail on a String Column Having More Than 10 Million Distinct Values #2341

mn-mikke commented Oct 2, 2020 •

edited

michalkurka Oct 12, 2020

mn-mikke Oct 12, 2020

Pscheidl Oct 14, 2020

Pscheidl Oct 14, 2020 •

edited

Pscheidl left a comment


		ChunkUtils.finalizeFrame(request.key, rowsPerChunk, columnTypes, domains)
		convertColumnsWithTooManyCategoricalLevelsToStringColumns(frame, columnTypesAwareOfEmptyFrames, stringDomains)

[SW-2449] asH2OFrame Method Could Fail on a String Column Having More Than 10 Million Distinct Values #2341

[SW-2449] asH2OFrame Method Could Fail on a String Column Having More Than 10 Million Distinct Values #2341

Conversation

mn-mikke commented Oct 2, 2020 • edited

michalkurka Oct 12, 2020

Choose a reason for hiding this comment

mn-mikke Oct 12, 2020

Choose a reason for hiding this comment

Pscheidl Oct 14, 2020

Choose a reason for hiding this comment

Pscheidl Oct 14, 2020 • edited

Choose a reason for hiding this comment

Pscheidl left a comment

Choose a reason for hiding this comment

mn-mikke commented Oct 2, 2020 •

edited

Pscheidl Oct 14, 2020 •

edited