New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SW-2449] asH2OFrame Method Could Fail on a String Column Having More Than 10 Million Distinct Values #2341
Conversation
core/src/main/scala/ai/h2o/sparkling/backend/converters/DataTypeConverter.scala
Outdated
Show resolved
Hide resolved
… Than 10 Million Distinct Values
extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/ImportFrameHandler.scala
Outdated
Show resolved
Hide resolved
extensions/src/main/scala/ai/h2o/sparkling/extensions/rest/api/ImportFrameHandler.scala
Outdated
Show resolved
Hide resolved
|
||
ChunkUtils.finalizeFrame(request.key, rowsPerChunk, columnTypes, domains) | ||
convertColumnsWithTooManyCategoricalLevelsToStringColumns(frame, columnTypesAwareOfEmptyFrames, stringDomains) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a part I don't understand - the frame is already in DKV and has "too many" categoricals - should thee frame not be in DKV yet?
is there any locking in place, the ParseDataset locks the frame against reading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frame is in DKV from the beginning of the conversion. The initialize method puts the Frame into DKV (ChunkUtils.initFrame
) as partial frame:
public static void initFrame(String keyName, String[] names) {
Frame fr = new water.fvec.Frame(Key.<Frame>make(keyName));
fr.preparePartialFrame(names);
// Save it directly to DKV
fr.update();
}
Do you think that keeping the frame as partial and not finalizing it can avoid to read frame in a provisional state?
The columns with too many levels have null domains and during finalization are converted to T_NUM
and then those columns are converted to T_STR
with ConvertCategoricalToStringColumnsTask.
I needed to finalize the frame first to get Vector instances which are passed to ConvertCategoricalToStringColumnsTask.
Or is there any dedicated mechanism for locking frames for reading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at the code mentioned by @mn-mikke, there is no reason why the frame should not be in DKV - at least to my understanding. H2O and presumably SW is a single-user environment. Therefore any accidental reads are not possible this way.
The original ParseDataset phase is finished once the frame is finalized. And there is no ParseDataset active during the convertColumnsWithTooManyCategoricalLevelsToStringColumns
- or is there ?
If not, then just removing the old vec (the replace
method of a Frame
is called) and replacing it with a new one seems to be safe operation. Chunk layout is adapted.
This is done sequentially.
.../test/scala/ai/h2o/sparkling/backend/converters/DataFrameConverterCategoricalTestSuite.scala
Outdated
Show resolved
Hide resolved
extensions/src/main/scala/water/parser/CategoricalPreviewParseWriter.java
Show resolved
Hide resolved
super(1); | ||
this._nstrings[0] = totalCount; | ||
IcedHashMap<String, String>[] domains = new IcedHashMap[1]; | ||
domains[0] = new IcedHashMapWrapper(domain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proxy/wrapper is fine for now. There is a clear reason for that - memory savings. The very fact that the original PreviewParseWriter
uses HashMap as a set effectively doubles the already doubled memory (extreme case).
My suggestion:
- Keep this as-is, as the internal logic of the
guessTypes
method seems to be compliant with this proxy implementation of IcedHashMap. - In the very next H2O fix release, make changes to
PreviewParseWriter
to enabled better embedding in SW. This could also include migration to a much later introducerd IcedHashSet. - Overridable method
getDomain()
should be introduced in the parent class as well.
(this action is motivated by our intentions to release SW with this functionality now, and not with next fix release)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR converts all Spark string columns to H2O-3 categorical columns as a first step. The H2O backend then converts the columns back to string if they are identified as too unique or if the limit for maximum categorical levels is exceeded.