Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SW-2449] asH2OFrame Method Could Fail on a String Column Having More Than 10 Million Distinct Values #2341

Merged
merged 28 commits into from Oct 14, 2020

Conversation

mn-mikke
Copy link
Collaborator

@mn-mikke mn-mikke commented Oct 2, 2020

This PR converts all Spark string columns to H2O-3 categorical columns as a first step. The H2O backend then converts the columns back to string if they are identified as too unique or if the limit for maximum categorical levels is exceeded.


ChunkUtils.finalizeFrame(request.key, rowsPerChunk, columnTypes, domains)
convertColumnsWithTooManyCategoricalLevelsToStringColumns(frame, columnTypesAwareOfEmptyFrames, stringDomains)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a part I don't understand - the frame is already in DKV and has "too many" categoricals - should thee frame not be in DKV yet?

is there any locking in place, the ParseDataset locks the frame against reading

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frame is in DKV from the beginning of the conversion. The initialize method puts the Frame into DKV (ChunkUtils.initFrame) as partial frame:

    public static void initFrame(String keyName, String[] names) {
        Frame fr = new water.fvec.Frame(Key.<Frame>make(keyName));
        fr.preparePartialFrame(names);
        // Save it directly to DKV
        fr.update();
    }

Do you think that keeping the frame as partial and not finalizing it can avoid to read frame in a provisional state?

The columns with too many levels have null domains and during finalization are converted to T_NUM and then those columns are converted to T_STR with ConvertCategoricalToStringColumnsTask.

I needed to finalize the frame first to get Vector instances which are passed to ConvertCategoricalToStringColumnsTask.

Or is there any dedicated mechanism for locking frames for reading?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the code mentioned by @mn-mikke, there is no reason why the frame should not be in DKV - at least to my understanding. H2O and presumably SW is a single-user environment. Therefore any accidental reads are not possible this way.

The original ParseDataset phase is finished once the frame is finalized. And there is no ParseDataset active during the convertColumnsWithTooManyCategoricalLevelsToStringColumns - or is there ?

If not, then just removing the old vec (the replace method of a Frame is called) and replacing it with a new one seems to be safe operation. Chunk layout is adapted.

This is done sequentially.

super(1);
this._nstrings[0] = totalCount;
IcedHashMap<String, String>[] domains = new IcedHashMap[1];
domains[0] = new IcedHashMapWrapper(domain);
Copy link
Contributor

@Pscheidl Pscheidl Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy/wrapper is fine for now. There is a clear reason for that - memory savings. The very fact that the original PreviewParseWriter uses HashMap as a set effectively doubles the already doubled memory (extreme case).

My suggestion:

  • Keep this as-is, as the internal logic of the guessTypes method seems to be compliant with this proxy implementation of IcedHashMap.
  • In the very next H2O fix release, make changes to PreviewParseWriter to enabled better embedding in SW. This could also include migration to a much later introducerd IcedHashSet.
  • Overridable method getDomain() should be introduced in the parent class as well.

(this action is motivated by our intentions to release SW with this functionality now, and not with next fix release)

Copy link
Contributor

@Pscheidl Pscheidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The business logic makes complete sense to me.

LGTM. @mn-mikke Let's improve the PreviewParseWriter on the H2O-3 side in the next iteration. We made a deal with @mn-mikke I'll help with the improvement, won't take long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
next major release Goes into Major release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants