New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix integer overflow when turning dictionary into direct #12015

Merged
merged 1 commit into from Jan 8, 2019

Conversation

Projects
None yet
3 participants
@wenleix
Copy link
Contributor

wenleix commented Dec 5, 2018

Turning dictionary into direct in ORC writer can end up with writing
huge data.
eb6028d tries to fix this by
checking the written size after converting each row group. However,
in certain cases a single row group already exceed 2G after converting
into direct. For example, a varchar column repeated with a single
value with length around 500k.

This commit fixed this by checking the written size after each batch of
write.

Fixes #11930

@wenleix wenleix requested a review from dain Dec 5, 2018

@wenleix

This comment has been minimized.

Copy link
Contributor

wenleix commented Dec 5, 2018

Resolves #11930

@wenleix wenleix removed the request for review from dain Dec 5, 2018

@wenleix wenleix assigned wenleix and unassigned dain Dec 5, 2018

@wenleix wenleix changed the title Fix integer overflow when turning dictionary into direct [WIP] Fix integer overflow when turning dictionary into direct Dec 5, 2018

@wenleix

This comment has been minimized.

Copy link
Contributor

wenleix commented Dec 5, 2018

The batch size is fixed to 1024, which can be still too large in theory, say the cell is a 8MB string.

I can see two solution

  • Make batch size smaller, which is a lazy solution (say, 32). Should be good enough...
  • Estimate the batch size based on the max string length in the dictionary -- will it be an overkill?

@dain , I am wondering you have any preference / suggestions?

@dain

dain approved these changes Dec 9, 2018

Copy link
Contributor

dain left a comment

Assuming we don't need a new test, looks good.

@@ -222,8 +220,15 @@ private void writeDictionaryRowGroup(Block dictionary, int valueCount, IntBigArr
DictionaryBlock dictionaryBlock = new DictionaryBlock(positionCount, dictionary, segment);
directColumnWriter.writeBlock(dictionaryBlock);
valueCount -= positionCount;

if (directColumnWriter.getBufferedBytes() > maxDirectBytes) {
directColumnWriter.close();

This comment has been minimized.

@dain

dain Dec 9, 2018

Contributor

Just curious, why move these here. I ask because the caller is still doing directColumnWriter.finishRowGroup();

This comment has been minimized.

@wenleix

wenleix Dec 10, 2018

Contributor

Good point, I originally want to avoid repeating the following two lines:

    directColumnWriter.close();
    directColumnWriter.reset();

But this does make the abstraction a bit weird.

if (directColumnWriter.getBufferedBytes() > maxDirectBytes) {
directColumnWriter.close();
directColumnWriter.reset();
if (!writeDictionaryRowGroup(dictionaryValues, rowGroup.getValueCount(), rowGroup.getDictionaryIndexes(), maxDirectBytes)) {

This comment has been minimized.

@dain

dain Dec 9, 2018

Contributor

Do we have an existing test that covers this?

This comment has been minimized.

@wenleix

wenleix Dec 10, 2018

Contributor

@dain : The case of direct conversion exceeds stripe size limit is tested in testNotDirectConversionOnDictionaryFull.

However, when a single row group exceeds 2G after direct conversion, it will fail inside SliceDictionaryColumnWriter::tryConvertToDirect. And this is fixed in this commit.

I will see if a unit test for SliceDictionaryColumnWriter::tryConvertToDirect is straightforward to add.

@wenleix wenleix force-pushed the wenleix:orc_overflow branch 3 times, most recently from c25fef9 to db273d7 Dec 10, 2018

@wenleix wenleix changed the title [WIP] Fix integer overflow when turning dictionary into direct Fix integer overflow when turning dictionary into direct Dec 10, 2018

@dain

dain approved these changes Dec 10, 2018

Copy link
Contributor

dain left a comment

One comment about the placement of finishRowGroup call. Assuming that is correct, looks good.

@wenleix wenleix force-pushed the wenleix:orc_overflow branch 3 times, most recently from 7c19e51 to 3138077 Dec 31, 2018

@dain

dain approved these changes Jan 7, 2019

Copy link
Contributor

dain left a comment

Looks good

@wenleix wenleix force-pushed the wenleix:orc_overflow branch 2 times, most recently from 3934737 to 64edecc Jan 8, 2019

Fix integer overflow when turning dictionary into direct
Turning dictionary into direct in ORC writer can end up with writing
huge data.

eb6028d tries to fix this by
checking the written size after converting each row group. However,
in certain cases a single row group already exceed 2G after converting
into direct. For example, a varchar column repeated with a single
value with length around 500k.

This commit fixed this by checking the written size after each batch of
write.

@wenleix wenleix force-pushed the wenleix:orc_overflow branch from 64edecc to 7264516 Jan 8, 2019

@wenleix wenleix merged commit 8db7641 into prestodb:master Jan 8, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@wenleix wenleix deleted the wenleix:orc_overflow branch Jan 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment