Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support skip header and footer for hive CREATE TABLE #845

Merged
merged 1 commit into from Jun 5, 2019

Conversation

@ebyhr
Copy link
Contributor

commented May 29, 2019

Related to #572

I'll send another PR for delimiter.

@cla-bot cla-bot bot added the cla-signed label May 29, 2019

@findepi findepi requested a review from kokosing May 29, 2019

@ebyhr

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

Let me check the cause of travis failure later.

@@ -50,6 +50,8 @@
public static final String ORC_BLOOM_FILTER_COLUMNS = "orc_bloom_filter_columns";
public static final String ORC_BLOOM_FILTER_FPP = "orc_bloom_filter_fpp";
public static final String AVRO_SCHEMA_URL = "avro_schema_url";
public static final String TEXT_SKIP_HEADER_COUNT = "skip_header_count";

This comment has been minimized.

Copy link
@findepi

findepi May 30, 2019

Member

Maybe:

Suggested change
public static final String TEXT_SKIP_HEADER_COUNT = "skip_header_count";
public static final String TEXTFILE_SKIP_HEADER_LINE_COUNT = "textfile_skip_header_line_count";

WDYT?

This comment has been minimized.

Copy link
@ebyhr

ebyhr May 30, 2019

Author Contributor

Agreed, thank you for your suggestion.

@@ -50,6 +50,8 @@
public static final String ORC_BLOOM_FILTER_COLUMNS = "orc_bloom_filter_columns";
public static final String ORC_BLOOM_FILTER_FPP = "orc_bloom_filter_fpp";
public static final String AVRO_SCHEMA_URL = "avro_schema_url";
public static final String TEXT_SKIP_HEADER_COUNT = "skip_header_count";
public static final String TEXT_SKIP_FOOTER_COUNT = "skip_footer_count";

This comment has been minimized.

Copy link
@findepi

findepi May 30, 2019

Member

... then this would be something like:

Suggested change
public static final String TEXT_SKIP_FOOTER_COUNT = "skip_footer_count";
public static final String TEXTFILE_SKIP_FOOTER_LINE_COUNT = "textfile_skip_footer_line_count";
@@ -123,6 +125,8 @@ public HiveTableProperties(TypeManager typeManager, HiveConfig config)
config.getOrcDefaultBloomFilterFpp(),
false),
integerProperty(BUCKET_COUNT_PROPERTY, "Number of buckets", 0, false),
integerProperty(TEXT_SKIP_HEADER_COUNT, "Number of header lines", 0, false),

This comment has been minimized.

Copy link
@kokosing

kokosing May 30, 2019

Member

nit. I think default should be null to distinguish case when user set it explicitly to case when user is using default.

assertUpdate("DROP TABLE test_table_skip_header_footer");
}

@Test(expectedExceptions = RuntimeException.class, expectedExceptionsMessageRegExp = "Cannot specify skip_header_count table property for storage format: ORC")

This comment has been minimized.

Copy link
@kokosing

kokosing May 30, 2019

Member

nit prefer org.assertj.core.api.Assertions#assertThatThrownBy

This comment has been minimized.

Copy link
@ebyhr

ebyhr May 30, 2019

Author Contributor

Changed to assertThatThrownBy and added two assertions.

@ebyhr ebyhr force-pushed the ebyhr:hive/ct-header-footer branch from e9ee63c to 61181df May 30, 2019

@ebyhr

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

The travis failed reason is presto-phoenix.

@@ -761,6 +778,35 @@ public void createTable(ConnectorSession session, ConnectorTableMetadata tableMe
tableProperties.put(AVRO_SCHEMA_URL_KEY, validateAndNormalizeAvroSchemaUrl(avroSchemaUrl, hdfsContext));
}

// Textfile specific properties
Integer textHeaderSkipCount = getTextHeaderSkipCount(tableMetadata.getProperties());

This comment has been minimized.

Copy link
@kokosing

kokosing Jun 1, 2019

Member

nit: use Optional<Integer>

This comment has been minimized.

Copy link
@dain

dain Jun 1, 2019

Member

There is also OptionalInt. I'm not sure which is better here, just a reminder :)

This comment has been minimized.

Copy link
@kokosing

kokosing Jun 1, 2019

Member

Optional<Integer> is better, OptionalInt is missing map method which is useful.

This comment has been minimized.

Copy link
@findepi

findepi Jun 1, 2019

Member

We had this discussion already. Since the classes cannot be used easily together (eg missing .map method), we decided to use generic classes everywhere except where there is compelling performance advantage from using specialized classes


assertUpdate(createTableSql);

MaterializedResult actual = computeActual("SHOW CREATE TABLE test_table_skip_header_footer");

This comment has been minimized.

Copy link
@kokosing

kokosing Jun 1, 2019

Member

can you also add a tests where only one of these is set?

@ebyhr ebyhr force-pushed the ebyhr:hive/ct-header-footer branch from 61181df to 587a2ce Jun 2, 2019

@ebyhr

This comment has been minimized.

Copy link
Contributor Author

commented Jun 2, 2019

Changed to Optional<Integer> and added two tests.

@electrum
Copy link
Member

left a comment

The test code that creates a file with header/footer using an insert query is not really correct. We shouldn't allow inserting into such tables for two reasons:

  • we might have multiple writers
  • the order in which rows are written is arbitrary

So I think we either need to disallow this, or we need to change the textfile writer to write an appropriate number of header/footer lines (maybe a dummy line like ### HEADER ###).

The expected use case for setting these properties is for creating external tables pointing at existing data files that already have a header/footer, so simply disallowing write support seems reasonable.

We should be able to write these tests as a product test. TestAvroSchemaUrl looks to be a good example for that.

public void testCreateTableInvalidSkipHeaderFooter()
{
assertThatThrownBy(() -> assertUpdate("CREATE TABLE test_orc_skip_header (col1 bigint) WITH (format = 'ORC', textfile_skip_header_line_count = 1)"))
.isInstanceOf(RuntimeException.class)

This comment has been minimized.

Copy link
@electrum

electrum Jun 3, 2019

Member

Nit: asserting RuntimeException here isn't useful. Since it's not a specific exception type, all we care is that the message matches.

@@ -761,6 +778,37 @@ public void createTable(ConnectorSession session, ConnectorTableMetadata tableMe
tableProperties.put(AVRO_SCHEMA_URL_KEY, validateAndNormalizeAvroSchemaUrl(avroSchemaUrl, hdfsContext));
}

// Textfile specific properties
Optional<Integer> textHeaderSkipCount = getTextHeaderSkipCount(tableMetadata.getProperties());

This comment has been minimized.

Copy link
@electrum

electrum Jun 3, 2019

Member

Now that these are using Optional, they can be simplified using ifPresent:

getTextHeaderSkipCount(tableMetadata.getProperties()).ifPresent(headerSkipCount ->
    ...
);
@ebyhr

This comment has been minimized.

Copy link
Contributor Author

commented Jun 3, 2019

Created another PR #891 to disable INERT INTO Hive table having skip.header.line.count or skip.footer.line.count.

@ebyhr ebyhr force-pushed the ebyhr:hive/ct-header-footer branch from 587a2ce to 2aaf062 Jun 5, 2019

@ebyhr ebyhr force-pushed the ebyhr:hive/ct-header-footer branch from 2aaf062 to 0340e25 Jun 5, 2019

@ebyhr ebyhr assigned electrum and unassigned ebyhr Jun 5, 2019

@electrum electrum merged commit a4d1342 into prestosql:master Jun 5, 2019

2 checks passed

Travis CI - Pull Request Build Passed
Details
verification/cla-signed
Details
@electrum

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

The new test looks great. Merged, thanks!

@electrum electrum added this to the 314 milestone Jun 5, 2019

@electrum electrum referenced this pull request Jun 7, 2019

Closed

Release notes for 314 #879

2 of 6 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.