Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for CSV format in hive #920

Merged
merged 7 commits into from Jul 5, 2019

Conversation

4 participants
@kokosing
Copy link
Member

commented Jun 5, 2019

Fixes #1057

@cla-bot cla-bot bot added the cla-signed label Jun 5, 2019

@kokosing

This comment has been minimized.

Copy link
Member Author

commented Jun 5, 2019

This depends on #845. It is very likely that changes around TEXTFILE are no longer needed.

@kokosing kokosing force-pushed the kokosing:origin/master/121_csv branch from df82136 to 46af7b4 Jun 7, 2019

@kokosing kokosing requested review from ebyhr and electrum Jun 7, 2019

@kokosing kokosing changed the title Support for CSV format in hive, fixes around TEXTFILE Support for CSV format in hive Jun 7, 2019

@kokosing kokosing requested a review from findepi Jun 7, 2019

// CSV specific properties
getCsvProperty(tableMetadata.getProperties(), CSV_ESCAPE)
.ifPresent(escape -> {
if (hiveStorageFormat != HiveStorageFormat.TEXTFILE) {

This comment has been minimized.

Copy link
@ebyhr

ebyhr Jun 7, 2019

Contributor

I think HiveStorageFormat.CSV is expected condition. (also, L819 and L826)

This comment has been minimized.

Copy link
@kokosing

kokosing Jun 7, 2019

Author Member

Correct. Then it looks like tests are also missing. Thanks for catching this.

}
String csvPropertyValue = (String) tableProperties.get(csvPropertyKey);
if (csvPropertyValue.length() != 1) {
throw new PrestoException(INVALID_TABLE_PROPERTY, format("%s must be set to a single character string, but was: ", csvPropertyKey, csvPropertyValue));

This comment has been minimized.

Copy link
@ebyhr

ebyhr Jun 7, 2019

Contributor

Last %s is missing
... but was: %s

return serdeInfo.getSerializationLib() != null &&
(table.getParameters().get(AVRO_SCHEMA_URL_KEY) != null ||
(serdeInfo.getParameters() != null && serdeInfo.getParameters().get(AVRO_SCHEMA_URL_KEY) != null)) &&
serdeInfo.getSerializationLib().equals(AVRO.getSerDe());

This comment has been minimized.

Copy link
@ebyhr

ebyhr Jun 7, 2019

Contributor

This method looks wrong. At least, this line should be

serdeInfo.getSerializationLib().equals(CSV.getSerDe());

This comment has been minimized.

Copy link
@kokosing

kokosing Jun 11, 2019

Author Member

Sorry about this rebasing mistake.

@electrum

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

Commit description wording:

these properties requires to be passed to

should be

these properties must be passed to

if (serdePropertyValue == null) {
return Optional.ofNullable(tablePropertyValue);
}
else if (tablePropertyValue == null) {

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

Nit: redundant else

}
if (!tablePropertyValue.equals(serdePropertyValue)) {
throw new PrestoException(
HIVE_UNSUPPORTED_FORMAT,

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

HIVE_INVALID_METADATA

getCsvProperty(tableMetadata.getProperties(), CSV_ESCAPE)
.ifPresent(escape -> {
if (hiveStorageFormat != HiveStorageFormat.TEXTFILE) {
throw new PrestoException(INVALID_TABLE_PROPERTY, format("Cannot specify %s table property for storage format: %s", CSV_ESCAPE, hiveStorageFormat));

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

Maybe introduce a method

checkFormatForProperty(hiveStorageFormat, TEXTFILE, CSV_ESCAPE);
String joinedUnsupportedColumns = notSupportedCsvColumns.stream()
.map(columnMetadata -> format("%s %s", columnMetadata.getName(), columnMetadata.getType()))
.collect(Collectors.joining(", "));
throw new PrestoException(

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

This looks bad. Let's put it on one line.

throw new PrestoException(
NOT_SUPPORTED,
format(
"Hive CSV storage format does not support columns: [%s]. All non-partitioning columns must be of unbounded varchar type",

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

How about

"Hive CSV storage format only supports VARCHAR (unbounded). Unsupported columns: " + joinedUnsupportedColumns
@@ -4188,6 +4192,9 @@ protected static SchemaTableName temporaryTable(String database, String tableNam
.put(BUCKETED_BY_PROPERTY, ImmutableList.of())
.put(BUCKET_COUNT_PROPERTY, 0)
.put(SORTED_BY_PROPERTY, ImmutableList.of())
.put(CSV_ESCAPE, "\\")

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

This seems wrong. Shouldn't we only set these if the storage format is CSV?

@@ -196,6 +196,136 @@ public void testCreatePartitionedTableAs(StorageFormat storageFormat)
query(format("DROP TABLE %s", tableName));
}

@Test(groups = STORAGE_FORMATS)

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

The rest of the tests in this file seem to be for all formats. Let's instead create a new class named something like TestHiveCsvStorageFormat

public void testCsvFile(int rowCount)
throws Exception
{
Set<String> csvSupportingColumns = ImmutableSet.of("t_empty_string", "t_string");

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

I think we can also handle

  • t_empty_varchar
  • t_varchar
  • t_varchar_max_length
  • t_null_string
  • t_null_varchar

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

We might want to build the list like this:

List<TestColumn> testColumns = TEST_COLUMNS.stream()
        .filter(column -> column.isPartitionKey() || isUnboundedVarchar(column.getType()))
        ...

private static boolean isUnboundedVarchar(String hiveTypeName)
{
    return HiveType.valueOf(hiveTypeName).getType(TYPE_MANAGER).equals(VARCHAR);
}
.withRowsCount(rowCount)
.isReadableByRecordCursor(new GenericHiveRecordCursorProvider(HDFS_ENVIRONMENT));
}

@Test
public void testCsvFileWithNull()

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

This should be redundant after adding the columns in the above comment

throws Exception
{
assertThatFileFormat(CSV)
.withColumns(ImmutableList.of(new TestColumn("p_null_string", javaStringObjectInspector, null, null)))

This comment has been minimized.

Copy link
@electrum

electrum Jun 10, 2019

Member

p_ means partition column, t_ means table column

@kokosing kokosing force-pushed the kokosing:origin/master/121_csv branch 3 times, most recently from ef1c880 to e54aa8c Jun 11, 2019

@kokosing

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

@electrum, @ebyhr comments addressed. Please take a look again.

@@ -432,6 +436,8 @@ else if (insertExistingPartitionsBehavior == InsertExistingPartitionsBehavior.ER
}
}

additionalTableParameters.forEach((key, value) -> schema.put(key, value));

This comment has been minimized.

Copy link
@ebyhr

ebyhr Jun 12, 2019

Contributor

How about
schema.putAll(additionalTableParameters); or additionalTableParameters.forEach(schema::put);

This comment has been minimized.

Copy link
@findepi
@@ -131,10 +136,43 @@ public void testSequenceFile(int rowCount)

assertThatFileFormat(SEQUENCEFILE)
.withColumns(testColumns)
.withRowsCount(rowCount)
.isReadableByRecordCursor(new GenericHiveRecordCursorProvider(HDFS_ENVIRONMENT));

This comment has been minimized.

Copy link
@ebyhr

ebyhr Jun 12, 2019

Contributor

nit: These indent look wrong.

{
testCreatePartitionedCsvTableAs(
"storage_formats_test_create_table_as_select_partitioned_csv_with_custom_parameters",
", csv_escape = 'e', csv_separator='s', csv_quote='q'");

This comment has been minimized.

Copy link
@ebyhr

ebyhr Jun 12, 2019

Contributor

nit: Add spaces or remove the first csv_escape='e'

Suggested change
", csv_escape = 'e', csv_separator='s', csv_quote='q'");
", csv_escape = 'e', csv_separator = 's', csv_quote = 'q'");

This comment has been minimized.

Copy link
@findepi

@kokosing kokosing force-pushed the kokosing:origin/master/121_csv branch from e54aa8c to b703255 Jun 13, 2019

@kokosing

This comment has been minimized.

Copy link
Member Author

commented Jun 13, 2019

@ebyhr Thanks for comments. I have just addressed them.

@kokosing

This comment has been minimized.

Copy link
Member Author

commented Jun 24, 2019

@electrum Ping

@kokosing

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

@findepi

findepi approved these changes Jul 3, 2019

}

@Override
public ConnectorPageSink createPageSink(ConnectorTransactionHandle transaction, ConnectorSession session, ConnectorInsertTableHandle tableHandle)
{
HiveInsertTableHandle handle = (HiveInsertTableHandle) tableHandle;
return createPageSink(handle, false, session);
return createPageSink(handle, false, session, ImmutableMap.of());

This comment has been minimized.

Copy link
@findepi

findepi Jul 3, 2019

Member

Add a comment explaining that/why we don't need to pass table parameters here.

@@ -125,6 +126,7 @@
private final int maxOpenSortFiles;
private final boolean immutablePartitions;
private final InsertExistingPartitionsBehavior insertExistingPartitionsBehavior;
private final Map<String, String> additionalTableParameters;

This comment has been minimized.

Copy link
@findepi

findepi Jul 3, 2019

Member

This could go after HiveStorageFormat tableStorageFormat,

@@ -432,6 +436,8 @@ else if (insertExistingPartitionsBehavior == InsertExistingPartitionsBehavior.ER
}
}

additionalTableParameters.forEach((key, value) -> schema.put(key, value));

This comment has been minimized.

Copy link
@findepi
@@ -778,6 +770,13 @@ public void createTable(ConnectorSession session, ConnectorTableMetadata tableMe
return tableProperties.build();
}

private static void checkFormatForProperty(HiveStorageFormat actualStorageFormat, HiveStorageFormat storageFormat, String propertyName)

This comment has been minimized.

Copy link
@findepi

findepi Jul 3, 2019

Member

storageFormat -> expectedStorageFormat

@@ -228,13 +228,13 @@
public static final String PRESTO_VERSION_NAME = "presto_version";
public static final String PRESTO_QUERY_ID_NAME = "presto_query_id";
public static final String TABLE_COMMENT = "comment";
public static final Set<String> RESERVED_ROLES = ImmutableSet.of("all", "default", "none");
private static final Set<String> RESERVED_ROLES = ImmutableSet.of("all", "default", "none");

This comment has been minimized.

Copy link
@findepi

findepi Jul 3, 2019

Member

Limit scope of variables

Limit visibility of constants

import static io.prestosql.tests.TestGroups.STORAGE_FORMATS;
import static java.lang.String.format;

public class TestHiveCsvStorageFormat

This comment has been minimized.

Copy link
@findepi

findepi Jul 3, 2019

Member

Remove "Hive".

maybe just "TestCsv" ?

{
private static final String TPCH_SCHEMA = "tiny";

@Test(groups = STORAGE_FORMATS)

This comment has been minimized.

Copy link
@findepi

findepi Jul 3, 2019

Member

this triggers intellij warning for me
please consider using @Test(groups = {STORAGE_FORMATS}) as a workaround

This comment has been minimized.

Copy link
@kokosing

kokosing Jul 4, 2019

Author Member

what warnning do you get? I didn't get any.

" shipinstruct VARCHAR," +
" shipmode VARCHAR," +
" comment VARCHAR," +
" returnflag VARCHAR" +

This comment has been minimized.

Copy link
@findepi

findepi Jul 3, 2019

Member

nit: we usually lowercase varchar

{
testCreatePartitionedCsvTableAs(
"storage_formats_test_create_table_as_select_partitioned_csv_with_custom_parameters",
", csv_escape = 'e', csv_separator='s', csv_quote='q'");

This comment has been minimized.

Copy link
@findepi
QueryResult actual = query(format(query, tableName));
assertThat(actual)
.hasColumns(expected.getColumnTypes())
.containsExactly(expectedRows);

This comment has been minimized.

Copy link
@findepi

findepi Jul 3, 2019

Member

.containsOnly since your queries have no ORDER BY

@kokosing
Copy link
Member Author

left a comment

Comments addressed.

if (!tablePropertyValue.equals(serdePropertyValue)) {
throw new PrestoException(
HIVE_INVALID_METADATA,
format("Different values for '%s' set in serde properties '%s' and table properties '%s'", key, serdePropertyValue, tablePropertyValue));

This comment has been minimized.

Copy link
@kokosing

kokosing Jul 4, 2019

Author Member

What's hive's behavior when property declared twice, differently?

https://gist.github.com/kokosing/ffbc55bd3a8aa5bf8ad56ea89a7bffd8

table properties win

stringProperty(AVRO_SCHEMA_URL, "URI pointing to Avro schema for the table", null, false),
stringProperty(CSV_SEPARATOR, "CSV separator character", null, false),
stringProperty(CSV_QUOTE, "CSV quote character", null, false),
stringProperty(CSV_ESCAPE, "CSV escape character", null, false));

This comment has been minimized.

Copy link
@kokosing

kokosing Jul 4, 2019

Author Member

It looks like AVRO should be moved before TEXTFILE

{
private static final String TPCH_SCHEMA = "tiny";

@Test(groups = STORAGE_FORMATS)

This comment has been minimized.

Copy link
@kokosing

kokosing Jul 4, 2019

Author Member

what warnning do you get? I didn't get any.

@kokosing kokosing force-pushed the kokosing:origin/master/121_csv branch 2 times, most recently from fb6c913 to cb322a8 Jul 4, 2019

Pass table properties to Hive page sink
In case of CTAS query when a table does not yet exists in metastore and
has custom properties, these properties requires to be passed to
underlying storage format.

@kokosing kokosing force-pushed the kokosing:origin/master/121_csv branch from cb322a8 to a472c0a Jul 4, 2019

@kokosing kokosing force-pushed the kokosing:origin/master/121_csv branch from a472c0a to 41093c0 Jul 5, 2019

@kokosing kokosing closed this Jul 5, 2019

@kokosing kokosing deleted the kokosing:origin/master/121_csv branch Jul 5, 2019

@kokosing kokosing merged commit 41093c0 into prestosql:master Jul 5, 2019

1 of 3 checks passed

Travis CI - Branch Build Created
Details
Travis CI - Pull Request Build Created
Details
verification/cla-signed
Details

@kokosing kokosing referenced this pull request Jul 5, 2019

Closed

Release notes for 316 #1000

5 of 6 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.