Error reading Parquet file #13261

l1x · 2019-08-21T08:15:43Z

Query:

select internal_id from hive.db.table limit 10;

Output:

Query 20190821_080531_00002_dj2ve failed: com.facebook.presto.spi.type.VarcharType

Same query runs with Hive or Athena perfectly.

Table create:

CREATE TABLE hive.db.table (
   ...
   internal_id varchar,
   ...
)

Parquet schema:

{
  "type": "record",
  "name": "flatschema",
  "namespace": "ns",
  "fields": [ {
    "name": "internal_id",
    "type": ["null", "string"],
    "default": null
  }]
}

Version:

presto-cli --version
Presto CLI 0.220

The text was updated successfully, but these errors were encountered:

babrar · 2019-08-21T17:43:42Z

There is a Stack Overflow thread for this problem:
https://stackoverflow.com/questions/51918860/unable-to-query-parquet-data-with-nested-fields-in-presto-db

Problem: Presto is reading data based on the column indices instead of the column names.

Solution: Add hive.parquet.use-column-names=true to your hive.properties file to force presto to read data using column names instead of column indices.

l1x · 2019-08-22T08:39:03Z

Thanks @babrar but it still fails:

Query 20190822_083805_00007_2sbhh failed: com.facebook.presto.spi.type.VarcharType

I can load the file with parquet tools and Python. Not sure why Presto cannot read it.

babrar · 2019-08-22T15:29:08Z

Seems like the same issue as before.
Make sure you restart Presto after adding the new configuration to allow the changes to take effect.

l1x · 2019-08-22T18:38:19Z

I have made the change and restarted Presto.

It is visible in the live session:

 hive.parquet_fail_with_corrupted_statistics     | false                 | true                  | boolean | Parquet: Fail when
 hive.parquet_use_column_names                   | true                  | true                  | boolean | Experimental: Parq
 hive.parquet_writer_block_size                  | 134217728B            | 134217728B            | varchar | Parquet: Writer bl
 hive.parquet_writer_page_size                   | 1048576B              | 1048576B              | varchar | Parquet: Writer pa
 hive.partition_statistics_sample_size           | 100                   | 100                   | integer | Maximum sample siz

l1x · 2019-08-23T07:46:32Z

We figured out with @maugly24. The problem is that Presto cannot read from a table where the table create field order does not match the Parquet field order.

Parget field order -> A,B,C
Table in Hive -> A,C, B

It does not matter what is the value hive.parquet_use_column_names, unless I am missing something. Does this value need to be set on all nodes or only on the coordinator node?

ioah86 · 2019-08-23T17:42:25Z

@babrar and I have set this one up on all nodes, worker and coordinator.
We have actually had the scenario where suddenly a new column appeared which was positioned differently in the parquet compared to the Hive metastore, and there was not an issue.

l1x · 2019-08-24T17:13:33Z

So the setting has to be set on all node I guess.

l1x closed this as completed Aug 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error reading Parquet file #13261

Error reading Parquet file #13261

l1x commented Aug 21, 2019

babrar commented Aug 21, 2019 •

edited

l1x commented Aug 22, 2019

babrar commented Aug 22, 2019

l1x commented Aug 22, 2019

l1x commented Aug 23, 2019

ioah86 commented Aug 23, 2019

l1x commented Aug 24, 2019

Error reading Parquet file #13261

Error reading Parquet file #13261

Comments

l1x commented Aug 21, 2019

babrar commented Aug 21, 2019 • edited

l1x commented Aug 22, 2019

babrar commented Aug 22, 2019

l1x commented Aug 22, 2019

l1x commented Aug 23, 2019

ioah86 commented Aug 23, 2019

l1x commented Aug 24, 2019

babrar commented Aug 21, 2019 •

edited