unable to insert empty map data type into parquet format #5934

edwardxwu · 2016-08-23T22:57:05Z

Hi,

I am trying to insert some data that might contain empty data for the map column into a parquet table and I kept getting:

Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

Looks similar to this avro issue: https://github.com/Parquet/parquet-mr/pull/169

Edward

nezihyigitbasi · 2016-08-23T23:13:10Z

This is a known issue and the root cause boils down to the Hive Parquet writers, which Presto uses to write Parquet data (see HIVE-11625).

edwardxwu · 2016-08-24T02:43:15Z

But when I used hive/hadoop to perform the operation, it was able to complete the operation without error

nezihyigitbasi · 2016-08-24T04:43:39Z

Which hive version are you using?
On Tue, Aug 23, 2016 at 7:43 PM Edward X. Wu notifications@github.com
wrote:

But when I used hive/hadoop to perform the operation, it was able to
complete the operation without error

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#5934 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKsny3lepk3JQ4dyiw25HO6Ay0O6TH6ks5qi6_NgaJpZM4JreWd
.

edwardxwu · 2016-08-24T22:03:49Z

I was using hive 1.0 on emr. And the issue I am having is a bit different from the ticket you linked. Presto seems to fail if the map being inserted is empty (like {}).

edwardxwu · 2016-08-24T22:17:00Z

Here is a concrete example:

presto:default> create table test (a map<varchar(3), integer>) with (format='PARQUET');
CREATE TABLE
presto:default> show columns from test;
 Column |           Type           | Comment 
--------+--------------------------+---------
 a      | map(varchar(3), integer) |         
(1 row)

Query 20160824_221422_00259_hnsq7, FINISHED, 2 nodes
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 70B] [2 rows/s, 196B/s]

presto:default> insert into test select MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]);
INSERT: 1 row

Query 20160824_221457_00260_hnsq7, FINISHED, 3 nodes
Splits: 4 total, 4 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto:default> select * from test;
       a        
----------------
 {bar=2, foo=1} 
(1 row)

Query 20160824_221502_00261_hnsq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 344B] [3 rows/s, 1.05KB/s]

presto:default> insert into test select MAP(ARRAY[], ARRAY[]);

Query 20160824_221509_00262_hnsq7, FAILED, 3 nodes
Splits: 4 total, 2 done (50.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20160824_221509_00262_hnsq7 failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

presto:default> select MAP(ARRAY[], ARRAY[]);
 _col0 
-------
 {}    
(1 row)

Query 20160824_221535_00263_hnsq7, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

nezihyigitbasi · 2016-08-25T16:54:36Z

If you re-run your query and check the complete stack trace (pass --debug to the cli) you will see that the root cause is the same as HIVE-11625. So it's really a problem with the underlying Hive Parquet writers. It's hard to fix it at Presto level unless Presto had its own Parquet writers.

Query 20160825_165119_00008_3zd6n failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
        at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
        at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
        at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
        at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
        at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
        at com.facebook.presto.hive.HivePageSink$HiveRecordWriter.addRow(HivePageSink.java:747)
        at com.facebook.presto.hive.HivePageSink.doAppend(HivePageSink.java:411)
        at com.facebook.presto.hive.HivePageSink.lambda$appendPage$2(HivePageSink.java:390)
        at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
        at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
        at com.facebook.presto.hive.HivePageSink.appendPage(HivePageSink.java:390)
        at com.facebook.presto.spi.classloader.ClassLoaderSafeConnectorPageSink.appendPage(ClassLoaderSafeConnectorPageSink.java:42)
        at com.facebook.presto.operator.TableWriterOperator.addInput(TableWriterOperator.java:207)
        at com.facebook.presto.operator.Driver.processInternal(Driver.java:384)
        at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
        at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
        at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:529)
        at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:665)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
        at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:244)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
        ... 23 more

edwardxwu · 2016-08-30T23:59:32Z

Got it.. I will upvote there.

genomics-geek · 2016-12-22T14:35:57Z

I am having the same issue. How did you overcome this? Did you change your schema or replace null values?

pranavoyo · 2018-08-06T11:47:48Z

I am also facing this issue.. any workaround please

igorcalabria · 2019-02-15T19:39:55Z

@nezihyigitbasi You are right and the parquet writer does not allow empty maps, but I may have found something specific to presto. If you create an array that contains a struct, you'll get the same error.

CREATE TABLE empty_map_array (some_list ARRAY(ROW(thing varchar))) WITH (format = 'PARQUET');
INSERT INTO empty_map_array (some_list) VALUES (ARRAY[]);

you'll get

Query 20190215_193022_00006_auzvk, FAILED, 1 node
Splits: 51 total, 33 done (64.71%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20190215_193022_00006_auzvk failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

If I'm not mistaken, empty arrays should be allowed in parquet(tested in spark)

PowerToThePeople111 · 2019-11-22T10:29:01Z

I stumbled upon the same problem with Athena: when I do a CTAS query there that has empty arrays in the result set, I get this error.

alecbw · 2020-04-04T06:30:46Z

Not really a solution to the actual issue here but re: @PowerToThePeople111's problem, you can work around the write-to-Parquet error by writing the Athena CTAS query to JSON instead.

There are performance/compression tradeoffs, but they may be preferable to the hours I spent fruitlessly trying to coalse my nulls into compliance

abhishekupadhyaya · 2020-04-23T17:40:08Z

Had anyone come across ways to create Athena CTAS query by checking cardinality of nested structure? In my case, I have a struct that can have empty arrays.

mgmarino · 2020-06-02T11:52:39Z

In our case, this happens as an Athena CTAS trying to write some rows with empty arrays. Our fix is to use nullif, e.g.;

INSERT INTO ... 
SELECT
  nullif(my_col, ARRAY[]) as my_col
FROM ...

Not sure if this is an option to your case, @alecbw, but if it is then it may be better than writing to json. @abhishekupadhyaya perhaps you can try checking for an empty struct?, maybe:

nullif(my_col, MAP(ARRAY[], ARRAY[])) as my_col

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to insert empty map data type into parquet format #5934

unable to insert empty map data type into parquet format #5934

edwardxwu commented Aug 23, 2016

nezihyigitbasi commented Aug 23, 2016

edwardxwu commented Aug 24, 2016

nezihyigitbasi commented Aug 24, 2016

edwardxwu commented Aug 24, 2016

edwardxwu commented Aug 24, 2016 •

edited

nezihyigitbasi commented Aug 25, 2016

edwardxwu commented Aug 30, 2016

genomics-geek commented Dec 22, 2016

pranavoyo commented Aug 6, 2018

igorcalabria commented Feb 15, 2019 •

edited

PowerToThePeople111 commented Nov 22, 2019

alecbw commented Apr 4, 2020

abhishekupadhyaya commented Apr 23, 2020 •

edited

mgmarino commented Jun 2, 2020 •

edited

unable to insert empty map data type into parquet format #5934

unable to insert empty map data type into parquet format #5934

Comments

edwardxwu commented Aug 23, 2016

nezihyigitbasi commented Aug 23, 2016

edwardxwu commented Aug 24, 2016

nezihyigitbasi commented Aug 24, 2016

edwardxwu commented Aug 24, 2016

edwardxwu commented Aug 24, 2016 • edited

nezihyigitbasi commented Aug 25, 2016

edwardxwu commented Aug 30, 2016

genomics-geek commented Dec 22, 2016

pranavoyo commented Aug 6, 2018

igorcalabria commented Feb 15, 2019 • edited

PowerToThePeople111 commented Nov 22, 2019

alecbw commented Apr 4, 2020

abhishekupadhyaya commented Apr 23, 2020 • edited

mgmarino commented Jun 2, 2020 • edited

edwardxwu commented Aug 24, 2016 •

edited

igorcalabria commented Feb 15, 2019 •

edited

abhishekupadhyaya commented Apr 23, 2020 •

edited

mgmarino commented Jun 2, 2020 •

edited