Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to insert empty map data type into parquet format #5934

Open
edwardxwu opened this issue Aug 23, 2016 · 14 comments
Open

unable to insert empty map data type into parquet format #5934

edwardxwu opened this issue Aug 23, 2016 · 14 comments

Comments

@edwardxwu
Copy link

Hi,

I am trying to insert some data that might contain empty data for the map column into a parquet table and I kept getting:

Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

Looks similar to this avro issue: https://github.com/Parquet/parquet-mr/pull/169

Edward

@nezihyigitbasi
Copy link
Contributor

This is a known issue and the root cause boils down to the Hive Parquet writers, which Presto uses to write Parquet data (see HIVE-11625).

@edwardxwu
Copy link
Author

But when I used hive/hadoop to perform the operation, it was able to complete the operation without error

@nezihyigitbasi
Copy link
Contributor

Which hive version are you using?
On Tue, Aug 23, 2016 at 7:43 PM Edward X. Wu notifications@github.com
wrote:

But when I used hive/hadoop to perform the operation, it was able to
complete the operation without error


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#5934 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKsny3lepk3JQ4dyiw25HO6Ay0O6TH6ks5qi6_NgaJpZM4JreWd
.

@edwardxwu
Copy link
Author

I was using hive 1.0 on emr. And the issue I am having is a bit different from the ticket you linked. Presto seems to fail if the map being inserted is empty (like {}).

@edwardxwu
Copy link
Author

edwardxwu commented Aug 24, 2016

Here is a concrete example:

presto:default> create table test (a map<varchar(3), integer>) with (format='PARQUET');
CREATE TABLE
presto:default> show columns from test;
 Column |           Type           | Comment 
--------+--------------------------+---------
 a      | map(varchar(3), integer) |         
(1 row)

Query 20160824_221422_00259_hnsq7, FINISHED, 2 nodes
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 70B] [2 rows/s, 196B/s]

presto:default> insert into test select MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]);
INSERT: 1 row

Query 20160824_221457_00260_hnsq7, FINISHED, 3 nodes
Splits: 4 total, 4 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto:default> select * from test;
       a        
----------------
 {bar=2, foo=1} 
(1 row)

Query 20160824_221502_00261_hnsq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:00 [1 rows, 344B] [3 rows/s, 1.05KB/s]

presto:default> insert into test select MAP(ARRAY[], ARRAY[]);

Query 20160824_221509_00262_hnsq7, FAILED, 3 nodes
Splits: 4 total, 2 done (50.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20160824_221509_00262_hnsq7 failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

presto:default> select MAP(ARRAY[], ARRAY[]);
 _col0 
-------
 {}    
(1 row)

Query 20160824_221535_00263_hnsq7, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

@nezihyigitbasi
Copy link
Contributor

If you re-run your query and check the complete stack trace (pass --debug to the cli) you will see that the root cause is the same as HIVE-11625. So it's really a problem with the underlying Hive Parquet writers. It's hard to fix it at Presto level unless Presto had its own Parquet writers.

Query 20160825_165119_00008_3zd6n failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
        at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
        at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
        at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
        at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
        at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
        at com.facebook.presto.hive.HivePageSink$HiveRecordWriter.addRow(HivePageSink.java:747)
        at com.facebook.presto.hive.HivePageSink.doAppend(HivePageSink.java:411)
        at com.facebook.presto.hive.HivePageSink.lambda$appendPage$2(HivePageSink.java:390)
        at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
        at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
        at com.facebook.presto.hive.HivePageSink.appendPage(HivePageSink.java:390)
        at com.facebook.presto.spi.classloader.ClassLoaderSafeConnectorPageSink.appendPage(ClassLoaderSafeConnectorPageSink.java:42)
        at com.facebook.presto.operator.TableWriterOperator.addInput(TableWriterOperator.java:207)
        at com.facebook.presto.operator.Driver.processInternal(Driver.java:384)
        at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
        at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
        at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:529)
        at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:665)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
        at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:244)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
        at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
        ... 23 more

@edwardxwu
Copy link
Author

Got it.. I will upvote there.

@genomics-geek
Copy link

I am having the same issue. How did you overcome this? Did you change your schema or replace null values?

@pranavoyo
Copy link

I am also facing this issue.. any workaround please

@igorcalabria
Copy link

igorcalabria commented Feb 15, 2019

@nezihyigitbasi You are right and the parquet writer does not allow empty maps, but I may have found something specific to presto. If you create an array that contains a struct, you'll get the same error.

CREATE TABLE empty_map_array (some_list ARRAY(ROW(thing varchar))) WITH (format = 'PARQUET');
INSERT INTO empty_map_array (some_list) VALUES (ARRAY[]);

you'll get

Query 20190215_193022_00006_auzvk, FAILED, 1 node
Splits: 51 total, 33 done (64.71%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20190215_193022_00006_auzvk failed: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead

If I'm not mistaken, empty arrays should be allowed in parquet(tested in spark)

@PowerToThePeople111
Copy link

I stumbled upon the same problem with Athena: when I do a CTAS query there that has empty arrays in the result set, I get this error.

@alecbw
Copy link

alecbw commented Apr 4, 2020

Not really a solution to the actual issue here but re: @PowerToThePeople111's problem, you can work around the write-to-Parquet error by writing the Athena CTAS query to JSON instead.

There are performance/compression tradeoffs, but they may be preferable to the hours I spent fruitlessly trying to coalse my nulls into compliance

@abhishekupadhyaya
Copy link

abhishekupadhyaya commented Apr 23, 2020

Had anyone come across ways to create Athena CTAS query by checking cardinality of nested structure? In my case, I have a struct that can have empty arrays.

@mgmarino
Copy link

mgmarino commented Jun 2, 2020

In our case, this happens as an Athena CTAS trying to write some rows with empty arrays. Our fix is to use nullif, e.g.;

INSERT INTO ... 
SELECT
  nullif(my_col, ARRAY[]) as my_col
FROM ...

Not sure if this is an option to your case, @alecbw, but if it is then it may be better than writing to json. @abhishekupadhyaya perhaps you can try checking for an empty struct?, maybe:

nullif(my_col, MAP(ARRAY[], ARRAY[])) as my_col

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants