Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for map Parquet legacy format #279

Closed
smf-srogozins opened this issue Nov 11, 2022 · 4 comments
Closed

Support for map Parquet legacy format #279

smf-srogozins opened this issue Nov 11, 2022 · 4 comments

Comments

@smf-srogozins
Copy link

Hello, this is likely related to #184

I am using parquet4s 2.6.0, which as far as I understand uses parquet-mr 1.12.0, and I need to read some files that were generated with parquet-mr 1.11.0. The issue is that the files contain map fields which and apparently the logical name for them in parquet schema has changed between versions from map to key_value. Last version of parquet4s that was using 1.11.0 is 1.7.0, which is a pretty big downgrade. I see something related to it in spark code as well:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L412-L437

Is there an already present workaround in parquet4s I can use for this, or can the support for legacy map type be added as well?

@mjakubowski84
Copy link
Owner

Hi @smf-srogozins,
Parquet4s should be able to read both the current and the legacy format of maps. It ignores the name of the group (it doesn't care if it is named map or key_value). The name is used only during writing. You can check it in the code: https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/ParquetRecord.scala#L682. If I missed something and the backwards compatibility is not met then do not hesitate and propose a fix. PRs are warmly welcome :)

@smf-srogozins
Copy link
Author

OK, that's weird. To double check this, I tried generating a similar file with version 1.11.0 of parquet4s and I am not seeing the error with that file. Need to spend more time to understand the cause, because the error specifically complains about key_value not found in optional group myMap (MAP). Not sure if that is useful, but this also only seems to occur when I am using projection, also I believe the failing parquet files were generated using Apache Iceberg.

@mjakubowski84
Copy link
Owner

mjakubowski84 commented Nov 14, 2022

Not sure if that is useful, but this also only seems to occur when I am using projection

Oh, yes, it is useful. When using projection, you define the exact schema you expect from the file you read. And when using Parquet 1.12, the schema will contain key_value, which does not match your data.
There are two ways to work around this:

@mjakubowski84
Copy link
Owner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants