Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot open Parquet file with 2 similar column names (different case) #68

Closed
MCRE-BE opened this issue Jan 27, 2023 · 5 comments
Closed
Labels
bug Something isn't working

Comments

@MCRE-BE
Copy link
Contributor

MCRE-BE commented Jan 27, 2023

Parquet Viewer Version
What version of Parquet Viewer are you experiencing the issue with?
2.4.2.0

Where was the parquet file created?
pyarrow

Sample File
Example.zip

Describe the bug
I believe the bug comes from having two column names that are equal when viewed as lowercase.
I can open the file in pyarrow/python, not in ParquetViewer.
Screenshot 2023-01-27 113819

Screenshots
Screenshot 2023-01-27 113152

Additional context
The similar column names is a bug in my code, but should not make the program crash.

Note: This tool relies on the parquet-dotnet library for all the actual Parquet processing. So any issues where that library cannot process a parquet file will not be addressed by us. Please open a ticket on that library's repo to address such issues.

@MCRE-BE MCRE-BE added the bug Something isn't working label Jan 27, 2023
mukunku added a commit that referenced this issue Jan 29, 2023
@mukunku
Copy link
Owner

mukunku commented Jan 29, 2023

I gave this a shot but it turns out DataTables are case insensitive when it comes to column names. So it's not possible to show two fields with the same name.

For now I've added logic to gracefully exclude duplicate fields from the output. It's not ideal but at least the utility won't crash when opening such files.

Give it a shot here if you get the chance: https://github.com/mukunku/ParquetViewer/releases/tag/v2.5.1

I'll leave this ticket open since the original issue hasn't been solved and it should be possible, albeit difficult, to handle case sensitive field names.

@MCRE-BE
Copy link
Contributor Author

MCRE-BE commented Jan 29, 2023

So the issue is not with you but with the underlying library you are using to parse Parquet files? I can open a bug report there.

I'll test the fix, but indeed it's a workaround...

@mukunku
Copy link
Owner

mukunku commented Jan 29, 2023

@MCRE-BE The issue is with the data structure the app is using to store the data in memory. It doesn't support multiple columns with the same name because it's built to be case insensitive.

In your original bug report you mentioned:

The similar column names is a bug in my code, but should not make the program crash.

Is this a legitimate use case for your workflow or was it a mistake and you don't normally have same column names with different casing?

If this isn't a normal use case maybe just gracefully warning the user of the problem is a sufficient solution here:
image

@MCRE-BE
Copy link
Contributor Author

MCRE-BE commented Jan 30, 2023

For me it was a mistake. So for me it's a sufficient solution, but might not be for others 🙄 But thanks for the fix 😄

I guess you can't change the column names easily (like setting a _x behind)? That's how pandas solves the issue in its dataframes.

@MCRE-BE MCRE-BE closed this as completed Jan 30, 2023
@mukunku
Copy link
Owner

mukunku commented Jan 31, 2023

Appending a suffix might be the only way to handle these but it's not straightforward. Might not be worth investing time if it's such a rare use-case. Let's see if anyone else needs this kind of support. If demand increases I can take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants