Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Preview: human readable timestamp and date values #7159

Closed
3 tasks done
keen85 opened this issue Aug 7, 2023 · 15 comments
Closed
3 tasks done

Parquet Preview: human readable timestamp and date values #7159

keen85 opened this issue Aug 7, 2023 · 15 comments
Assignees
Labels
💡 feature request New feature or request 💡 file previewing Feature request for file previewing 📋 feedback wanted Issue is for soliciting of feedback regarding a feature/idea/issue/etc. ❔ external Root cause of this issue is in another component, product, or service ✅ merged A fix for this issue has been merged
Milestone

Comments

@keen85
Copy link

keen85 commented Aug 7, 2023

Preflight Checklist

Problem

when previewing parquet files, values of columns with datatype timestamp and date are hard to read.
20230807_FeatureRequest_AzureStorageExplorer_ParquetPreview

Example parquet:
part-00000-43831db6-19d5-4964-a8c8-cb8d6d1664b3-c000.snappy.parquet.zip

PySpark code for reproducing the example parquet:

import pyspark.sql.functions as F

df = (
    spark.range(3).toDF("id")
    .withColumn("ts_base", F.to_timestamp(F.lit('2021-07-24 12:01:19.123456789012')))
    .withColumn("ts", (F.unix_timestamp(F.col("ts_base")) + F.col("id") * 10000000).cast("timestamp"))
    .withColumn("ts_str", F.col("ts").cast("string"))
    .withColumn("ts_unix", F.unix_timestamp(F.col("ts")))
    .withColumn("date", F.col("ts").cast("date"))
    .withColumn("date_str", F.col("date").cast("string"))
    .drop("ts_base")
)
df.printSchema()
display(df)
df.coalesce(1).write.format("parquet").mode("overwrite").option("overwriteSchema", "True").save(root_path + 'test1/')

Desired Solution

timestamp and date column values should be rendered in a (easily) human readable format, e.g yyyy-MM-dd'T'HH:mm:ss[.SSSSSS]'Z' for timestamp and yyyy-MM-dd for dates (ISO 8601)

Alternatives and Workarounds

No response

Additional Context

No response

@MRayermannMSFT
Copy link
Member

@keen85, slightly unrelated to your issue, would you mind sharing what is shown when you click on "Chart" in your Jupyter Notebook? :)

@MRayermannMSFT MRayermannMSFT added 💡 feature request New feature or request 💡 file previewing Feature request for file previewing labels Aug 7, 2023
@MRayermannMSFT MRayermannMSFT added this to the Near Future milestone Aug 7, 2023
@keen85
Copy link
Author

keen85 commented Aug 7, 2023

@MRayermannMSFT sure, but it has little meaning for this dataframe:
image

This is some convenience feature of Azure Synapse Analytics:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-data-visualization#displaydf-function

@MRayermannMSFT
Copy link
Member

@MRayermannMSFT sure, but it has little meaning for this dataframe

Sure, was just curious. :)

@pierrejeandev
Copy link

same issue here on v1.31.1 (93)

@craxal
Copy link
Contributor

craxal commented Dec 7, 2023

@keen85 @pierrejeandev

For anyone that's interested, we have a private build that adds date formatting to preview for Parquet files. Please give it a try and leave your feedback here.

Windows:
https://cralvordtest02.blob.core.windows.net/share/1.32.1-datepreview.1/StorageExplorer-windows-x64.exe?sv=2023-01-03&st=2023-12-07T18%3A38%3A14Z&se=2023-12-22T18%3A38%3A00Z&sr=b&sp=r&sig=DJPjcxkltVwrP7oSR74KGzf%2B3NxY1jOvi6%2FG783F45U%3D

For any other platforms, please let us know, and we'll provide additional links.

@keen85
Copy link
Author

keen85 commented Dec 10, 2023

Hi @craxal ,
Thanks for your work.
I had Storage Explorer installed already, downloaded your preview and re-installed it; but I'm not sure if this was successful.
Info dialog looks like this:
image

I tried previewing the parquet file that I provided originally.

Preview looks for me like this (but I'm not sure if I actually tested your preview build):
image
So timestamp is still not readable and date is localized but also includes time "00:00". I'd rather prefer ISO 8601 format.

@MRayermannMSFT MRayermannMSFT added the 📋 feedback wanted Issue is for soliciting of feedback regarding a feature/idea/issue/etc. label Dec 11, 2023
@craxal
Copy link
Contributor

craxal commented Jan 6, 2024

@keen85 We can stick to ISO format unless we get additional feedback to the contrary.

Our testing shows that DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICRO types should be parsed correctly. But from your Parquet file, only the date field has a DATE type. All the other fields are just UTF8 or integer types. I don't think the parsing library we have handles TIMESTAMP(...) types at the moment.

Supporting timestamp types is possible with deprecated TIME_MILLIS, TIME_MICROS, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS. Additional support for TIME and TIMESTAMP is forthcoming (LibertyDSNP/parquetjs#99).

@keen85
Copy link
Author

keen85 commented Jan 6, 2024

Hi @craxal ,
I did some research and I think I understand now, what the problem is concerning TIMESTAMP...

In the past, parquet used to persist TIMESTAMP columns as physical data type INT96. As of my understanding INT96 was used exclusively to represent TIMESTAMP columns. That is why no logical data type needed to be specified.
From parquet side, this is deprecated and INT64 should be used with logical data types TIMESTAMP(unit=MILLIS) or TIMESTAMP(unit=MICROS) or TIMESTAMP(unit=NANOS).

However, despite deprecation, there are very prominent data processing engines (e.g. Apache Spark, Apache Impala) that still write parquet files the old way, with TIMESTAMPS as INT96 (with no logical data type) per default because of the larger precision / range. This is also true for prominent Microsoft/Azure services (e.g. Azure Synapse, Azure HDI, Microsoft Fabric, Databricks) which use Apache Spark.
I created the example parquet file using Azure Synapse. That is why the parsing library sees TIMESTAMP as INT96.

It would be greatly appreciated, if Azure Storage Explorer would support this old, deprecated encoding. I saw that there is a parquetjs feature request for supporting INT96, so maybe this will solve the issue indirectly for Azure Storage Explorer?

@craxal
Copy link
Contributor

craxal commented Jan 9, 2024

We will keep this item open to track the progress of those features. Not much else we can do until fixes/features become available.

@craxal craxal added ❔ external Root cause of this issue is in another component, product, or service ✅ merged A fix for this issue has been merged labels Jan 9, 2024
@craxal craxal modified the milestones: 1.33.0, 1.34.0 Feb 12, 2024
@MRayermannMSFT
Copy link
Member

@craxal for 1.34 let's start with a 1 day initial investigation into if we have the skill needed to contribute back to the open source library.

@craxal
Copy link
Contributor

craxal commented Mar 7, 2024

Let's break this down a bit. There currently exist three different ways to represent timestamps in the Parquet format:

Type Description
INT96 These were originally intended to store timestamps with nanosecond precision. This type is deprecated but remains for backward compatibility. The LibertyDNSP/parquetjs library currently supports this type. Storage Explorer is not formatting these values as timestamps, but it should be reasonably simple to do so.
TIMESTAMP_MILLIS
TIMETAMP_MICROS
These are what the Parquet spec refers to as ConvertedTypes, which are deprecated. The LibertyDNSP/parquetjs library currently supports these types. Storage Explorer currently formats these values as timestamps.
TIMESTAMP This is what the Parquet spec refers to as a LogicalType. They differ from ConvertedTypes in that they have parameters (for example, TIMESTAMP(unit=NANOS, isAdjustedToUTC=true)). The LibertyDNSP/parquetjs library only appears to support ConvertedTypes. If we were to contribute to the library by adding support for LogicalTypes of any kind, this would be a considerable undertaking (understanding the Parquet encoding, learning the library codebase, adding necessary logic throughout to encode/decode, adding tests, etc.).

This leads me to conclude that we should do the following:

  • Format INT96 values as timestamps, since this is likely still widely in use.
  • Wait for support for LogicalTypes to be added to the library by those who understand Parquet and the codebase better. If demand is high, we can push for support. Alternatively, we can tinker with it over time ourselves and eventually add support at some future time.

@craxal
Copy link
Contributor

craxal commented Apr 15, 2024

We started running into a build-related issue (see LibertyDSNP/parquetjs#125).

@craxal
Copy link
Contributor

craxal commented Apr 16, 2024

@keen85 It turns out converting INT96 values into ISO-formatted strings is a lot more complicated than I initially thought.

Incorrect Assumptions

At first, I assumed INT96 values simply encoded the number of nanoseconds from some epoch. If that were true, the question is which epoch?

Upon further investigation, I found things to be much more complex. The first eight bytes represent the time of day in nanoseconds (the bytes have to be reversed and possibly converted if it's negative). The last 4 bytes represent a Julian day (the bytes also need to be reversed, and the Julian day needs to be converted to a Gregorian date).

This is a lot of work to convert a number to a recognizable date, work that is better left, I think, to the Parquet library.

Varying Interpretations

As if a complex encoding weren't enough, interpreting INT96 values can vary. Spark configurations can change how values are read or written. This means INT96 values might not be consistently represented. Furthermore, only Impala seems to strictly interpret INT96 values as dates/times. That is, in the broader ecosystem, INT96 values might not always be considered dates/times.

As an example, in the discussion for the PR you linked to, there's mention that a reliable method of determining when an INT96 can be considered a date/time is unknown.

Additionally, it's not clear whether the values in your sample data align with the above encoding (the most significant 6 bytes are all zeroes, for starters), which suggests your data may be interpreting INT96 date/times value differently.

These consistency problems are probably why the INT96 data type was deprecated in the first place.

Conclusion

All of these points add up to a big consistency problem in how Storage Explorer should interpret INT96 values. If Storage Explorer were designed more specifically for Parquet data, then we would want to consider supporting INT96 as dates/times more carefully. However, I think the best (and safest) way to move forward is leave Storage Explorer as is and encourage the Parquet library authors to add support for this.

Further reading

@keen85
Copy link
Author

keen85 commented Apr 16, 2024

Hi @craxal,
Thanks for your comprehensive analysis.
I was not fully aware of these problems when I opened the issue, sorry for that.

I understand your reasoning and suggest closing this issue.

@keen85 keen85 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2024
@craxal
Copy link
Contributor

craxal commented Apr 16, 2024

@keen85 No apologies necessary! It was a perfectly valid request. We did our homework, and it turned out to be more work than we initially thought. All part of the process. Thank you for the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💡 feature request New feature or request 💡 file previewing Feature request for file previewing 📋 feedback wanted Issue is for soliciting of feedback regarding a feature/idea/issue/etc. ❔ external Root cause of this issue is in another component, product, or service ✅ merged A fix for this issue has been merged
Projects
None yet
Development

No branches or pull requests

4 participants