## Timestamp implementation in different frameworks:

**Arrow timestamps** has three parts:
1. a **64-bit integer**
2. a **metadata** that associates a time unit** (e.g. milliseconds, microseconds, or nanoseconds),
3. an **optional time zone**.

**Pandas (Timestamp)** has two parts:
1. a **64-bit integer** representing **nanoseconds**
2. an **optional time zone**.

Python/Pandas timestamp types without an associated time zone are referred to as “Time Zone Naive”.
Python/Pandas timestamp types with an associated time zone are referred to as “Time Zone Aware”.

**Spark timestamps** has one part:
1. a **64-bit integers** representing **seconds since the UNIX epoch**.
2. Note don't mix the long(unix_timestamp) with timestamp(spark_timestamp, microseconds since the unix epoch). They are two different data types.

Note, Spark does not store any metadata about time zones with its timestamps. Spark interprets timestamps with
the session local time zone, (i.e. spark.sql.session.timeZone). If that time zone is undefined, Spark turns to
the default system time zone.

## The difference of the timestamp implementation will cause: 

- Timezone information is lost (all timestamps that result from converting from spark to arrow/pandas are “time zone naive”).

- Timestamps are truncated to microseconds.

- The session time zone might have unintuitive impacts on translation of timestamp values.


In [1]:
from datetime import datetime, timezone, timedelta

import pandas as pd
from pandas import Timestamp
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col, from_unixtime,lit, unix_timestamp
import pyarrow as pa
import pyarrow.parquet as pq
import os
import s3fs

In [2]:
spark = SparkSession.builder \
    .master("local[2]") \
    .appName("PandasSparkTimeStamp") \
    .getOrCreate()

pdf = pd.DataFrame({'naive': [datetime(2046, 1, 1, 0)],
                    'aware': [Timestamp(year=2046, month=1, day=1,
                                        nanosecond=500, tz=timezone(timedelta(hours=-8)))]})
# pandas data frame print the datetime
print(pdf.head())
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")


       naive                               aware
0 2046-01-01 2046-01-01 00:00:00.000000500-08:00


## Convert pandas datetime to spark

In [3]:
# set up spark session time zone
spark.conf.set("spark.sql.session.timeZone", "UTC")

# spark convert the datetime with UTC timezone
utc_df = spark.createDataFrame(pdf)
print("UTC converted datetime in UTC timezone")
utc_df.show()

# if we change the spark session time zone, and read datetime with it.
spark.conf.set("spark.sql.session.timeZone", "US/Pacific")
# spark convert the datetime with US/Pacific timezone
pst_df = spark.createDataFrame(pdf)
print("US/Pacific converted datetime in US/Pacific timezone")
pst_df.show()
print("UTC converted datetime in US/Pacific timezone")
utc_df.show()

UTC converted datetime in UTC timezone
+-------------------+-------------------+
|              naive|              aware|
+-------------------+-------------------+
|2046-01-01 00:00:00|2046-01-01 08:00:00|
+-------------------+-------------------+

US/Pacific converted datetime in US/Pacific timezone
+-------------------+-------------------+
|              naive|              aware|
+-------------------+-------------------+
|2046-01-01 00:00:00|2046-01-01 00:00:00|
+-------------------+-------------------+

UTC converted datetime in US/Pacific timezone
+-------------------+-------------------+
|              naive|              aware|
+-------------------+-------------------+
|2045-12-31 16:00:00|2046-01-01 00:00:00|
+-------------------+-------------------+



## Convert spark datetime back to pandas

In the first block, we are in timeZone US/Pacific

In the second block, we are in timeZone UTC

In [4]:
# set timezone to US/Pacific
spark.conf.set("spark.sql.session.timeZone", "US/Pacific")
# we convert a spark dataframe back to pandas dataframe
# as spark does not have time zone, so the generated pandas can't have time zone
ppst_df1 = pst_df.toPandas()
print(ppst_df1.head())

# now we compare the datetime of origin pandas dataframe with the dataframe generated by spark.
print(f"spark converted pandas data frame {ppst_df1['aware'][0]}")
print(f"pandas origin data frame{pdf['aware'][0]}")

# the result should be 0, but because spark converted dataframe lost the timezone info, so we have a 8 hour difference. 
print(f"time zone hours {(ppst_df1['aware'][0].timestamp() - pdf['aware'][0].timestamp()) / 3600}")

       naive      aware
0 2046-01-01 2046-01-01
spark converted pandas data frame 2046-01-01 00:00:00
pandas origin data frame2046-01-01 00:00:00.000000500-08:00
time zone hours -8.0


Note that the surprising shift for aware doesn’t happen when the session time zone is UTC (but the timestamps still become “time zone naive”):


In [5]:
# set the session timezone to UTC again
spark.conf.set("spark.sql.session.timeZone", "UTC")

ppst_df2 = pst_df.toPandas()
print(ppst_df2.head())

# now we compare the datetime of origin pandas dataframe with the dataframe generated by spark.
print(f"spark converted pandas data frame {ppst_df2['aware'][0]}")
print(f"pandas origin data frame{pdf['aware'][0]}")

# the result should be 0, but because spark converted dataframe lost the timezone info, so we have a 8 hour difference. 
print(f"time zone hours {(ppst_df2['aware'][0].timestamp() - pdf['aware'][0].timestamp()) / 3600}")

                naive               aware
0 2046-01-01 08:00:00 2046-01-01 08:00:00
spark converted pandas data frame 2046-01-01 08:00:00
pandas origin data frame2046-01-01 00:00:00.000000500-08:00
time zone hours 0.0


# Test the date compatility of the output parquet

In above test, we have test the data conversation via the framework memory converter.

Now if we output the date in a parquet file with pyarrow and read it with spark and vise versa. Is it still compatible?

In [6]:
# 1. We creat a pandas data frame and write it in a parquet file
pdf = pd.DataFrame({'naive': [datetime(2046, 1, 1, 0)],
                    'aware': [Timestamp(year=2046, month=1, day=1,
                                        nanosecond=500, tz=timezone(timedelta(hours=-8)))]})
# pandas data frame print the datetime
print(pdf.head())

       naive                               aware
0 2046-01-01 2046-01-01 00:00:00.000000500-08:00


In [19]:
# 2. write it as parquet file
def write_parquet_as_partitioned_dataset(table, endpoint, bucket_name, path, partition_cols=None, compression="SNAPPY",version="1.0"):
    url = f"https://{endpoint}"
    fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': url})
    file_uri = f"{bucket_name}/{path}"
    if version=="1.0":
        # note without the coerce_timestamps='ms', the write will fail. Because it cant convert the nano second automatically.
        # allow_truncated_timestamps=True, suppress the conversion warning (lose time precision)
        pq.write_to_dataset(table, root_path=file_uri, partition_cols=partition_cols, filesystem=fs, compression=compression,version=version, coerce_timestamps='ms', allow_truncated_timestamps=True)
    elif version=="2.0":
        pq.write_to_dataset(table, root_path=file_uri, partition_cols=partition_cols, filesystem=fs, compression=compression,version=version)
    else: 
        raise ValueError("The parquet version must be 1.0 or 2.0")
    
# omit the index by using preserve_index=False
table = pa.Table.from_pandas(pdf, preserve_index=False)

In [11]:
# arrow write to parquet version 1.0. timestamp cast between pandas and arrow lose data
# Casting from timestamp[ns, tz=-08:00] to timestamp[us] would lose data: 2398406400000000500
# with 2.0, no more warning.

endpoint=os.environ['AWS_S3_ENDPOINT']
bucket_name="pengfei"
path_v1="diffusion/data_format/timestamp_compability/arrow_time_v1.0"
path_v2="diffusion/data_format/timestamp_compability/arrow_time_v2.0"



In [20]:
# write parquet with format version 1.0
write_parquet_as_partitioned_dataset(table, endpoint, bucket_name, path_v1,version="1.0")

In [21]:
# write parquet with format version 2.0
write_parquet_as_partitioned_dataset(table, endpoint, bucket_name, path_v2,version="2.0")

In [26]:
# 3. Arrow read it back to pandas df
# This function reads a parquet data set (partitioned partque files) from s3, and returns an arrow table
def read_parquet_from_s3(endpoint: str, bucket_name, path):
    url = f"https://{endpoint}"
    fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': url})
    file_uri = f"{bucket_name}/{path}"
    str_info = fs.info(file_uri)
    print(f"input file metadata: {str_info}")
    dataset = pq.ParquetDataset(file_uri, filesystem=fs, metadata_nthreads=8)
    table = dataset.read()
    return table

# 3. Spark read different parquet version


In [22]:
spath_v1=f"s3a://pengfei/{path_v1}"
spath_v2=f"s3a://pengfei/{path_v2}"


In [23]:
# check the compability of arrow parquet v1
df_v1=spark.read.parquet(spath_v1)
df_v1=df_v1.withColumn("now", lit(unix_timestamp()))
# you can notice in the dataframe schema, for naive and aware column, they are both recongnize as type timestamp automatically
# Because in pandas/arrow conversion, we convert the nanosecond to microsecond, which is consider as Spart column type timestamp.  
df_v1.printSchema()
df_v1.show()

root
 |-- naive: timestamp (nullable = true)
 |-- aware: timestamp (nullable = true)
 |-- now: long (nullable = true)

+-------------------+-------------------+----------+
|              naive|              aware|       now|
+-------------------+-------------------+----------+
|2046-01-01 00:00:00|2046-01-01 08:00:00|1632823275|
+-------------------+-------------------+----------+



In [24]:
spark.conf.set("spark.sql.session.timeZone", "US/Pacific")
df_v1.show()

+-------------------+-------------------+----------+
|              naive|              aware|       now|
+-------------------+-------------------+----------+
|2045-12-31 16:00:00|2046-01-01 00:00:00|1632823439|
+-------------------+-------------------+----------+



In [31]:
# check the compability of arrow parquet v2
df_v2=spark.read.parquet(spath_v2)
df_v2=df_v2.withColumn("now", lit(unix_timestamp()))
df_v2.printSchema()
df_v2.show()

root
 |-- naive: long (nullable = true)
 |-- aware: long (nullable = true)
 |-- now: long (nullable = true)

+-------------------+-------------------+----------+
|              naive|              aware|       now|
+-------------------+-------------------+----------+
|2398377600000000000|2398406400000000500|1632499078|
+-------------------+-------------------+----------+



In [34]:

df_v2_convert = df_v2.select( \
        from_unixtime(col("naive"), "MM-dd-yyyy HH:mm:ss").alias("naive_convert"), \
        from_unixtime(col("aware"), "MM-dd-yyyy HH:mm:ss").alias("aware_convert"), \
        from_unixtime(col("now"), "MM-dd-yyyy HH:mm:ss").alias("now_convert"))

df_v2_convert.show(truncate=False)

+----------------------+---------------------+-------------------+
|naive_convert         |aware_convert        |now_convert        |
+----------------------+---------------------+-------------------+
|03-16-+183309 11:28:57|10-10--73164 03:33:37|09-24-2021 16:01:22|
+----------------------+---------------------+-------------------+



In [50]:
df_v2_nano_to_micro=df_v2.withColumn("micro_naive",col("naive")/1000000000) \
                         .withColumn("micro_aware",col("aware")/1000000000) \
                         .withColumn("convert_naive", from_unixtime(col("micro_naive"), "yyyy-MM-dd HH:mm:ss")) \
                         .withColumn("convert_aware", from_unixtime(col("micro_aware"), "yyyy-MM-dd HH:mm:ss")) \
                         .withColumn("convert_now", from_unixtime(col("now"), "yyyy-MM-dd HH:mm:ss")) \
                         

In [52]:
df_v2_nano_to_micro.select("convert_naive","convert_aware","convert_now").show()

+-------------------+-------------------+-------------------+
|      convert_naive|      convert_aware|        convert_now|
+-------------------+-------------------+-------------------+
|2046-01-01 00:00:00|2046-01-01 08:00:00|2021-09-24 16:17:41|
+-------------------+-------------------+-------------------+

