## Timestamp implementation in different frameworks:

**Arrow timestamps** has three parts:
1. a **64-bit integer**
2. a **metadata** that associates a time unit** (e.g. milliseconds, microseconds, or nanoseconds),
3. an **optional time zone**.

**Pandas (Timestamp)** has two parts:
1. a **64-bit integer** representing **nanoseconds**
2. an **optional time zone**.

Python/Pandas timestamp types without an associated time zone are referred to as “Time Zone Naive”.
Python/Pandas timestamp types with an associated time zone are referred to as “Time Zone Aware”.

**Spark timestamps** has one part:
1. a **64-bit integers** representing **microseconds since the UNIX epoch**.

Note, Spark does not store any metadata about time zones with its timestamps. Spark interprets timestamps with
the session local time zone, (i.e. spark.sql.session.timeZone). If that time zone is undefined, Spark turns to
the default system time zone.

## The difference of the timestamp implementation will cause:

- Timezone information is lost (all timestamps that result from converting from spark to arrow/pandas are “time zone naive”).

- Timestamps are truncated to microseconds.

- The session time zone might have unintuitive impacts on translation of timestamp values.


In [10]:
from datetime import datetime, timezone, timedelta

import pandas as pd
from pandas import Timestamp
from pyspark.sql import SparkSession

In [11]:
spark = SparkSession.builder \
    .master("local[2]") \
    .appName("PandasSparkTimeStamp") \
    .getOrCreate()

pdf = pd.DataFrame({'naive': [datetime(2046, 1, 1, 0)],
                    'aware': [Timestamp(year=2046, month=1, day=1,
                                        nanosecond=500, tz=timezone(timedelta(hours=-8)))]})
# pandas data frame print the datetime
print(pdf.head())
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")


       naive                               aware
0 2046-01-01 2046-01-01 00:00:00.000000500-08:00


## Convert pandas datetime to spark

In [12]:
# set up spark session time zone
spark.conf.set("spark.sql.session.timeZone", "UTC")

# spark convert the datetime with UTC timezone
utc_df = spark.createDataFrame(pdf)
print("UTC converted datetime in UTC timezone")
utc_df.show()

# if we change the spark session time zone, and read datetime with it.
spark.conf.set("spark.sql.session.timeZone", "US/Pacific")
# spark convert the datetime with US/Pacific timezone
pst_df = spark.createDataFrame(pdf)
print("US/Pacific converted datetime in US/Pacific timezone")
pst_df.show()
print("UTC converted datetime in US/Pacific timezone")
utc_df.show()

UTC converted datetime in UTC timezone
+-------------------+-------------------+
|              naive|              aware|
+-------------------+-------------------+
|2046-01-01 00:00:00|2046-01-01 08:00:00|
+-------------------+-------------------+

US/Pacific converted datetime in US/Pacific timezone
+-------------------+-------------------+
|              naive|              aware|
+-------------------+-------------------+
|2046-01-01 00:00:00|2046-01-01 00:00:00|
+-------------------+-------------------+

UTC converted datetime in US/Pacific timezone
+-------------------+-------------------+
|              naive|              aware|
+-------------------+-------------------+
|2045-12-31 16:00:00|2046-01-01 00:00:00|
+-------------------+-------------------+



## Convert spark datetime back to pandas

In [13]:
# we convert a spark dataframe back to pandas dataframe
# as spark does not have time zone, so the generated pandas can't have time zone
ppst_df = pst_df.toPandas()
print(ppst_df.head())
print(ppst_df.info())

# now we compare the datetime of origin pandas dataframe with the dataframe generated by spark.
print(ppst_df['aware'][0])
print(pdf['aware'][0])
print(f"time zone hours {(ppst_df['aware'][0].timestamp() - pdf['aware'][0].timestamp()) / 3600}")

       naive      aware
0 2046-01-01 2046-01-01
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   naive   1 non-null      datetime64[ns]
 1   aware   1 non-null      datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 144.0 bytes
None
2046-01-01 00:00:00
2046-01-01 00:00:00.000000500-08:00
time zone hours -8.0


Note that the surprising shift for aware doesn’t happen when the session time zone is UTC (but the timestamps still become “time zone naive”):


In [18]:
# set the session timezone to UTC again
spark.conf.set("spark.sql.session.timeZone", "UTC")

print("US/Pacific converted datetime in US/Pacific timezone")
pst_df.show()

print(f"spark converted to pandas aware time: {ppst_df['aware'][0]}")

print(f"pandas aware time: {pdf['aware'][0]}")

(ppst_df['aware'][0].timestamp()-pdf['aware'][0].timestamp())/3600

US/Pacific converted datetime in US/Pacific timezone
+-------------------+-------------------+
|              naive|              aware|
+-------------------+-------------------+
|2046-01-01 08:00:00|2046-01-01 08:00:00|
+-------------------+-------------------+

spark converted to pandas aware time: 2046-01-01 00:00:00
pandas aware time: 2046-01-01 00:00:00.000000500-08:00


-8.0

In [None]:
df_mod=df.withColumn("callDate_unix",f.unix_timestamp("CallDate","dd/MM/yyyy")) \
   .withColumn("callDate_ts",f.to_timestamp("CallDate","dd/MM/yyyy"))

df_mod.printSchema()