## Timestamp implementation in different frameworks:

**Arrow timestamps** has three parts:
1. a **64-bit integer**
2. a **metadata** that associates a time unit** (e.g. milliseconds, microseconds, or nanoseconds),
3. an **optional time zone**.

**Pandas (Timestamp)** has two parts:
1. a **64-bit integer** representing **nanoseconds**
2. an **optional time zone**.

Python/Pandas timestamp types without an associated time zone are referred to as “Time Zone Naive”.
Python/Pandas timestamp types with an associated time zone are referred to as “Time Zone Aware”.

**Spark timestamps** has one part:
1. a **64-bit integers** representing **microseconds since the UNIX epoch**.

Note, Spark does not store any metadata about time zones with its timestamps. Spark interprets timestamps with
the session local time zone, (i.e. spark.sql.session.timeZone). If that time zone is undefined, Spark turns to
the default system time zone.

## The difference of the timestamp implementation will cause:

- Timezone information is lost (all timestamps that result from converting from spark to arrow/pandas are “time zone naive”).

- Timestamps are truncated to microseconds.

- The session time zone might have unintuitive impacts on translation of timestamp values.


## Convert pandas datetime to spark

In [None]:
from datetime import datetime, timezone, timedelta

import pandas as pd
from pandas import Timestamp
from pyspark.sql import SparkSession



In [None]:
spark = SparkSession.builder \
    .master("local[2]") \
    .appName("PandasSparkTimeStamp") \
    .getOrCreate()

pdf = pd.DataFrame({'naive': [datetime(2019, 1, 1, 0)],
                    'aware': [Timestamp(year=2019, month=1, day=1,
                                        nanosecond=500, tz=timezone(timedelta(hours=-8)))]})
# pandas data frame print the datetime
print(pdf.head())
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# set up spark session time zone
spark.conf.set("spark.sql.session.timeZone", "UTC")

In [None]:


# spark read datetime with UTC timezone
utc_df = spark.createDataFrame(pdf)
utc_df.show()

# if we change the spark session time zone, and read datetime with it.
spark.conf.set("spark.sql.session.timeZone", "US/Pacific")
pst_df = spark.createDataFrame(pdf)
pst_df.show()
utc_df.show()

In [None]:
# we convert a spark dataframe back to pandas dataframe
# as spark does not have time zone, so the generated pandas can't have time zone
ppst_df = pst_df.toPandas()
print(ppst_df.head())
print(ppst_df.info())

# now we compare the datetime of origin pandas dataframe with the dataframe generated by spark.
print(ppst_df['aware'][0])
print(pdf['aware'][0])
print(f"time zone hours {(ppst_df['aware'][0].timestamp() - pdf['aware'][0].timestamp()) / 3600}")
