# PySpark SparkSQL Date/Time

## Date/Time Format Patterns

* [A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0](https://databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html)(MUST)

* [Datetime Patterns for Formatting and Parsing](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)

> Spark uses pattern letters in the following table for date and timestamp parsing and formatting:

* [SparkSQL - Built-in Functions](https://spark.apache.org/docs/latest/api/sql/index.html)
* [Migration Guide: SQL, Datasets and DataFrame](https://spark.apache.org/docs/latest/sql-migration-guide.html)

* [Deep Dive into Apache Spark DateTime Functions](https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-datetime-functions-b66de737950a)

> Catalog of DateTime functions in Apache Spark

In [1]:
%%html
<style>
table {float:left}
</style>

In [2]:
%%html
<style>
div.output_area pre {
    white-space: pre;
}
</style>

In [3]:
import os
import sys
import gc
from datetime import (
    datetime,
    date
)

#  Environemnt Variables

## Hadoop

In [4]:
os.environ['HADOOP_CONF_DIR'] = "/opt/hadoop/hadoop-3.2.2/etc/hadoop"

In [5]:
%%bash
export HADOOP_CONF_DIR="/opt/hadoop/hadoop-3.2.2/etc/hadoop"
ls $HADOOP_CONF_DIR | head -n 5

capacity-scheduler.xml
configuration.xsl
container-executor.cfg
core-site.xml
core-site.xml.48132.2022-02-15@12:29:41~


## PYTHONPATH

Refer to the **pyspark** modules to load from the ```$SPARK_HOME/python/lib``` in the Spark installation.

* [PySpark Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)

> Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:

```
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
```

Alternatively install **pyspark** with pip or conda locally which installs the Spark runtime libararies (for standalone).

* [Can PySpark work without Spark?](https://stackoverflow.com/questions/51728177/can-pyspark-work-without-spark)

> As of v2.2, executing pip install pyspark will install Spark. If you're going to use Pyspark it's clearly the simplest way to get started. On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars  
> PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark. This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

In [6]:
# os.environ['PYTHONPATH'] = "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip:/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
sys.path.extend([
    "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip",
    "/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
])

## Python packages

Execute after the PYTHONPATH setup.

### pyspark.sql.funtions

See [pyspark.sql.functions module](https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#module-pyspark.sql.functions) for available function you can import. Spark Documentation [Built-in Functions](https://spark.apache.org/docs/latest/api/sql/index.html#day) has functions such as ```day```, ```month``` but they cannot be imported and [pyspark.sql.functions module](https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#module-pyspark.sql.functions) does not have them.

In [7]:
import pyspark.sql 
from pyspark.sql.types import *
from pyspark.sql.functions import (
    col,
    lit,
    avg,
    stddev,
    isnan,
    to_date,
    to_timestamp,
    date_format,
    year,
    month,
    hour,
    min,
    second,
)

---
# Spark Session


In [8]:
from pyspark.sql import SparkSession

In [9]:
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.debug.maxToStringFields', 100) \
    .config('spark.executor.memory', '2g') \
    .getOrCreate()

2022-02-20 19:40:53,883 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-20 19:40:56,300 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2022-02-20 19:40:58,938 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [10]:
NUM_CORES = 4
NUM_PARTITIONS = 3

spark.conf.set("spark.sql.shuffle.partitions", NUM_CORES * NUM_PARTITIONS)
spark.conf.set("spark.default.parallelism", NUM_CORES * NUM_PARTITIONS)

# Date/Time Format String

* [Datetime Patterns for Formatting and Parsing](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)


| Symbol | Meaning                      | Presentation | Examples                                       |
|--------|------------------------------|--------------|------------------------------------------------|
| G      | era                          | text         | AD; Anno Domini                                |
| y      | year                         | year         | 2020; 20                                       |
| D      | day-of-year                  | number(3)    | 189                                            |
| M/L    | month-of-year                | month        | 7; 07; Jul; July                               |
| d      | day-of-month                 | number(3)    | 28                                             |
| Q/q    | quarter-of-year              | number/text  | 3; 03; Q3; 3rd quarter                         |
| E      | day-of-week                  | text         | Tue; Tuesday                                   |
| F      | aligned day of week in month | number(1)    | 3                                              |
| a      | am-pm-of-day                 | am-pm        | PM                                             |
| h      | clock-hour-of-am-pm (1-12)   | number(2)    | 12                                             |
| K      | hour-of-am-pm (0-11)         | number(2)    | 0                                              |
| k      | clock-hour-of-day (1-24)     | number(2)    | 0                                              |
| H      | hour-of-day (0-23)           | number(2)    | 0                                              |
| m      | minute-of-hour               | number(2)    | 30                                             |
| s      | second-of-minute             | number(2)    | 55                                             |
| S      | fraction-of-second           | fraction     | 978                                            |
| V      | time-zone ID                 | zone-id      | America/Los_Angeles; Z; -08:30                 |
| z      | time-zone name               | zone-name    | Pacific Standard Time; PST                     |
| O      | localized zone-offset        | offset-O     | GMT+8; GMT+08:00; UTC-08:00;                   |
| X      | zone-offset ‘Z’ for zero     | offset-X     | Z; -08; -0830; -08:30; -083015; -08:30:15;     |
| x      | zone-offset                  | offset-x     | +0000; -08; -0830; -08:30; -083015; -08:30:15; |
| Z      | zone-offset                  | offset-Z     | +0000; -0800; -08:00;                          |
| ‘      | escape for text              | delimiter    |                                                |
| ’‘     | single quote                 | literal      | ’                                              |
| [      | optional section start       |              |                                                |
| ]      | optional section end         |              |                                                |

# Data Types


* [Data Types](https://spark.apache.org/docs/latest/sql-ref-datatypes.html#data-types)

```from pyspark.sql.types import *```

| Data type | Value type in Python | API to access or create a data type |  |
|:---|:---|:---|:--|
|ByteType | int or long Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127. | ByteType() |  |
| ShortType | int or long Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767. | ShortType() |  |
| IntegerType | int or long | IntegerType() |  |
| LongType | long Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807.Otherwise, please convert data to decimal.Decimal and use DecimalType. | LongType() |  |
| FloatType | float Note: Numbers will be converted to 4-byte single-precision floating point numbers at runtime. | FloatType() |  |
| DoubleType | float | DoubleType() |  |
| DecimalType | decimal.Decimal | DecimalType() |  |
| StringType | string | StringType() |  |
| BinaryType | bytearray | BinaryType() |  |
| BooleanType | bool | BooleanType() |  |
| TimestampType | datetime.datetime | TimestampType() |  |
| DateType | datetime.date | DateType() |  |
| ArrayType | list, tuple, or array | ArrayType(elementType, [containsNull]) Note:The default value of containsNull is True. |  |
| MapType | dict | MapType(keyType, valueType, [valueContainsNull]) Note:The default value of valueContainsNull is True. |  |
| StructType | list or tuple | StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |  |
| StructField | The value type in Python of the data type of this field (For example, Int for a StructField with the data type IntegerType) | StructField(name, dataType, [nullable]) Note: The default value of nullable is True. |  |


# Date/Timestamp Literals

* [Datetime Literal](https://spark.apache.org/docs/latest/sql-ref-literals.html#datetime-literal)

### Date Literal

> A datetime literal is used to specify a date or timestamp value.
> ```
> DATE { 'yyyy' |
>        'yyyy-[m]m' |
>        'yyyy-[m]m-[d]d' |
>        'yyyy-[m]m-[d]d[T]'  }
> ```

Example: ```DATE '2011-11-11'``` is the Date literal which SparlSQL engine interprets proprietary manner.
```
SELECT DATE '2011-11-11' AS col;
+----------+
|       col|
+----------+
|2011-11-11|
+----------+
```

### Timestamp Literal

> ```
> TIMESTAMP { 'yyyy' |
>             'yyyy-[m]m' |
>             'yyyy-[m]m-[d]d' |
>             'yyyy-[m]m-[d]d ' |
>             'yyyy-[m]m-[d]d[T][h]h[:]' |
>             'yyyy-[m]m-[d]d[T][h]h:[m]m[:]' |
>             'yyyy-[m]m-[d]d[T][h]h:[m]m:[s]s[.]' |
>             'yyyy-[m]m-[d]d[T][h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]' }
> ```

Example:

```
SELECT TIMESTAMP '1997-01-31 09:26:56.66666666UTC+08:00' AS col;
+--------------------------+
|                      col |
+--------------------------+
|1997-01-30 17:26:56.666666|
+--------------------------+
```

---
# 2 digit-year handling

```to_date``` can convert 2 digit year e.g. ```31-DEC-98``` into ```2098-12-31```.

* [spark to_date function - how to convert 31-DEC-98 to 1998-12-31 not 2098-12-31](https://stackoverflow.com/questions/71182230)

> On Spark 3.0, a new dates parser was introduced, with a changed behavior for dealing with 2 digits year.
You could find a reference for the change under Upgrading from Spark SQL 2.4 to 3.0.
> ```spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')``` will give you the original behavior with the required results

```
from pyspark.sql import functions as F

spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')

(spark.createDataFrame([('31-DEC-98',)], 'my_date string')
 .select(F.to_date('my_date','dd-MMM-yy')
 .alias('my_new_date')).show()
)

+-----------+
|my_new_date|
+-----------+
| 1998-12-31|
+-----------+
```

* [spark - where is spark.sql.legacy.timeParserPolicy documented](https://stackoverflow.com/questions/71190476/spark-where-is-spark-sql-legacy-timeparserpolicy-documented)

In [11]:
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')

---
# Extract date/time element


* [Spark documentation - date_format][1]

> date_format(timestamp, fmt) - Converts timestamp to a value of string in the format specified by the date format fmt.
> * timestamp - A date/timestamp or string to be converted to the given format.
> * fmt - Date/time format pattern to follow. See Datetime Patterns for valid date and time format patterns.


## Year

* [Datetime Patterns for Formatting and Parsing](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)

> Year: The count of letters determines the minimum field width below which padding is used. **If the count of letters is two, then a reduced two digit form is used**. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive.
> 
> **If the count of letters is less than four (but not two)**, then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded when ‘G’ is not present. 7 or more letters will fail.

In [12]:
spark.sql("select date_format(date '2007-11-13T09:00', 'y') AS year_number").show()

[Stage 0:>                                                          (0 + 1) / 1]

+-----------+
|year_number|
+-----------+
|       2007|
+-----------+



                                                                                

In [13]:
spark.sql("select date_format(date '2007-11-13T09:00', 'yy') AS year_number").show()

+-----------+
|year_number|
+-----------+
|         07|
+-----------+



## Month

* [Datetime Patterns for Formatting and Parsing](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)

> Month: It follows the rule of Number/Text. The text form is depend on letters - ‘M’ denotes the ‘standard’ form, and ‘L’ is for ‘stand-alone’ form. These two forms are different only in some certain languages. For example, in Russian, ‘Июль’ is the stand-alone form of July, and ‘Июля’ is the standard form. Here are examples for all supported pattern letters:

```
select date_format(date '1970-01-01', "M")
1
```

```
select date_format(date '1970-09-01', "MM")
09
```

```
select date_format(date '1970-01-01', "d MMM")
1 Jan
```

```
select date_format(date '1970-01-01', "d MMMM")
1 January
```

In [14]:
spark.sql("select date_format(date '2007-11-13T09:00', 'MMM') AS month_text").show()

+----------+
|month_text|
+----------+
|       Nov|
+----------+



## Week in month (?)

To be verified.

* [spark - what is F string for the date/time format?](https://stackoverflow.com/questions/71190684/spark-what-is-f-string-for-the-date-time-format)

In [15]:
spark.sql("select date_format(date '2007-11-10', 'F') AS day_in_week_text").show()

+----------------+
|day_in_week_text|
+----------------+
|               2|
+----------------+



In [16]:
spark.sql("select date_format(date '2007-11-17', 'F') AS day_in_week_text").show()

+----------------+
|day_in_week_text|
+----------------+
|               3|
+----------------+



## Day

### Day in the month

In [17]:
spark.sql("select date_format(date '2007-11-13T09:00', 'd') AS day_in_month_number").show()

+-------------------+
|day_in_month_number|
+-------------------+
|                 13|
+-------------------+



### Day in the year

In [18]:
spark.sql("select date_format(date '2007-11-13T09:00', 'D') AS day_in_year_number").show()

+------------------+
|day_in_year_number|
+------------------+
|               317|
+------------------+



## Day in the week

In [19]:
spark.sql("select date_format(date '2007-11-13T09:00', 'E') AS day_in_week_text").show()

+----------------+
|day_in_week_text|
+----------------+
|             Tue|
+----------------+



In [33]:
query = """
SELECT
    date_format(to_date('2007-11-13', 'yyyy-MM-dd'), 'E') AS day_in_week_text,
    dayofweek(to_date('2007-11-13', 'yyyy-MM-dd')) AS day_in_week_num
"""
spark.sql(query).show()

+----------------+---------------+
|day_in_week_text|day_in_week_num|
+----------------+---------------+
|             Tue|              3|
+----------------+---------------+



## Hour in the day (0-23)

* [spark - how to extract hour from timestamp?](https://stackoverflow.com/questions/71190604/spark-how-to-extract-hour-from-timestamp)

> why the date_format does not extract 08:15 for 8:15am
> ```
> spark.sql("select date_format(date '1994-11-05T08:15:30-05:00', 'HH:mm') AS hour_in_day_number").show()
>
>+------------------+
>|hour_in_day_number|
>+------------------+
>|             00:00|
>+------------------+
>```

> You used date, which only keep year, month and day.
> ```date '1994-11-05T08:15:30-05:00'```
> You can try use tiemstamp as below:
> ```timestamp '1994-11-05T08:15:30-05:00'```

In [20]:
spark.sql("select date_format(timestamp '1994-11-05T08:15:30-05:00', 'hh:mm') AS hour_in_day_number").show()

+------------------+
|hour_in_day_number|
+------------------+
|             12:15|
+------------------+



In [21]:
spark.sql("select date_format(timestamp '1994-11-05T08:15:30-05:00', 'HH:mm') AS hour_in_day_number").show()

+------------------+
|hour_in_day_number|
+------------------+
|             00:15|
+------------------+



In [22]:
spark.sql("select date_format(date '1994-11-05T08:15:30-05:00', 'kk:mm') AS hour_in_day_number").show()

+------------------+
|hour_in_day_number|
+------------------+
|             24:00|
+------------------+



## AM/PM

In [23]:
spark.sql("select date_format(date '2007-11-13T09:00', 'aa') AS AMPM").show()

+----+
|AMPM|
+----+
|  AM|
+----+



In [24]:
spark.sql("select date_format(date '2007-11-13', 'MMM') AS month_text").show()

+----------+
|month_text|
+----------+
|       Nov|
+----------+



---
# Date/Time Comparison

```to_date```, ```date_format``` **expects a column** as its parameter. Use ```lit()``` to convert a valie to a column of valie.

* [pyspark.sql.functions.lit(col)](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.lit.html)

> df.select(lit(5).alias('height')).withColumn('spark_user', lit(True)).take(1)


Note that ```date_format```, ```to_date``` returns Column object.

In [25]:
date_format(date=lit(datetime(year=2000,month=12,day=11,hour=10,minute=20,second=20)),format="yyyy-MM-dd")

Column<'date_format(TIMESTAMP '2000-12-11 10:20:20', yyyy-MM-dd)'>

## Date comparison

In [27]:
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("number", DoubleType(), False),
    StructField("date", DateType(), False),
    StructField("boolean", BooleanType(), False),
])
data = [
    (1, "tako", 3.1415, date(year=2000,month=1,day=1), True),
    (2, "ika", 1.6180, date(year=1999,month=12,day=31), False)
]
df = spark.createDataFrame(data=data, schema=schema)
df.show()

+---+----+------+----------+-------+
| id|name|number|      date|boolean|
+---+----+------+----------+-------+
|  1|tako|3.1415|2000-01-01|   true|
|  2| ika| 1.618|1999-12-31|  false|
+---+----+------+----------+-------+



In [28]:
df.createOrReplaceTempView("df")
query = """
SELECT
    date_format(date, "EEE") AS day_text,
    dayofweek(date) AS day_num
FROM
    df
"""
spark.sql(query).show()

+--------+-------+
|day_text|day_num|
+--------+-------+
|     Sat|      7|
|     Fri|      6|
+--------+-------+



In [27]:
df.where(col("date")==date(year=2000,month=1,day=1)).show()

+---+----+------+----------+-------+
| id|name|number|      date|boolean|
+---+----+------+----------+-------+
|  1|tako|3.1415|2000-01-01|   true|
+---+----+------+----------+-------+



In [28]:
df.where(col("date")==to_date(lit("2000-01-01"))).show()

+---+----+------+----------+-------+
| id|name|number|      date|boolean|
+---+----+------+----------+-------+
|  1|tako|3.1415|2000-01-01|   true|
+---+----+------+----------+-------+



In [29]:
df.where(col("date")==date_format(lit(date(year=1999,month=12,day=31)), "yyyy-MM-dd")).show()

+---+----+------+----------+-------+
| id|name|number|      date|boolean|
+---+----+------+----------+-------+
|  2| ika| 1.618|1999-12-31|  false|
+---+----+------+----------+-------+



In [30]:
df.select(col("date")).show()

+----------+
|      date|
+----------+
|2000-01-01|
|1999-12-31|
+----------+



## Timestamp comparison

In [31]:
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("number", DoubleType(), False),
    StructField("datetime", TimestampType(), False),
    StructField("boolean", BooleanType(), False),
])
data = [
    (1, "tako", 3.1415, datetime(year=2000,month=1,day=1,hour=10,minute=20, second=30), True),
    (2, "ika", 1.6180, datetime(year=1999,month=12,day=31,hour=20,minute=50, second=34), False)
]
timestamp_df = spark.createDataFrame(data=data, schema=schema)
timestamp_df.show()

+---+----+------+-------------------+-------+
| id|name|number|           datetime|boolean|
+---+----+------+-------------------+-------+
|  1|tako|3.1415|2000-01-01 10:20:30|   true|
|  2| ika| 1.618|1999-12-31 20:50:34|  false|
+---+----+------+-------------------+-------+



In [32]:
timestamp_df.where(hour(col("datetime")) == 20).show()

+---+----+------+-------------------+-------+
| id|name|number|           datetime|boolean|
+---+----+------+-------------------+-------+
|  2| ika| 1.618|1999-12-31 20:50:34|  false|
+---+----+------+-------------------+-------+



In [33]:
timestamp_df.select("datetime", hour(col("datetime")) == 20).show()

+-------------------+---------------------+
|           datetime|(hour(datetime) = 20)|
+-------------------+---------------------+
|2000-01-01 10:20:30|                false|
|1999-12-31 20:50:34|                 true|
+-------------------+---------------------+



In [34]:
timestamp_df.where(
    date_format(col("datetime"), 'yyyy-MM-dd:HH') == "1999-12-31:20"
).show()

+---+----+------+-------------------+-------+
| id|name|number|           datetime|boolean|
+---+----+------+-------------------+-------+
|  2| ika| 1.618|1999-12-31 20:50:34|  false|
+---+----+------+-------------------+-------+



---
# Stop Spark Session

In [34]:
spark.stop()



# Cleanup

In [35]:
del spark
gc.collect()

1683