# Funciones con Datetime

##### Objetivos
1. Casting a timestamp
2. Formato de datetimes
3. Extracción desde timestamp
4. Conversión a date
5. Manipulación de datetimes

##### Métodos
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html" target="_blank">Column</a>: **`cast`**
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html#datetime-functions" target="_blank">Built-In Functions</a>: **`date_format`**, **`to_date`**, **`date_add`**, **`year`**, **`month`**, **`dayofweek`**, **`minute`**, **`second`**

In [None]:
%pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=a0608ea28c4a3cabaf7b0423d1e39a278751a10e8db4bd8c8618e4420ae9389f
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.master('local[*]').appName('datetimes').getOrCreate()
sc = SparkContext.getOrCreate()

In [None]:
%pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


In [None]:
from ucimlrepo import fetch_ucirepo

air_quality = fetch_ucirepo(id=360)
df_aq = air_quality.data.features
df_aq = spark.createDataFrame(df_aq)
df_aq.show()

+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+
|     Date|    Time|CO(GT)|PT08.S1(CO)|NMHC(GT)|C6H6(GT)|PT08.S2(NMHC)|NOx(GT)|PT08.S3(NOx)|NO2(GT)|PT08.S4(NO2)|PT08.S5(O3)|   T|  RH|    AH|
+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+
|3/10/2004|18:00:00|   2.6|       1360|     150|    11.9|         1046|    166|        1056|    113|        1692|       1268|13.6|48.9|0.7578|
|3/10/2004|19:00:00|   2.0|       1292|     112|     9.4|          955|    103|        1174|     92|        1559|        972|13.3|47.7|0.7255|
|3/10/2004|20:00:00|   2.2|       1402|      88|     9.0|          939|    131|        1140|    114|        1555|       1074|11.9|54.0|0.7502|
|3/10/2004|21:00:00|   2.2|       1376|      80|     9.2|          948|    172|        1092|    122|        1584|       1203|11.0|60.0|0.7867|

### Built-In Functions: Date Time Functions

| Método | Descripción |
| --- | --- |
| **`add_months`** | Returns the date that is numMonths after startDate |
| **`current_timestamp`** | Returns the current timestamp at the start of query evaluation as a timestamp column |
| **`date_format`** | Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument. |
| **`dayofweek`** | Extracts the day of the month as an integer from a given date/timestamp/string |
| **`from_unixtime`** | Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format |
| **`minute`** | Extracts the minutes as an integer from a given date/timestamp/string. |
| **`unix_timestamp`** | Converts time string with given pattern to Unix timestamp (in seconds) |

### Cast a Timestamp

#### **`cast()`**

In [None]:
from pyspark.sql.functions import col

timestamp_df = df_aq.withColumn('timestamp', (col('Time').cast('timestamp')))
timestamp_df.show()

+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+
|     Date|    Time|CO(GT)|PT08.S1(CO)|NMHC(GT)|C6H6(GT)|PT08.S2(NMHC)|NOx(GT)|PT08.S3(NOx)|NO2(GT)|PT08.S4(NO2)|PT08.S5(O3)|   T|  RH|    AH|          timestamp|
+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+
|3/10/2004|18:00:00|   2.6|       1360|     150|    11.9|         1046|    166|        1056|    113|        1692|       1268|13.6|48.9|0.7578|2023-12-04 18:00:00|
|3/10/2004|19:00:00|   2.0|       1292|     112|     9.4|          955|    103|        1174|     92|        1559|        972|13.3|47.7|0.7255|2023-12-04 19:00:00|
|3/10/2004|20:00:00|   2.2|       1402|      88|     9.0|          939|    131|        1140|    114|        1555|       1074|11.9|54.0|0.7502|2023-12-04 20:00:00|
|3/10/2004|21:00:00|  

#### Patrones de Datetime

Hay varios escenarios comunes para el uso de fechas y horas en Spark:

Las fuentes de datos CSV/JSON utilizan la cadena de patrones para analizar y dar formato al contenido de fechas y horas.
Funciones de fecha y hora relacionadas con la conversión de StringType a/from DateType o TimestampType, por ejemplo, unix_timestamp, date_format, from_unixtime, to_date, to_timestamp, etc.

Patrones de Fechas y Horas para Dar Formato y Analizar

Spark utiliza <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html" target="_blank">símbolos de patrones para analizar y dar formato a fechas y horas</a>. A continuación se muestra un subconjunto de estos patrones.

| Symbol | Meaning         | Presentation | Examples               |
| ------ | --------------- | ------------ | ---------------------- |
| G      | era             | text         | AD; Anno Domini        |
| y      | year            | year         | 2020; 20               |
| D      | day-of-year     | number(3)    | 189                    |
| M/L    | month-of-year   | month        | 7; 07; Jul; July       |
| d      | day-of-month    | number(3)    | 28                     |
| Q/q    | quarter-of-year | number/text  | 3; 03; Q3; 3rd quarter |
| E      | day-of-week     | text         | Tue; Tuesday           |

#### Format date

#### **`date_format()`**

In [None]:
from pyspark.sql.functions import date_format

formatted_df = (
    timestamp_df
    .withColumn('date_string', date_format('timestamp', 'MMMM dd, yyyy'))
    .withColumn('time_string', date_format('timestamp', 'HH:mm:ss.SSSSSS'))
)

formatted_df.show()

+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+-----------------+---------------+
|     Date|    Time|CO(GT)|PT08.S1(CO)|NMHC(GT)|C6H6(GT)|PT08.S2(NMHC)|NOx(GT)|PT08.S3(NOx)|NO2(GT)|PT08.S4(NO2)|PT08.S5(O3)|   T|  RH|    AH|          timestamp|      date_string|    time_string|
+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+-----------------+---------------+
|3/10/2004|18:00:00|   2.6|       1360|     150|    11.9|         1046|    166|        1056|    113|        1692|       1268|13.6|48.9|0.7578|2023-12-04 18:00:00|December 04, 2023|18:00:00.000000|
|3/10/2004|19:00:00|   2.0|       1292|     112|     9.4|          955|    103|        1174|     92|        1559|        972|13.3|47.7|0.7255|2023-12-04 19:00:00|December 04, 2023|19:00:00.000000|
|3/10/2004|20:0

#### Extracción de atributos datetime des timestamp

#### **`year`**

##### Métodos similares: **`month`**, **`dayofweek`**, **`minute`**, **`second`**, etc.

In [None]:
from pyspark.sql.functions import year, month, dayofweek, minute, second

datetime_df = (
    timestamp_df
    .withColumn('year', year(col('timestamp')))
    .withColumn('month', month(col('timestamp')))
    .withColumn('dayofweek', dayofweek(col('timestamp')))
    .withColumn('minute', minute(col('timestamp')))
    .withColumn('second', second(col('timestamp')))
)

datetime_df.show()

+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+----+-----+---------+------+------+
|     Date|    Time|CO(GT)|PT08.S1(CO)|NMHC(GT)|C6H6(GT)|PT08.S2(NMHC)|NOx(GT)|PT08.S3(NOx)|NO2(GT)|PT08.S4(NO2)|PT08.S5(O3)|   T|  RH|    AH|          timestamp|year|month|dayofweek|minute|second|
+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+----+-----+---------+------+------+
|3/10/2004|18:00:00|   2.6|       1360|     150|    11.9|         1046|    166|        1056|    113|        1692|       1268|13.6|48.9|0.7578|2023-12-04 18:00:00|2023|   12|        2|     0|     0|
|3/10/2004|19:00:00|   2.0|       1292|     112|     9.4|          955|    103|        1174|     92|        1559|        972|13.3|47.7|0.7255|2023-12-04 19:00:00|2023|   12|        2|     0|     0|
|3/10/2004

#### Conversión a Date

#### **`to_date`**

In [None]:
from pyspark.sql.functions import to_date

date_df = timestamp_df.withColumn('date_', to_date(col('timestamp')))

date_df.show()

+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+----------+
|     Date|    Time|CO(GT)|PT08.S1(CO)|NMHC(GT)|C6H6(GT)|PT08.S2(NMHC)|NOx(GT)|PT08.S3(NOx)|NO2(GT)|PT08.S4(NO2)|PT08.S5(O3)|   T|  RH|    AH|          timestamp|     date_|
+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+----------+
|3/10/2004|18:00:00|   2.6|       1360|     150|    11.9|         1046|    166|        1056|    113|        1692|       1268|13.6|48.9|0.7578|2023-12-04 18:00:00|2023-12-04|
|3/10/2004|19:00:00|   2.0|       1292|     112|     9.4|          955|    103|        1174|     92|        1559|        972|13.3|47.7|0.7255|2023-12-04 19:00:00|2023-12-04|
|3/10/2004|20:00:00|   2.2|       1402|      88|     9.0|          939|    131|        1140|    114|        1555|       1074|11.9|

### Manipulación de Datetimes

#### **`date_add`**

In [None]:
from pyspark.sql.functions import date_add

plus_2_df = timestamp_df.withColumn('plus_two_days', date_add(col('timestamp'), 2))
plus_2_df.show()

+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+-------------+
|     Date|    Time|CO(GT)|PT08.S1(CO)|NMHC(GT)|C6H6(GT)|PT08.S2(NMHC)|NOx(GT)|PT08.S3(NOx)|NO2(GT)|PT08.S4(NO2)|PT08.S5(O3)|   T|  RH|    AH|          timestamp|plus_two_days|
+---------+--------+------+-----------+--------+--------+-------------+-------+------------+-------+------------+-----------+----+----+------+-------------------+-------------+
|3/10/2004|18:00:00|   2.6|       1360|     150|    11.9|         1046|    166|        1056|    113|        1692|       1268|13.6|48.9|0.7578|2023-12-04 18:00:00|   2023-12-06|
|3/10/2004|19:00:00|   2.0|       1292|     112|     9.4|          955|    103|        1174|     92|        1559|        972|13.3|47.7|0.7255|2023-12-04 19:00:00|   2023-12-06|
|3/10/2004|20:00:00|   2.2|       1402|      88|     9.0|          939|    131|        1140|    114|        1555|  