## Config stuff

In [1]:

from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
import ConnectionConfig as cc
cc.setupEnvironment()


## Start the cluster
Look at the getActiveSession() method in the ConnectionConfig.py file. It will return the active session. It will also add the delta package to the session and add extra jars to the session. The jars are needed to connect to the SQL Server database.

In [2]:
spark = cc.startLocalCluster("DIM_DATE",4)
spark.getActiveSession()

# Creating Date dimension from scratch

In this example we will build a date dimension from scratch.

## Step 1: Generate rows for a sequence of dates


In [3]:
#extract
from pyspark.sql.functions import *

beginDate = '2009-01-01'
endDate = '2023-12-31'

df_SQL = spark.sql(f"select explode(sequence(to_date('{beginDate}'), to_date('{endDate}'), interval 1 day)) as calendarDate, monotonically_increasing_id() as dateSK ")


df_SQL.createOrReplaceTempView('neededDates' )

spark.sql("select * from neededDates").show()

+------------+------+
|calendarDate|dateSK|
+------------+------+
|  2009-01-01|     0|
|  2009-01-02|     1|
|  2009-01-03|     2|
|  2009-01-04|     3|
|  2009-01-05|     4|
|  2009-01-06|     5|
|  2009-01-07|     6|
|  2009-01-08|     7|
|  2009-01-09|     8|
|  2009-01-10|     9|
|  2009-01-11|    10|
|  2009-01-12|    11|
|  2009-01-13|    12|
|  2009-01-14|    13|
|  2009-01-15|    14|
|  2009-01-16|    15|
|  2009-01-17|    16|
|  2009-01-18|    17|
|  2009-01-19|    18|
|  2009-01-20|    19|
+------------+------+


In this example a dataframe df_SQL is created based on the result of a select statement:
* ```spark.sql``` is used to create date rows with sql-like language. You can find all possible SQL functions [here](https://spark.apache.org/docs/latest/api/sql/)
*  [```sequence```](https://spark.apache.org/docs/latest/api/sql/#sequence) creates a list of dates between the begin and end date. The interval is 1 day., [```explode```](https://spark.apache.org/docs/latest/api/sql/#explode) generates a row for each item in the array.
* [```monotonically_increasing_id```](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.monotonically_increasing_id.html#pyspark.sql.functions.monotonically_increasing_id) is used to generate a unique id in a clustered environment

The dataframe is made available to use as a table called "neededDates".

## Step 2: Create all typical dimension fields
Because we want to represent the date in different ways (weekday, month...), we have to do several tranformations. You can see an extract of our go
```
+------+--------+------------+------------+-------------+-----------+-----------+---------+--------------------+---------+----------+----------------+---------+-------------+-------------+
|dateSK| dateInt|CalendarDate|CalendarYear|CalendarMonth|MonthOfYear|CalendarDay|DayOfWeek|DayOfWeekStartMonday|IsWeekDay|DayOfMonth|IsLastDayOfMonth|DayOfYear|WeekOfYearIso|QuarterOfYear|
+------+--------+------------+------------+-------------+-----------+-----------+---------+--------------------+---------+----------+----------------+---------+-------------+-------------+
|     0|20090101|  2009-01-01|        2009|      January|          1|   Thursday|        5|                   4|        Y|         1|               N|        1|            1|            1|
|     1|20090102|  2009-01-02|        2009|      January|          1|     Friday|        6|                   5|        Y|         2|               N|        2|            1|            1|
|     2|20090103|  2009-01-03|        2009|      January|          1|   Saturday|        7|                   6|        N|         3|               N|        3|            1|            1|
|     3|20090104|  2009-01-04|        2009|      January|          1|     Sunday|        1|                   7|        N|         4|               N|        4|            1|            1|
```

### Method a: Use spark.sql to perform all the transformations with the help of a sql-query.
For many, creating an SQL-select statement is the most easy way to perform the transformation.

In [4]:
dimDate = spark.sql("select dateSK, \
  year(calendarDate) * 10000 + month(calendarDate) * 100 + day(calendarDate) as dateInt, \
  CalendarDate, \
  year(calendarDate) AS CalendarYear, \
  date_format(calendarDate, 'MMMM') as CalendarMonth, \
  month(calendarDate) as MonthOfYear, \
  date_format(calendarDate, 'EEEE') as CalendarDay, \
  dayofweek(calendarDate) AS DayOfWeek, \
  weekday(calendarDate) + 1 as DayOfWeekStartMonday, \
  case \
    when weekday(calendarDate) < 5 then 'Y' \
    else 'N' \
  end as IsWeekDay, \
  dayofmonth(calendarDate) as DayOfMonth, \
  case \
    when calendarDate = last_day(calendarDate) then 'Y' \
    else 'N' \
  end as IsLastDayOfMonth, \
  dayofyear(calendarDate) as DayOfYear, \
  weekofyear(calendarDate) as WeekOfYearIso, \
  quarter(calendarDate) as QuarterOfYear \
from  \
  neededDates \
order by \
  calendarDate")

dimDate.show()

+------+--------+------------+------------+-------------+-----------+-----------+---------+--------------------+---------+----------+----------------+---------+-------------+-------------+
|dateSK| dateInt|CalendarDate|CalendarYear|CalendarMonth|MonthOfYear|CalendarDay|DayOfWeek|DayOfWeekStartMonday|IsWeekDay|DayOfMonth|IsLastDayOfMonth|DayOfYear|WeekOfYearIso|QuarterOfYear|
+------+--------+------------+------------+-------------+-----------+-----------+---------+--------------------+---------+----------+----------------+---------+-------------+-------------+
|     0|20090101|  2009-01-01|        2009|      January|          1|   Thursday|        5|                   4|        Y|         1|               N|        1|            1|            1|
|     1|20090102|  2009-01-02|        2009|      January|          1|     Friday|        6|                   5|        Y|         2|               N|        2|            1|            1|
|     2|20090103|  2009-01-03|        2009|      Januar

* ```spark.sql``` is used to create the select query and returns the desired DataFrame. Remember to look-up the possible functions [here](https://spark.apache.org/docs/latest/api/sql/).
* ```dimDate.show()``` s used to show the records in a DataFrame. Use it during development, but disable when not needed anymore


### Method b: Use the dataframe API

This method does not use the sql-like language. You can achieve the same with this method and you get better code completion. See [DataFrame API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)
As an example two columns where added with   ```withColumn``` .
```èxpr()``` is used to write a snippet of 'sql' code and parse it into a column.

In [5]:
#from pyspark.sql.functions import explode, expr, sequence,col, date_format
df_SparkSQL = df_SQL \
    .withColumn("year", date_format("calendarDate",'yyyy')) \
    .withColumn("month", date_format("calendarDate",'MMMM')) \
    .withColumn("lasyDayOfMonth" \
                ,expr("case when calendarDate = last_day(calendarDate) then 'Y' \
                else 'N' \
                end as IsLastDayOfMonth"))
df_SparkSQL.show()

+------------+------+----+-------+--------------+
|calendarDate|dateSK|year|  month|lasyDayOfMonth|
+------------+------+----+-------+--------------+
|  2009-01-01|     0|2009|January|             N|
|  2009-01-02|     1|2009|January|             N|
|  2009-01-03|     2|2009|January|             N|
|  2009-01-04|     3|2009|January|             N|
|  2009-01-05|     4|2009|January|             N|
|  2009-01-06|     5|2009|January|             N|
|  2009-01-07|     6|2009|January|             N|
|  2009-01-08|     7|2009|January|             N|
|  2009-01-09|     8|2009|January|             N|
|  2009-01-10|     9|2009|January|             N|
|  2009-01-11|    10|2009|January|             N|
|  2009-01-12|    11|2009|January|             N|
|  2009-01-13|    12|2009|January|             N|
|  2009-01-14|    13|2009|January|             N|
|  2009-01-15|    14|2009|January|             N|
|  2009-01-16|    15|2009|January|             N|
|  2009-01-17|    16|2009|January|             N|


> ## TASK:
> Complete the transformation in method b until the result matches the result of method a.

# Step 3: Writing the data to a delta-file

In [6]:
dimDate.write.format("delta").mode("overwrite").saveAsTable("dimDate")


In [7]:
spark.stop()