# # Loading Data To Gold Zone 

**This Notebook:**
* Load data to Golg Zone of the Data Lake House
* Star Schekma and One Big Table Modeling
* Creates **`IDENTITY`** column in Databricks delta table

## 1.0 Initial Setup

In [0]:

%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Helpers/data_generator" 

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting dbldatagen
  Downloading dbldatagen-0.4.0.post1-py3-none-any.whl (122 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 122.8/122.8 kB 3.6 MB/s eta 0:00:00
Installing collected packages: dbldatagen
Successfully installed dbldatagen-0.4.0.post1
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


## 2.0 Create `Gold Zone` Schema

In [0]:
spark.sql("CREATE DATABASE IF NOT EXISTS gold")

DataFrame[]

## 3.0 `Sales Star Schema` Modeling 

Aiming to optimize queries in large datasets, we can use a dimensional model. 
We will use Ralph Kimball data warehouse principles and build a Star Schema model.


### `Dimensional Tables`
- **dim_calendar** - Dimension with date information
- **dim_cod** - Dimensions with codes  - Low cardinality Dimensions (Junk Dimension): 
  - **user_origin** - API vs. Files
  - **access_from** - mobile vs. computer
  - **payment_method** - Pix vs. Boleto vs. Cartão
  - **percent_discount** - 5% vs. 10% vs. 15%
- **dim_courses** - Dimensão responsável por armazenar as informações de Curso.
- **dim_user** - Dimensão responsável por armazenar as informações de Alunos.


All tables will have a **Surrogate Key (SK)** column that will be creeated with the **`<col_name> BIGINT GENERATED ALWAYS AS IDENTITY`** command. Spark will populate this column in execution time with an incremental value (incremental(1,1). )


### 3.1 `Sale Dimensions`  

In [0]:
spark.sql( """
    CREATE TABLE IF NOT EXISTS gold.dim_calendar(
        sk_tempo BIGINT GENERATED ALWAYS AS IDENTITY,
        date DATE,
        year INT, 
        month STRING,
        month_year INT,
        day_week_int INT, 
        day_week STRING,
        fl_day_week BOOLEAN,
        day_month INT,
        fl_last_month_day INT,
        day_year INT,
        week_year INT,
        bimonthly INT,
        quarter INT, 
        semester INT, 
        dt_load TIMESTAMP
    )  
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_cod (
        sk_cod BIGINT GENERATED ALWAYS AS IDENTITY,
        user_origin STRING,
        access_from STRING,
        payment_method STRING,
        percent_discount STRING,
        dt_load TIMESTAMP
    )  
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_course(
    sk_course BIGINT GENERATED ALWAYS AS IDENTITY,
    course_uuid STRING,
    course_name STRING, 
    course_level STRING,
    cource_price DECIMAL(9,2),
    dt_carga TIMESTAMP

    )          
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_user(
    sk_user BIGINT GENERATED ALWAYS AS IDENTITY,
    user_uuid STRING,
    name_user STRING,
    user_email STRING,
    user_age INT, 
    user_gender STRING,
    user_state STRING,
    user_profession STRING,
    company STRING,
    dt_load
    )
""")

[0;31m---------------------------------------------------------------------------[0m
[0;31mParseException[0m                            Traceback (most recent call last)
File [0;32m<command-1039232445426085>, line 45[0m
[1;32m     22[0m spark[38;5;241m.[39msql([38;5;124m"""[39m
[1;32m     23[0m [38;5;124m    CREATE TABLE IF NOT EXISTS gold.dim_cod ([39m
[1;32m     24[0m [38;5;124m        sk_cod BIGINT GENERATED ALWAYS AS IDENTITY,[39m
[0;32m   (...)[0m
[1;32m     30[0m [38;5;124m    )  [39m
[1;32m     31[0m [38;5;124m"""[39m)
[1;32m     33[0m spark[38;5;241m.[39msql([38;5;124m"""[39m
[1;32m     34[0m [38;5;124m    CREATE TABLE IF NOT EXISTS gold.dim_course([39m
[1;32m     35[0m [38;5;124m    sk_course BIGINT GENERATED ALWAYS AS IDENTITY,[39m
[0;32m   (...)[0m
[1;32m     42[0m [38;5;124m    )          [39m
[1;32m     43[0m [38;5;124m"""[39m)
[0;32m---> 45[0m spark[38;5;241m.[39msql([38;5;124m"""[39m
[1;32m     46[0m [38;5;12

### 3.2 `Calendar Dimension`  

The view **`vw_dim_calendar`**:
* Starts date : **01/06/2024** 
* End date: **31/12/2025**

In [0]:

from pyspark.sql.functions import explode, sequence, to_date

start_date = '2024-06-01'
end_date = '2025-12-31'

spark.sql(f"""         
  with dates as (
    select
      explode(
        sequence(
          to_date('{start_date}'),
          to_date('{end_date}'),
          interval 1 day
        )
      ) as date
  )
  select
    date,
    year(date) AS year,
    to_csv(
      named_struct('date', date),
      map('dateFormat', 'MMMM', 'locale', 'EN')
    ) AS month,
    month(date) as month_year,
    dayofweek(date) AS day_week_int,
    to_csv(
      named_struct('date', date),
      map('dateFormat', 'EEEE', 'locale', 'EN')
    ) AS day_week,
    case
      when weekday(date) < 5 then True
      else False
    end as fl_day_week,
    dayofmonth(date) as day_month,
    case
      when date = last_day(date) then True
      else False
    end as fl_last_month_day,
    dayofyear(date) as day_year,
    weekofyear(date) as week_year,
    case
      when month(date) in (1, 2) then 1
      when month(date) in (3, 4) then 2
      when month(date) in (5, 6) then 3
      when month(date) in (7, 8) then 4
      when month(date) in (9, 10) then 5
      when month(date) in (11, 12) then 6
    end as bimonthly,
    case
      when month(date) in (1, 2, 3) then 1
      when month(date) in (4, 5, 6) then 2
      when month(date) in (7, 8, 9) then 3
      when month(date) in (10, 11, 12) then 4
    end as quarter,
    case
      when month(date) in (1, 2, 3, 4, 5, 6) then 1
      when month(date) in (7, 8, 9, 10, 11, 12) then 2
    end as semester
  from
    dates
""").createOrReplaceTempView('vw_dim_calendar')

spark.sql('SELECT * FROM vw_dim_calendar LIMIT 3').display()

date,year,month,month_year,day_week_int,day_week,fl_day_week,day_month,fl_last_month_day,day_year,week_year,bimonthly,quarter,semester
2024-06-01,2024,June,6,7,Saturday,False,1,False,153,22,3,2,1
2024-06-02,2024,June,6,1,Sunday,False,2,False,154,22,3,2,1
2024-06-03,2024,June,6,2,Monday,True,3,False,155,23,3,2,1



We will now use the **`Merge`** commando to load the `dim_calendar` data.

In [0]:
spark.sql("""
    MERGE INTO gold.dim_calendar as dest
    USING vw_dim_calendar AS orig
        ON dest.date = orig.date
    WHEN NOT MATCHED    
        THEN INSERT(
            date,year,month,month_year,day_week_int,day_week,fl_day_week,day_month,fl_last_month_day,day_year,week_year,bimonthly,quarter,semester,dt_load
            )
        VALUES(
            date,year,month,month_year,day_week_int,day_week,fl_day_week,day_month,fl_last_month_day,day_year,week_year,bimonthly,quarter,semester,getdate()
        )
    """).display()


spark.sql("SELECT * FROM gold.dim_calendar LIMIT 5").display()

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
579,0,0,579


sk_tempo,date,year,month,month_year,day_week_int,day_week,fl_day_week,day_month,fl_last_month_day,day_year,week_year,bimonthly,quarter,semester,dt_load
1,2024-06-01,2024,June,6,7,Saturday,False,1,0,153,22,3,2,1,2024-11-25T10:27:01.396Z
2,2024-06-02,2024,June,6,1,Sunday,False,2,0,154,22,3,2,1,2024-11-25T10:27:01.396Z
3,2024-06-03,2024,June,6,2,Monday,True,3,0,155,23,3,2,1,2024-11-25T10:27:01.396Z
4,2024-06-04,2024,June,6,3,Tuesday,True,4,0,156,23,3,2,1,2024-11-25T10:27:01.396Z
5,2024-06-05,2024,June,6,4,Wednesday,True,5,0,157,23,3,2,1,2024-11-25T10:27:01.396Z
