# # Loading Data To Gold Zone 

**This Notebook:**
* Load data to Golg Zone of the Data Lake House
* Star Schekma and One Big Table Modeling
* Creates **`IDENTITY`** column in Databricks delta table

## 1.0 Initial Setup

In [0]:

%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Helpers/data_generator" 

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting dbldatagen
  Downloading dbldatagen-0.4.0.post1-py3-none-any.whl (122 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 122.8/122.8 kB 3.6 MB/s eta 0:00:00
Installing collected packages: dbldatagen
Successfully installed dbldatagen-0.4.0.post1
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


## 2.0 Create `Gold Zone` Schema

In [0]:
spark.sql("CREATE DATABASE IF NOT EXISTS gold")

DataFrame[]

## 3.0 `Sales Star Schema` Modeling 

Aiming to optimize queries in large datasets, we can use a dimensional model. 
We will use Ralph Kimball data warehouse principles and build a Star Schema model.


### `Dimensional Tables`
- **dim_calendar** - Dimension with date information
- **dim_cod** - Dimensions with codes  - Low cardinality Dimensions (Junk Dimension): 
  - **user_origin** - API vs. Files
  - **access_from** - mobile vs. computer
  - **payment_method** - Pix vs. Boleto vs. Cartão
  - **percent_discount** - 5% vs. 10% vs. 15%
- **dim_courses** - Dimensão responsável por armazenar as informações de Curso.
- **dim_user** - Dimensão responsável por armazenar as informações de Alunos.


All tables will have a **Surrogate Key (SK)** column that will be creeated with the **`<col_name> BIGINT GENERATED ALWAYS AS IDENTITY`** command. Spark will populate this column in execution time with an incremental value (incremental(1,1). )


### 3.1 `Sale Dimensions`  

In [0]:
spark.sql( """
    CREATE TABLE IF NOT EXISTS gold.dim_calendar(
        sk_tempo BIGINT GENERATED ALWAYS AS IDENTITY,
        date DATE,
        year INT, 
        month STRING,
        month_year INT,
        day_week_int INT, 
        day_week STRING,
        fl_day_week BOOLEAN,
        day_month INT,
        fl_last_month_day INT,
        day_year INT,
        week_year INT,
        bimonthly INT,
        quarter INT, 
        semester INT, 
        dt_load TIMESTAMP
    )  
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_cod (
        sk_cod BIGINT GENERATED ALWAYS AS IDENTITY,
        user_origin STRING,
        access_from STRING,
        payment_method STRING,
        percent_discount STRING,
        dt_load TIMESTAMP
    )  
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_course(
    sk_course BIGINT GENERATED ALWAYS AS IDENTITY,
    course_uuid STRING,
    course_name STRING, 
    course_level STRING,
    cource_price DECIMAL(9,2),
    dt_carga TIMESTAMP

    )          
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_user(
    sk_user BIGINT GENERATED ALWAYS AS IDENTITY,
    user_uuid STRING,
    name_user STRING,
    user_email STRING,
    user_age INT, 
    user_gender STRING,
    user_state STRING,
    user_profession STRING,
    company STRING,
    dt_load
    )
""")

### 3.2 `Calendar Dimension`  

The view **`vw_dim_tempo`**:
* Starts date : **01/06/2024** 
* End date: **31/12/2025**

In [0]:

spark.sql(f"""         
  with date as (
    select
      explode(
        sequence(
          to_date('{data_inicio}'),
          to_date('{data_fim}'),
          interval 1 day
        )
      ) as data
  )
  select
    data,
    year(data) AS year,
    to_csv(
      named_struct('date', data),
      map('dateFormat', 'MMMM', 'locale', 'PT')
    ) AS mes,
    month(data) as month_year,
    dayofweek(data) AS day_week_int,
    to_csv(
      named_struct('date', date),
      map('dateFormat', 'EEEE', 'locale', 'PT')
    ) AS day_week,
    case
      when weekday(data) < 5 then True
      else False
    end as fl_dia_semana,
    dayofmonth(data) as day_month,
    case
      when data = last_day(data) then True
      else False
    end as fl_ultimo_dia_mes,
    dayofyear(data) as dia_ano,
    weekofyear(data) as week_year,
    case
      when month(data) in (1, 2) then 1
      when month(data) in (3, 4) then 2
      when month(data) in (5, 6) then 3
      when month(data) in (7, 8) then 4
      when month(data) in (9, 10) then 5
      when month(data) in (11, 12) then 6
    end as bimestre,
    case
      when month(data) in (1, 2, 3) then 1
      when month(data) in (4, 5, 6) then 2
      when month(data) in (7, 8, 9) then 3
      when month(data) in (10, 11, 12) then 4
    end as trimestre,
    case
      when month(data) in (1, 2, 3, 4, 5, 6) then 1
      when month(data) in (7, 8, 9, 10, 11, 12) then 2
    end as semestre
  from
    datas
""").createOrReplaceTempView('vw_dim_tempo')

spark.sql('SELECT * FROM vw_dim_tempo LIMIT 5').display()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-101296273828189>, line 6[0m
[1;32m      1[0m spark[38;5;241m.[39msql([38;5;124mf[39m[38;5;124m"""[39m[38;5;124m         [39m
[1;32m      2[0m [38;5;124m  with date as ([39m
[1;32m      3[0m [38;5;124m    select[39m
[1;32m      4[0m [38;5;124m      explode([39m
[1;32m      5[0m [38;5;124m        sequence([39m
[0;32m----> 6[0m [38;5;124m          to_date([39m[38;5;124m'[39m[38;5;132;01m{[39;00mdata_inicio[38;5;132;01m}[39;00m[38;5;124m'[39m[38;5;124m),[39m
[1;32m      7[0m [38;5;124m          to_date([39m[38;5;124m'[39m[38;5;132;01m{[39;00mdata_fim[38;5;132;01m}[39;00m[38;5;124m'[39m[38;5;124m),[39m
[1;32m      8[0m [38;5;124m          interval 1 day[39m
[1;32m      9[0m [38;5;124m        )[39m
[1;32m     10[0m [38;5;124m      ) as

In [0]:
from pyspark.sql.functions import explode,sequence,to_date 

start_date = "2024-06-01"
end_date = "2025-12-31"

spark.sql(f"""
      with date As ( 
      select
      explode( 
        sequence(
          to_date('{start_date}'),
          to_date('{end_date}'),
          interval 1 day
         )) as Date )
      select 
        Date, 
        year(Date) AS year,
        to_csv(
        named_struct('date', Date),
        map('dateFormat', 'MMMM', 'locale', 'PT')
      ) AS month
      from date 

         --select * from date
""").createOrReplaceTempView('test_view')

spark.sql('SELECT * FROM test_view LIMIT 5').display()

Date,year,month
2024-06-01,2024,Junho
2024-06-02,2024,Junho
2024-06-03,2024,Junho
2024-06-04,2024,Junho
2024-06-05,2024,Junho


In [0]:
spark.sql(f"""         
  with datas as (
    select
      explode(
        sequence(
          to_date('{data_inicio}'),
          to_date('{data_fim}'),
          interval 1 day
        )
      ) as data
  )
  select
    data,
    year(data) AS ano,
    to_csv(
      named_struct('date', data),
      map('dateFormat', 'MMMM', 'locale', 'PT')
    ) AS mes,
    month(data) as mes_ano,
    dayofweek(data) AS dia_semana_int,
    to_csv(
      named_struct('date', data),
      map('dateFormat', 'EEEE', 'locale', 'PT')
    ) AS dia_semana,
    case
      when weekday(data) < 5 then True
      else False
    end as fl_dia_semana,
    dayofmonth(data) as dia_mes,
    case
      when data = last_day(data) then True
      else False
    end as fl_ultimo_dia_mes,
    dayofyear(data) as dia_ano,
    weekofyear(data) as semana_ano,
    case
      when month(data) in (1, 2) then 1
      when month(data) in (3, 4) then 2
      when month(data) in (5, 6) then 3
      when month(data) in (7, 8) then 4
      when month(data) in (9, 10) then 5
      when month(data) in (11, 12) then 6
    end as bimestre,
    case
      when month(data) in (1, 2, 3) then 1
      when month(data) in (4, 5, 6) then 2
      when month(data) in (7, 8, 9) then 3
      when month(data) in (10, 11, 12) then 4
    end as trimestre,
    case
      when month(data) in (1, 2, 3, 4, 5, 6) then 1
      when month(data) in (7, 8, 9, 10, 11, 12) then 2
    end as semestre
  from
    datas
""").createOrReplaceTempView('vw_dim_tempo')

spark.sql('SELECT * FROM vw_dim_tempo LIMIT 5').display()