# # Loading Data To Gold Zone 

**This Notebook:**
* Load data to Golg Zone of the Data Lake House
* Star Schekma and One Big Table Modeling
* Creates **`IDENTITY`** column in Databricks delta table

## 1.0 Initial Setup

In [0]:

%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Helpers/data_generator" 

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting dbldatagen
  Using cached dbldatagen-0.4.0.post1-py3-none-any.whl (122 kB)
Installing collected packages: dbldatagen
Successfully installed dbldatagen-0.4.0.post1
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


## 2.0 Create `Gold Zone` Schema

In [0]:
spark.sql("CREATE DATABASE IF NOT EXISTS gold")

DataFrame[]

## 3.0 `Sales Star Schema` Modeling 

Aiming to optimize queries in large datasets, we can use a dimensional model. 
We will use Ralph Kimball data warehouse principles and build a Star Schema model.


### `Dimensional Tables`
- **dim_calendar** - Dimension with date information
- **dim_cod** - Dimensions with codes  - Low cardinality Dimensions (Junk Dimension): 
  - **user_origin** - API vs. Files
  - **access_from** - mobile vs. computer
  - **payment_method** - Pix vs. Boleto vs. Cartão
  - **percent_discount** - 5% vs. 10% vs. 15%
- **dim_courses** - Dimensão responsável por armazenar as informações de Curso.
- **dim_user** - Dimensão responsável por armazenar as informações de Alunos.


All tables will have a **Surrogate Key (SK)** column that will be creeated with the **`<col_name> BIGINT GENERATED ALWAYS AS IDENTITY`** command. Spark will populate this column in execution time with an incremental value (incremental(1,1). )


### 3.1 `Sale Dimensions`  

In [0]:
%fs rm -r dbfs:/user/hive/warehouse/gold.db/dim_calendar

In [0]:

%fs rm -r dbfs:/user/hive/warehouse/gold.db/dim_cod

In [0]:
%fs rm -r dbfs:/user/hive/warehouse/gold.db/dim_course

In [0]:
%fs rm -r dbfs:/user/hive/warehouse/gold.db/dim_user

In [0]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_calendar(
        sk_tempo BIGINT GENERATED ALWAYS AS IDENTITY,
        date DATE,
        year INT, 
        month STRING,
        month_year INT,
        day_week_int INT, 
        day_week STRING,
        fl_day_week BOOLEAN,
        day_month INT,
        fl_last_month_day INT,
        day_year INT,
        week_year INT,
        bimonthly INT,
        quarter INT, 
        semester INT, 
        dt_load TIMESTAMP
    )  
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_cod (
        sk_cod BIGINT GENERATED ALWAYS AS IDENTITY,
        user_origin STRING,
        access_from STRING,
        payment_method STRING,
        percent_discount STRING,
        dt_load TIMESTAMP
    )  
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_course(
        sk_course BIGINT GENERATED ALWAYS AS IDENTITY,
        course_uuid STRING,
        course_name STRING, 
        course_level STRING,
        cource_price DECIMAL(9,2),
        dt_carga TIMESTAMP
    )          
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_user(
    sk_user BIGINT GENERATED ALWAYS AS IDENTITY,
    user_uuid STRING,
    name_user STRING,
    user_email STRING,
    user_age INT, 
    user_gender STRING,
    user_state STRING,
    user_profession STRING,
    company STRING,
    dt_load TIMESTAMP
    )
""")

DataFrame[]

### 3.2 `Calendar Dimension`  

The view **`vw_dim_calendar`**:
* Starts date : **01/06/2024** 
* End date: **31/12/2025**

In [0]:

from pyspark.sql.functions import explode, sequence, to_date

start_date = '2024-06-01'
end_date = '2025-12-31'

spark.sql(f"""         
  with dates as (
    select
      explode(
        sequence(
          to_date('{start_date}'),
          to_date('{end_date}'),
          interval 1 day
        )
      ) as date
  )
  select
    date,
    year(date) AS year,
    to_csv(
      named_struct('date', date),
      map('dateFormat', 'MMMM', 'locale', 'EN')
    ) AS month,
    month(date) as month_year,
    dayofweek(date) AS day_week_int,
    to_csv(
      named_struct('date', date),
      map('dateFormat', 'EEEE', 'locale', 'EN')
    ) AS day_week,
    case
      when weekday(date) < 5 then True
      else False
    end as fl_day_week,
    dayofmonth(date) as day_month,
    case
      when date = last_day(date) then True
      else False
    end as fl_last_month_day,
    dayofyear(date) as day_year,
    weekofyear(date) as week_year,
    case
      when month(date) in (1, 2) then 1
      when month(date) in (3, 4) then 2
      when month(date) in (5, 6) then 3
      when month(date) in (7, 8) then 4
      when month(date) in (9, 10) then 5
      when month(date) in (11, 12) then 6
    end as bimonthly,
    case
      when month(date) in (1, 2, 3) then 1
      when month(date) in (4, 5, 6) then 2
      when month(date) in (7, 8, 9) then 3
      when month(date) in (10, 11, 12) then 4
    end as quarter,
    case
      when month(date) in (1, 2, 3, 4, 5, 6) then 1
      when month(date) in (7, 8, 9, 10, 11, 12) then 2
    end as semester
  from
    dates
""").createOrReplaceTempView('vw_dim_calendar')

spark.sql('SELECT * FROM vw_dim_calendar LIMIT 3').display()

date,year,month,month_year,day_week_int,day_week,fl_day_week,day_month,fl_last_month_day,day_year,week_year,bimonthly,quarter,semester
2024-06-01,2024,June,6,7,Saturday,False,1,False,153,22,3,2,1
2024-06-02,2024,June,6,1,Sunday,False,2,False,154,22,3,2,1
2024-06-03,2024,June,6,2,Monday,True,3,False,155,23,3,2,1



We will now use the **`Merge`** commando to load the `dim_calendar` data.

In [0]:
spark.sql("""
    MERGE INTO gold.dim_calendar as dest
    USING vw_dim_calendar AS orig
        ON dest.date = orig.date
    WHEN NOT MATCHED    
        THEN INSERT(
            date,year,month,month_year,day_week_int,day_week,fl_day_week,day_month,fl_last_month_day,day_year,week_year,bimonthly,quarter,semester,dt_load
            )
        VALUES(
            date,year,month,month_year,day_week_int,day_week,fl_day_week,day_month,fl_last_month_day,day_year,week_year,bimonthly,quarter,semester,getdate()
        )
    """).display()


spark.sql("SELECT * FROM gold.dim_calendar LIMIT 5").display()

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
579,0,0,579


sk_tempo,date,year,month,month_year,day_week_int,day_week,fl_day_week,day_month,fl_last_month_day,day_year,week_year,bimonthly,quarter,semester,dt_load
1,2024-06-01,2024,June,6,7,Saturday,False,1,0,153,22,3,2,1,2024-11-26T00:44:01.331Z
2,2024-06-02,2024,June,6,1,Sunday,False,2,0,154,22,3,2,1,2024-11-26T00:44:01.331Z
3,2024-06-03,2024,June,6,2,Monday,True,3,0,155,23,3,2,1,2024-11-26T00:44:01.331Z
4,2024-06-04,2024,June,6,3,Tuesday,True,4,0,156,23,3,2,1,2024-11-26T00:44:01.331Z
5,2024-06-05,2024,June,6,4,Wednesday,True,5,0,157,23,3,2,1,2024-11-26T00:44:01.331Z


## 3.3 Junk Dimension
We will first create a **`vw_dim_cod`** view that will be responsible for creating a **cartesian product between some of low cardinality codes***. 

We will use this method in order to avoid building low dimensions with few registers.


In [0]:
spark.sql( """
    SELECT DISTINCT 
    s.origin user_origin ,
    CASE WHEN s.origin = 'File' then "do not apply" ELSE a.local_access end as local_access,
    s.payment_method,
    coalesce(s.percent_discount, 'do not apply' ) as percent_discount
    from silver.tb_sales as s
    cross join silver.tb_access as a 
    order BY user_origin,local_access, payment_method, percent_discount      
""").createOrReplaceTempView("vw_dim_cod")

spark.sql("select * from vw_dim_cod limit 5").display()

user_origin,local_access,payment_method,percent_discount
API,Computer,boleto,5%
API,Computer,boleto,do not apply
API,Computer,credito,10%
API,Computer,credito,15%
API,Computer,credito,5%


In [0]:
spark.sql("""
    MERGE INTO gold.dim_cod as dest
    using vw_dim_cod as orig
    on dest.user_origin = orig.user_origin
        and dest.payment_method = orig.payment_method
        and dest.access_from = orig.local_access
        and dest.percent_discount = orig.percent_discount

    when not matched 
        then INSERT (
            user_origin,
            access_from,
            payment_method,
            percent_discount,
           dt_load
        )
        values(
            user_origin,
            local_access,
            payment_method,
            percent_discount,
           getdate()
        )       
""").display()

spark.sql("SELECT * FROM gold.dim_cod LIMIT 5").display()

num_affected_rows,num_updated_rows,num_deleted_rows,num_inserted_rows
22,0,0,22


sk_cod,user_origin,access_from,payment_method,percent_discount,dt_load
1,API,Computer,boleto,5%,2024-11-26T01:31:21.901Z
2,API,Computer,boleto,do not apply,2024-11-26T01:31:21.901Z
3,API,Computer,credito,10%,2024-11-26T01:31:21.901Z
4,API,Computer,credito,15%,2024-11-26T01:31:21.901Z
5,API,Computer,credito,5%,2024-11-26T01:31:21.901Z


## 3.4 User Dimension

In [0]:
spark.sql("""
      MERGE INTO gold.dim_user as dest 
      using silver.tb_user as orig
        on dest.user_uuid = orig.user_uuid

        
    when matched
        and dest.email_user != orig.email_user
        or dest.age_user !=orig_age_user
        or dest.user_state != orig.user_state
        or dest.user_profession != orig.user_profession
        or dest.company != orig.company
        THEN UPDATE 
            SET dest.user_email = orig.user_email
                ,dest.user_age = orig.user_age 
                ,dest.user_state = orig.user_state
                ,dest.user_profession = orig.user_profession
                ,dest.company  = orig.company
          
         when not MATCHED
            then insert (
                    user_uuid, 
        user_name, 
        user_email, 
        user_age, 
        gender,
        state_user, 
        user_profession, 
        empresa, 
        dt_carga
    )
    VALUES (
        user_uuid, 
        user_name, 
        user_email, 
        user_age, 
        gender,
        state_user, 
        user_profession, 
        empresa, 
        getdate()

            )   
""").display()

spark.sql("SELECT * FROM gold.dim_user LIMIT 5").display()

## 3.5 Courses Dimension

We will load the **`dim_course`** with  **`MERGE`**  command
* Source: **`silver.tb_curso`**. table


In [0]:
spark.sql("""
    merge into gold.dim_course as dest 
        using silver.tb_course as orig
            on dest.course_uuid = orig.course_uuid

    when matched    
        and dest.course_name != orig.course_name
        or dest.course_level != orig.course_level
        or dest.course_price != orig.course_price
        then update 
            set 
                dest.course_name = orig.course_name,
                dest.course_level = orig.course_level,
                dest.course_price = orig.course_price,
    when not matched
        then insert (
            course_uuid,
            course_name,
            course_level,
            course_price,
            dt_load
        )
    values (
            course_uuid,
            course_name,
            course_level,
            course_price,
            GETDATE()
    )
""").display()

spark.sql("SELECT * FROM gold.dim_course LIMIT 5").display()