# # Loading Data To Gold Zone 

**This Notebook:**
* Load data to Golg Zone of the Data Lake House
* Star Schekma and One Big Table Modeling
* Creates **`IDENTITY`** column in Databricks delta table

## 1.0 Initial Setup

In [0]:

%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Helpers/data_generator" 

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting dbldatagen
  Downloading dbldatagen-0.4.0.post1-py3-none-any.whl (122 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 122.8/122.8 kB 3.6 MB/s eta 0:00:00
Installing collected packages: dbldatagen
Successfully installed dbldatagen-0.4.0.post1
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


## 2.0 Create `Gold Zone` Schema

In [0]:
spark.sql("CREATE DATABASE IF NOT EXISTS gold")

DataFrame[]

## 3.0 `Sales Star Schema` Modeling 

Aiming to optimize queries in large datasets, we can use a dimensional model. 
We will use Ralph Kimball data warehouse principles and build a Star Schema model.


### `Dimensional Tables`
- **dim_calendar** - Dimension with date information
- **dim_cod** - Dimensions with codes  - Low cardinality Dimensions (Junk Dimension): 
  - **user_origin** - API vs. Files
  - **access_from** - mobile vs. computer
  - **payment_method** - Pix vs. Boleto vs. Cartão
  - **percent_discount** - 5% vs. 10% vs. 15%
- **dim_courses** - Dimensão responsável por armazenar as informações de Curso.
- **dim_user** - Dimensão responsável por armazenar as informações de Alunos.


All tables will have a **Surrogate Key (SK)** column that will be creeated with the **`<col_name> BIGINT GENERATED ALWAYS AS IDENTITY`** command. Spark will populate this column in execution time with an incremental value (incremental(1,1). )


### 3.1 `Sale Dimensions`  

In [0]:
spark.sql( """
    CREATE TABLE IF NOT EXISTS gold.dim_calendar(
        sk_tempo BIGINT GENERATED ALWAYS AS IDENTITY,
        date DATE,
        year INT, 
        month STRING,
        month_year INT,
        day_week_int INT, 
        day_week STRING,
        fl_day_week BOOLEAN,
        day_month INT,
        fl_last_month_day INT,
        day_year INT,
        week_year INT,
        bimonthly INT,
        quarter INT, 
        semester INT, 
        dt_load TIMESTAMP
    )  
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_cod (
        sk_cod BIGINT GENERATED ALWAYS AS IDENTITY,
        user_origin STRING,
        access_from STRING,
        payment_method STRING,
        percent_discount STRING,
        dt_load TIMESTAMP
    )  
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_course(
    sk_course BIGINT GENERATED ALWAYS AS IDENTITY,
    course_uuid STRING,
    course_name STRING, 
    course_level STRING,
    cource_price DECIMAL(9,2),
    dt_carga TIMESTAMP

    )          
""")

spark.sql("""
    CREATE TABLE IF NOT EXISTS gold.dim_user(
    sk_user BIGINT GENERATED ALWAYS AS IDENTITY,
    user_uuid STRING,
    name_user STRING,
    user_email STRING,
    user_age INT, 
    user_gender STRING,
    user_state STRING,
    user_profession STRING,
    company STRING,
    dt_load
    )
""")

### 3.2 `Calendar Dimension`  

The view **`vw_dim_tempo`**:
* Starts date : **01/06/2024** 
* End date: **31/12/2025**

In [0]:
from pyspark.sql.functions import explode,sequence,to_date 

start_date = "2024-06-01"
end_date = "2025-12-31"

spark.sql("""




""")