# Loading Data to Silver Zone

This Notebook:
* We will iIngest data from **Bronze Zone** to **Silver Zone** using spark
* We will use Spark Structured Streaming with **`trigger(availableNow=True)`** for batch loading
* We will do a **load control** of the batch processes through Structured Streaming **checkpoint**
* We will use **`awaitTermination()`**  ethod to transform the streaming queries in a synchronous process
* We will Combine spark and sql in order to do the data load

## 1.0 Initial Setup

In [0]:
%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Helpers/data_generator" 

In [0]:
%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Load_Bronze_Zone"   

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


## 2.0 Create `Silver Zone` Schema

In [0]:
spark.sql("CREATE DATABASE IF NOT EXISTS silver")

DataFrame[]

data_venda,nome_empresa,sexo,nome_funcionario,email_functionario,profissao,idade,estado,curso,valor,disconto,source_file_name,processing_timestamp
2025-02-26,Empresa A,M,Funcionario c00a3119ac,funcionario_c00a3119ac@empresaa.com.br,Cientista de Dados,44,RO,Construindo o seu Primeiro Pipeline de Dados com o Databricks,"R$ 789,90",5%,part-00000-tid-577086507526502122-e3723548-ae8b-4131-841b-c65d35b11604-0-1-c000.csv,2024-11-28T09:23:08.677Z
2024-07-01,Empresa A,F,Funcionario f2e3eedb83,funcionario_f2e3eedb83@empresaa.com.br,Desenvolvedor de ETL,31,AC,Do Primeiro Pipeline ao Data Lakehouse com o Databricks,"R$ 689,90",5%,part-00000-tid-577086507526502122-e3723548-ae8b-4131-841b-c65d35b11604-0-1-c000.csv,2024-11-28T09:23:08.677Z
2025-11-23,Empresa A,M,Funcionario a86bec1438,funcionario_a86bec1438@empresaa.com.br,Desenvolvedor de ETL,19,AM,Construindo Pipelines de Dados usando o Spark Structured Streaming,"R$ 549,90",5%,part-00000-tid-577086507526502122-e3723548-ae8b-4131-841b-c65d35b11604-0-1-c000.csv,2024-11-28T09:23:08.677Z
2025-04-07,Empresa A,F,Funcionario 3d824a84e7,funcionario_3d824a84e7@empresaa.com.br,Analista de Dados,21,RR,Construindo o seu Primeiro Pipeline de Dados com o Databricks,"R$ 789,90",5%,part-00000-tid-577086507526502122-e3723548-ae8b-4131-841b-c65d35b11604-0-1-c000.csv,2024-11-28T09:23:08.677Z
2024-08-10,Empresa A,M,Funcionario 34eb9d299a,funcionario_34eb9d299a@empresaa.com.br,Arquiteto de Dados,44,PA,Do Primeiro Pipeline ao Data Lakehouse com o Databricks,"R$ 689,90",5%,part-00000-tid-577086507526502122-e3723548-ae8b-4131-841b-c65d35b11604-0-1-c000.csv,2024-11-28T09:23:08.677Z


Total of rows: 100000
File Name: part-00000-tid-577086507526502122-e3723548-ae8b-4131-841b-c65d35b11604-0-1-c000.csv


Qnt of rows in bromze.file_data table: 100000


stop_all_streams-inicio-2024-11-28 09:23:40.463751
O stream display_query_1 fui finalizado com sucesso.
O stream generate_api_stream_data fui finalizado com sucesso.
O stream None fui finalizado com sucesso.
stop_all_streams-fim-2024-11-28 09:23:42.552468
              
clean_up_landing_dir-inicio-2024-11-28 09:23:42.552570
Todos os arquivos e diretórios dentro de 'dbfs:/FileStore/landing/' foram excluidos com sucesso.
clean_up_landing_dir-fim-2024-11-28 09:23:43.647624
              


## 3.0 Businesse Requirements for Silver Zone

1. The ingestion need to be done in batch in order to avoid extra costs 
    * Even though the API data is stored in streaming in landing zone 
2. API and Batch Data need to be stored in the same table 
3. Each table will have an uuid column with a hash to identify each register
4. We have to garantee the correct data type of all column 
5. We need to create better column names for each table 

## 4.0 Data Modeling

### 4.1 Courses Table  ( Domain Table)

The table `tb_courses` is a **domain table** and we will store all the available courses information ( the product ). Its information will be added manually.
* We will use the **md5()** function to create the **curso_uuuid** column by the course name 
* The column **data_carga** : Contains the processing date

In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/tb_courses

In [0]:
spark.sql("""
  CREATE TABLE IF NOT EXISTS silver.tb_courses
  AS
    SELECT  
      md5('Construindo o seu Primeiro Pipeline de Dados com o Databricks') AS course_uuid,
      'Construindo o seu Primeiro Pipeline de Dados com o Databricks' AS course_name,
      'beginner' AS course_level,
      589.90 AS course_price,
      getdate() AS dt_load

      UNION

    SELECT
      md5('Do Primeiro Pipeline ao Data Lakehouse com o Databricks') AS course_uuid,
      'Do Primeiro Pipeline ao Data Lakehouse com o Databricks' AS course_name,
      'intermediate' as course_level,
      659.90 AS course_price,
      getdate() AS dt_load


      UNION

    SELECT
      md5('Construindo Pipelines de Dados usando o Spark Structured Streaming') AS course_uuid,
      'Construindo Pipelines de Dados usando o Spark Structured Streaming' as course_name,
      'advanced' as course_level,
      549.90 as course_price,
      getdate() as dt_load
"""
)


spark.sql('SELECT * FROM silver.tb_courses').display()

course_uuid,course_name,course_level,course_price,dt_load
f260cd97c6c9813b01601e834a2added,Construindo o seu Primeiro Pipeline de Dados com o Databricks,beginner,589.9,2024-11-29T15:37:40.47Z
34bdd77f6954552d11c4f5547cb41458,Do Primeiro Pipeline ao Data Lakehouse com o Databricks,intermediate,659.9,2024-11-29T15:37:40.47Z
c2d6bcbc3e46555bb1e7e9afbc24d3af,Construindo Pipelines de Dados usando o Spark Structured Streaming,advanced,549.9,2024-11-29T15:37:40.47Z


We will now create two streaming views called **`stream_temp_vw_api`**  and **`stream_temp_vw_files`** that will be used as source data for our loading process.

In [0]:
api_df = spark.readStream.table('bronze.api_data')
api_df.createOrReplaceTempView('stream_temp_vw_api')

files_df = spark.readStream.table('bronze.file_data')
files_df.createOrReplaceTempView('stream_temp_vw_files')

### 4.2 Access Table  

This table stores all the website access.

**`df_access`** : Dataframe that used to load data into **`tb_access`** table. This dataframe stores all information regarding the website visitors and its information comes from API. 

**Columns:**
* **`acesso_uuid` column** : Created with the **`md5()`** function  by **`concatenating`** the columns below:
  * **`access_date`** - After being converted to Timestamp
  * **`ip_address`**. - ip address of the computer 
  * **`access_point`** -Identify the access point as mobile or computer (**local_acesso**).
* **`usuario_uuid` column**: Created with the **`md5()`** function  by **`concatenating`** the columns below:
  * **`access_date`** - After being converted to Timestamp
  * **`payload.info_usuario.nome`** - name of the user
* **`data_carga` column**: The processing date of the register 

In [0]:
%sql
select * from stream_temp_vw_api limit 2

access_date,access_point,ip_address,payload,_rescued_data,source_file_name,processing_timestamp
2024-06-02T10:51:16.000Z,safari,69.127.75.83,"List(null, null, null)",,part-00001-487e7b86-4de6-4481-89dd-9509c10ba409-c000.json,2024-11-29T13:48:50.178Z
2024-06-02T11:24:12.000Z,android,168.18.37.100,"List(null, null, null)",,part-00001-487e7b86-4de6-4481-89dd-9509c10ba409-c000.json,2024-11-29T13:48:50.178Z


In [0]:
df_access = spark.sql(""" 
    SELECT 
      CAST( access_date AS TIMESTAMP) AS access_timestamp,
      ip_address AS access_ip_address,
      CASE WHEN access_point IN ('iphone','android') THEN 'Mobile' ELSE 'Computer' END AS local_access,
      md5(concat(
        CAST(access_date AS TIMESTAMP),
        ip_address,
        CASE WHEN access_point IN ('iphne', 'android') THEN 'Mobile' ELSE 'Computer' END
      )) AS access_uuid,
      md5(concat(
          CAST( access_date AS TIMESTAMP),
          payload.info_usuario.nome        
         )) AS user_uuid,
        payload.info_produto.product_uuid AS course_uuid,
      getdate() AS dt_load
    FROM stream_temp_vw_api
"""
)

df_access.limit(5).display()

access_timestamp,access_ip_address,local_access,access_uuid,user_uuid,course_uuid,dt_load
2024-06-02T10:51:16Z,69.127.75.83,Computer,d6894f8d0be6fc07738efd35ad4d076d,,,2024-11-29T15:17:48.667Z
2024-06-02T11:24:12Z,168.18.37.100,Mobile,9282482ea2d2753cc2fca9fe45584434,,,2024-11-29T15:17:48.667Z
2024-06-02T11:57:08Z,113.109.66.208,Computer,8f8888544fef2ab31eba86ac2439643f,,,2024-11-29T15:17:48.667Z
2024-06-02T12:30:04Z,87.241.252.59,Computer,0b187aed08deadfc694ca80bb5beb8c6,,,2024-11-29T15:17:48.667Z
2024-06-02T13:03:00Z,188.111.120.11,Mobile,8954acd1862b490bac52b46314bb1327,,,2024-11-29T15:17:48.667Z


We have just read the stream data from our API. 

Important note about spark:
* The origin and destination of the should have the same caracteristics. In other words:
  * `Source` : Stream data and `Destination`: Stream data
  * `Source` : Batch data and `Destionation`: batch data

In other to overcome this issue, we will use `Spark Structured Streaming` to read the sterming data in batch.

We use `Spark Structured Streaming` to load micro batch of data.

`Spark Structured Streaming` :
* No need to manage a checkpoint table to identify data that have been loead
* The `Spark Structured Streaming` uses a **`checkpoint directory`** defined by the writeStream method. This checkpoint stores the last file/offset/row  that have been stored.This way, in case of failing the process, the spark will be able to garantee the **Stream Exactly-Once Semantics**. In other words, the checkpoint is responsible for controlling the load as it should be loaded.
* **trigger(availableNow == True)** : States spark to do the load in batch by using the Structured Streaming Process. That way, spark will ingest the data in micro-batches. After finishing all mapped data 


O Spark Structured Streaming permite o uso do tipo de **trigger availableNow**. Quando definido como **True** dentro do método **`trigger`** no método **`writeStream`** indicará ao Spark que **realize a carga de dados em Batch** usando o processo do Structured Streaming. O spark irá realizar a leitura de todos os registros disponíveis para carga e irá realizar a ingestão de todos esses dados em micro-batchs. Ao terminar a execução de todos os registros mapeados no início do processo de carga, o **Spark will stop the  Stream query automatically.**.  



* **`.writeStream`** : Stores the dataframe data into the **silver.tb_access** table which will be created on-the-fly through the **`.table()`** method 
* **`.outputMode('append')`**: States the the data will be appended in the destiny table
* **`option('CheckpointLocation', access_checkpoint_location)`**: Defines the diretory where the spark will use to control the streaming data and perform the `exactly-once delivery`
* **`.trigger(availableNow=True)`**: States that the writeStream process will be performed in batch.
* **`.awaitTermination()`**: Makes the stream query a synchronous process. Used when the **`availableNow`** parameter is equal to **`True`**


In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/tb_access

In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_access

In [0]:
access_checkpoint_location = 'dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_access'
(
    df_access.writeStream
        .format('delta')
        .outputMode('append')
        .option('CheckpointLocation', access_checkpoint_location)
        .trigger(availableNow = True)
        .table('silver.tb_access').awaitTermination()
)



spark.sql('SELECT * FROM silver.tb_access LIMIT 5').display()

access_timestamp,access_ip_address,local_access,access_uuid,user_uuid,course_uuid,dt_load
2024-06-02T10:51:16Z,69.127.75.83,Computer,d6894f8d0be6fc07738efd35ad4d076d,,,2024-11-29T15:18:32.449Z
2024-06-02T11:24:12Z,168.18.37.100,Mobile,9282482ea2d2753cc2fca9fe45584434,,,2024-11-29T15:18:32.449Z
2024-06-02T11:57:08Z,113.109.66.208,Computer,8f8888544fef2ab31eba86ac2439643f,,,2024-11-29T15:18:32.449Z
2024-06-02T12:30:04Z,87.241.252.59,Computer,0b187aed08deadfc694ca80bb5beb8c6,,,2024-11-29T15:18:32.449Z
2024-06-02T13:03:00Z,188.111.120.11,Mobile,8954acd1862b490bac52b46314bb1327,,,2024-11-29T15:18:32.449Z


### 4.3 Users Table 

Important Note:
* Even though we have two different sources ( API and Files) building the same user table, we still  have to identify two different checkpoint directory for each of the source.


### 4.3.1 API users data

* **`df_users_api`**: contains all the user information from the API data. 
* **`user_uuid`** : We will use the **`md5()`** function to create the `user_uuid`. To do that, we will concat the columns below:
  * **`access_date`** - we will cast to timestamp
  * **`payload.info_usuario.nome`** 
* **`origin`**: Informs wheather the data is from API or file ( data vault modelling principle)
* **`dt_load`**: The process date

In [0]:
df_user_api = spark.sql("""
          SELECT 
            md5(concat(
            cast(access_date AS TIMESTAMP),
            payload.info_usuario.nome
            )) AS user_uuid,
            payload.info_usuario.nome AS user_name,
            payload.info_usuario.email AS user_email,
            CAST(payload.info_usuario.idade  AS INT) AS user_idade,
            payload.info_usuario.sexo as user_gender,
            payload.info_usuario.estado AS user_state,
            payload.info_usuario.profissao AS user_profession,
            CAST(NULL AS STRING) AS company,
            'API' as origin,
            getdate() as dt_load
            FROM stream_temp_vw_api
            WHERE payload.info_usuario IS NOT NULL
""")

df_user_api.limit(5).display()

user_uuid,user_name,user_email,user_idade,user_gender,user_state,user_profession,company,origin,dt_load
b743fc5f984f623958d0ea0bc0edafef,Usuario 15ce4bb4b2,usuario_15ce4bb4b2@uol.com,21,F,MS,Cientista de Dados,,API,2024-11-29T13:57:52.384Z
9e26cd7d73ab748a2e84d60f63e37d1e,Usuario f9db37a838,usuario_f9db37a838@gmail.com,37,M,PE,Arquiteto de Dados,,API,2024-11-29T13:57:52.384Z
b4087b3b31810e76478bf36d15a5a004,Usuario 3c3d30dcfd,usuario_3c3d30dcfd@uol.com,41,M,PR,Desenvolvedor de Sistemas,,API,2024-11-29T13:57:52.384Z
6b6f45a7ba1a53661b2ec7344d9e7e60,Usuario 412bbf0591,usuario_412bbf0591@outlook.com,23,F,SE,Desenvolvedor de ETL,,API,2024-11-29T13:57:52.384Z
6505bcb29d968af8150ddd9082899312,Usuario 1dadb89898,usuario_1dadb89898@gmail.com,20,F,RR,Desenvolvedor de ETL,,API,2024-11-29T13:57:52.384Z


* **`option('mergeSchema',True)`** Indicates the process of **Schema Evolution**.

* **`.trigger(availableNow=True)`** States that the writeStream occours in  **batch**.

* **`.awaitTermination()`**  Makes the stream query to be a synchronous 


In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_users

In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/tb_users

In [0]:
user_api_checkpoint_path = "dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_users"
(
  df_user_api.writeStream
        .format('delta')
        .outputMode('append')
        .option('CheckpointLocation',user_api_checkpoint_path)
        .option('mergeSchema', True)
        .trigger(availableNow=True)
        .table('silver.tb_users')
).awaitTermination()

spark.sql("SELECT * FROM silver.tb_users WHERE origin = 'API' LIMIT 5").display()

user_uuid,user_name,user_email,user_idade,user_gender,user_state,user_profession,company,origin,dt_load
b743fc5f984f623958d0ea0bc0edafef,Usuario 15ce4bb4b2,usuario_15ce4bb4b2@uol.com,21,F,MS,Cientista de Dados,,API,2024-11-29T13:58:41.602Z
9e26cd7d73ab748a2e84d60f63e37d1e,Usuario f9db37a838,usuario_f9db37a838@gmail.com,37,M,PE,Arquiteto de Dados,,API,2024-11-29T13:58:41.602Z
b4087b3b31810e76478bf36d15a5a004,Usuario 3c3d30dcfd,usuario_3c3d30dcfd@uol.com,41,M,PR,Desenvolvedor de Sistemas,,API,2024-11-29T13:58:41.602Z
6b6f45a7ba1a53661b2ec7344d9e7e60,Usuario 412bbf0591,usuario_412bbf0591@outlook.com,23,F,SE,Desenvolvedor de ETL,,API,2024-11-29T13:58:41.602Z
6505bcb29d968af8150ddd9082899312,Usuario 1dadb89898,usuario_1dadb89898@gmail.com,20,F,RR,Desenvolvedor de ETL,,API,2024-11-29T13:58:41.602Z


### 4.3.2 Files users data


We will now create a **`df_users_file`** that will be used to ingest users data from all batch files.

* **`usuario_uuid`** : We will use the `md5()` function with concat to create the user uuid
  * **`nome_empresa`**.
  * **`nome_funcionario`**.
* **`origin`** :
* **`dt_loat`**:

In [0]:
df_users_file = spark.sql("""
      SELECT   md5(CONCAT(
            nome_empresa,
            nome_funcionario
        )) AS user_uuid,
        nome_funcionario AS user_name,
        email_functionario AS user_email,
        CAST(idade AS INT) AS user_idade,
        sexo AS user_gender,
        estado as user_state,
        profissao as user_profession,  
        nome_empresa AS company,
        'FILE'  as origin,
        getdate() as dt_load       
        FROM stream_temp_vw_files
""")

df_users_file.limit(5).display()

user_uuid,user_name,user_email,user_idade,user_gender,user_state,user_profession,company,origin,dt_load
984e334f07cddc9ad2640420af9a5345,Funcionario 9ea557ec22,funcionario_9ea557ec22@empresaa.com.br,38,F,ES,Arquiteto de Dados,Empresa A,FILE,2024-11-29T13:59:10.918Z
45895efd021febd9a70f246e3d99adb4,Funcionario 004dfe5038,funcionario_004dfe5038@empresaa.com.br,46,F,RJ,Arquiteto de Dados,Empresa A,FILE,2024-11-29T13:59:10.918Z
1e09fc978496b61c33e8d2ca831dae6e,Funcionario 7df9309e4f,funcionario_7df9309e4f@empresaa.com.br,36,F,SP,Analista de BI,Empresa A,FILE,2024-11-29T13:59:10.918Z
f6bc71b39ec053409d3ea68b6a7a816f,Funcionario 109a634962,funcionario_109a634962@empresaa.com.br,48,M,PR,Cientista de Dados,Empresa A,FILE,2024-11-29T13:59:10.918Z
865cd17c600cf8c81200b91e04e7c6f5,Funcionario fafcb19dae,funcionario_fafcb19dae@empresaa.com.br,48,M,SC,Cientista de Dados,Empresa A,FILE,2024-11-29T13:59:10.918Z


In [0]:
user_files_checkpoint_files = 'dbfs:/user/hive/warehouse/silver.db/_checkpoint/files/tb_users'
(
df_users_file.writeStream
        .format('delta')
        .outputMode('append')
        .option('CheckpointLocation',user_files_checkpoint_files)
        .option('mergeSchema',True)
        .trigger(availableNow=True)
        .table('silver.tb_users')
).awaitTermination()

spark.sql("SELECT * FROM silver.tb_users WHERE origin = 'FILE' LIMIT 5").display()

user_uuid,user_name,user_email,user_idade,user_gender,user_state,user_profession,company,origin,dt_load
e783e39b9a202f3c75199676ed309876,Funcionario ee2d2a7774,funcionario_ee2d2a7774@empresaa.com.br,38,F,PA,Analista de Negocio,Empresa A,FILE,2024-11-29T14:00:14.792Z
c82b6ce03e00c77ab7e8fc1e64dfc6e4,Funcionario 3df40f8365,funcionario_3df40f8365@empresaa.com.br,20,M,AP,Analista de BI,Empresa A,FILE,2024-11-29T14:00:14.792Z
a33a0ab22834df686bcfe3797602af4a,Funcionario 574315e0f4,funcionario_574315e0f4@empresaa.com.br,45,M,TO,Analista de Dados,Empresa A,FILE,2024-11-29T14:00:14.792Z
9233664dd866ac4d3eff10d73b537086,Funcionario 855b095c21,funcionario_855b095c21@empresaa.com.br,31,F,MA,Cientista de Dados,Empresa A,FILE,2024-11-29T14:00:14.792Z
d84d335fdbb0dfb5c9135be223911dd4,Funcionario e3206f6f30,funcionario_e3206f6f30@empresaa.com.br,44,M,PI,Analista de BI,Empresa A,FILE,2024-11-29T14:00:14.792Z


### 4.4 Sales Table 

We will load sales information fom the streming and API data and create the dataframes **`df_sales_api`** and **`df_sales_file`**. The first one with API sales data and the second containing file sales data.

1. **`df_sales_api`** 
* **`acesso_uuid`** We will use the**`md5()`** function with **`concat`** to create the uuid with the columns below:
  * **`access_date`** - cast to timestamp
  * **`ip_address`**.
  * **`access_point`** -  mobile or Computador (**local_acesso**).
* **`usuario_uuid`**  **`md5()`** + **`concat`** to create the uuid by concatenating the columns below:
  * **`access_date`** - cast to timestamp
  * **`payload.info_usuario.nome`**.
* **`total_value`**, **`percent_descount`** and **`descount_value`** treated
* **`origin`** API or FILE
* **`td_load`** the loading date

#### 4.4.1 Sales API Data


In [0]:
df_sales_api = spark.sql("""
      SELECT 
        CAST( access_date AS TIMESTAMP) AS dt_sale,
        md5(concat(
          CAST(access_date AS TIMESTAMP),
          ip_address,
          CASE WHEN access_point IN ('iphone', 'android') THEN 'Mobile' ELSE 'Computer' END
        )) AS access_uuid,
        md5(concat(
          cast(access_date as TIMESTAMP),
          payload.info_usuario.nome
       )) user_uuid,
        payload.info_produto.product_uuid AS course_uuid,
        payload.info_pagamento.forma_pagamento as payment_method, 
        CAST(payload.info_pagamento.quantidade_parcelas AS INT) AS qnt_instalments,  
        CAST(payload.info_pagamento.valor_parcelas as DECIMAL(9,2) ) AS instalments_values,
        CAST(payload.info_pagamento.quantidade_parcelas * payload.info_pagamento.valor_parcelas AS DECIMAL(9,2)) AS total_value, 
        concat(CAST((payload.info_pagamento.disconto*100) AS INT ), '%' ) AS percent_discount ,
        CAST( replace(substr(payload.info_produto.valor,4),',','.' ) * payload.info_pagamento.disconto AS DECIMAL(9,2)) as discount_value,
        'API' as origin,
        getdate() as dt_load     
      FROM stream_temp_vw_api
      WHERE payload.info_pagamento IS NOT NULL       
                         
                         
""")

df_sales_api.limit(5).display()

dt_sale,access_uuid,user_uuid,course_uuid,payment_method,qnt_instalments,instalments_values,total_value,percent_discount,discount_value,origin,dt_load
2024-06-02T21:17:00Z,c2bf54a92f643d4b800b026c29aaa91d,9e26cd7d73ab748a2e84d60f63e37d1e,f260cd97c6c9813b01601e834a2added,credito,10,58.99,589.9,,,API,2024-11-29T14:00:51.766Z
2024-06-02T21:49:56Z,0c7b85d124e6762b3d1345e1e8d4a90b,b4087b3b31810e76478bf36d15a5a004,c2d6bcbc3e46555bb1e7e9afbc24d3af,credito,10,54.99,549.9,,,API,2024-11-29T14:00:51.766Z
2024-06-03T02:13:24Z,f1d885fcd44adf54b836f44480c900d7,6505bcb29d968af8150ddd9082899312,f260cd97c6c9813b01601e834a2added,credito,2,280.2,560.4,5%,29.5,API,2024-11-29T14:00:51.766Z
2024-06-03T17:02:36Z,199d1865c4941093c38b780228597cc3,6e0be9ade7feb0c9609799b8468cd3c2,f260cd97c6c9813b01601e834a2added,credito,10,56.04,560.4,5%,29.5,API,2024-11-29T14:00:51.766Z
2024-06-01T20:35:00Z,61ee98d03d8969ca8ea80e0e14d75a28,9b43b24a68017eee83bfdcd98d5a3460,f260cd97c6c9813b01601e834a2added,boleto,1,589.9,589.9,,,API,2024-11-29T14:00:51.766Z


In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/tb_sales

In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_sales

In [0]:
sales_api_checkout_path = 'dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_sales'
(
df_sales_api.writeStream
        .format('delta')
        .outputMode('append')
        .option('CheckpointLocation',sales_api_checkout_path)
        .option('mergeSchema','true')
        .trigger(availableNow = True)
        .toTable('silver.tb_sales')
).awaitTermination()

spark.sql("select * from silver.tb_sales where origin ='API' limit 5")

DataFrame[dt_sale: timestamp, access_uuid: string, user_uuid: string, course_uuid: string, payment_method: string, qnt_instalments: int, instalments_values: decimal(9,2), total_value: decimal(9,2), percent_discount: string, discount_value: decimal(9,2), origin: string, dt_load: timestamp]

#### 4.4.1 Sales File Data 


In [0]:
df_sales_file = spark.sql("""
        SELECT 
          CAST(data_venda as TIMESTAMP) as dt_sale,
          CAST(NULL AS STRING) AS access_uuid,
          md5(concat(
            nome_empresa,
            nome_funcionario
             )) AS user_uuid,
          md5(curso) AS course_uuid,
          'pix' AS payment_method,
          1 AS qnt_instalments,
          CAST((replace(substr(valor,4), ',','.') - CAST(replace(substr(valor,4), ',','.' * (replace(disconto,'%','') / 100)) AS DECIMAL(9,2))) AS DECIMAL(9,2)) AS instalments_values,
          CAST((replace(substr(valor,4),',','.') - CAST((replace(substr(valor,4),',','.') * (replace(disconto,'%','')/100)) AS DECIMAL(9,2))) AS DECIMAL(9,2)) AS total_value,
          disconto AS percent_discount,
          CAST((replace(substr(valor,4),',','.') * (replace(disconto,'%','')/100)) AS DECIMAL(9,2)) AS discount_value,
          'FILE' as origin,
          getdate() as dt_load
          FROM stream_temp_vw_files                       
""")

df_sales_file.limit(5).display()

dt_sale,access_uuid,user_uuid,course_uuid,payment_method,qnt_instalments,instalments_values,total_value,percent_discount,discount_value,origin,dt_load
2025-04-07T00:00:00Z,,984e334f07cddc9ad2640420af9a5345,c2d6bcbc3e46555bb1e7e9afbc24d3af,pix,1,,522.4,5%,27.5,FILE,2024-11-29T14:04:30.79Z
2025-10-04T00:00:00Z,,45895efd021febd9a70f246e3d99adb4,f260cd97c6c9813b01601e834a2added,pix,1,,750.4,5%,39.5,FILE,2024-11-29T14:04:30.79Z
2024-10-19T00:00:00Z,,1e09fc978496b61c33e8d2ca831dae6e,34bdd77f6954552d11c4f5547cb41458,pix,1,,655.4,5%,34.5,FILE,2024-11-29T14:04:30.79Z
2025-09-04T00:00:00Z,,f6bc71b39ec053409d3ea68b6a7a816f,c2d6bcbc3e46555bb1e7e9afbc24d3af,pix,1,,522.4,5%,27.5,FILE,2024-11-29T14:04:30.79Z
2025-01-27T00:00:00Z,,865cd17c600cf8c81200b91e04e7c6f5,f260cd97c6c9813b01601e834a2added,pix,1,,750.4,5%,39.5,FILE,2024-11-29T14:04:30.79Z


* **`.writeStream`** and **`.table()`**: loads data from **`df_sales_file`** to the **silver.tb_sales** table 
* **`outputMode('append')`**: States that the data will be appended 
* **`option('CheckpointLocation', file_checkpoint_path)`**: Defines the checkpoint diretory where Spark will control the streaming process with **`exactly-once-delivery`**
* **`option('mergeSchema',True)`**: Indicates a **Schema Evolution** processing
* **`trigger(availableNow=True)`**: States that the process will occcour in batches
* **`.trigger(availableNow=True)`**: Makes the query a scyncronous process

In [0]:
file_checkpoint_path = 'dbfs:/user/hive/warehouse/silver.db/_checkpoint/arquivo/tb_sales'
(

df_sales_file.writeStream
      .format('delta')
      .outputMode('append')
      .option('mergeSchema', 'true')
      .option('CheckpointLocation', file_checkpoint_path)
      .trigger(availableNow=True)
      .table('silver.tb_sales')
).awaitTermination()

spark.sql("SELECT * FROM silver.tb_sales WHERE origin = 'FILE' LIMIT 5").display()

dt_sale,access_uuid,user_uuid,course_uuid,payment_method,qnt_instalments,instalments_values,total_value,percent_discount,discount_value,origin,dt_load
2024-09-19T00:00:00Z,,e783e39b9a202f3c75199676ed309876,34bdd77f6954552d11c4f5547cb41458,pix,1,,655.4,5%,34.5,FILE,2024-11-29T14:05:42.578Z
2025-01-27T00:00:00Z,,c82b6ce03e00c77ab7e8fc1e64dfc6e4,c2d6bcbc3e46555bb1e7e9afbc24d3af,pix,1,,522.4,5%,27.5,FILE,2024-11-29T14:05:42.578Z
2024-07-31T00:00:00Z,,a33a0ab22834df686bcfe3797602af4a,f260cd97c6c9813b01601e834a2added,pix,1,,750.4,5%,39.5,FILE,2024-11-29T14:05:42.578Z
2025-01-07T00:00:00Z,,9233664dd866ac4d3eff10d73b537086,34bdd77f6954552d11c4f5547cb41458,pix,1,,655.4,5%,34.5,FILE,2024-11-29T14:05:42.578Z
2024-11-28T00:00:00Z,,d84d335fdbb0dfb5c9135be223911dd4,c2d6bcbc3e46555bb1e7e9afbc24d3af,pix,1,,522.4,5%,27.5,FILE,2024-11-29T14:05:42.578Z


In [0]:
stop_all_streams()

stop_all_streams-inicio-2024-11-29 14:06:41.414082
O stream display_query_7 fui finalizado com sucesso.
O stream display_query_5 fui finalizado com sucesso.
O stream display_query_2 fui finalizado com sucesso.
O stream display_query_3 fui finalizado com sucesso.
O stream display_query_6 fui finalizado com sucesso.
O stream display_query_4 fui finalizado com sucesso.
stop_all_streams-fim-2024-11-29 14:06:42.818463
              
