# Loading Data to Silver Zone

This Notebook:
* We will iIngest data from **Bronze Zone** to **Silver Zone** using spark
* We will use Spark Structured Streaming with **`trigger(availableNow=True)`** for batch loading
* We will do a **load control** of the batch processes through Structured Streaming **checkpoint**
* We will use **`awaitTermination()`**  ethod to transform the streaming queries in a synchronous process
* We will Combine spark and sql in order to do the data load

## 1.0 Initial Setup

In [0]:
%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Helpers/data_generator" 

In [0]:
#%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Load_Bronze_Zone"   

## 2.0 Create `Silver Zone` Schema

In [0]:
spark.sql("CREATE DATABASE IF NOT EXISTS silver")

DataFrame[]

## 3.0 Businesse Requirements for Silver Zone

1. The ingestion need to be done in batch in order to avoid extra costs 
    * Even though the API data is stored in streaming in landing zone 
2. API and Batch Data need to be stored in the same table 
3. Each table will have an uuid column with a hash to identify each register
4. We have to garantee the correct data type of all column 
5. We need to create better column names for each table 

## 4.0 Data Modeling

### 4.1 Courses Table  ( Domain Table)

The table `tb_courses` is a **domain table** and we will store all the available courses information ( the product ). Its information will be added manually.
* We will use the **md5()** function to create the **curso_uuuid** column by the course name 
* The column **data_carga** : Contains the processing date

In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/tb_courses

In [0]:
spark.sql("""
  CREATE TABLE IF NOT EXISTS silver.tb_courses
  AS
    SELECT  
      md5('Data Pipeline with Databricks') AS course_uuid,
      'Data Pipeline with Databricks' AS course_name,
      'beginner' AS course_level,
      589.90 AS course_price,
      getdate() AS dt_load

      UNION

    SELECT
      md5('From your first data pipeline to a Data Lakehouse with Databricks') AS course_uuid,
      'From your first data pipeline to a Data Lakehouse with Databricks' AS course_name,
      'intermediate' as course_level,
      659.90 AS course_price,
      getdate() AS dt_load


      UNION

    SELECT
      md5('Building a Data Pipeline with Spark Structured Streaming') AS course_uuid,
      'Building a Data Pipeline with Spark Structured Streaming' as course_name,
      'advanced' as course_level,
      549.90 as course_price,
      getdate() as dt_load
"""
)


spark.sql('SELECT * FROM silver.tb_courses').display()

course_uuid,course_name,course_level,course_price,dt_load
fb95df132ca7f41d392bc98ccf0cfeb8,Data Pipeline with Databricks,beginner,589.9,2024-11-23T09:43:09.142Z
bda125b01c9596e123e5f9b3bf00f3a8,From your first data pipeline to a Data Lakehouse with Databricks,intermediate,659.9,2024-11-23T09:43:09.142Z
ff17869bc6f9d9865e0bf8133c4ce3c3,Building a Data Pipeline with Spark Structured Streaming,advanced,549.9,2024-11-23T09:43:09.142Z


We will now create two streaming views called **`stream_temp_vw_api`**  and **`stream_temp_vw_files`** that will be used as source data for our loading process.

In [0]:
api_df = spark.readStream.table('bronze.api_data')
api_df.createOrReplaceTempView('stream_temp_vw_api')

files_df = spark.readStream.table('bronze.file_data')
files_df.createOrReplaceTempView('stream_temp_vw_files')

### 4.2 Access Table  

This table stores all the website access.

**`df_access`** : Dataframe that used to load data into **`tb_access`** table. This dataframe stores all information regarding the website visitors and its information comes from API. 

**Columns:**
* **`acesso_uuid` column** : Created with the **`md5()`** function  by **`concatenating`** the columns below:
  * **`access_date`** - After being converted to Timestamp
  * **`ip_address`**. - ip address of the computer 
  * **`access_point`** -Identify the access point as mobile or computer (**local_acesso**).
* **`usuario_uuid` column**: Created with the **`md5()`** function  by **`concatenating`** the columns below:
  * **`access_date`** - After being converted to Timestamp
  * **`payload.info_usuario.nome`** - name of the user
* **`data_carga` column**: The processing date of the register 

In [0]:
%sql
select * from stream_temp_vw_api limit 2

access_date,access_point,ip_address,payload,_rescued_data,source_file_name,processing_timestamp
2024-06-02T04:20:11.000Z,safari,69.127.75.83,"List(null, null, List(usuario_022d744bd9@hotmail.com, PI, 20, Usuario 022d744bd9, Arquiteto de Dados, F))",,part-00004-a3c1111d-bc3c-4dda-9b95-fc4fb36260cd-c000.json,2024-11-23T09:37:34.673Z
2024-06-02T04:53:07.000Z,android,168.18.37.100,"List(null, null, null)",,part-00004-a3c1111d-bc3c-4dda-9b95-fc4fb36260cd-c000.json,2024-11-23T09:37:34.673Z


In [0]:
df_access = spark.sql(""" 
    SELECT 
      CAST( access_date AS TIMESTAMP) AS access_timestamp,
      ip_address AS access_ip_address,
      CASE WHEN access_point IN ('iphone','android') THEN 'Mobile' ELSE 'Computer' END AS local_access,
      md5(concat(
        CAST(access_date AS TIMESTAMP),
        ip_address,
        CASE WHEN access_point IN ('iphne', 'android') THEN 'mobile' ELSE 'computer' END
      )) AS access_uuid,
      md5(concat(
          CAST( access_date AS TIMESTAMP),
          payload.info_usuario.nome        
         )) AS user_uuid,
        payload.info_produto.product_uuid AS course_uuid,
      getdate() AS dt_load
    FROM stream_temp_vw_api
"""
)

df_access.limit(5).display()

access_timestamp,access_ip_address,local_access,access_uuid,user_uuid,course_uuid,dt_load
2024-06-02T04:20:11Z,69.127.75.83,Computer,58b30dcf3f31a346a1e8309ffbdf2887,f4044489ec92e523c515291487ba1415,,2024-11-23T09:45:13.265Z
2024-06-02T04:53:07Z,168.18.37.100,Mobile,2558841a58daf7b7d6bce7237c9c2321,,,2024-11-23T09:45:13.265Z
2024-06-02T05:26:03Z,113.109.66.208,Computer,643b450160c408db38b3f5ec5311f3b7,,,2024-11-23T09:45:13.265Z
2024-06-02T05:58:59Z,87.241.252.59,Computer,4f10a5f62ae52a8f5e8ac2b05d782963,5a80e866b5b0a425d501536855566096,,2024-11-23T09:45:13.265Z
2024-06-02T06:31:55Z,188.111.120.11,Mobile,1719c3662fb07ac4c153e25e003799a5,,,2024-11-23T09:45:13.265Z


We have just read the stream data from our API. 

Important note about spark:
* The origin and destination of the should have the same caracteristics. In other words:
  * `Source` : Stream data and `Destination`: Stream data
  * `Source` : Batch data and `Destionation`: batch data

In other to overcome this issue, we will use `Spark Structured Streaming` to read the sterming data in batch.

We use `Spark Structured Streaming` to load micro batch of data.

`Spark Structured Streaming` :
* No need to manage a checkpoint table to identify data that have been loead
* The `Spark Structured Streaming` uses a **`checkpoint directory`** defined by the writeStream method. This checkpoint stores the last file/offset/row  that have been stored.This way, in case of failing the process, the spark will be able to garantee the **Stream Exactly-Once Semantics**. In other words, the checkpoint is responsible for controlling the load as it should be loaded.
* **trigger(availableNow == True)** : States spark to do the load in batch by using the Structured Streaming Process. That way, spark will ingest the data in micro-batches. After finishing all mapped data 


O Spark Structured Streaming permite o uso do tipo de **trigger availableNow**. Quando definido como **True** dentro do método **`trigger`** no método **`writeStream`** indicará ao Spark que **realize a carga de dados em Batch** usando o processo do Structured Streaming. O spark irá realizar a leitura de todos os registros disponíveis para carga e irá realizar a ingestão de todos esses dados em micro-batchs. Ao terminar a execução de todos os registros mapeados no início do processo de carga, o **Spark will stop the  Stream query automatically.**.  



* **`.writeStream`** : Stores the dataframe data into the **silver.tb_access** table which will be created on-the-fly through the **`.table()`** method 
* **`.outputMode('append')`**: States the the data will be appended in the destiny table
* **`option('CheckpointLocation', access_checkpoint_location)`**: Defines the diretory where the spark will use to control the streaming data and perform the `exactly-once delivery`
* **`.trigger(availableNow=True)`**: States that the writeStream process will be performed in batch.
* **`.awaitTermination()`**: Makes the stream query a synchronous process. Used when the **`availableNow`** parameter is equal to **`True`**


In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_access

In [0]:
access_checkpoint_location = 'dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_access'
(
    df_access.writeStream
        .format('delta')
        .outputMode('append')
        .option('CheckpointLocation', access_checkpoint_location)
        .trigger(availableNow = True)
        .table('silver.tb_access').awaitTermination()
)



spark.sql('SELECT * FROM silver.tb_access LIMIT 5').display()

access_timestamp,access_ip_address,local_access,access_uuid,user_uuid,course_uuid,dt_load
2024-06-02T04:20:11Z,69.127.75.83,Computer,58b30dcf3f31a346a1e8309ffbdf2887,f4044489ec92e523c515291487ba1415,,2024-11-23T09:47:04.931Z
2024-06-02T04:53:07Z,168.18.37.100,Mobile,2558841a58daf7b7d6bce7237c9c2321,,,2024-11-23T09:47:04.931Z
2024-06-02T05:26:03Z,113.109.66.208,Computer,643b450160c408db38b3f5ec5311f3b7,,,2024-11-23T09:47:04.931Z
2024-06-02T05:58:59Z,87.241.252.59,Computer,4f10a5f62ae52a8f5e8ac2b05d782963,5a80e866b5b0a425d501536855566096,,2024-11-23T09:47:04.931Z
2024-06-02T06:31:55Z,188.111.120.11,Mobile,1719c3662fb07ac4c153e25e003799a5,,,2024-11-23T09:47:04.931Z


### 4.3 Users Table 

Important Note:
* Even though we have two different sources ( API and Files) building the same user table, we still  have to identify two different checkpoint directory for each of the source.


### 4.3.1 API users data

* **`df_users_api`**: contains all the user information from the API data. 
* **`user_uuid`** : We will use the **`md5()`** function to create the `user_uuid`. To do that, we will concat the columns below:
  * **`access_date`** - we will cast to timestamp
  * **`payload.info_usuario.nome`** 
* **`origin`**: Informs wheather the data is from API or file ( data vault modelling principle)
* **`dt_load`**: The process date

In [0]:
df_user_api = spark.sql("""
          SELECT 
            md5(concat(
            cast(access_date AS TIMESTAMP),
            payload.info_usuario.nome
            )) AS user_uuid,
            payload.info_usuario.nome AS user_name,
            payload.info_usuario.email AS user_email,
            CAST(payload.info_usuario.idade  AS INT) AS user_idade,
            payload.info_usuario.sexo as user_gender,
            payload.info_usuario.estado AS user_state,
            payload.info_usuario.profissao AS user_profession,
            CAST(NULL AS STRING) AS company,
            'API' as origin,
            getdate() as dt_load
            FROM stream_temp_vw_api
            WHERE payload.info_usuario IS NOT NULL
""")

df_user_api.limit(5).display()

user_uuid,user_name,user_email,user_idade,user_gender,user_state,user_profession,company,origin,dt_load
f4044489ec92e523c515291487ba1415,Usuario 022d744bd9,usuario_022d744bd9@hotmail.com,20,F,PI,Arquiteto de Dados,,API,2024-11-23T09:48:15.233Z
5a80e866b5b0a425d501536855566096,Usuario 17e7c1bc35,usuario_17e7c1bc35@hotmail.com,20,F,AP,Analista de Dados,,API,2024-11-23T09:48:15.233Z
7c36c47f0d173f1b1d2f34331228b83c,Usuario 11d0f5abaf,usuario_11d0f5abaf@hotmail.com,30,F,AM,Cientista de Dados,,API,2024-11-23T09:48:15.233Z
605fd40767a69a20931e1beb8820ba82,Usuario a563ca5283,usuario_a563ca5283@uol.com,37,M,RN,Analista de BI,,API,2024-11-23T09:48:15.233Z
adf11c2b0254e8d4e72a51b2bc86737c,Usuario 993c1e8a92,usuario_993c1e8a92@outlook.com,46,M,RJ,Analista de Dados,,API,2024-11-23T09:48:15.233Z


* **`option('mergeSchema',True)`** Indicates the process of **Schema Evolution**.

* **`.trigger(availableNow=True)`** States that the writeStream occours in  **batch**.

* **`.awaitTermination()`**  Makes the stream query to be a synchronous 


In [0]:
%fs rm -r dbfs:/user/hive/warehouse/silver.db/tb_users

In [0]:
user_api_checkpoint_path = "dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_users"
(
  df_user_api.writeStream
        .format('delta')
        .outputMode('append')
        .option('CheckpointLocation',user_api_checkpoint_path)
        .option('mergeSchema', True)
        .trigger(availableNow=True)
        .table('silver.tb_users')
).awaitTermination()

spark.sql("SELECT * FROM silver.tb_users WHERE origin = 'API' LIMIT 5").display()

user_uuid,user_name,user_email,user_idade,user_gender,user_state,user_profession,company,origin,dt_load
f4044489ec92e523c515291487ba1415,Usuario 022d744bd9,usuario_022d744bd9@hotmail.com,20,F,PI,Arquiteto de Dados,,API,2024-11-23T09:50:48.55Z
5a80e866b5b0a425d501536855566096,Usuario 17e7c1bc35,usuario_17e7c1bc35@hotmail.com,20,F,AP,Analista de Dados,,API,2024-11-23T09:50:48.55Z
7c36c47f0d173f1b1d2f34331228b83c,Usuario 11d0f5abaf,usuario_11d0f5abaf@hotmail.com,30,F,AM,Cientista de Dados,,API,2024-11-23T09:50:48.55Z
605fd40767a69a20931e1beb8820ba82,Usuario a563ca5283,usuario_a563ca5283@uol.com,37,M,RN,Analista de BI,,API,2024-11-23T09:50:48.55Z
adf11c2b0254e8d4e72a51b2bc86737c,Usuario 993c1e8a92,usuario_993c1e8a92@outlook.com,46,M,RJ,Analista de Dados,,API,2024-11-23T09:50:48.55Z


### 4.3.2 Files users data


We will now create a **`df_users_file`** that will be used to ingest users data from all batch files.

* **`usuario_uuid`** : We will use the `md5()` function with concat to create the user uuid
  * **`nome_empresa`**.
  * **`nome_funcionario`**.
* **`origin`** :
* **`dt_loat`**:

In [0]:
df_users_file = spark.sql("""
      SELECT   md5(CONCAT(
            nome_empresa,
            nome_funcionario
        )) AS user_uuid,
        nome_funcionario AS user_name,
        email_functionario AS user_email,
        CAST(idade AS INT) AS user_idade,
        sexo AS user_gender,
        estado as user_state,
        profissao as user_profession,  
        nome_empresa AS company,
        'FILE'  as origin,
        getdate() as dt_load       
        FROM stream_temp_vw_files
""")

df_users_file.limit(5).display()

user_uuid,user_name,user_email,user_idade,user_gender,user_state,user_profession,company,origin,dt_load
90d0bf57fa11177af392c348d2a5bd72,Funcionario 3284054f49,funcionario_3284054f49@empresaa.com.br,44,M,RO,Cientista de Dados,Empresa A,FILE,2024-11-23T09:51:39.493Z
807a38110c4576283cb8d42233072e47,Funcionario 85502baecc,funcionario_85502baecc@empresaa.com.br,31,F,AC,Desenvolvedor de ETL,Empresa A,FILE,2024-11-23T09:51:39.493Z
0d9b75f8fb5b36b9980780b335c37ddb,Funcionario cbdc66550c,funcionario_cbdc66550c@empresaa.com.br,19,M,AM,Desenvolvedor de ETL,Empresa A,FILE,2024-11-23T09:51:39.493Z
893ef864f4c3c3d96103b79c9f1d0277,Funcionario 348609e786,funcionario_348609e786@empresaa.com.br,21,F,RR,Analista de Dados,Empresa A,FILE,2024-11-23T09:51:39.493Z
901d1824911ad8b25d56b719a9e68942,Funcionario cfe808cdc0,funcionario_cfe808cdc0@empresaa.com.br,44,M,PA,Arquiteto de Dados,Empresa A,FILE,2024-11-23T09:51:39.493Z


In [0]:
user_files_checkpoint_files = 'dbfs:/user/hive/warehouse/silver.db/_checkpoint/files/tb_users'
(
df_users_file.writeStream
        .format('delta')
        .outputMode('append')
        .option('CheckpointLocation',user_files_checkpoint_files)
        .option('mergeSchema',True)
        .trigger(availableNow=True)
        .table('silver.tb_users')
).awaitTermination()

spark.sql("SELECT * FROM silver.tb_users WHERE origin = 'FILE' LIMIT 5").display()

user_uuid,user_name,user_email,user_idade,user_gender,user_state,user_profession,company,origin,dt_load
6f9fe2aab0141bb55e4ac96386619fa8,Funcionario 259218b251,funcionario_259218b251@empresaa.com.br,43,M,AM,Engenheiro de Dados,Empresa A,FILE,2024-11-23T09:51:59.49Z
2c6044088cfed30dae78630eb4212151,Funcionario b1df875968,funcionario_b1df875968@empresaa.com.br,34,F,RR,Desenvolvedor de ETL,Empresa A,FILE,2024-11-23T09:51:59.49Z
72d4b60fa3676f45cb57d010e00efd03,Funcionario d0a2e2fa77,funcionario_d0a2e2fa77@empresaa.com.br,25,M,PA,Analista de BI,Empresa A,FILE,2024-11-23T09:51:59.49Z
4fbf0fbb8df8b0ae31dc523e2f1aff7b,Funcionario a774a75a5d,funcionario_a774a75a5d@empresaa.com.br,45,M,AP,Desenvolvedor de ETL,Empresa A,FILE,2024-11-23T09:51:59.49Z
212df58d29be91116d1e6317678e797c,Funcionario 9a97c8cb83,funcionario_9a97c8cb83@empresaa.com.br,45,M,TO,Analista de Dados,Empresa A,FILE,2024-11-23T09:51:59.49Z


### 4.4 Sales Table 

We will load sales information fom the streming and API data and create the dataframes **`df_sales_api`** and **`df_sales_file`**. The first one with API sales data and the second containing file sales data.

1. **`df_sales_api`** 
* **`acesso_uuid`** We will use the**`md5()`** function with **`concat`** to create the uuid with the columns below:
  * **`access_date`** - cast to timestamp
  * **`ip_address`**.
  * **`access_point`** -  mobile or Computador (**local_acesso**).
* **`usuario_uuid`**  **`md5()`** + **`concat`** to create the uuid by concatenating the columns below:
  * **`access_date`** - cast to timestamp
  * **`payload.info_usuario.nome`**.
* **`total_value`**, **`percent_descount`** and **`descount_value`** treated
* **`origin`** API or FILE
* **`td_load`** the loading date

#### 4.4.1 Sales API Data


In [0]:
df_sales_api = spark.sql("""
      SELECT 
        CAST( access_date AS TIMESTAMP) AS dt_sale,
        md5(concat(
          CAST(access_date AS TIMESTAMP),
          ip_address,
          CASE WHEN access_point IN ('iphone', 'android') THEN 'Mobile' ELSE 'Computer' END
        )) AS access_uuid,
        md5(concat(
          cast(access_date as TIMESTAMP),
          payload.info_usuario.nome
       )) user_uuid,
        payload.info_produto.product_uuid AS course_uuid,
        payload.info_pagamento.forma_pagamento as payment_method, 
        CAST(payload.info_pagamento.quantidade_parcelas AS INT) AS qnt_instalments,  
        CAST(payload.info_pagamento.valor_parcelas as DECIMAL(9,2) ) AS instalments_values,
        CAST(payload.info_pagamento.quantidade_parcelas * payload.info_pagamento.valor_parcelas AS DECIMAL(9,2)) AS total_value, 
        concat(CAST((payload.info_pagamento.disconto*100) AS INT ), '%' ) AS percent_discount ,
        CAST( replace(substr(payload.info_produto.valor,4),',','.' ) * payload.info_pagamento.disconto AS DECIMAL(9,2)) as discount_value,
        'API' as origin,
        getdate() as dt_load     
      FROM stream_temp_vw_api
      WHERE payload.info_pagamento IS NOT NULL       
                         
                         
""")

df_sales_api.limit(5).display()

dt_sale,access_uuid,user_uuid,course_uuid,payment_method,qnt_instalments,instalments_values,total_value,percent_discount,discount_value,origin,dt_load
2024-06-02T07:37:47Z,c67bcead77c723dd2273bdc537cb5085,7c36c47f0d173f1b1d2f34331228b83c,c2d6bcbc3e46555bb1e7e9afbc24d3af,credito,10,54.99,549.9,,,API,2024-11-23T10:24:57.967Z
2024-06-02T21:21:07Z,9f703f48102ed386e6d3fe36b3db5ae6,a08e693ac7d1e3630788552626a31379,34bdd77f6954552d11c4f5547cb41458,credito,10,65.99,659.9,,,API,2024-11-23T10:24:57.967Z
2024-06-03T11:04:27Z,5516deecfae24f492753d612a7c9b0bb,dc402a364c5a6eb1e4a0e5d148a663c0,f260cd97c6c9813b01601e834a2added,credito,10,58.99,589.9,,,API,2024-11-23T10:24:57.967Z
2024-06-03T12:43:15Z,2b3e9b7fb951ea3d26df2c4f0eee41d7,5b71e33fcb230c85a909cccc4a7c4614,f260cd97c6c9813b01601e834a2added,credito,12,49.16,589.92,,,API,2024-11-23T10:24:57.967Z
2024-06-03T13:49:07Z,a006e739b422b40b0c0950f4c21d8819,228f3881e527db54ea749dd08f65c789,34bdd77f6954552d11c4f5547cb41458,boleto,1,659.9,659.9,,,API,2024-11-23T10:24:57.967Z


In [0]:
sales_api_checkout_path = 'dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_sales'
(
df_sales_api.writeStream
        .format('delta')
        .outputMode('append')
        .option('CheckpointLocation',sales_api_checkout_path)
        .option('mergeSchema','true')
        .trigger(availableNow = True)
        .toTable('silver.tb_sales')
).awaitTermination()

spark.sql("select * from silver.tb_sales where origin ='API' limit 5")

DataFrame[dt_sale: timestamp, access_uuid: string, user_uuid: string, course_uuid: string, payment_method: string, qnt_instalments: int, instalments_values: decimal(9,2), total_value: decimal(9,2), percent_discount: string, discount_value: decimal(9,2), origin: string, dt_load: timestamp]

#### 4.4.1 Sales File Data 


In [0]:
df_sales_file = spark.sql("""
        SELECT 
          CAST(data_venda as TIMESTAMP) as dt_sale,
          CAST(NULL AS STRING) AS access_uuid,
          md5(concat(
            nome_empresa,
            nome_funcionario
             )) AS user_uuid,
          md5(curso) AS course_uuid,
          'pix' AS payment_method,
          1 AS qnt_instalments,
          CAST((replace(substr(valor,4), ',','.') - CAST(replace(substr(valor,4), ',','.' * (replace(disconto,'%','') / 100)) AS DECIMAL(9,2))) AS DECIMAL(9,2)) AS instalments_values,
          CAST((replace(substr(valor,4),',','.') - CAST((replace(substr(valor,4),',','.') * (replace(disconto,'%','')/100)) AS DECIMAL(9,2))) AS DECIMAL(9,2)) AS total_value,
          disconto AS percent_discount,
          CAST((replace(substr(valor,4),',','.') * (replace(disconto,'%','')/100)) AS DECIMAL(9,2)) AS discount_value,
          'FILE' as origin,
          getdate() as dt_load
          FROM stream_temp_vw_files                       
""")

df_sales_file.limit(5).display()

dt_sale,access_uuid,user_uuid,course_uuid,payment_method,qnt_instalments,instalments_values,total_value,percent_discount,discount_value,origin,dt_load
2025-02-26T00:00:00Z,,90d0bf57fa11177af392c348d2a5bd72,f260cd97c6c9813b01601e834a2added,pix,1,,750.4,5%,39.5,FILE,2024-11-23T10:47:35.11Z
2024-07-01T00:00:00Z,,807a38110c4576283cb8d42233072e47,34bdd77f6954552d11c4f5547cb41458,pix,1,,655.4,5%,34.5,FILE,2024-11-23T10:47:35.11Z
2025-11-23T00:00:00Z,,0d9b75f8fb5b36b9980780b335c37ddb,c2d6bcbc3e46555bb1e7e9afbc24d3af,pix,1,,522.4,5%,27.5,FILE,2024-11-23T10:47:35.11Z
2025-04-07T00:00:00Z,,893ef864f4c3c3d96103b79c9f1d0277,f260cd97c6c9813b01601e834a2added,pix,1,,750.4,5%,39.5,FILE,2024-11-23T10:47:35.11Z
2024-08-10T00:00:00Z,,901d1824911ad8b25d56b719a9e68942,34bdd77f6954552d11c4f5547cb41458,pix,1,,655.4,5%,34.5,FILE,2024-11-23T10:47:35.11Z


* **`.writeStream`** and **`.table()`**: loads data from **`df_sales_file`** to the **silver.tb_sales** table 
* **`outputMode('append')`**: States that the data will be appended 
* **`option('CheckpointLocation', file_checkpoint_path)`**: Defines the checkpoint diretory where Spark will control the streaming process with **`exactly-once-delivery`**
* **`option('mergeSchema',True)`**: Indicates a **Schema Evolution** processing
* **`trigger(availableNow=True)`**: States that the process will occcour in batches
* **`.trigger(availableNow=True)`**: Makes the query a scyncronous process

In [0]:
file_checkpoint_path = 'dbfs:/user/hive/warehouse/silver.db/_checkpoint/arquivo/tb_sales'
(

df_sales_file.writeStream
      .format('delta')
      .outputMode('append')
      .option('mergeSchema', 'true')
      .option('CheckpointLocation', file_checkpoint_path)
      .trigger(availableNow=True)
      .table('silver.tb_sales')
).awaitTermination()

spark.sql("SELECT * FROM silver.tb_sales WHERE origin = 'FILE' LIMIT 5").display()

dt_sale,access_uuid,user_uuid,course_uuid,payment_method,qnt_instalments,instalments_values,total_value,percent_discount,discount_value,origin,dt_load
2025-02-26T00:00:00Z,,fbf6574746a46651e9a72cf771891f14,f260cd97c6c9813b01601e834a2added,pix,1,,750.4,5%,39.5,FILE,2024-11-23T10:51:35.968Z
2024-07-01T00:00:00Z,,0146e0149c663954281270fe0396b8d3,34bdd77f6954552d11c4f5547cb41458,pix,1,,655.4,5%,34.5,FILE,2024-11-23T10:51:35.968Z
2025-11-23T00:00:00Z,,fe04c6beb42683f312bb07a3e344e4a2,c2d6bcbc3e46555bb1e7e9afbc24d3af,pix,1,,522.4,5%,27.5,FILE,2024-11-23T10:51:35.968Z
2025-04-07T00:00:00Z,,27eb0a19f0d73169e5163c1612647bc0,f260cd97c6c9813b01601e834a2added,pix,1,,750.4,5%,39.5,FILE,2024-11-23T10:51:35.968Z
2024-08-10T00:00:00Z,,4eca5d4c0b997cb694dc95bc5f46aa1d,34bdd77f6954552d11c4f5547cb41458,pix,1,,655.4,5%,34.5,FILE,2024-11-23T10:51:35.968Z


In [0]:
stop_all_streams()

stop_all_streams-inicio-2024-11-23 11:01:50.047884
O stream display_query_5 fui finalizado com sucesso.
O stream display_query_6 fui finalizado com sucesso.
O stream display_query_3 fui finalizado com sucesso.
O stream display_query_9 fui finalizado com sucesso.
O stream display_query_8 fui finalizado com sucesso.
O stream display_query_4 fui finalizado com sucesso.
stop_all_streams-fim-2024-11-23 11:01:51.755516
              
