# Loading Data to Silver Zone

This Notebook:
* We will iIngest data from **Bronze Zone** to **Silver Zone** using spark
* We will use Spark Structured Streaming with **`trigger(availableNow=True)`** for batch loading
* We will do a **load control** of the batch processes through Structured Streaming **checkpoint**
* We will use **`awaitTermination()`**  ethod to transform the streaming queries in a synchronous process
* We will Combine spark and sql in order to do the data load

## 1.0 Initial Setup

In [0]:
%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Helpers/data_generator" 


## 2.0 Create `Silver Zone` Schema

In [0]:
spark.sql("CREATE DATABASE IF NOT EXISTS silver")

DataFrame[]

## 3.0 Businesse Requirements for Silver Zone

1. The ingestion need to be done in batch in order to avoid extra costs 
    * Even though the API data is stored in streaming in landing zone 
2. API and Batch Data need to be stored in the same table 
3. Each table will have an uuid column with a hash to identify each register
4. We have to garantee the correct data type of all column 
5. We need to create better column names for each table 

## 4.0 Data Modeling

### 4.1 Courses Table  ( Domain Table)

The table `tb_courses` is a **domain table** and we will store all the available courses information ( the product ). Its information will be added manually.
* We will use the **md5()** function to create the **curso_uuuid** column by the course name 
* The column **data_carga** : Contains the processing date

In [0]:
spark.sql("""
  CREATE TABLE IF NOT EXISTS silver.tb_courses
  AS
    SELECT  
      md5('Data Pipeline with Databricks') AS course_uuid,
      'Data Pipeline with Databricks' AS course_name,
      'beginner' AS course_level,
      589.90 AS course_price,
      getdate() AS dt_load

      UNION

    SELECT
      md5('From your first data pipeline to a Data Lakehouse with Databricks') AS course_uuid,
      'From your first data pipeline to a Data Lakehouse with Databricks' AS course_name,
      'intermediate' as course_level,
      659.90 AS course_price,
      getdate() AS dt_load


      UNION

    SELECT
      md5('Building a Data Pipeline with Spark Structured Streaming') AS course_uuid,
      'Building a Data Pipeline with Spark Structured Streaming' as course_name,
      'advanced' as course_level,
      549.90 as course_price,
      getdate() as dt_load
"""
)


spark.sql('SELECT * FROM silver.tb_courses').display()

course_uuid,course_name,course_level,course_price,dt_load
fb95df132ca7f41d392bc98ccf0cfeb8,Data Pipeline with Databricks,beginner,589.9,2024-11-22T09:39:02.963Z
bda125b01c9596e123e5f9b3bf00f3a8,From your first data pipeline to a Data Lakehouse with Databricks,intermediate,659.9,2024-11-22T09:39:02.963Z
ff17869bc6f9d9865e0bf8133c4ce3c3,Building a Data Pipeline with Spark Structured Streaming,advanced,549.9,2024-11-22T09:39:02.963Z


We will now create two streaming views called **`stream_temp_vw_api`**  and **`stream_temp_vw_files`** that will be used as source data for our loading process.

In [0]:
api_df = spark.readStream.table('bronze.api_data_')
api_df.createOrReplaceTempView('stream_temp_vw_api')

files_df = spark.readStream.table('bronze.file_data_')
files_df.createOrReplaceTempView('stream_temp_vw_files')

### 4.2 Access Table  

This table stores all the website access.

**`df_access`** : Dataframe that used to load data into **`tb_access`** table. This dataframe stores all information regarding the website visitors and its information comes from API. 

**Columns:**
* **`acesso_uuid` column** : Created with the **`md5()`** function  by **`concatenating`** the columns below:
  * **`access_date`** - After being converted to Timestamp
  * **`ip_address`**. - ip address of the computer 
  * **`access_point`** -Identify the access point as mobile or computer (**local_acesso**).
* **`usuario_uuid` column**: Created with the **`md5()`** function  by **`concatenating`** the columns below:
  * **`access_date`** - After being converted to Timestamp
  * **`payload.info_usuario.nome`** - name of the user
* **`data_carga` column**: The processing date of the register 

In [0]:
%sql
select * from stream_temp_vw_api limit 2

access_date,access_point,ip_address,payload,_rescued_data,source_file_name,processing_timestamp
2024-06-02T04:32:32.000Z,android,69.127.75.83,"List(null, null, null)",,part-00007-dcf3b00e-1836-4744-b6b7-0b017475c73e-c000.json,2024-11-22T10:18:20.134Z
2024-06-02T05:05:28.000Z,firefox,168.18.37.100,"List(null, null, null)",,part-00007-dcf3b00e-1836-4744-b6b7-0b017475c73e-c000.json,2024-11-22T10:18:20.134Z


In [0]:
df_access = spark.sql(""" 
    SELECT 
      CAST( access_date AS TIMESTAMP) AS access_timestamp,
      ip_address AS access_ip_address,
      CASE WHEN access_point IN ('iphone','android') THEN 'Mobile' ELSE 'Computer' END AS local_access,
      md5(concat(
        CAST(access_date AS TIMESTAMP),
        ip_address,
        CASE WHEN access_point IN ('iphne', 'android') THEN 'mobile' ELSE 'computer' END
      )) AS access_uuid,
      md5(concat(
          CAST( access_date AS TIMESTAMP),
          payload.info_usuario.nome        
         )) AS user_uuid,
        payload.info_produto.product_uuid AS course_uuid,
      getdate() AS dt_load
    FROM stream_temp_vw_api
"""
)

df_access.limit(5).display()

access_timestamp,access_ip_address,local_access,access_uuid,uuid_user,course_uuid,dt_load
2024-06-02T04:32:32Z,69.127.75.83,Mobile,1103c7b8628eb7d9e78de86aa2b29d68,,,2024-11-22T11:07:38.842Z
2024-06-02T05:05:28Z,168.18.37.100,Computer,9b6441d6a9398786735cd962eabaf763,,,2024-11-22T11:07:38.842Z
2024-06-02T05:38:24Z,113.109.66.208,Computer,af4e558b3f8e7bda5f938521fc9f8d71,6a7bba03bab91ad37c8b99cc62b2b49b,f260cd97c6c9813b01601e834a2added,2024-11-22T11:07:38.842Z
2024-06-02T06:11:20Z,87.241.252.59,Mobile,09203dff192ad95f83d644065fdc86c9,,,2024-11-22T11:07:38.842Z
2024-06-02T06:44:16Z,188.111.120.11,Computer,a234a81dc3a2926c09d2a198e8d09750,d6f409ceb737a0a28b5c020593a4dff9,34bdd77f6954552d11c4f5547cb41458,2024-11-22T11:07:38.842Z


We have just read the stream data from our API. 

Important note about spark:
* The origin and destination of the should have the same caracteristics. In other words:
  * `Source` : Stream data and `Destination`: Stream data
  * `Source` : Batch data and `Destionation`: batch data

In other to overcome this issue, we will use `Spark Structured Streaming` to read the sterming data in batch.

We use `Spark Structured Streaming` to load micro batch of data.

`Spark Structured Streaming` :
* No need to manage a checkpoint table to identify data that have been loead
* The `Spark Structured Streaming` uses a **`checkpoint directory`** defined by the writeStream method. This checkpoint stores the last file/offset/row  that have been stored.This way, in case of failing the process, the spark will be able to garantee the **Stream Exactly-Once Semantics**. In other words, the checkpoint is responsible for controlling the load as it should be loaded.
* **trigger(availableNow* == True)** : States spark to do the load in batch by using the Structured Streaming Process. That way, spark will ingest the data in micro-batches. After finishing all mapped data 


O Spark Structured Streaming permite o uso do tipo de **trigger availableNow**. Quando definido como **True** dentro do método **`trigger`** no método **`writeStream`** indicará ao Spark que **realize a carga de dados em Batch** usando o processo do Structured Streaming. O spark irá realizar a leitura de todos os registros disponíveis para carga e irá realizar a ingestão de todos esses dados em micro-batchs. Ao terminar a execução de todos os registros mapeados no início do processo de carga, o **Spark will stop the  Stream query automatically.**.  



* **`.writeStream`** : Stores the dataframe data into the **silver.tb_access** table which will be created on-the-fly through the **`.table()`** method 
* **`.outputMode('append')`**: States the the data will be appended in the destiny table
* **`option('CheckpointLocation', access_checkpoint_location)`**: Defines the diretory where the spark will use to control the streaming data and perform the `exactly-once delivery`
* **`.trigger(availableNow=True)`**: States that the writeStream process will be performed in batch.
* **`.awaitTermination()`**: Makes the stream query a synchronous process


In [0]:
access_checkpoint_location = 'dbfs:/user/hive/warehouse/silver.db/_checkpoint/api/tb_acesso'
(
    df_access.writeStream
        .format('delta')
        .outputMode('Append')
        .option('CheckpointLocation', access_checkpoint_location)
        .trigger(availableNow = True)
        .table('silver.tb_access').awaitTermination()
)
spark.sql('SELECT * FROM silver.tb_access LIMIT 5').display()

access_timestamp,access_ip_address,local_access,access_uuid,uuid_user,course_uuid,dt_load
2024-06-02T04:32:32Z,69.127.75.83,Mobile,1103c7b8628eb7d9e78de86aa2b29d68,,,2024-11-22T11:27:33.031Z
2024-06-02T05:05:28Z,168.18.37.100,Computer,9b6441d6a9398786735cd962eabaf763,,,2024-11-22T11:27:33.031Z
2024-06-02T05:38:24Z,113.109.66.208,Computer,af4e558b3f8e7bda5f938521fc9f8d71,6a7bba03bab91ad37c8b99cc62b2b49b,f260cd97c6c9813b01601e834a2added,2024-11-22T11:27:33.031Z
2024-06-02T06:11:20Z,87.241.252.59,Mobile,09203dff192ad95f83d644065fdc86c9,,,2024-11-22T11:27:33.031Z
2024-06-02T06:44:16Z,188.111.120.11,Computer,a234a81dc3a2926c09d2a198e8d09750,d6f409ceb737a0a28b5c020593a4dff9,34bdd77f6954552d11c4f5547cb41458,2024-11-22T11:27:33.031Z


### 4.3 Users Table 

In [0]:
stop_all_streams()
clean_up_landing_dir()

stop_all_streams-inicio-2024-11-22 11:44:08.223156
O stream display_query_5 fui finalizado com sucesso.
O stream display_query_3 fui finalizado com sucesso.
stop_all_streams-fim-2024-11-22 11:44:10.136551
              
clean_up_landing_dir-inicio-2024-11-22 11:44:10.136712
Todos os arquivos e diretórios dentro de 'dbfs:/FileStore/landing/' foram excluidos com sucesso.
clean_up_landing_dir-fim-2024-11-22 11:44:20.700540
              
