# Loading Data To Bronze Layer 

1. The origin of the table dados_arquivo are files and will be loaded in batch using spark. 
2. The origin of the table dados_api is the sales API and will be loaded in streamming using AutoLoader.


* Batch Data : Sales from sales consultant - B2B
* Streaming Data : API Sales

## 1.0 Initial Setup 

In [0]:
%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Helpers/data_generator"

The Data
1. **generate_api_data** - This function generate the Sales API streaming data which is in JSON format and is stored in **Landing Zone** : `dbfs:/FileStore/landing/stream/`


* The payload struct has information regarding user registration, the sale itself and the payment method. If an user just visit the website without any other action ( registration and selling), all elements will be set as null. 

An API file example is presented below : 

```
    {
        "access_date":"2024-06-02T19:01:09.000Z",
        "ip_address":"207.198.60.166",
        "access_point":"chrome",
        "payload":{
            "info_usuario":{
                "nome":"Usuario c3e5d305e1",
                "idade":"44",
                "sexo":"F",
                "email":"usuario_c3e5d305e1@outlook.com",
                "profissao":"Desenvolvedor de ETL",
                "estado":"TO"
            },
            "info_produto":{
                "product_uuid":"f260cd97c6c9813b01601e834a2added",
                "valor":"R$ 589,90"
            },
            "info_pagamento":{
                "valor":"589.90",
                "forma_pagamento":"credito",
                "quantidade_parcelas":"2",
                "valor_parcelas":"294.95"
            }
        }
    }
```

2. **generate_files_data** - This function generates the batch data in csv format. As mentioned above, these csv data are data from sales consultant that sells the products to business ( B2B ). All the data are stored in **Landing Zone** `dbfs:/FileStore/landing/files/`.

A batch file example is presented below:

```
    data_venda,nome_empresa,sexo,nome_funcionario,email_functionario,profissao,idade,estado,curso,valor,disconto
    2025-02-26,Empresa A,M,Funcionario 3a0f99b401,funcionario_3a0f99b401@empresaa.com.br,Cientista de Dados,44,RO,Construindo o seu Primeiro Pipeline de Dados com o Databricks,"R$ 789,90",5%


```

In [0]:
# generate api data
query = generate_api_data()

# generate file data
generate_files_data(100000)

generate_api_data-inicio-2024-11-20 01:40:31.691490
generate_api_data-fim-2024-11-20 01:40:41.352557
              
generate_files_data-inicio-2024-11-20 01:40:41.353228
O arquivo .csv com 100000 registros foi gerado no diretorio 'dbfs:/FileStore/landing/files'.
generate_files_data-fim-2024-11-20 01:41:07.392084
              


## 2.0 Create Bronze Schema

In [0]:
spark.sql("CREATE DATABASE IF NOT EXISTS bronze")

DataFrame[]

## 3.0 Streaming Data Ingestion to Bronze Layer 

Using the method **spark.readStream** and the AutoLoader **.format('cloudFiles')** we will be able to read the API streming data with no need of a pre defined schema.

* `option('cloudFiles.format', 'json')` : Allows us to read the json data with no need of defining the schema. It is used to identify the file format the Autoloader will process. 
* `option('cloudFiles.schemaLocation', schema_location_api))`: The local where the infered schema will be stored and versioned.

In [0]:
schema_location_api = 'dbfs:/user/hive/warehouse/bronze.db/_schemas/load_api_raw_data'
stream_lading_path  = "dbfs:/FileStore/landing/stream/"
auto_loader_df = (
  spark.readStream
       .format('cloudFiles')
       .option('cloudFiles.format','json')
       .option('cloudFiles.schemaLocation', schema_location_api)
       .option('cloudFiles.inferColumnTypes', True)
       .load(stream_lading_path)
)

auto_loader_df = (
  auto_loader_df.withColumn("source_file_name", col("_metadata.file_name"))
                .withColumn("processing_timestamp", current_timestamp() )
)

auto_loader_df.limit(5).display()

access_date,access_point,ip_address,payload,_rescued_data,source_file_name,processing_timestamp
2024-06-02T02:33:09.000Z,chrome,69.127.75.83,"List(null, null, null)",,part-00000-e08a9d60-07fc-435c-91f3-5cc817acc52e-c000.json,2024-11-20T01:57:46.01Z
2024-06-02T03:06:05.000Z,iphone,168.18.37.100,"List(null, null, List(usuario_41a8154efd@hotmail.com, ES, 33, Usuario 41a8154efd, Cientista de Dados, F))",,part-00000-e08a9d60-07fc-435c-91f3-5cc817acc52e-c000.json,2024-11-20T01:57:46.01Z
2024-06-02T03:39:01.000Z,safari,113.109.66.208,"List(null, null, List(usuario_0cbfedc6af@uol.com, GO, 25, Usuario 0cbfedc6af, Analista de Dados, F))",,part-00000-e08a9d60-07fc-435c-91f3-5cc817acc52e-c000.json,2024-11-20T01:57:46.01Z
2024-06-02T04:11:57.000Z,android,87.241.252.59,"List(null, null, null)",,part-00000-e08a9d60-07fc-435c-91f3-5cc817acc52e-c000.json,2024-11-20T01:57:46.01Z
2024-06-02T04:44:53.000Z,firefox,188.111.120.11,"List(List(null, boleto, 1, 549.90, 549.90), List(c2d6bcbc3e46555bb1e7e9afbc24d3af, R$ 549,90), List(usuario_5425fd2789@gmail.com, SE, 27, Usuario 5425fd2789, Engenheiro de Dados, M))",,part-00000-e08a9d60-07fc-435c-91f3-5cc817acc52e-c000.json,2024-11-20T01:57:46.01Z


In [0]:
stop_all_streams()
clean_up_landing_dir()

stop_all_streams-inicio-2024-11-20 02:02:14.188870
O stream generate_api_stream_data fui finalizado com sucesso.
O stream display_query_1 fui finalizado com sucesso.
stop_all_streams-fim-2024-11-20 02:02:15.846736
              
clean_up_landing_dir-inicio-2024-11-20 02:02:15.846880
Todos os arquivos e diretórios dentro de 'dbfs:/FileStore/landing/' foram excluidos com sucesso.
clean_up_landing_dir-fim-2024-11-20 02:02:22.001455
              
