# Loading Data To Bronze Layer 

1. The origin of the table dados_arquivo are files and will be loaded in batch using spark. 
2. The origin of the table dados_api is the sales API and will be loaded in streamming using AutoLoader.


* Batch Data : Sales from sales consultant - B2B
* Streaming Data : API Sales

## 1.0 Initial Setup 

In [0]:
%run "/Users/cabreirajm@gmail.com/DataPipelineCabreira/Helpers/data_generator"

The Data
1. **generate_api_data** - This function generate the Sales API streaming data which is in JSON format and is stored in **Landing Zone** : `dbfs:/FileStore/landing/stream/`


* The payload struct has information regarding user registration, the sale itself and the payment method. If an user just visit the website without any other action ( registration and selling), all elements will be set as null. 

An API file example is presented below : 

```
    {
        "access_date":"2024-06-02T19:01:09.000Z",
        "ip_address":"207.198.60.166",
        "access_point":"chrome",
        "payload":{
            "info_usuario":{
                "nome":"Usuario c3e5d305e1",
                "idade":"44",
                "sexo":"F",
                "email":"usuario_c3e5d305e1@outlook.com",
                "profissao":"Desenvolvedor de ETL",
                "estado":"TO"
            },
            "info_produto":{
                "product_uuid":"f260cd97c6c9813b01601e834a2added",
                "valor":"R$ 589,90"
            },
            "info_pagamento":{
                "valor":"589.90",
                "forma_pagamento":"credito",
                "quantidade_parcelas":"2",
                "valor_parcelas":"294.95"
            }
        }
    }
```

2. **generate_files_data** - This function generates the batch data in csv format. As mentioned above, these csv data are data from sales consultant that sells the products to business ( B2B ). All the data are stored in **Landing Zone** `dbfs:/FileStore/landing/files/`.

A batch file example is presented below:

```
    data_venda,nome_empresa,sexo,nome_funcionario,email_functionario,profissao,idade,estado,curso,valor,disconto
    2025-02-26,Empresa A,M,Funcionario 3a0f99b401,funcionario_3a0f99b401@empresaa.com.br,Cientista de Dados,44,RO,Construindo o seu Primeiro Pipeline de Dados com o Databricks,"R$ 789,90",5%


```

In [0]:
# generate api data
query = generate_api_data()

# generate file data
generate_files_data(100000)

generate_api_data-inicio-2024-11-20 14:21:05.553818
generate_api_data-fim-2024-11-20 14:21:13.186053
              
generate_files_data-inicio-2024-11-20 14:21:13.186325
O arquivo .csv com 100000 registros foi gerado no diretorio 'dbfs:/FileStore/landing/files'.
generate_files_data-fim-2024-11-20 14:21:42.492506
              


## 2.0 Create Bronze Schema

In [0]:
spark.sql("CREATE DATABASE IF NOT EXISTS bronze")

DataFrame[]

## 3.0 Streaming Data Ingestion to Bronze Layer 

Using the method **spark.readStream** and the AutoLoader **.format('cloudFiles')** we will be able to read the API streming data with no need of a pre defined schema.

* `option('cloudFiles.format', 'json')` : Allows us to read the json data with no need of defining the schema. It is used to identify the file format the Autoloader will process. 
* `option('cloudFiles.schemaLocation', schema_location_api))`: The local where the infered schema will be stored and versioned.
* `option('cloudFiles.inferColumnTypes', True)`: Allows AutoLoader to infer the schema of all columns.
* `load(stream_lading_path)`: Creates the readStream processes pointing to the local where the API is sending the streaming data ( the origin path)

The `auto_loader_df` dataframe is created below following all the above requirements. In addition, we create two new columns : 
1. `source_file_name`: Indicates the identifier of the file that originated the registri. 
2. `processing_timestamp`: The timestamp when the data is ingested and processed. 

Note thar these two columns are added in order to help in possible debuggings.

In [0]:

# Reading the API streaming data
schema_location_api = 'dbfs:/user/hive/warehouse/bronze.db/_schemas/load_api_raw_data'
stream_lading_path  = "dbfs:/FileStore/landing/stream/"
auto_loader_df = (
  spark.readStream
       .format('cloudFiles')
       .option('cloudFiles.format','json')
       .option('cloudFiles.schemaLocation', schema_location_api)
       .option('cloudFiles.inferColumnTypes', True)
       .load(stream_lading_path)
)

auto_loader_df = (
  auto_loader_df.withColumn("source_file_name", col("_metadata.file_name"))
                .withColumn("processing_timestamp", current_timestamp() )
)

auto_loader_df.limit(5).display()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-3789230903044409>, line 14[0m
[1;32m      3[0m stream_lading_path  [38;5;241m=[39m [38;5;124m"[39m[38;5;124mdbfs:/FileStore/landing/stream/[39m[38;5;124m"[39m
[1;32m      4[0m auto_loader_df [38;5;241m=[39m (
[1;32m      5[0m   spark[38;5;241m.[39mreadStream
[1;32m      6[0m        [38;5;241m.[39mformat([38;5;124m'[39m[38;5;124mcloudFiles[39m[38;5;124m'[39m)
[0;32m   (...)[0m
[1;32m     10[0m        [38;5;241m.[39mload(stream_lading_path)
[1;32m     11[0m )
[1;32m     13[0m auto_loader_df [38;5;241m=[39m (
[0;32m---> 14[0m   auto_loader_df[38;5;241m.[39mwithColumn([38;5;124m"[39m[38;5;124msource_file_name[39m[38;5;124m"[39m, col([38;5;124m"[39m[38;5;124m_metadata.file_name[39m[38;5;124m"[39m))
[1;32m     15[0m                 [38;5;241

* The `_rescued_data` column contains data that spark couldnt identify during the process of schema evolution. In other words, the autoLoader stores within this column the registers that it was not able to identify its schema.
* Example: Changes in data type might be stored since the autolader may not identify the schema evolution.

In [0]:
stop_all_streams()
clean_up_landing_dir()

stop_all_streams-inicio-2024-11-20 14:32:34.089135
O stream generate_api_stream_data fui finalizado com sucesso.
stop_all_streams-fim-2024-11-20 14:32:34.878001
              
clean_up_landing_dir-inicio-2024-11-20 14:32:34.878151
Todos os arquivos e diretórios dentro de 'dbfs:/FileStore/landing/' foram excluidos com sucesso.
clean_up_landing_dir-fim-2024-11-20 14:32:38.731395
              
