<center><a href="https://ilum.cloud"><img src="../logo.svg" alt="ILUM Logo"></a></center>

<center><h1 style="padding-left: 32px;">Raw data to Bronze</h1></center>
<center>Welcome to the Ilum Interactive Capabilities Tutorial! In this section you can load the first batch of data into the bronze layer. Let's dive in!</center>
</br>

# The Bronze Layer

The **Bronze Layer** is the foundational tier of a Medallion (Lakehouse) architecture, designed as a landing zone for raw, unprocessed data from multiple sources. It stores data exactly as it arrives from source systems such as databases, APIs, IoT sensors, logs, and more. This guarantees the integrity and completeness of incoming data before further processing or analysis.

## Key Capabilities

- **Raw Data Storage:**  
  Acts as a central repository for raw data in its original format, capturing information from operational systems (ERP, CRM, IoT devices, logs) without preprocessing. This approach ensures that all source data remains available for future use.

- **Immutable Data Handling:**  
  Data in the Bronze layer is typically stored as immutable, append-only records. This immutability preserves the original state of data, enabling auditing and exact reproducibility of historical data.

- **Multi-Format Support:**  
  Supports diverse data formats such as JSON, CSV, Parquet, Avro, and XML. Storing data in its original form allows handling semi-structured and unstructured datasets without upfront conversions.

- **Schema Flexibility:**  
  Utilizes a schema-on-read approach, applying schemas only at read-time. This flexibility accommodates evolving data structures and changing business requirements without disrupting ingestion.

- **Efficient Data Ingestion:**  
  Optimized for high-throughput ingestion from batch and streaming sources, allowing low-latency data capture. Minimal transformation during ingestion ensures data is loaded rapidly into the Bronze Layer.

- **Data Lineage and Metadata Management:**  
  Records metadata (e.g., ingestion timestamps, source identifiers, batch IDs) alongside data. This provides robust data lineage, facilitating governance, compliance, and troubleshooting.

## Why Use the Bronze Layer?

Raw data serves as a crucial foundation for advanced analytics and machine learning. The Bronze Layer provides key benefits, including:

- **Data Reliability:**  
  Preserving raw data as an unaltered source-of-truth enables reprocessing or re-analysis when downstream logic or requirements evolve.

- **Scalability:**  
  Designed to handle massive volumes of diverse data sources, the Bronze Layer supports continuous ingestion without performance bottlenecks, ensuring data availability for future use cases.

- **Flexibility for Downstream Processing:**  
  Acts as a staging area, allowing subsequent layers (Silver and Gold) to independently cleanse, integrate, and transform data as needed. Separation of ingestion from processing accelerates data onboarding and enhances analytical flexibility.

- **Regulatory Compliance:**  
  Maintaining original data with detailed lineage aids governance and traceability, essential for compliance with regulatory frameworks such as GDPR and HIPAA.

## Summary

The Bronze Layer is an essential component of modern data lakehouse architectures. By reliably capturing and preserving raw data, it supports scalable, flexible, and compliant data management, laying the foundation for advanced analytics and informed decision-making in subsequent layers.


As a continuation, let's now walk through an example of loading data into the **Bronze Layer**.

---

## Example: Loading Data into the Bronze Layer

In this example, we will demonstrate how to load raw data into the Bronze Layer of the Medallion architecture. Let’s assume we have a CSV file containing sales transaction data.

### **Step 1: Set Up the Environment**
To begin, we need to ensure our environment is ready for data processing and Hive integration. This includes setting up any necessary configurations, and importing all the required libraries.

<div class="alert alert-info" role="alert">
  <h4 class="alert-heading">Before running your notebook</h4>
  <p>Please ensure your environment is properly configured for Hive integration.</p>
  <ul>
    <li>
      <strong>Hive Integration Requirements:</strong> This notebook is integrated with Hive. To properly support Hive, you must enable Hive in your environment. For detailed instructions, please refer to <a href="https://ilum.cloud/resources/getting-started" target="_blank" rel="noopener noreferrer">this guide</a>. Also, add the following properties to your cluster configuration:
      <table class="table table-bordered" style="text-align: left;">
        <thead>
          <tr>
            <th>key</th>
            <th>value</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>spark.hadoop.hive.metastore.uris</td>
            <td>thrift://ilum-hive-metastore:9083</td>
          </tr>
          <tr>
            <td>spark.sql.catalogImplementation</td>
            <td>hive</td>
          </tr>
          <tr>
            <td>spark.sql.warehouse.dir</td>
            <td>s3a://ilum-data/</td>
          </tr>
        </tbody>
      </table>
    </li>
    <li>
      <strong>Session-Specific Hive Capabilities:</strong> If Hive is only required for a specific session, configure the necessary environment variables and dependencies on a per-session basis. For example:
      <pre><code>{"conf": {"spark.sql.warehouse.dir": "s3a://ilum-data/", "spark.kubernetes.container.image": "ilum/spark:3.5.3-delta", "spark.hadoop.hive.metastore.uris": "thrift://ilum-hive-metastore:9083", "spark.sql.catalogImplementation": "hive"}, "driverMemory": "1000M", "executorCores": 2}</code></pre>
      This configuration prepares your session for Delta operations without affecting other workflows.
    </li>
  </ul>
</div>

First, we'll need to load the spark magic extension. You can do this by running the following command:

In [None]:
%load_ext sparkmagic.magics

Ilum's Bundled Jupyter is ready to work out of the box and has a predefined endpoint address, which points to ```livy-proxy```. 

Use **%manage_spark** to create new session. 

Choose between Scala or Python, adjust Spark settings if necessary, and then click the `Create Session` button. As simple as that. 

The following example is written in `Python`.

In [None]:
%manage_spark

Before we start processing, we need to import the necessary libraries.

In [None]:
%%spark
   
    import pandas as pd

**Creating a Dedicated Database for the Use Case**

A good practice in data engineering is to separate data within dedicated databases for specific use cases. This approach helps maintain data organization and makes it easier to manage, query, and scale.

For this use case, we will create a database named `example_bronze`. This will ensure that all data related to this use case is stored in a structured and isolated manner.

To create the database, we use the following command:

In [None]:
%%spark

    spark.sql("CREATE DATABASE example_bronze")
    spark.sql("USE example_bronze")

### **Step 2: Load Raw Data**
The second step is to push the data into the bronze layer. This is usually done automatically from many different sources, but for this notebook the test data will be loaded manually.

Below, each of the three sample data packages is downloaded from a remote repository without any processing.

**Animals:**

In [None]:
%%spark

    animals_url = 'https://raw.githubusercontent.com/ilum-cloud/ilum-python-examples/main/animals.csv'
    
    animals_df = spark.createDataFrame(pd.read_csv(animals_url))
    animals_df.printSchema()
    animals_df.show(5)

**Owners:**

In [None]:
%%spark

    owners_url = 'https://raw.githubusercontent.com/ilum-cloud/ilum-python-examples/main/owners.csv'

    owners_df = spark.createDataFrame(pd.read_csv(owners_url))
    owners_df.printSchema()
    owners_df.show(5)

**Species:**

In [None]:
%%spark
    
    species_url = 'https://raw.githubusercontent.com/ilum-cloud/ilum-python-examples/main/species.csv'

    species_df = spark.createDataFrame(pd.read_csv(species_url))
    species_df.printSchema()
    species_df.show(5)

### **Step 3: Save Data to the Bronze Layer**
In this step, we will save the raw data to a dedicated Bronze Layer location. Since Ilum provides integrated S3 storage and Hive integration, no credentials are required to access the storage.

In [None]:
%%spark

    animals_df.write.format("csv").saveAsTable("animals")
    owners_df.write.format("csv").saveAsTable("owners")
    species_df.write.format("csv").saveAsTable("species")

### **Summary**
In this example:
 - **We loaded raw data** from a CSV file containing sales transactions.
 - **We saved the data in CSV format** to a dedicated Bronze Layer location using Ilum's integrated S3 storage.

Storing raw data in the Bronze Layer this way ensures a solid foundation for further processing and analysis in the higher layers of the Medallion architecture.

### Cleaning up

Now that you’re done with your work, you should clean them up to free up resources when they’re no longer in use. 
Simply click on the Delete buttons!

![Ilum session clean](../../images/clean_ilum_jupyter_session.png)

In [None]:
%manage_spark

#### [Click here to proceed to the "Bronze to silver" section.](2_Bronze_to_silver.ipynb)

