### <font color='blue'>**Download Kaggle Datasets** | **Load to LakeHouse (Bronze)**</font>
Total of 9 datasets from https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
1. olist_customers_dataset.csv
2. olist_geolocation_dataset.csv
3. olist_order_items_dataset.csv
4. olist_order_payments_dataset.csv
5. olist_order_reviews_dataset.csv
6. olist_orders_dataset.csv
7. olist_products_dataset.csv
8. olist_sellers_dataset.csv
9. product_category_name_translation.csv

### <font color='blue'> **Install Kaggle Library** </font>

In [1]:
# Install the Kaggle library
# 1. Used for authenticating with Kaggle's public API.
# 2. Allows programmatic download of datasets (e.g., Olist E-commerce).
%pip install kaggle

StatementMeta(, e49eef65-93fe-4723-b2b0-c162009589c0, 8, Finished, Available, Finished)

Collecting kaggle
  Downloading kaggle-1.8.4-py3-none-any.whl.metadata (15 kB)
Collecting kagglesdk<1.0,>=0.1.15 (from kaggle)
  Downloading kagglesdk-0.1.15-py3-none-any.whl.metadata (13 kB)
Downloading kaggle-1.8.4-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.5/75.5 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading kagglesdk-0.1.15-py3-none-any.whl (160 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.4/160.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kagglesdk, kaggle
Successfully installed kaggle-1.8.4 kagglesdk-0.1.15

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.



## <font color='blue'>**Import Libraries** </font>

| Import Statement | Purpose |
| :--- | :--- |
| `import os` | Used to talk to the computer's operating system. It lets the notebook do things like checking file paths and creating folders where the data will be saved.|
| `from kaggle.api.kaggle_api_extended import KaggleApi` | Imports the specific tool needed to connect to Kaggle. This object handles authentication and allows us to run commands (like downloading datasets) directly from the notebook. |
| `from notebookutils import mssparkutils` | Provides a suite of utilities for Microsoft Fabric, used for file system management (OneLake), notebook orchestration, and credential handling. |

In [None]:
# Welcome to your new notebook
# Type here in the cell editor to add code!

import os

# Set credentials (you can also use Fabric Environment Variables for better security)
os.environ['KAGGLE_USERNAME'] = your_actual_username
os.environ['KAGGLE_KEY'] = your_actual_key

# Standard Libraries
from kaggle.api.kaggle_api_extended import KaggleApi
from notebookutils import mssparkutils

api = KaggleApi()
api.authenticate()

# Define path (Fabric notebooks mount the Lakehouse at /lakehouse/default/)
# download_path = "/lakehouse/default/Files/KaggleData/"

# Download and unzip
# api.dataset_download_files('olistbr/brazilian-ecommerce', path=download_path, unzip=True)

StatementMeta(, e49eef65-93fe-4723-b2b0-c162009589c0, 10, Finished, Available, Finished)

In [3]:
def extract_kaggle_dataset(dataset_name: str, local_path: str):
    """
    Authenticates with Kaggle and downloads/unzips a dataset to a specific Fabric directory.
   . 
    Args:
        dataset_name (str): The Kaggle dataset string (e.g., 'olistbr/brazilian-ecommerce').
        local_path (str): The destination path in the Fabric Lakehouse
    """
    try:
        # 1. Initialize the Kaggle API client and look for credentials (kaggle.json)
        # Typically looks in ~/.kaggle/ or environment variables KAGGLE_USERNAME/KAGGLE_KEY
        api = KaggleApi()
        api.authenticate()
        
        # Ensure the directory exists
        if not os.path.exists(local_path):
            os.makedirs(local_path)
            print(f"Created directory: {local_path}")

        print(f"Downloading dataset '{dataset_name}' to {local_path}...")
        
        # Download and unzip
        api.dataset_download_files(dataset_name, path=local_path, unzip=True)
        
        print("Download and extraction complete!")
        
    except Exception as e:
        print(f"An error occurred: {e}")

StatementMeta(, e49eef65-93fe-4723-b2b0-c162009589c0, 11, Finished, Available, Finished)

In [4]:
def load_csv_to_delta(file_path: str, table_name: str, mode: str = "overwrite"):
    """
    Reads a CSV file from the Lakehouse 'Files' section and saves it as a Delta Table.
    
    Parameters:
        file_path (str): Relative path to the CSV (e.g., '/lakehouse/default/Files/KaggleData/').
        table_name (str): The name of the destination Delta table.
        mode (str): 'overwrite' to replace the table, default.
    """
    try:

        df = (spark.read.format("csv")
            .option("header", "true")
            .option("multiLine", "true")
            .option("quote", "\"")
            .option("escape", "\"")
            .option("inferSchema", "false")
            .load(file_path)
        )

        # Get the loaded file count
        file_row_count = df.count()

        # 2. Save as a Delta Table 
        df.write.format("delta") \
            .mode(mode) \
            .option("overwriteSchema", "true") \
            .saveAsTable(table_name)

        print(f"Table '{table_name}' created successfully from {file_path}!")

        delta_df = spark.table(table_name)

        # Get the loaded table count
        delta_row_count = delta_df.count()
        print(f"Total rows compare : read: {file_row_count}, Delta table: {delta_row_count}")
        
        # Assertion Check all data are loaded as per expected
        assert_expected_rows_loaded(file_path, table_name, file_row_count, delta_row_count)
        # assert file_row_count == delta_row_count, f"Row count mismatch for {file_path} and {table_name}! Read: {file_row_count}, Delta: {delta_row_count}"        
        
    except Exception as e:
        print(f"Error processing table {table_name}: {str(e)}")

StatementMeta(, e49eef65-93fe-4723-b2b0-c162009589c0, 12, Finished, Available, Finished)

In [5]:
def assert_expected_rows_loaded(scr_name: str, target_name:str, before_rows_count: int, loaded_rows_count: int):
    """
    Validates data integrity by comparing source record counts against loaded records.
    
    This assertion acts as a 'circuit breaker' in the ETL pipeline. If the counts 
    do not match exactly, it raises an AssertionError to prevent downstream 
    processing of incomplete or corrupted data.

    Args:
        src_name (str): The name of the source system or table (for logging).
        target_name (str): The name of the destination Lakehouse/Warehouse table.
        before_rows_count (int): The number of records extracted from the source.
        loaded_rows_count (int): The number of records successfully written to the target.

    Raises:
        AssertionError: If the source count and loaded count are not identical.
    """
    assert before_rows_count == loaded_rows_count, f"Row count mismatch for {scr_name} and {target_name}! Read: {before_rows_count}, Delta: {loaded_rows_count}"        

StatementMeta(, e49eef65-93fe-4723-b2b0-c162009589c0, 13, Finished, Available, Finished)

In [6]:
def load_bronze_delta_tables():
    """
    Orchestrates the loading of raw Olist E-commerce CSV files into Delta tables.
    
    This function serves as a wrapper to call load_csv_to_delta for each 
    individual file in the dataset, ensuring a consistent 'Bronze' naming convention.
    """

    # Load Customers dataset
    load_csv_to_delta(
        file_path="Files/KaggleData/olist_customers_dataset.csv",
        table_name="customers_bronze"
    )
    
    # Load Geolocation dataset
    load_csv_to_delta(
        file_path="Files/KaggleData/olist_geolocation_dataset.csv",
        table_name="geolocation_bronze"
    )

    # Load Order_Items dataset      
    load_csv_to_delta(
        file_path="Files/KaggleData/olist_order_items_dataset.csv",
        table_name="order_items_bronze"
    )

    # Load Order_Reviews dataset  
    load_csv_to_delta(
        file_path="Files/KaggleData/olist_order_reviews_dataset.csv",
        table_name="order_reviews_bronze"
    )

    # Load Order_Payments dataset  
    load_csv_to_delta(
        file_path="Files/KaggleData/olist_order_payments_dataset.csv",
        table_name="order_payments_bronze"
    )

    # Load Orders dataset
    load_csv_to_delta(
        file_path="Files/KaggleData/olist_orders_dataset.csv",
        table_name="orders_bronze"
    )

    # Load Sellers dataset
    load_csv_to_delta(
        file_path="Files/KaggleData/olist_sellers_dataset.csv",
        table_name="sellers_bronze"
    )

    # Load Products dataset
    load_csv_to_delta(
        file_path="Files/KaggleData/olist_products_dataset.csv",
        table_name="products_bronze"
    )

    # Load Product_Category_Name_Translation dataset
    load_csv_to_delta(
        file_path="Files/KaggleData/product_category_name_translation.csv",
        table_name="product_category_name_translation_bronze"
    )

StatementMeta(, e49eef65-93fe-4723-b2b0-c162009589c0, 14, Finished, Available, Finished)

In [7]:
# ==============================================================================
# MAIN EXECUTION FLOW
# ==============================================================================

# 1. Global Configurations
# Define the source Kaggle slug and the absolute path in the Fabric OneLake

kaggle_dataset_name: str = 'olistbr/brazilian-ecommerce'
files_local_path:str = '/lakehouse/default/Files/KaggleData/'

# 2. Ingestion Phase
extract_kaggle_dataset(kaggle_dataset_name, files_local_path)

# 3. Storage Phase
load_bronze_delta_tables()

print("Full Bronze Ingestion Pipeline completed successfully.")

StatementMeta(, e49eef65-93fe-4723-b2b0-c162009589c0, 15, Finished, Available, Finished)

Downloading dataset 'olistbr/brazilian-ecommerce' to /lakehouse/default/Files/KaggleData/...
Dataset URL: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
Download and extraction complete!
Table 'customers_bronze' created successfully from Files/KaggleData/olist_customers_dataset.csv!
Total rows compare : read: 99441, Delta table: 99441
Table 'geolocation_bronze' created successfully from Files/KaggleData/olist_geolocation_dataset.csv!
Total rows compare : read: 1000163, Delta table: 1000163
Table 'order_items_bronze' created successfully from Files/KaggleData/olist_order_items_dataset.csv!
Total rows compare : read: 112650, Delta table: 112650
Table 'order_reviews_bronze' created successfully from Files/KaggleData/olist_order_reviews_dataset.csv!
Total rows compare : read: 99224, Delta table: 99224
Table 'order_payments_bronze' created successfully from Files/KaggleData/olist_order_payments_dataset.csv!
Total rows compare : read: 103886, Delta table: 103886
Table 'orders_br

In [8]:
# Stop spark session before exit
mssparkutils.session.stop()

StatementMeta(, e49eef65-93fe-4723-b2b0-c162009589c0, 16, Finished, Available, Finished)