# Bronze Layer ‚Äì Kaggle API Ingestion (Fabric)

## Purpose
This notebook handles **Bronze ingestion from Kaggle using the Kaggle API** in Microsoft Fabric.

It is a **Phase 1 (Fabric-only)** ingestion path and is designed to:
- Pull raw data from Kaggle via API
- Land files into the Lakehouse Files area
- Convert raw files into **parallel Bronze Delta tables (`br_api_*`)**
- Avoid impacting existing stable Bronze tables (`br_*`)

This notebook focuses on **ingestion only**, not transformation.

---

## Scope (Sprint 1)
- Platform: Microsoft Fabric
- Ingestion method: Kaggle API (Python / PySpark)
- Storage:
  - Files ‚Üí `Files/bronze_kaggle_api/`
  - Tables ‚Üí `br_api_*` (Delta tables)

This allows safe validation before any promotion or swap.

---

## Input
- Kaggle datasets (via Kaggle API)
- Authentication via `kaggle.json` (project-scoped token)

---

## Output (WRITE ONLY)
- Parallel Bronze tables:
  - `br_api_customers`
  - `br_api_orders`
  - `br_api_order_items`
  - `br_api_order_payments`
  - `br_api_order_reviews`
  - `br_api_products`
  - `br_api_sellers`
  - `br_api_geolocation`
  - `br_api_product_category_translation`

---

## Rules & Constraints
- This notebook **must not overwrite** existing `br_*` tables
- Schema inference is disabled (`inferSchema = False`)
- No cleaning, joining, or business logic is applied
- One-to-one mapping between Kaggle files and Bronze tables

---

## Relationship to Other Layers
- Silver development reads from **approved Bronze tables only**
- Promotion from `br_api_*` ‚Üí `br_*` (if any) is a **manual PO decision**
- Downstream Silver / Gold logic remains unchanged

---

## Notes
- Kaggle API token is **project-scoped** and reusable for Phase 2
- In Phase 2, ingestion may move to Azure Data Factory + Key Vault
- This notebook remains as the Fabric-based reference ingestion

## IMPORTANT:
olist_order_reviews_dataset.csv contains multiline review text.
Therefore:
- multiLine=True is required when ingesting br_api_order_reviews
- inferSchema=False is enforced for all Bronze tables
This prevents incorrect row counts and schema drift during overwrites.


In [1]:
%pip install -q kaggle

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 7, Finished, Available, Finished)

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
library-metadata-cooker 3.5.0.1 requires mypy==1.4.1, but you have mypy 1.19.1 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.



In [2]:
import os

# Input: paste your token between the quotes
os.environ["KAGGLE_API_TOKEN"] = "KGAT_e25d27dbdf44bc750554fd**********" # Token masked for security

print("KAGGLE_API_TOKEN set:", os.getenv("KAGGLE_API_TOKEN") is not None)

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 9, Finished, Available, Finished)

KAGGLE_API_TOKEN set: True


In [10]:
import sys, os

print("Python executable:", sys.executable)

bin_dir = os.path.dirname(sys.executable)
print("Bin dir:", bin_dir)

print("kaggle exists:", os.path.exists(os.path.join(bin_dir, "kaggle")))
print("Files with kaggle in name:", [f for f in os.listdir(bin_dir) if "kaggle" in f.lower()])

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 17, Finished, Available, Finished)

Python executable: /nfs4/pyenv-d46785ae-aec0-4036-9391-efa66e570217/bin/python
Bin dir: /nfs4/pyenv-d46785ae-aec0-4036-9391-efa66e570217/bin
kaggle exists: True
Files with kaggle in name: ['kaggle']


In [13]:
kaggle_cli = os.path.join(os.path.dirname(sys.executable), "kaggle")

!{kaggle_cli} --version
!{kaggle_cli} datasets list -s olist

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 20, Finished, Available, Finished)

Kaggle API 1.8.3
ref                                                               title                                                    size  lastUpdated                 downloadCount  voteCount  usabilityRating  
----------------------------------------------------------------  -------------------------------------------------  ----------  --------------------------  -------------  ---------  ---------------  
olistbr/brazilian-ecommerce                                       Brazilian E-Commerce Public Dataset by Olist         44717580  2021-10-01 19:08:27.970000         420735       3867                1  
olistbr/marketing-funnel-olist                                    Marketing Funnel by Olist                              284562  2018-11-16 14:00:20.677000          17747        319                1  
terencicp/e-commerce-dataset-by-olist-as-an-sqlite-database       E-commerce dataset by Olist (SQLite)                 51085670  2024-04-28 14:56:35.423000           9422         

In [14]:
out_dir = "/lakehouse/default/Files/bronze_kaggle_api"
!{kaggle_cli} datasets download -d olistbr/brazilian-ecommerce -p "{out_dir}" --unzip

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 21, Finished, Available, Finished)

Dataset URL: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
License(s): CC-BY-NC-SA-4.0
Downloading brazilian-ecommerce.zip to /lakehouse/default/Files/bronze_kaggle_api
 77%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç        | 33.0M/42.6M [00:00<00:00, 47.2MB/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 42.6M/42.6M [00:01<00:00, 26.6MB/s]


In [15]:
import os
os.listdir("/lakehouse/default/Files/bronze_kaggle_api")[:20]

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 22, Finished, Available, Finished)

['product_category_name_translation.csv',
 'olist_customers_dataset.csv',
 'olist_orders_dataset.csv',
 'olist_order_items_dataset.csv',
 'olist_geolocation_dataset.csv',
 'olist_order_payments_dataset.csv',
 'olist_sellers_dataset.csv',
 'olist_order_reviews_dataset.csv',
 'olist_products_dataset.csv']

In [16]:
import os
sorted(os.listdir("/lakehouse/default/Files/bronze_kaggle_api"))

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 23, Finished, Available, Finished)

['olist_customers_dataset.csv',
 'olist_geolocation_dataset.csv',
 'olist_order_items_dataset.csv',
 'olist_order_payments_dataset.csv',
 'olist_order_reviews_dataset.csv',
 'olist_orders_dataset.csv',
 'olist_products_dataset.csv',
 'olist_sellers_dataset.csv',
 'product_category_name_translation.csv']

In [17]:
base = "/lakehouse/default/Files/bronze_kaggle_api"

file_to_table = {
    "olist_customers_dataset.csv": "br_api_customers",
    "olist_geolocation_dataset.csv": "br_api_geolocation",
    "olist_orders_dataset.csv": "br_api_orders",
    "olist_order_items_dataset.csv": "br_api_order_items",
    "olist_order_payments_dataset.csv": "br_api_order_payments",
    "olist_order_reviews_dataset.csv": "br_api_order_reviews",
    "olist_products_dataset.csv": "br_api_products",
    "olist_sellers_dataset.csv": "br_api_sellers",
    "product_category_name_translation.csv": "br_api_product_category_translation",
}

for fname, tbl in file_to_table.items():
    path = f"file:{base}/{fname}"
    df = (spark.read
          .option("header", True)
          .option("inferSchema", False)   # keep Bronze stable
          .csv(path))
    df.write.mode("overwrite").format("delta").saveAsTable(tbl)
    print(f" {tbl} created from {fname} | rows={df.count()}")

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 24, Finished, Available, Finished)

 br_api_customers created from olist_customers_dataset.csv | rows=99441
 br_api_geolocation created from olist_geolocation_dataset.csv | rows=1000163
 br_api_orders created from olist_orders_dataset.csv | rows=99441
 br_api_order_items created from olist_order_items_dataset.csv | rows=112650
 br_api_order_payments created from olist_order_payments_dataset.csv | rows=103886
 br_api_order_reviews created from olist_order_reviews_dataset.csv | rows=104162
 br_api_products created from olist_products_dataset.csv | rows=32951
 br_api_sellers created from olist_sellers_dataset.csv | rows=3095
 br_api_product_category_translation created from product_category_name_translation.csv | rows=71


### Special Handling: Order Reviews (Multiline CSV)

#### Why this exists

The Kaggle file olist_order_reviews_dataset.csv contains free-text review comments with embedded line breaks and quotation marks.

When read using default Spark CSV options, this caused:
* Row truncation
* Incorrect row counts
* Mismatch against the existing CSV-based Bronze table (br_reviews)

#### Observed issue

* br_reviews row count ‚â† br_api_order_reviews row count

#### Resolution
The CSV was re-read with explicit multiline and quote handling enabled:

* multiLine = true
* quote = "\""
* escape = "\""

The corrected DataFrame was then written as a Delta table:
#### br_api_order_reviews_fixed

#### Validation
After applying this fix:

* br_reviews = 99,224
* br_api_order_reviews_fixed = 99,224

This table should be treated as the authoritative API-based Bronze source for order reviews going forward.
That is reviewer-proof. No further justification needed.

In [26]:
# FIX: Re-ingest reviews CSV with multiline support
# Reason: review_comment_message contains line breaks which cause row mismatch
# This ensures 1 CSV row = 1 logical review record

base = "/lakehouse/default/Files/bronze_kaggle_api"
path = f"file:{base}/olist_order_reviews_dataset.csv"

df = (
    spark.read
    .option("header", True)
    .option("inferSchema", False)
    .option("multiLine", True)
    .option("escape", '"')
    .option("quote", '"')
    .option("mode", "PERMISSIVE")
    .csv(path)
)

df.write.mode("overwrite").format("delta").saveAsTable("br_api_order_reviews")
print("rows:", df.count())

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 33, Finished, Available, Finished)

rows: 99224


In [28]:
pairs = [
    ("br_customers", "br_api_customers"),
    ("br_geolocation", "br_api_geolocation"),
    ("br_orders", "br_api_orders"),
    ("br_order_items", "br_api_order_items"),
    ("br_payments", "br_api_order_payments"),
    ("br_reviews", "br_api_order_reviews"),
    ("br_products", "br_api_products"),
    ("br_sellers", "br_api_sellers"),
    ("br_product_category_translation", "br_api_product_category_translation"),
]

for a, b in pairs:
    ca = spark.table(a).count()
    cb = spark.table(b).count()
    print(f"{a:35} {ca:10} | {b:35} {cb:10} | match={ca==cb}")

StatementMeta(, b86aeb33-16bc-43f9-9f55-c21fc9b8745a, 35, Finished, Available, Finished)

br_customers                             99441 | br_api_customers                         99441 | match=True
br_geolocation                         1000163 | br_api_geolocation                     1000163 | match=True
br_orders                                99441 | br_api_orders                            99441 | match=True
br_order_items                          112650 | br_api_order_items                      112650 | match=True
br_payments                             103886 | br_api_order_payments                   103886 | match=True
br_reviews                               99224 | br_api_order_reviews                     99224 | match=True
br_products                              32951 | br_api_products                          32951 | match=True
br_sellers                                3095 | br_api_sellers                            3095 | match=True
br_product_category_translation             71 | br_api_product_category_translation         71 | match=True
